Documentation INTEL

 

 

 CD ROM Annuaire d'Entreprises France prospect (avec ou sans emails) : REMISE DE 10 % Avec le code réduction AUDEN872

: matrix-matrix product, triangular matrix, double-precision complex. Sparse BLAS level 1 naming conventions are similar to those of BLAS level 1. For more information, see Naming Conventions. Fortran 95 Interface Conventions Fortran 95 interface to BLAS and Sparse BLAS Level 1 routines is implemented through wrappers that call respective FORTRAN 77 routines. This interface uses such features of Fortran 95 as assumed-shape arrays and optional arguments to provide simplified calls to BLAS and Sparse BLAS Level 1 routines with fewer parameters. 2 Intel® Math Kernel Library Reference Manual 52 NOTE For BLAS, Intel MKL offers two types of Fortran 95 interfaces: • using mkl_blas.fi only through include 'mkl_blas_subroutine.fi' statement. Such interfaces allow you to make use of the original LAPACK routines with all their arguments • using blas.f90 that includes improved interfaces. This file is used to generate the module files blas95.mod and f95_precision.mod. The module files mkl95_blas.mod and mkl95_precision.mod are also generated. See also section "Fortran 95 interfaces and wrappers to LAPACK and BLAS" of Intel® MKL User's Guide for details. The module files are used to process the FORTRAN use clauses referencing the BLAS interface: use blas95 (or an equivalent use mkl95_blas) and use f95_precision (or an equivalent use mkl95_precision). The main conventions used in Fortran 95 interface are as follows: • The names of parameters used in Fortran 95 interface are typically the same as those used for the respective generic (FORTRAN 77) interface. In rare cases formal argument names may be different. • Some input parameters such as array dimensions are not required in Fortran 95 and are skipped from the calling sequence. Array dimensions are reconstructed from the user data that must exactly follow the required array shape. • A parameter can be skipped if its value is completely defined by the presence or absence of another parameter in the calling sequence, and the restored value is the only meaningful value for the skipped parameter. • Parameters specifying the increment values incx and incy are skipped. In most cases their values are equal to 1. In Fortran 95 an increment with different value can be directly established in the corresponding parameter. • Some generic parameters are declared as optional in Fortran 95 interface and may or may not be present in the calling sequence. A parameter can be declared optional if it satisfies one of the following conditions: 1. It can take only a few possible values. The default value of such parameter typically is the first value in the list; all exceptions to this rule are explicitly stated in the routine description. 2. It has a natural default value. Optional parameters are given in square brackets in Fortran 95 call syntax. The particular rules used for reconstructing the values of omitted optional parameters are specific for each routine and are detailed in the respective "Fortran 95 Notes" subsection at the end of routine specification section. If this subsection is omitted, the Fortran 95 interface for the given routine does not differ from the corresponding FORTRAN 77 interface. Note that this interface is not implemented in the current version of Sparse BLAS Level 2 and Level 3 routines. Matrix Storage Schemes Matrix arguments of BLAS routines can use the following storage schemes: • Full storage: a matrix A is stored in a two-dimensional array a, with the matrix element aij stored in the array element a(i,j). • Packed storage scheme allows you to store symmetric, Hermitian, or triangular matrices more compactly: the upper or lower triangle of the matrix is packed by columns in a one-dimensional array. • Band storage: a band matrix is stored compactly in a two-dimensional array: columns of the matrix are stored in the corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. For more information on matrix storage schemes, see Matrix Arguments in Appendix B. BLAS Level 1 Routines and Functions BLAS Level 1 includes routines and functions, which perform vector-vector operations. Table “BLAS Level 1 Routine Groups and Their Data Types” lists the BLAS Level 1 routine and function groups and the data types associated with them. BLAS and Sparse BLAS Routines 2 53 BLAS Level 1 Routine and Function Groups and Their Data Types Routine or Function Group Data Types Description ?asum s, d, sc, dz Sum of vector magnitudes (functions) ?axpy s, d, c, z Scalar-vector product (routines) ?copy s, d, c, z Copy vector (routines) ?dot s, d Dot product (functions) ?sdot sd, d Dot product with extended precision (functions) ?dotc c, z Dot product conjugated (functions) ?dotu c, z Dot product unconjugated (functions) ?nrm2 s, d, sc, dz Vector 2-norm (Euclidean norm) (functions) ?rot s, d, cs, zd Plane rotation of points (routines) ?rotg s, d, c, z Generate Givens rotation of points (routines) ?rotm s, d Modified Givens plane rotation of points (routines) ?rotmg s, d Generate modified Givens plane rotation of points (routines) ?scal s, d, c, z, cs, zd Vector-scalar product (routines) ?swap s, d, c, z Vector-vector swap (routines) i?amax s, d, c, z Index of the maximum absolute value element of a vector (functions) i?amin s, d, c, z Index of the minimum absolute value element of a vector (functions) ?cabs1 s, d Auxiliary functions, compute the absolute value of a complex number of single or double precision ?asum Computes the sum of magnitudes of the vector elements. Syntax Fortran 77: res = sasum(n, x, incx) res = scasum(n, x, incx) res = dasum(n, x, incx) res = dzasum(n, x, incx) Fortran 95: res = asum(x) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 2 Intel® Math Kernel Library Reference Manual 54 • C: mkl_blas.h Description The ?asum routine computes the sum of the magnitudes of elements of a real vector, or the sum of magnitudes of the real and imaginary parts of elements of a complex vector: res = |Re x(1)| + |Im x(1)| + |Re x(2)| + |Im x(2)|+ ... + |Re x(n)| + |Im x(n)|, where x is a vector with a number of elements that equals n. Input Parameters n INTEGER. Specifies the number of elements in vector x. x REAL for sasum DOUBLE PRECISION for dasum COMPLEX for scasum DOUBLE COMPLEX for dzasum Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for indexing vector x. Output Parameters res REAL for sasum DOUBLE PRECISION for dasum REAL for scasum DOUBLE PRECISION for dzasum Contains the sum of magnitudes of real and imaginary parts of all elements of the vector. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine asum interface are the following: x Holds the array of size n. ?axpy Computes a vector-scalar product and adds the result to a vector. Syntax Fortran 77: call saxpy(n, a, x, incx, y, incy) call daxpy(n, a, x, incx, y, incy) call caxpy(n, a, x, incx, y, incy) call zaxpy(n, a, x, incx, y, incy) Fortran 95: call axpy(x, y [,a]) BLAS and Sparse BLAS Routines 2 55 Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?axpy routines perform a vector-vector operation defined as y := a*x + y where: a is a scalar x and y are vectors each with a number of elements that equals n. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. a REAL for saxpy DOUBLE PRECISION for daxpy COMPLEX for caxpy DOUBLE COMPLEX for zaxpy Specifies the scalar a. x REAL for saxpy DOUBLE PRECISION for daxpy COMPLEX for caxpy DOUBLE COMPLEX for zaxpy Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for saxpy DOUBLE PRECISION for daxpy COMPLEX for caxpy DOUBLE COMPLEX for zaxpy Array, DIMENSION at least (1 + (n-1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. Output Parameters y Contains the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine axpy interface are the following: x Holds the array of size n. y Holds the array of size n. a The default value is 1. ?copy Copies vector to another vector. 2 Intel® Math Kernel Library Reference Manual 56 Syntax Fortran 77: call scopy(n, x, incx, y, incy) call dcopy(n, x, incx, y, incy) call ccopy(n, x, incx, y, incy) call zcopy(n, x, incx, y, incy) Fortran 95: call copy(x, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?copy routines perform a vector-vector operation defined as y = x, where x and y are vectors. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x REAL for scopy DOUBLE PRECISION for dcopy COMPLEX for ccopy DOUBLE COMPLEX for zcopy Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for scopy DOUBLE PRECISION for dcopy COMPLEX for ccopy DOUBLE COMPLEX for zcopy Array, DIMENSION at least (1 + (n-1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. Output Parameters y Contains a copy of the vector x if n is positive. Otherwise, parameters are unaltered. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine copy interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. BLAS and Sparse BLAS Routines 2 57 ?dot Computes a vector-vector dot product. Syntax Fortran 77: res = sdot(n, x, incx, y, incy) res = ddot(n, x, incx, y, incy) Fortran 95: res = dot(x, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?dot routines perform a vector-vector reduction operation defined as where xi and yi are elements of vectors x and y. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x REAL for sdot DOUBLE PRECISION for ddot Array, DIMENSION at least (1+(n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for sdot DOUBLE PRECISION for ddot Array, DIMENSION at least (1+(n-1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. Output Parameters res REAL for sdot DOUBLE PRECISION for ddot Contains the result of the dot product of x and y, if n is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine dot interface are the following: 2 Intel® Math Kernel Library Reference Manual 58 x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. ?sdot Computes a vector-vector dot product with extended precision. Syntax Fortran 77: res = sdsdot(n, sb, sx, incx, sy, incy) res = dsdot(n, sx, incx, sy, incy) Fortran 95: res = sdot(sx, sy) res = sdot(sx, sy, sb) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?sdot routines compute the inner product of two vectors with extended precision. Both routines use extended precision accumulation of the intermediate results, but the sdsdot routine outputs the final result in single precision, whereas the dsdot routine outputs the double precision result. The function sdsdot also adds scalar value sb to the inner product. Input Parameters n INTEGER. Specifies the number of elements in the input vectors sx and sy. sb REAL. Single precision scalar to be added to inner product (for the function sdsdot only). sx, sy REAL. Arrays, DIMENSION at least (1+(n -1)*abs(incx)) and (1+ (n-1)*abs(incy)), respectively. Contain the input single precision vectors. incx INTEGER. Specifies the increment for the elements of sx. incy INTEGER. Specifies the increment for the elements of sy. Output Parameters res REAL for sdsdot DOUBLE PRECISION for dsdot Contains the result of the dot product of sx and sy (with sb added for sdsdot), if n is positive. Otherwise, res contains sb for sdsdot and 0 for dsdot. BLAS and Sparse BLAS Routines 2 59 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sdot interface are the following: sx Holds the vector with the number of elements n. sy Holds the vector with the number of elements n. NOTE Note that scalar parameter sb is declared as a required parameter in Fortran 95 interface for the function sdot to distinguish between function flavors that output final result in different precision. ?dotc Computes a dot product of a conjugated vector with another vector. Syntax Fortran 77: res = cdotc(n, x, incx, y, incy) res = zdotc(n, x, incx, y, incy) Fortran 95: res = dotc(x, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?dotc routines perform a vector-vector operation defined as: where xi and yi are elements of vectors x and y. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x COMPLEX for cdotc DOUBLE COMPLEX for zdotc Array, DIMENSION at least (1 + (n -1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y COMPLEX for cdotc DOUBLE COMPLEX for zdotc Array, DIMENSION at least (1 + (n -1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. 2 Intel® Math Kernel Library Reference Manual 60 Output Parameters res COMPLEX for cdotc DOUBLE COMPLEX for zdotc Contains the result of the dot product of the conjugated x and unconjugated y, if n is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine dotc interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. ?dotu Computes a vector-vector dot product. Syntax Fortran 77: res = cdotu(n, x, incx, y, incy) res = zdotu(n, x, incx, y, incy) Fortran 95: res = dotu(x, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?dotu routines perform a vector-vector reduction operation defined as where xi and yi are elements of complex vectors x and y. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x COMPLEX for cdotu DOUBLE COMPLEX for zdotu Array, DIMENSION at least (1 + (n -1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y COMPLEX for cdotu DOUBLE COMPLEX for zdotu BLAS and Sparse BLAS Routines 2 61 Array, DIMENSION at least (1 + (n -1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. Output Parameters res COMPLEX for cdotu DOUBLE COMPLEX for zdotu Contains the result of the dot product of x and y, if n is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine dotu interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. ?nrm2 Computes the Euclidean norm of a vector. Syntax Fortran 77: res = snrm2(n, x, incx) res = dnrm2(n, x, incx) res = scnrm2(n, x, incx) res = dznrm2(n, x, incx) Fortran 95: res = nrm2(x) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?nrm2 routines perform a vector reduction operation defined as res = ||x||, where: x is a vector, res is a value containing the Euclidean norm of the elements of x. Input Parameters n INTEGER. Specifies the number of elements in vector x. x REAL for snrm2 2 Intel® Math Kernel Library Reference Manual 62 DOUBLE PRECISION for dnrm2 COMPLEX for scnrm2 DOUBLE COMPLEX for dznrm2 Array, DIMENSION at least (1 + (n -1)*abs (incx)). incx INTEGER. Specifies the increment for the elements of x. Output Parameters res REAL for snrm2 DOUBLE PRECISION for dnrm2 REAL for scnrm2 DOUBLE PRECISION for dznrm2 Contains the Euclidean norm of the vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine nrm2 interface are the following: x Holds the vector with the number of elements n. ?rot Performs rotation of points in the plane. Syntax Fortran 77: call srot(n, x, incx, y, incy, c, s) call drot(n, x, incx, y, incy, c, s) call csrot(n, x, incx, y, incy, c, s) call zdrot(n, x, incx, y, incy, c, s) Fortran 95: call rot(x, y, c, s) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description Given two complex vectors x and y, each vector element of these vectors is replaced as follows: x(i) = c*x(i) + s*y(i) y(i) = c*y(i) - s*x(i) Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x REAL for srot BLAS and Sparse BLAS Routines 2 63 DOUBLE PRECISION for drot COMPLEX for csrot DOUBLE COMPLEX for zdrot Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for srot DOUBLE PRECISION for drot COMPLEX for csrot DOUBLE COMPLEX for zdrot Array, DIMENSION at least (1 + (n -1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. c REAL for srot DOUBLE PRECISION for drot REAL for csrot DOUBLE PRECISION for zdrot A scalar. s REAL for srot DOUBLE PRECISION for drot REAL for csrot DOUBLE PRECISION for zdrot A scalar. Output Parameters x Each element is replaced by c*x + s*y. y Each element is replaced by c*y - s*x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine rot interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. ?rotg Computes the parameters for a Givens rotation. Syntax Fortran 77: call srotg(a, b, c, s) call drotg(a, b, c, s) call crotg(a, b, c, s) call zrotg(a, b, c, s) Fortran 95: call rotg(a, b, c, s) 2 Intel® Math Kernel Library Reference Manual 64 Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description Given the Cartesian coordinates (a, b) of a point, these routines return the parameters c, s, r, and z associated with the Givens rotation. The parameters c and s define a unitary matrix such that: The parameter z is defined such that if |a| > |b|, z is s; otherwise if c is not 0 z is 1/c; otherwise z is 1. See a more accurate LAPACK version ?lartg. Input Parameters a REAL for srotg DOUBLE PRECISION for drotg COMPLEX for crotg DOUBLE COMPLEX for zrotg Provides the x-coordinate of the point p. b REAL for srotg DOUBLE PRECISION for drotg COMPLEX for crotg DOUBLE COMPLEX for zrotg Provides the y-coordinate of the point p. Output Parameters a Contains the parameter r associated with the Givens rotation. b Contains the parameter z associated with the Givens rotation. c REAL for srotg DOUBLE PRECISION for drotg REAL for crotg DOUBLE PRECISION for zrotg Contains the parameter c associated with the Givens rotation. s REAL for srotg DOUBLE PRECISION for drotg COMPLEX for crotg DOUBLE COMPLEX for zrotg Contains the parameter s associated with the Givens rotation. ?rotm Performs modified Givens rotation of points in the plane. Syntax Fortran 77: call srotm(n, x, incx, y, incy, param) BLAS and Sparse BLAS Routines 2 65 call drotm(n, x, incx, y, incy, param) Fortran 95: call rotm(x, y, param) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description Given two vectors x and y, each vector element of these vectors is replaced as follows: for i=1 to n, where H is a modified Givens transformation matrix whose values are stored in the param(2) through param(5) array. See discussion on the param argument. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x REAL for srotm DOUBLE PRECISION for drotm Array, DIMENSION at least (1 + (n -1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for srotm DOUBLE PRECISION for drotm Array, DIMENSION at least (1 + (n -1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. param REAL for srotm DOUBLE PRECISION for drotm Array, DIMENSION 5. The elements of the param array are: param(1) contains a switch, flag. param(2-5) contain h11, h21, h12, and h22, respectively, the components of the array H. Depending on the values of flag, the components of H are set as follows: 2 Intel® Math Kernel Library Reference Manual 66 In the last three cases, the matrix entries of 1., -1., and 0. are assumed based on the value of flag and are not required to be set in the param vector. Output Parameters x Each element x(i) is replaced by h11*x(i) + h12*y(i). y Each element y(i) is replaced by h21*x(i) + h22*y(i). Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine rotm interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. ?rotmg Computes the parameters for a modified Givens rotation. Syntax Fortran 77: call srotmg(d1, d2, x1, y1, param) call drotmg(d1, d2, x1, y1, param) Fortran 95: call rotmg(d1, d2, x1, y1, param) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description Given Cartesian coordinates (x1, y1) of an input vector, these routines compute the components of a modified Givens transformation matrix H that zeros the y-component of the resulting vector: BLAS and Sparse BLAS Routines 2 67 Input Parameters d1 REAL for srotmg DOUBLE PRECISION for drotmg Provides the scaling factor for the x-coordinate of the input vector. d2 REAL for srotmg DOUBLE PRECISION for drotmg Provides the scaling factor for the y-coordinate of the input vector. x1 REAL for srotmg DOUBLE PRECISION for drotmg Provides the x-coordinate of the input vector. y1 REAL for srotmg DOUBLE PRECISION for drotmg Provides the y-coordinate of the input vector. Output Parameters d1 REAL for srotmg DOUBLE PRECISION for drotmg Provides the first diagonal element of the updated matrix. d2 REAL for srotmg DOUBLE PRECISION for drotmg Provides the second diagonal element of the updated matrix. x1 REAL for srotmg DOUBLE PRECISION for drotmg Provides the x-coordinate of the rotated vector before scaling. param REAL for srotmg DOUBLE PRECISION for drotmg Array, DIMENSION 5. The elements of the param array are: param(1) contains a switch, flag. param(2-5) contain h11, h21, h12, and h22, respectively, the components of the array H. Depending on the values of flag, the components of H are set as follows: 2 Intel® Math Kernel Library Reference Manual 68 In the last three cases, the matrix entries of 1., -1., and 0. are assumed based on the value of flag and are not required to be set in the param vector. ?scal Computes the product of a vector by a scalar. Syntax Fortran 77: call sscal(n, a, x, incx) call dscal(n, a, x, incx) call cscal(n, a, x, incx) call zscal(n, a, x, incx) call csscal(n, a, x, incx) call zdscal(n, a, x, incx) Fortran 95: call scal(x, a) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?scal routines perform a vector operation defined as x = a*x where: a is a scalar, x is an n-element vector. Input Parameters n INTEGER. Specifies the number of elements in vector x. a REAL for sscal and csscal DOUBLE PRECISION for dscal and zdscal COMPLEX for cscal DOUBLE COMPLEX for zscal Specifies the scalar a. x REAL for sscal DOUBLE PRECISION for dscal COMPLEX for cscal and csscal DOUBLE COMPLEX for zscal and zdscal Array, DIMENSION at least (1 + (n -1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. BLAS and Sparse BLAS Routines 2 69 Output Parameters x Updated vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine scal interface are the following: x Holds the vector with the number of elements n. ?swap Swaps a vector with another vector. Syntax Fortran 77: call sswap(n, x, incx, y, incy) call dswap(n, x, incx, y, incy) call cswap(n, x, incx, y, incy) call zswap(n, x, incx, y, incy) Fortran 95: call swap(x, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description Given two vectors x and y, the ?swap routines return vectors y and x swapped, each replacing the other. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x REAL for sswap DOUBLE PRECISION for dswap COMPLEX for cswap DOUBLE COMPLEX for zswap Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for sswap DOUBLE PRECISION for dswap COMPLEX for cswap DOUBLE COMPLEX for zswap Array, DIMENSION at least (1 + (n-1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. 2 Intel® Math Kernel Library Reference Manual 70 Output Parameters x Contains the resultant vector x, that is, the input vector y. y Contains the resultant vector y, that is, the input vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine swap interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. i?amax Finds the index of the element with maximum absolute value. Syntax Fortran 77: index = isamax(n, x, incx) index = idamax(n, x, incx) index = icamax(n, x, incx) index = izamax(n, x, incx) Fortran 95: index = iamax(x) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description This function is declared in mkl_blas.fi for FORTRAN 77 interface, in blas.f90 for Fortran 95 interface, and in mkl_blas.h for C interface. Given a vector x, the i?amax functions return the position of the vector element x(i) that has the largest absolute value for real flavors, or the largest sum |Re(x(i))|+|Im(x(i))| for complex flavors. If n is not positive, 0 is returned. If more than one vector element is found with the same largest absolute value, the index of the first one encountered is returned. Input Parameters n INTEGER. Specifies the number of elements in vector x. x REAL for isamax DOUBLE PRECISION for idamax COMPLEX for icamax BLAS and Sparse BLAS Routines 2 71 DOUBLE COMPLEX for izamax Array, DIMENSION at least (1+(n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. Output Parameters index INTEGER. Contains the position of vector element x that has the largest absolute value. Fortran 95 Interface Notes Functions and routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the function iamax interface are the following: x Holds the vector with the number of elements n. i?amin Finds the index of the element with the smallest absolute value. Syntax Fortran 77: index = isamin(n, x, incx) index = idamin(n, x, incx) index = icamin(n, x, incx) index = izamin(n, x, incx) Fortran 95: index = iamin(x) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description This function is declared in mkl_blas.fi for FORTRAN 77 interface, in blas.f90 for Fortran 95 interface, and in mkl_blas.h for C interface. Given a vector x, the i?amin functions return the position of the vector element x(i) that has the smallest absolute value for real flavors, or the smallest sum |Re(x(i))|+|Im(x(i))| for complex flavors. If n is not positive, 0 is returned. If more than one vector element is found with the same smallest absolute value, the index of the first one encountered is returned. Input Parameters n INTEGER. On entry, n specifies the number of elements in vector x. x REAL for isamin 2 Intel® Math Kernel Library Reference Manual 72 DOUBLE PRECISION for idamin COMPLEX for icamin DOUBLE COMPLEX for izamin Array, DIMENSION at least (1+(n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. Output Parameters index INTEGER. Contains the position of vector element x that has the smallest absolute value. Fortran 95 Interface Notes Functions and routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the function iamin interface are the following: x Holds the vector with the number of elements n. ?cabs1 Computes absolute value of complex number. Syntax Fortran 77: res = scabs1(z) res = dcabs1(z) Fortran 95: res = cabs1(z) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?cabs1 is an auxiliary routine for a few BLAS Level 1 routines. This routine performs an operation defined as res=|Re(z)|+|Im(z)|, where z is a scalar, and res is a value containing the absolute value of a complex number z. Input Parameters z COMPLEX scalar for scabs1. DOUBLE COMPLEX scalar for dcabs1. Output Parameters res REAL for scabs1. DOUBLE PRECISION for dcabs1. Contains the absolute value of a complex number z. BLAS and Sparse BLAS Routines 2 73 BLAS Level 2 Routines This section describes BLAS Level 2 routines, which perform matrix-vector operations. Table “BLAS Level 2 Routine Groups and Their Data Types” lists the BLAS Level 2 routine groups and the data types associated with them. BLAS Level 2 Routine Groups and Their Data Types Routine Groups Data Types Description ?gbmv s, d, c, z Matrix-vector product using a general band matrix gemv s, d, c, z Matrix-vector product using a general matrix ?ger s, d Rank-1 update of a general matrix ?gerc c, z Rank-1 update of a conjugated general matrix ?geru c, z Rank-1 update of a general matrix, unconjugated ?hbmv c, z Matrix-vector product using a Hermitian band matrix ?hemv c, z Matrix-vector product using a Hermitian matrix ?her c, z Rank-1 update of a Hermitian matrix ?her2 c, z Rank-2 update of a Hermitian matrix ?hpmv c, z Matrix-vector product using a Hermitian packed matrix ?hpr c, z Rank-1 update of a Hermitian packed matrix ?hpr2 c, z Rank-2 update of a Hermitian packed matrix ?sbmv s, d Matrix-vector product using symmetric band matrix ?spmv s, d Matrix-vector product using a symmetric packed matrix ?spr s, d Rank-1 update of a symmetric packed matrix ?spr2 s, d Rank-2 update of a symmetric packed matrix ?symv s, d Matrix-vector product using a symmetric matrix ?syr s, d Rank-1 update of a symmetric matrix ?syr2 s, d Rank-2 update of a symmetric matrix ?tbmv s, d, c, z Matrix-vector product using a triangular band matrix ?tbsv s, d, c, z Solution of a linear system of equations with a triangular band matrix ?tpmv s, d, c, z Matrix-vector product using a triangular packed matrix ?tpsv s, d, c, z Solution of a linear system of equations with a triangular packed matrix ?trmv s, d, c, z Matrix-vector product using a triangular matrix ?trsv s, d, c, z Solution of a linear system of equations with a triangular matrix 2 Intel® Math Kernel Library Reference Manual 74 ?gbmv Computes a matrix-vector product using a general band matrix Syntax Fortran 77: call sgbmv(trans, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy) call dgbmv(trans, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy) call cgbmv(trans, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy) call zgbmv(trans, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call gbmv(a, x, y [,kl] [,m] [,alpha] [,beta] [,trans]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gbmv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, or y := alpha*A'*x + beta*y, or y := alpha *conjg(A')*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-n band matrix, with kl sub-diagonals and ku super-diagonals. Input Parameters trans CHARACTER*1. Specifies the operation: If trans= 'N' or 'n', then y := alpha*A*x + beta*y If trans= 'T' or 't', then y := alpha*A'*x + beta*y If trans= 'C' or 'c', then y := alpha *conjg(A')*x + beta*y m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. kl INTEGER. Specifies the number of sub-diagonals of the matrix A. The value of kl must satisfy 0 = kl. ku INTEGER. Specifies the number of super-diagonals of the matrix A. The value of ku must satisfy 0 = ku. BLAS and Sparse BLAS Routines 2 75 alpha REAL for sgbmv DOUBLE PRECISION for dgbmv COMPLEX for cgbmv DOUBLE COMPLEX for zgbmv Specifies the scalar alpha. a REAL for sgbmv DOUBLE PRECISION for dgbmv COMPLEX for cgbmv DOUBLE COMPLEX for zgbmv Array, DIMENSION (lda, n). Before entry, the leading (kl + ku + 1) by n part of the array a must contain the matrix of coefficients. This matrix must be supplied column-bycolumn, with the leading diagonal of the matrix in row (ku + 1) of the array, the first super-diagonal starting at position 2 in row ku, the first subdiagonal starting at position 1 in row (ku + 2), and so on. Elements in the array a that do not correspond to elements in the band matrix (such as the top left ku by ku triangle) are not referenced. The following program segment transfers a band matrix from conventional full matrix storage to band storage: do 20, j = 1, n k = ku + 1 - j do 10, i = max(1, j-ku), min(m, j+kl) a(k+i, j) = matrix(i,j) 10 continue 20 continue lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least (kl + ku + 1). x REAL for sgbmv DOUBLE PRECISION for dgbmv COMPLEX for cgbmv DOUBLE COMPLEX for zgbmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)) when trans = 'N' or 'n', and at least (1 + (m - 1)*abs(incx)) otherwise. Before entry, the array x must contain the vector x. incx INTEGER. Specifies the increment for the elements of x. incx must not be zero. beta REAL for sgbmv DOUBLE PRECISION for dgbmv COMPLEX for cgbmv DOUBLE COMPLEX for zgbmv Specifies the scalar beta. When beta is equal to zero, then y need not be set on input. y REAL for sgbmv DOUBLE PRECISION for dgbmv COMPLEX for cgbmv DOUBLE COMPLEX for zgbmv Array, DIMENSION at least (1 +(m - 1)*abs(incy)) when trans = 'N' or 'n' and at least (1 +(n - 1)*abs(incy)) otherwise. Before entry, the incremented array y must contain the vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. 2 Intel® Math Kernel Library Reference Manual 76 Output Parameters y Updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbmv interface are the following: a Holds the array a of size (kl+ku+1, n). Contains a banded matrix m*nwith kl lower diagonal and ku upper diagonal. x Holds the vector with the number of elements rx, where rx = n if trans = 'N',rx = m otherwise. y Holds the vector with the number of elements ry, where ry = m if trans = 'N',ry = n otherwise. trans Must be 'N', 'C', or 'T'. The default value is 'N'. kl If omitted, assumed kl = ku, that is, the number of lower diagonals equals the number of the upper diagonals. ku Restored as ku = lda-kl-1, where lda is the leading dimension of matrix A. m If omitted, assumed m = n, that is, a square matrix. alpha The default value is 1. beta The default value is 0. ?gemv Computes a matrix-vector product using a general matrix Syntax Fortran 77: call sgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call dgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call cgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call zgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call scgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call dzgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call gemv(a, x, y [,alpha][,beta] [,trans]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h BLAS and Sparse BLAS Routines 2 77 Description The ?gemv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, or y := alpha*A'*x + beta*y, or y := alpha*conjg(A')*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-n matrix. Input Parameters trans CHARACTER*1. Specifies the operation: if trans= 'N' or 'n', then y := alpha*A*x + beta*y; if trans= 'T' or 't', then y := alpha*A'*x + beta*y; if trans= 'C' or 'c', then y := alpha *conjg(A')*x + beta*y. m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha REAL for sgemv DOUBLE PRECISION for dgemv COMPLEX for cgemv, scgemv DOUBLE COMPLEX for zgemv, dzgemv Specifies the scalar alpha. a REAL for sgemv, scgemv DOUBLE PRECISION for dgemv, dzgemv COMPLEX for cgemv DOUBLE COMPLEX for zgemv Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). x REAL for sgemv DOUBLE PRECISION for dgemv COMPLEX for cgemv, scgemv DOUBLE COMPLEX for zgemv, dzgemv Array, DIMENSION at least (1+(n-1)*abs(incx)) when trans = 'N' or 'n' and at least (1+(m - 1)*abs(incx)) otherwise. Before entry, the incremented array x must contain the vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta REAL for sgemv DOUBLE PRECISION for dgemv COMPLEX for cgemv, scgemv 2 Intel® Math Kernel Library Reference Manual 78 DOUBLE COMPLEX for zgemv, dzgemv Specifies the scalar beta. When beta is set to zero, then y need not be set on input. y REAL for sgemv DOUBLE PRECISION for dgemv COMPLEX for cgemv, scgemv DOUBLE COMPLEX for zgemv, dzgemv Array, DIMENSION at least (1 +(m - 1)*abs(incy)) when trans = 'N' or 'n' and at least (1 +(n - 1)*abs(incy)) otherwise. Before entry with non-zero beta, the incremented array y must contain the vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gemv interface are the following: a Holds the matrix A of size (m,n). x Holds the vector with the number of elements rx where rx = n if trans = 'N', rx = m otherwise. y Holds the vector with the number of elements ry where ry = m if trans = 'N', ry = n otherwise. trans Must be 'N', 'C', or 'T'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?ger Performs a rank-1 update of a general matrix. Syntax Fortran 77: call sger(m, n, alpha, x, incx, y, incy, a, lda) call dger(m, n, alpha, x, incx, y, incy, a, lda) Fortran 95: call ger(a, x, y [,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h BLAS and Sparse BLAS Routines 2 79 Description The ?ger routines perform a matrix-vector operation defined as A := alpha*x*y'+ A, where: alpha is a scalar, x is an m-element vector, y is an n-element vector, A is an m-by-n general matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha REAL for sger DOUBLE PRECISION for dger Specifies the scalar alpha. x REAL for sger DOUBLE PRECISION for dger Array, DIMENSION at least (1 + (m - 1)*abs(incx)). Before entry, the incremented array x must contain the m-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y REAL for sger DOUBLE PRECISION for dger Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. a REAL for sger DOUBLE PRECISION for dger Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). Output Parameters a Overwritten by the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ger interface are the following: a Holds the matrix A of size (m,n). 2 Intel® Math Kernel Library Reference Manual 80 x Holds the vector with the number of elements m. y Holds the vector with the number of elements n. alpha The default value is 1. ?gerc Performs a rank-1 update (conjugated) of a general matrix. Syntax Fortran 77: call cgerc(m, n, alpha, x, incx, y, incy, a, lda) call zgerc(m, n, alpha, x, incx, y, incy, a, lda) Fortran 95: call gerc(a, x, y [,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gerc routines perform a matrix-vector operation defined as A := alpha*x*conjg(y') + A, where: alpha is a scalar, x is an m-element vector, y is an n-element vector, A is an m-by-n matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha COMPLEX for cgerc DOUBLE COMPLEX for zgerc Specifies the scalar alpha. x COMPLEX for cgerc DOUBLE COMPLEX for zgerc Array, DIMENSION at least (1 + (m - 1)*abs(incx)). Before entry, the incremented array x must contain the m-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y COMPLEX for cgerc BLAS and Sparse BLAS Routines 2 81 DOUBLE COMPLEX for zgerc Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. a COMPLEX for cgerc DOUBLE COMPLEX for zgerc Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). Output Parameters a Overwritten by the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gerc interface are the following: a Holds the matrix A of size (m,n). x Holds the vector with the number of elements m. y Holds the vector with the number of elements n. alpha The default value is 1. ?geru Performs a rank-1 update (unconjugated) of a general matrix. Syntax Fortran 77: call cgeru(m, n, alpha, x, incx, y, incy, a, lda) call zgeru(m, n, alpha, x, incx, y, incy, a, lda) Fortran 95: call geru(a, x, y [,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?geru routines perform a matrix-vector operation defined as A := alpha*x*y ' + A, where: 2 Intel® Math Kernel Library Reference Manual 82 alpha is a scalar, x is an m-element vector, y is an n-element vector, A is an m-by-n matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha COMPLEX for cgeru DOUBLE COMPLEX for zgeru Specifies the scalar alpha. x COMPLEX for cgeru DOUBLE COMPLEX for zgeru Array, DIMENSION at least (1 + (m - 1)*abs(incx)). Before entry, the incremented array x must contain the m-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y COMPLEX for cgeru DOUBLE COMPLEX for zgeru Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. a COMPLEX for cgeru DOUBLE COMPLEX for zgeru Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). Output Parameters a Overwritten by the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine geru interface are the following: a Holds the matrix A of size (m,n). x Holds the vector with the number of elements m. y Holds the vector with the number of elements n. alpha The default value is 1. BLAS and Sparse BLAS Routines 2 83 ?hbmv Computes a matrix-vector product using a Hermitian band matrix. Syntax Fortran 77: call chbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy) call zhbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call hbmv(a, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hbmv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n Hermitian band matrix, with k super-diagonals. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian band matrix A is used: If uplo = 'U' or 'u', then the upper triangular part of the matrix A is used. If uplo = 'L' or 'l', then the low triangular part of the matrix A is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. k INTEGER. Specifies the number of super-diagonals of the matrix A. The value of k must satisfy 0 = k. alpha COMPLEX for chbmv DOUBLE COMPLEX for zhbmv Specifies the scalar alpha. a COMPLEX for chbmv DOUBLE COMPLEX for zhbmv Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading (k + 1) by n part of the array a must contain the upper triangular band part of the Hermitian matrix. The matrix must be supplied column-by-column, with the leading diagonal of the matrix in row (k + 1) of the array, the first super-diagonal starting at position 2 in row k, and so on. The top left k by k triangle of the array a is not referenced. 2 Intel® Math Kernel Library Reference Manual 84 The following program segment transfers the upper triangular part of a Hermitian band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = k + 1 - j do 10, i = max(1, j - k), j a(m + i, j) = matrix(i, j) 10 continue 20 continue Before entry with uplo = 'L' or 'l', the leading (k + 1) by n part of the array a must contain the lower triangular band part of the Hermitian matrix, supplied column-by-column, with the leading diagonal of the matrix in row 1 of the array, the first sub-diagonal starting at position 1 in row 2, and so on. The bottom right k by k triangle of the array a is not referenced. The following program segment transfers the lower triangular part of a Hermitian band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = 1 - j do 10, i = j, min( n, j + k ) a( m + i, j ) = matrix( i, j ) 10 continue 20 continue The imaginary parts of the diagonal elements need not be set and are assumed to be zero. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least (k + 1). x COMPLEX for chbmv DOUBLE COMPLEX for zhbmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta COMPLEX for chbmv DOUBLE COMPLEX for zhbmv Specifies the scalar beta. y COMPLEX for chbmv DOUBLE COMPLEX for zhbmv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbmv interface are the following: a Holds the array a of size (k+1,n). BLAS and Sparse BLAS Routines 2 85 x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?hemv Computes a matrix-vector product using a Hermitian matrix. Syntax Fortran 77: call chemv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) call zhemv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call hemv(a, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hemv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n Hermitian matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular of the array a is used. If uplo = 'L' or 'l', then the low triangular of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha COMPLEX for chemv DOUBLE COMPLEX for zhemv Specifies the scalar alpha. a COMPLEX for chemv DOUBLE COMPLEX for zhemv Array, DIMENSION (lda, n). 2 Intel® Math Kernel Library Reference Manual 86 Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of a is not referenced. The imaginary parts of the diagonal elements need not be set and are assumed to be zero. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). x COMPLEX for chemv DOUBLE COMPLEX for zhemv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta COMPLEX for chemv DOUBLE COMPLEX for zhemv Specifies the scalar beta. When beta is supplied as zero then y need not be set on input. y COMPLEX for chemv DOUBLE COMPLEX for zhemv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hemv interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?her Performs a rank-1 update of a Hermitian matrix. Syntax Fortran 77: call cher(uplo, n, alpha, x, incx, a, lda) BLAS and Sparse BLAS Routines 2 87 call zher(uplo, n, alpha, x, incx, a, lda) Fortran 95: call her(a, x [,uplo] [, alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?her routines perform a matrix-vector operation defined as A := alpha*x*conjg(x') + A, where: alpha is a real scalar, x is an n-element vector, A is an n-by-n Hermitian matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular of the array a is used. If uplo = 'L' or 'l', then the low triangular of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for cher DOUBLE PRECISION for zher Specifies the scalar alpha. x COMPLEX for cher DOUBLE COMPLEX for zher Array, dimension at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. a COMPLEX for cher DOUBLE COMPLEX for zher Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of a is not referenced. The imaginary parts of the diagonal elements need not be set and are assumed to be zero. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). 2 Intel® Math Kernel Library Reference Manual 88 Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements are set to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine her interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?her2 Performs a rank-2 update of a Hermitian matrix. Syntax Fortran 77: call cher2(uplo, n, alpha, x, incx, y, incy, a, lda) call zher2(uplo, n, alpha, x, incx, y, incy, a, lda) Fortran 95: call her2(a, x, y [,uplo][,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?her2 routines perform a matrix-vector operation defined as A := alpha *x*conjg(y') + conjg(alpha)*y *conjg(x') + A, where: alpha is a scalar, x and y are n-element vectors, A is an n-by-n Hermitian matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular of the array a is used. BLAS and Sparse BLAS Routines 2 89 If uplo = 'L' or 'l', then the low triangular of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha COMPLEX for cher2 DOUBLE COMPLEX for zher2 Specifies the scalar alpha. x COMPLEX for cher2 DOUBLE COMPLEX for zher2 Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y COMPLEX for cher2 DOUBLE COMPLEX for zher2 Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. a COMPLEX for cher2 DOUBLE COMPLEX for zher2 Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of a is not referenced. The imaginary parts of the diagonal elements need not be set and are assumed to be zero. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements are set to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine her2 interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. 2 Intel® Math Kernel Library Reference Manual 90 ?hpmv Computes a matrix-vector product using a Hermitian packed matrix. Syntax Fortran 77: call chpmv(uplo, n, alpha, ap, x, incx, beta, y, incy) call zhpmv(uplo, n, alpha, ap, x, incx, beta, y, incy) Fortran 95: call hpmv(ap, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hpmv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n Hermitian matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', then the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', then the low triangular part of the matrix A is supplied in the packed array ap . n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha COMPLEX for chpmv DOUBLE COMPLEX for zhpmv Specifies the scalar alpha. ap COMPLEX for chpmv DOUBLE COMPLEX for zhpmv Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the Hermitian matrix packed sequentially, column-by-column, so that ap(1) contains a(1, 1), ap(2) and ap(3) contain a(1, 2) and a(2, 2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the Hermitian matrix packed sequentially, column-by-column, so that ap(1) contains a(1, 1), ap(2) and ap(3) contain a(2, 1) and a(3, 1) respectively, and so on. BLAS and Sparse BLAS Routines 2 91 The imaginary parts of the diagonal elements need not be set and are assumed to be zero. x COMPLEX for chpmv DOUBLE PRECISION COMPLEX for zhpmv Array, DIMENSION at least (1 +(n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta COMPLEX for chpmv DOUBLE COMPLEX for zhpmv Specifies the scalar beta. When beta is equal to zero then y need not be set on input. y COMPLEX for chpmv DOUBLE COMPLEX for zhpmv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpmv interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?hpr Performs a rank-1 update of a Hermitian packed matrix. Syntax Fortran 77: call chpr(uplo, n, alpha, x, incx, ap) call zhpr(uplo, n, alpha, x, incx, ap) Fortran 95: call hpr(ap, x [,uplo] [, alpha]) Include Files • FORTRAN 77: mkl_blas.fi 2 Intel® Math Kernel Library Reference Manual 92 • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hpr routines perform a matrix-vector operation defined as A := alpha*x*conjg(x') + A, where: alpha is a real scalar, x is an n-element vector, A is an n-by-n Hermitian matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', the low triangular part of the matrix A is supplied in the packed array ap . n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for chpr DOUBLE PRECISION for zhpr Specifies the scalar alpha. x COMPLEX for chpr DOUBLE COMPLEX for zhpr Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. incx must not be zero. ap COMPLEX for chpr DOUBLE COMPLEX for zhpr Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the Hermitian matrix packed sequentially, column-by-column, so that ap(1) contains a(1, 1), ap(2) and ap(3) contain a(1, 2) and a(2, 2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the Hermitian matrix packed sequentially, column-bycolumn, so that ap(1) contains a(1, 1), ap(2) and ap(3) contain a(2, 1) and a(3, 1) respectively, and so on. The imaginary parts of the diagonal elements need not be set and are assumed to be zero. Output Parameters ap With uplo = 'U' or 'u', overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements are set to zero. BLAS and Sparse BLAS Routines 2 93 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpr interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?hpr2 Performs a rank-2 update of a Hermitian packed matrix. Syntax Fortran 77: call chpr2(uplo, n, alpha, x, incx, y, incy, ap) call zhpr2(uplo, n, alpha, x, incx, y, incy, ap) Fortran 95: call hpr2(ap, x, y [,uplo][,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hpr2 routines perform a matrix-vector operation defined as A := alpha*x*conjg(y') + conjg(alpha)*y*conjg(x') + A, where: alpha is a scalar, x and y are n-element vectors, A is an n-by-n Hermitian matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', then the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', then the low triangular part of the matrix A is supplied in the packed array ap . n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha COMPLEX for chpr2 2 Intel® Math Kernel Library Reference Manual 94 DOUBLE COMPLEX for zhpr2 Specifies the scalar alpha. x COMPLEX for chpr2 DOUBLE COMPLEX for zhpr2 Array, dimension at least (1 +(n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y COMPLEX for chpr2 DOUBLE COMPLEX for zhpr2 Array, DIMENSION at least (1 +(n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. ap COMPLEX for chpr2 DOUBLE COMPLEX for zhpr2 Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the Hermitian matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(1,2) and a(2,2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the Hermitian matrix packed sequentially, column-bycolumn, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(2,1) and a(3,1) respectively, and so on. The imaginary parts of the diagonal elements need not be set and are assumed to be zero. Output Parameters ap With uplo = 'U' or 'u', overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements need are set to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpr2 interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?sbmv Computes a matrix-vector product using a symmetric band matrix. BLAS and Sparse BLAS Routines 2 95 Syntax Fortran 77: call ssbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy) call dsbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call sbmv(a, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?sbmv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n symmetric band matrix, with k super-diagonals. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the band matrix A is used: if uplo = 'U' or 'u' - upper triangular part; if uplo = 'L' or 'l' - low triangular part. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. k INTEGER. Specifies the number of super-diagonals of the matrix A. The value of k must satisfy 0 = k. alpha REAL for ssbmv DOUBLE PRECISION for dsbmv Specifies the scalar alpha. a REAL for ssbmv DOUBLE PRECISION for dsbmv Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading (k + 1) by n part of the array a must contain the upper triangular band part of the symmetric matrix, supplied column-by-column, with the leading diagonal of the matrix in row (k + 1) of the array, the first superdiagonal starting at position 2 in row k, and so on. The top left k by k triangle of the array a is not referenced. The following program segment transfers the upper triangular part of a symmetric band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = k + 1 - j do 10, i = max( 1, j - k ), j 2 Intel® Math Kernel Library Reference Manual 96 a( m + i, j ) = matrix( i, j ) 10 continue 20 continue Before entry with uplo = 'L' or 'l', the leading (k + 1) by n part of the array a must contain the lower triangular band part of the symmetric matrix, supplied column-by-column, with the leading diagonal of the matrix in row 1 of the array, the first sub-diagonal starting at position 1 in row 2, and so on. The bottom right k by k triangle of the array a is not referenced. The following program segment transfers the lower triangular part of a symmetric band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = 1 - j do 10, i = j, min( n, j + k ) a( m + i, j ) = matrix( i, j ) 10 continue 20 continue lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least (k + 1). x REAL for ssbmv DOUBLE PRECISION for dsbmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta REAL for ssbmv DOUBLE PRECISION for dsbmv Specifies the scalar beta. y REAL for ssbmv DOUBLE PRECISION for dsbmv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sbmv interface are the following: a Holds the array a of size (k+1,n). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. BLAS and Sparse BLAS Routines 2 97 ?spmv Computes a matrix-vector product using a symmetric packed matrix. Syntax Fortran 77: call sspmv(uplo, n, alpha, ap, x, incx, beta, y, incy) call dspmv(uplo, n, alpha, ap, x, incx, beta, y, incy) Fortran 95: call spmv(ap, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?spmv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n symmetric matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', then the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', then the low triangular part of the matrix A is supplied in the packed array ap . n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for sspmv DOUBLE PRECISION for dspmv Specifies the scalar alpha. ap REAL for sspmv DOUBLE PRECISION for dspmv Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the symmetric matrix packed sequentially, column-bycolumn, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(1,2) and a(2, 2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the symmetric 2 Intel® Math Kernel Library Reference Manual 98 matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(2,1) and a(3,1) respectively, and so on. x REAL for sspmv DOUBLE PRECISION for dspmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta REAL for sspmv DOUBLE PRECISION for dspmv Specifies the scalar beta. When beta is supplied as zero, then y need not be set on input. y REAL for sspmv DOUBLE PRECISION for dspmv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spmv interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?spr Performs a rank-1 update of a symmetric packed matrix. Syntax Fortran 77: call sspr(uplo, n, alpha, x, incx, ap) call dspr(uplo, n, alpha, x, incx, ap) Fortran 95: call spr(ap, x [,uplo] [, alpha]) BLAS and Sparse BLAS Routines 2 99 Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?spr routines perform a matrix-vector operation defined as a:= alpha*x*x'+ A, where: alpha is a real scalar, x is an n-element vector, A is an n-by-n symmetric matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', then the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', then the low triangular part of the matrix A is supplied in the packed array ap . n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for sspr DOUBLE PRECISION for dspr Specifies the scalar alpha. x REAL for sspr DOUBLE PRECISION for dspr Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. ap REAL for sspr DOUBLE PRECISION for dspr Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the symmetric matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(1,2) and a(2,2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the symmetric matrix packed sequentially, column-bycolumn, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(2,1) and a(3,1) respectively, and so on. Output Parameters ap With uplo = 'U' or 'u', overwritten by the upper triangular part of the updated matrix. 2 Intel® Math Kernel Library Reference Manual 100 With uplo = 'L' or 'l', overwritten by the lower triangular part of the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spr interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?spr2 Performs a rank-2 update of a symmetric packed matrix. Syntax Fortran 77: call sspr2(uplo, n, alpha, x, incx, y, incy, ap) call dspr2(uplo, n, alpha, x, incx, y, incy, ap) Fortran 95: call spr2(ap, x, y [,uplo][,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?spr2 routines perform a matrix-vector operation defined as A:= alpha*x*y'+ alpha*y*x' + A, where: alpha is a scalar, x and y are n-element vectors, A is an n-by-n symmetric matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', then the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', then the low triangular part of the matrix A is supplied in the packed array ap . BLAS and Sparse BLAS Routines 2 101 n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for sspr2 DOUBLE PRECISION for dspr2 Specifies the scalar alpha. x REAL for sspr2 DOUBLE PRECISION for dspr2 Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y REAL for sspr2 DOUBLE PRECISION for dspr2 Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. ap REAL for sspr2 DOUBLE PRECISION for dspr2 Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the symmetric matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(1,2) and a(2,2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the symmetric matrix packed sequentially, column-bycolumn, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a (2,1) and a(3,1) respectively, and so on. Output Parameters ap With uplo = 'U' or 'u', overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', overwritten by the lower triangular part of the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spr2 interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?symv Computes a matrix-vector product for a symmetric matrix. 2 Intel® Math Kernel Library Reference Manual 102 Syntax Fortran 77: call ssymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) call dsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call symv(a, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?symv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n symmetric matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular part of the array a is used. If uplo = 'L' or 'l', then the low triangular part of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for ssymv DOUBLE PRECISION for dsymv Specifies the scalar alpha. a REAL for ssymv DOUBLE PRECISION for dsymv Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the symmetric matrix A and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the symmetric matrix A and the strictly upper triangular part of a is not referenced. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). x REAL for ssymv DOUBLE PRECISION for dsymv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. BLAS and Sparse BLAS Routines 2 103 incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta REAL for ssymv DOUBLE PRECISION for dsymv Specifies the scalar beta. When beta is supplied as zero, then y need not be set on input. y REAL for ssymv DOUBLE PRECISION for dsymv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine symv interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?syr Performs a rank-1 update of a symmetric matrix. Syntax Fortran 77: call ssyr(uplo, n, alpha, x, incx, a, lda) call dsyr(uplo, n, alpha, x, incx, a, lda) Fortran 95: call syr(a, x [,uplo] [, alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?syr routines perform a matrix-vector operation defined as A := alpha*x*x' + A , 2 Intel® Math Kernel Library Reference Manual 104 where: alpha is a real scalar, x is an n-element vector, A is an n-by-n symmetric matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular part of the array a is used. If uplo = 'L' or 'l', then the low triangular part of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for ssyr DOUBLE PRECISION for dsyr Specifies the scalar alpha. x REAL for ssyr DOUBLE PRECISION for dsyr Array, DIMENSION at least (1 + (n-1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. a REAL for ssyr DOUBLE PRECISION for dsyr Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the symmetric matrix A and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the symmetric matrix A and the strictly upper triangular part of a is not referenced. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syr interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. BLAS and Sparse BLAS Routines 2 105 ?syr2 Performs a rank-2 update of symmetric matrix. Syntax Fortran 77: call ssyr2(uplo, n, alpha, x, incx, y, incy, a, lda) call dsyr2(uplo, n, alpha, x, incx, y, incy, a, lda) Fortran 95: call syr2(a, x, y [,uplo][,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?syr2 routines perform a matrix-vector operation defined as A := alpha*x*y'+ alpha*y*x' + A, where: alpha is a scalar, x and y are n-element vectors, A is an n-by-n symmetric matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular part of the array a is used. If uplo = 'L' or 'l', then the low triangular part of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for ssyr2 DOUBLE PRECISION for dsyr2 Specifies the scalar alpha. x REAL for ssyr2 DOUBLE PRECISION for dsyr2 Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y REAL for ssyr2 DOUBLE PRECISION for dsyr2 Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. 2 Intel® Math Kernel Library Reference Manual 106 a REAL for ssyr2 DOUBLE PRECISION for dsyr2 Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of a is not referenced. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syr2 interface are the following: a Holds the matrix A of size (n,n). x Holds the vector x of length n. y Holds the vector y of length n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?tbmv Computes a matrix-vector product using a triangular band matrix. Syntax Fortran 77: call stbmv(uplo, trans, diag, n, k, a, lda, x, incx) call dtbmv(uplo, trans, diag, n, k, a, lda, x, incx) call ctbmv(uplo, trans, diag, n, k, a, lda, x, incx) call ztbmv(uplo, trans, diag, n, k, a, lda, x, incx) Fortran 95: call tbmv(a, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h BLAS and Sparse BLAS Routines 2 107 Description The ?tbmv routines perform one of the matrix-vector operations defined as x := A*x, or x := A'*x, or x := conjg(A')*x, where: x is an n-element vector, A is an n-by-n unit, or non-unit, upper or lower triangular band matrix, with (k +1) diagonals. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is an upper or lower triangular matrix: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then x := A*x; if trans = 'T' or 't', then x := A'*x; if trans = 'C' or 'c', then x := conjg(A')*x. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. k INTEGER. On entry with uplo = 'U' or 'u', k specifies the number of super-diagonals of the matrix A. On entry with uplo = 'L' or 'l', k specifies the number of sub-diagonals of the matrix a. The value of k must satisfy 0 = k. a REAL for stbmv DOUBLE PRECISION for dtbmv COMPLEX for ctbmv DOUBLE COMPLEX for ztbmv Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading (k + 1) by n part of the array a must contain the upper triangular band part of the matrix of coefficients, supplied column-by-column, with the leading diagonal of the matrix in row (k + 1) of the array, the first super-diagonal starting at position 2 in row k, and so on. The top left k by k triangle of the array a is not referenced. The following program segment transfers an upper triangular band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = k + 1 - j do 10, i = max(1, j - k), j a(m + i, j) = matrix(i, j) 10 continue 20 continue Before entry with uplo = 'L' or 'l', the leading (k + 1) by n part of the array a must contain the lower triangular band part of the matrix of coefficients, supplied column-by-column, with the leading diagonal of the matrix in row1 of the array, the first sub-diagonal starting at position 1 in 2 Intel® Math Kernel Library Reference Manual 108 row 2, and so on. The bottom right k by k triangle of the array a is not referenced. The following program segment transfers a lower triangular band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = 1 - j do 10, i = j, min(n, j + k) a(m + i, j) = matrix (i, j) 10 continue 20 continue Note that when diag = 'U' or 'u', the elements of the array a corresponding to the diagonal elements of the matrix are not referenced, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least (k + 1). x REAL for stbmv DOUBLE PRECISION for dtbmv COMPLEX for ctbmv DOUBLE COMPLEX for ztbmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the transformed vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tbmv interface are the following: a Holds the array a of size (k+1,n). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. ?tbsv Solves a system of linear equations whose coefficients are in a triangular band matrix. Syntax Fortran 77: call stbsv(uplo, trans, diag, n, k, a, lda, x, incx) call dtbsv(uplo, trans, diag, n, k, a, lda, x, incx) call ctbsv(uplo, trans, diag, n, k, a, lda, x, incx) call ztbsv(uplo, trans, diag, n, k, a, lda, x, incx) BLAS and Sparse BLAS Routines 2 109 Fortran 95: call tbsv(a, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?tbsv routines solve one of the following systems of equations: A*x = b, or A'*x = b, or conjg(A')*x = b, where: b and x are n-element vectors, A is an n-by-n unit, or non-unit, upper or lower triangular band matrix, with (k + 1) diagonals. The routine does not test for singularity or near-singularity. Such tests must be performed before calling this routine. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is an upper or lower triangular matrix: if uplo = 'U' or 'u' the matrix is upper triangular; if uplo = 'L' or 'l', the matrix is low triangular. trans CHARACTER*1. Specifies the system of equations: if trans = 'N' or 'n', then A*x = b; if trans = 'T' or 't', then A'*x = b; if trans = 'C' or 'c', then conjg(A')*x = b. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. k INTEGER. On entry with uplo = 'U' or 'u', k specifies the number of super-diagonals of the matrix A. On entry with uplo = 'L' or 'l', k specifies the number of sub-diagonals of the matrix A. The value of k must satisfy 0 = k. a REAL for stbsv DOUBLE PRECISION for dtbsv COMPLEX for ctbsv DOUBLE COMPLEX for ztbsv Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading (k + 1) by n part of the array a must contain the upper triangular band part of the matrix of coefficients, supplied column-by-column, with the leading diagonal of the matrix in row (k + 1) of the array, the first super-diagonal starting at position 2 in row k, and so on. The top left k by k triangle of the array a is not referenced. 2 Intel® Math Kernel Library Reference Manual 110 The following program segment transfers an upper triangular band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = k + 1 - j do 10, i = max(1, j - k), jl a(m + i, j) = matrix (i, j) 10 continue 20 continue Before entry with uplo = 'L' or 'l', the leading (k + 1) by n part of the array a must contain the lower triangular band part of the matrix of coefficients, supplied column-by-column, with the leading diagonal of the matrix in row 1 of the array, the first sub-diagonal starting at position 1 in row 2, and so on. The bottom right k by k triangle of the array a is not referenced. The following program segment transfers a lower triangular band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = 1 - j do 10, i = j, min(n, j + k) a(m + i, j) = matrix (i, j) 10 continue 20 continue When diag = 'U' or 'u', the elements of the array a corresponding to the diagonal elements of the matrix are not referenced, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least (k + 1). x REAL for stbsv DOUBLE PRECISION for dtbsv COMPLEX for ctbsv DOUBLE COMPLEX for ztbsv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element right-hand side vector b. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the solution vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tbsv interface are the following: a Holds the array a of size (k+1,n). x Holds the vector with the number of elements n. BLAS and Sparse BLAS Routines 2 111 uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. ?tpmv Computes a matrix-vector product using a triangular packed matrix. Syntax Fortran 77: call stpmv(uplo, trans, diag, n, ap, x, incx) call dtpmv(uplo, trans, diag, n, ap, x, incx) call ctpmv(uplo, trans, diag, n, ap, x, incx) call ztpmv(uplo, trans, diag, n, ap, x, incx) Fortran 95: call tpmv(ap, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?tpmv routines perform one of the matrix-vector operations defined as x := A*x, or x := A'*x, or x := conjg(A')*x, where: x is an n-element vector, A is an n-by-n unit, or non-unit, upper or lower triangular matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then x := A*x; if trans = 'T' or 't', then x := A'*x; if trans = 'C' or 'c', then x := conjg(A')*x. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. ap REAL for stpmv 2 Intel® Math Kernel Library Reference Manual 112 DOUBLE PRECISION for dtpmv COMPLEX for ctpmv DOUBLE COMPLEX for ztpmv Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(1,2) and a(2,2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(2,1) and a(3,1) respectively, and so on. When diag = 'U' or 'u', the diagonal elements of a are not referenced, but are assumed to be unity. x REAL for stpmv DOUBLE PRECISION for dtpmv COMPLEX for ctpmv DOUBLE COMPLEX for ztpmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the transformed vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tpmv interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. ?tpsv Solves a system of linear equations whose coefficients are in a triangular packed matrix. Syntax Fortran 77: call stpsv(uplo, trans, diag, n, ap, x, incx) call dtpsv(uplo, trans, diag, n, ap, x, incx) call ctpsv(uplo, trans, diag, n, ap, x, incx) call ztpsv(uplo, trans, diag, n, ap, x, incx) BLAS and Sparse BLAS Routines 2 113 Fortran 95: call tpsv(ap, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?tpsv routines solve one of the following systems of equations A*x = b, or A'*x = b, or conjg(A')*x = b, where: b and x are n-element vectors, A is an n-by-n unit, or non-unit, upper or lower triangular matrix, supplied in packed form. This routine does not test for singularity or near-singularity. Such tests must be performed before calling this routine. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans CHARACTER*1. Specifies the system of equations: if trans = 'N' or 'n', then A*x = b; if trans = 'T' or 't', then A'*x = b; if trans = 'C' or 'c', then conjg(A')*x = b. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. ap REAL for stpsv DOUBLE PRECISION for dtpsv COMPLEX for ctpsv DOUBLE COMPLEX for ztpsv Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular matrix packed sequentially, column-by-column, so that ap(1) contains a(1, +1), ap(2) and ap(3) contain a(1, 2) and a(2, 2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular matrix packed sequentially, column-by-column, so that ap(1) contains a(1, +1), ap(2) and ap(3) contain a(2, +1) and a(3, +1) respectively, and so on. When diag = 'U' or 'u', the diagonal elements of a are not referenced, but are assumed to be unity. x REAL for stpsv DOUBLE PRECISION for dtpsv COMPLEX for ctpsv 2 Intel® Math Kernel Library Reference Manual 114 DOUBLE COMPLEX for ztpsv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element right-hand side vector b. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the solution vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tpsv interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. ?trmv Computes a matrix-vector product using a triangular matrix. Syntax Fortran 77: call strmv(uplo, trans, diag, n, a, lda, x, incx) call dtrmv(uplo, trans, diag, n, a, lda, x, incx) call ctrmv(uplo, trans, diag, n, a, lda, x, incx) call ztrmv(uplo, trans, diag, n, a, lda, x, incx) Fortran 95: call trmv(a, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?trmv routines perform one of the following matrix-vector operations defined as x := A*x, or x := A'*x, or x := conjg(A')*x, where: x is an n-element vector, A is an n-by-n unit, or non-unit, upper or lower triangular matrix. BLAS and Sparse BLAS Routines 2 115 Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then x := A*x; if trans = 'T' or 't', then x := A'*x; if trans = 'C' or 'c', then x := conjg(A')*x. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. a REAL for strmv DOUBLE PRECISION for dtrmv COMPLEX for ctrmv DOUBLE COMPLEX for ztrmv Array, DIMENSION (lda,n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular matrix and the strictly upper triangular part of a is not referenced. When diag = 'U' or 'u', the diagonal elements of a are not referenced either, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). x REAL for strmv DOUBLE PRECISION for dtrmv COMPLEX for ctrmv DOUBLE COMPLEX for ztrmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the transformed vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trmv interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. 2 Intel® Math Kernel Library Reference Manual 116 The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. ?trsv Solves a system of linear equations whose coefficients are in a triangular matrix. Syntax Fortran 77: call strsv(uplo, trans, diag, n, a, lda, x, incx) call dtrsv(uplo, trans, diag, n, a, lda, x, incx) call ctrsv(uplo, trans, diag, n, a, lda, x, incx) call ztrsv(uplo, trans, diag, n, a, lda, x, incx) Fortran 95: call trsv(a, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?trsv routines solve one of the systems of equations: A*x = b, or A'*x = b, or conjg(A')*x = b, where: b and x are n-element vectors, A is an n-by-n unit, or non-unit, upper or lower triangular matrix. The routine does not test for singularity or near-singularity. Such tests must be performed before calling this routine. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans CHARACTER*1. Specifies the systems of equations: if trans = 'N' or 'n', then A*x = b; if trans = 'T' or 't', then A'*x = b; if trans = 'C' or 'c', then oconjg(A')*x = b. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. a REAL for strsv BLAS and Sparse BLAS Routines 2 117 DOUBLE PRECISION for dtrsv COMPLEX for ctrsv DOUBLE COMPLEX for ztrsv Array, DIMENSION (lda,n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular matrix and the strictly upper triangular part of a is not referenced. When diag = 'U' or 'u', the diagonal elements of a are not referenced either, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). x REAL for strsv DOUBLE PRECISION for dtrsv COMPLEX for ctrsv DOUBLE COMPLEX for ztrsv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element right-hand side vector b. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the solution vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trsv interface are the following: a Holds the matrix a of size (n,n). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. BLAS Level 3 Routines BLAS Level 3 routines perform matrix-matrix operations. Table “BLAS Level 3 Routine Groups and Their Data Types” lists the BLAS Level 3 routine groups and the data types associated with them. BLAS Level 3 Routine Groups and Their Data Types Routine Group Data Types Description ?gemm s, d, c, z Matrix-matrix product of general matrices ?hemm c, z Matrix-matrix product of Hermitian matrices ?herk c, z Rank-k update of Hermitian matrices 2 Intel® Math Kernel Library Reference Manual 118 Routine Group Data Types Description ?her2k c, z Rank-2k update of Hermitian matrices ?symm s, d, c, z Matrix-matrix product of symmetric matrices ?syrk s, d, c, z Rank-k update of symmetric matrices ?syr2k s, d, c, z Rank-2k update of symmetric matrices ?trmm s, d, c, z Matrix-matrix product of triangular matrices ?trsm s, d, c, z Linear matrix-matrix solution for triangular matrices Symmetric Multiprocessing Version of Intel® MKL Many applications spend considerable time executing BLAS routines. This time can be scaled by the number of processors available on the system through using the symmetric multiprocessing (SMP) feature built into the Intel MKL Library. The performance enhancements based on the parallel use of the processors are available without any programming effort on your part. To enhance performance, the library uses the following methods: • The BLAS functions are blocked where possible to restructure the code in a way that increases the localization of data reference, enhances cache memory use, and reduces the dependency on the memory bus. • The code is distributed across the processors to maximize parallelism. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 ?gemm Computes a scalar-matrix-matrix product and adds the result to a scalar-matrix product. Syntax Fortran 77: call sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call cgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call zgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call scgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call dzgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call gemm(a, b, c [,transa][,transb] [,alpha][,beta]) BLAS and Sparse BLAS Routines 2 119 Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gemm routines perform a matrix-matrix operation with general matrices. The operation is defined as C := alpha*op(A)*op(B) + beta*C, where: op(x) is one of op(x) = x, or op(x) = x', or op(x) = conjg(x'), alpha and beta are scalars, A, B and C are matrices: op(A) is an m-by-k matrix, op(B) is a k-by-n matrix, C is an m-by-n matrix. See also ?gemm3m, BLAS-like extension routines, that use matrix multiplication for similar matrix-matrix operations. Input Parameters transa CHARACTER*1. Specifies the form of op(A) used in the matrix multiplication: if transa = 'N' or 'n', then op(A) = A; if transa = 'T' or 't', then op(A) = A'; if transa = 'C' or 'c', then op(A) = conjg(A'). transb CHARACTER*1. Specifies the form of op(B) used in the matrix multiplication: if transb = 'N' or 'n', then op(B) = B; if transb = 'T' or 't', then op(B) = B'; if transb = 'C' or 'c', then op(B) = conjg(B'). m INTEGER. Specifies the number of rows of the matrix op(A) and of the matrix C. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix op(B) and the number of columns of the matrix C. The value of n must be at least zero. k INTEGER. Specifies the number of columns of the matrix op(A) and the number of rows of the matrix op(B). The value of k must be at least zero. alpha REAL for sgemm DOUBLE PRECISION for dgemm COMPLEX for cgemm, scgemm DOUBLE COMPLEX for zgemm, dzgemm Specifies the scalar alpha. a REAL for sgemm, scgemm DOUBLE PRECISION for dgemm, dzgemm COMPLEX for cgemm DOUBLE COMPLEX for zgemm 2 Intel® Math Kernel Library Reference Manual 120 Array, DIMENSION (lda, ka), where ka is k when transa = 'N' or 'n', and is m otherwise. Before entry with transa = 'N' or 'n', the leading mby- k part of the array a must contain the matrix A, otherwise the leading kby- m part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When transa = 'N' or 'n', then lda must be at least max(1, m), otherwise lda must be at least max(1, k). b REAL for sgemm DOUBLE PRECISION for dgemm COMPLEX for cgemm, scgemm DOUBLE COMPLEX for zgemm, dzgemm Array, DIMENSION (ldb, kb), where kb is n when transb = 'N' or 'n', and is k otherwise. Before entry with transb = 'N' or 'n', the leading kby- n part of the array b must contain the matrix B, otherwise the leading nby- k part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. When transb = 'N' or 'n', then ldb must be at least max(1, k), otherwise ldb must be at least max(1, n). beta REAL for sgemm DOUBLE PRECISION for dgemm COMPLEX for cgemm, scgemm DOUBLE COMPLEX for zgemm, dzgemm Specifies the scalar beta. When beta is equal to zero, then c need not be set on input. c REAL for sgemm DOUBLE PRECISION for dgemm COMPLEX for cgemm, scgemm DOUBLE COMPLEX for zgemm, dzgemm Array, DIMENSION (ldc, n). Before entry, the leading m-by-n part of the array c must contain the matrix C, except when beta is equal to zero, in which case c need not be set on entry. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, m). Output Parameters c Overwritten by the m-by-n matrix (alpha*op(A)*op(B) + beta*C). Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gemm interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if transa= 'N', ka = m otherwise, ma = m if transa= 'N', ma = k otherwise. b Holds the matrix B of size (mb,kb) where BLAS and Sparse BLAS Routines 2 121 kb = n if transb = 'N', kb = k otherwise, mb = k if transb = 'N', mb = n otherwise. c Holds the matrix C of size (m,n). transa Must be 'N', 'C', or 'T'. The default value is 'N'. transb Must be 'N', 'C', or 'T'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?hemm Computes a scalar-matrix-matrix product (either one of the matrices is Hermitian) and adds the result to scalar-matrix product. Syntax Fortran 77: call chemm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) call zhemm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call hemm(a, b, c [,side][,uplo] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hemm routines perform a matrix-matrix operation using Hermitian matrices. The operation is defined as C := alpha*A*B + beta*C or C := alpha*B*A + beta*C, where: alpha and beta are scalars, A is an Hermitian matrix, B and C are m-by-n matrices. Input Parameters side CHARACTER*1. Specifies whether the Hermitian matrix A appears on the left or right in the operation as follows: if side = 'L' or 'l', then C := alpha*A*B + beta*C; if side = 'R' or 'r', then C := alpha*B*A + beta*C. 2 Intel® Math Kernel Library Reference Manual 122 uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian matrix A is used: If uplo = 'U' or 'u', then the upper triangular part of the Hermitian matrix A is used. If uplo = 'L' or 'l', then the low triangular part of the Hermitian matrix A is used. m INTEGER. Specifies the number of rows of the matrix C. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix C. The value of n must be at least zero. alpha COMPLEX for chemm DOUBLE COMPLEX for zhemm Specifies the scalar alpha. a COMPLEX for chemm DOUBLE COMPLEX for zhemm Array, DIMENSION (lda,ka), where ka is m when side = 'L' or 'l' and is n otherwise. Before entry with side = 'L' or 'l', the m-by-m part of the array a must contain the Hermitian matrix, such that when uplo = 'U' or 'u', the leading m-by-m upper triangular part of the array a must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of a is not referenced, and when uplo = 'L' or 'l', the leading m-by-m lower triangular part of the array a must contain the lower triangular part of the Hermitian matrix, and the strictly upper triangular part of a is not referenced. Before entry with side = 'R' or 'r', the n-by-n part of the array a must contain the Hermitian matrix, such that when uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of a is not referenced, and when uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the Hermitian matrix, and the strictly upper triangular part of a is not referenced. The imaginary parts of the diagonal elements need not be set, they are assumed to be zero. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub) program. When side = 'L' or 'l' then lda must be at least max(1, m), otherwise lda must be at least max(1,n). b COMPLEX for chemm DOUBLE COMPLEX for zhemm Array, DIMENSION (ldb,n). Before entry, the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. The value of ldb must be at least max(1, m). beta COMPLEX for chemm DOUBLE COMPLEX for zhemm Specifies the scalar beta. When beta is supplied as zero, then c need not be set on input. c COMPLEX for chemm DOUBLE COMPLEX for zhemm BLAS and Sparse BLAS Routines 2 123 Array, DIMENSION (c, n). Before entry, the leading m-by-n part of the array c must contain the matrix C, except when beta is zero, in which case c need not be set on entry. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, m). Output Parameters c Overwritten by the m-by-n updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hemm interface are the following: a Holds the matrix A of size (k,k) where k = m if side = 'L', k = n otherwise. b Holds the matrix B of size (m,n). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?herk Performs a rank-k update of a Hermitian matrix. Syntax Fortran 77: call cherk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) call zherk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) Fortran 95: call herk(a, c [,uplo] [, trans] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?herk routines perform a matrix-matrix operation using Hermitian matrices. The operation is defined as C := alpha*A*conjg(A') + beta*C, or C := alpha*conjg(A')*A + beta*C, where: 2 Intel® Math Kernel Library Reference Manual 124 alpha and beta are real scalars, C is an n-by-n Hermitian matrix, A is an n-by-k matrix in the first case and a k-by-n matrix in the second case. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array c is used. If uplo = 'U' or 'u', then the upper triangular part of the array c is used. If uplo = 'L' or 'l', then the low triangular part of the array c is used. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then C:= alpha*A*conjg(A')+beta*C; if trans = 'C' or 'c', then C:= alpha*conjg(A')*A+beta*C. n INTEGER. Specifies the order of the matrix C. The value of n must be at least zero. k INTEGER. With trans = 'N' or 'n', k specifies the number of columns of the matrix A, and with trans = 'C' or 'c', k specifies the number of rows of the matrix A. The value of k must be at least zero. alpha REAL for cherk DOUBLE PRECISION for zherk Specifies the scalar alpha. a COMPLEX for cherk DOUBLE COMPLEX for zherk Array, DIMENSION (lda, ka), where ka is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array a must contain the matrix a, otherwise the leading kby- n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then lda must be at least max(1, n), otherwise lda must be at least max(1, k). beta REAL for cherk DOUBLE PRECISION for zherk Specifies the scalar beta. c COMPLEX for cherk DOUBLE COMPLEX for zherk Array, DIMENSION (ldc,n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array c must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of c is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array c must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of c is not referenced. The imaginary parts of the diagonal elements need not be set, they are assumed to be zero. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, n). BLAS and Sparse BLAS Routines 2 125 Output Parameters c With uplo = 'U' or 'u', the upper triangular part of the array c is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array c is overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements are set to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine herk interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if transa= 'N', ka = n otherwise, ma = n if transa= 'N', ma = k otherwise. c Holds the matrix C of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N' or 'C'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?her2k Performs a rank-2k update of a Hermitian matrix. Syntax Fortran 77: call cher2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call zher2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call her2k(a, b, c [,uplo][,trans] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?her2k routines perform a rank-2k matrix-matrix operation using Hermitian matrices. The operation is defined as C := alpha*A*conjg(B') + conjg(alpha)*B*conjg(A') + beta*C, or C := alpha *conjg(B')*A + conjg(alpha) *conjg(A')*B + beta*C, where: 2 Intel® Math Kernel Library Reference Manual 126 alpha is a scalar and beta is a real scalar, C is an n-by-n Hermitian matrix, A and B are n-by-k matrices in the first case and k-by-n matrices in the second case. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array c is used. If uplo = 'U' or 'u', then the upper triangular of the array c is used. If uplo = 'L' or 'l', then the low triangular of the array c is used. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then C:=alpha*A*conjg(B') + alpha*B*conjg(A') + beta*C; if trans = 'C' or 'c', then C:=alpha*conjg(A')*B + alpha*conjg(B')*A + beta*C. n INTEGER. Specifies the order of the matrix C. The value of n must be at least zero. k INTEGER. With trans = 'N' or 'n', k specifies the number of columns of the matrix A, and with trans = 'C' or 'c', k specifies the number of rows of the matrix A. The value of k must be at least equal to zero. alpha COMPLEX for cher2k DOUBLE COMPLEX for zher2k Specifies the scalar alpha. a COMPLEX for cher2k DOUBLE COMPLEX for zher2k Array, DIMENSION (lda, ka), where ka is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array a must contain the matrix A, otherwise the leading kby- n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then lda must be at least max(1, n), otherwise lda must be at least max(1, k). beta REAL for cher2k DOUBLE PRECISION for zher2k Specifies the scalar beta. b COMPLEX for cher2k DOUBLE COMPLEX for zher2k Array, DIMENSION (ldb, kb), where kb is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array b must contain the matrix B, otherwise the leading kby- n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. When trans = 'N' or 'n', then ldb must be at least max(1, n), otherwise ldb must be at least max(1, k). c COMPLEX for cher2k DOUBLE COMPLEX for zher2k Array, DIMENSION (ldc,n). BLAS and Sparse BLAS Routines 2 127 Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array c must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of c is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array c must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of c is not referenced. The imaginary parts of the diagonal elements need not be set, they are assumed to be zero. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, n). Output Parameters c With uplo = 'U' or 'u', the upper triangular part of the array c is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array c is overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements are set to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine her2k interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if trans = 'N', ka = n otherwise, ma = n if trans = 'N', ma = k otherwise. b Holds the matrix B of size (mb,kb) where kb = k if trans = 'N', kb = n otherwise, mb = n if trans = 'N', mb = k otherwise. c Holds the matrix C of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N' or 'C'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?symm Performs a scalar-matrix-matrix product (one matrix operand is symmetric) and adds the result to a scalarmatrix product. Syntax Fortran 77: call ssymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) call dsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) call csymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) 2 Intel® Math Kernel Library Reference Manual 128 call zsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call symm(a, b, c [,side][,uplo] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?symm routines perform a matrix-matrix operation using symmetric matrices. The operation is defined as C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where: alpha and beta are scalars, A is a symmetric matrix, B and C are m-by-n matrices. Input Parameters side CHARACTER*1. Specifies whether the symmetric matrix A appears on the left or right in the operation: if side = 'L' or 'l', then C := alpha*A*B + beta*C; if side = 'R' or 'r', then C := alpha*B*A + beta*C. uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric matrix A is used: if uplo = 'U' or 'u', then the upper triangular part is used; if uplo = 'L' or 'l', then the lower triangular part is used. m INTEGER. Specifies the number of rows of the matrix C. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix C. The value of n must be at least zero. alpha REAL for ssymm DOUBLE PRECISION for dsymm COMPLEX for csymm DOUBLE COMPLEX for zsymm Specifies the scalar alpha. a REAL for ssymm DOUBLE PRECISION for dsymm COMPLEX for csymm DOUBLE COMPLEX for zsymm Array, DIMENSION (lda, ka), where ka is m when side = 'L' or 'l' and is n otherwise. Before entry with side = 'L' or 'l', the m-by-m part of the array a must contain the symmetric matrix, such that when uplo = 'U' or 'u', the leading m-by-m upper triangular part of the array a must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part BLAS and Sparse BLAS Routines 2 129 of a is not referenced, and when uplo = 'L' or 'l', the leading m-by-m lower triangular part of the array a must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of a is not referenced. Before entry with side = 'R' or 'r', the n-by-n part of the array a must contain the symmetric matrix, such that when uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of a is not referenced, and when uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of a is not referenced. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When side = 'L' or 'l' then lda must be at least max(1, m), otherwise lda must be at least max(1, n). b REAL for ssymm DOUBLE PRECISION for dsymm COMPLEX for csymm DOUBLE COMPLEX for zsymm Array, DIMENSION (ldb,n). Before entry, the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. The value of ldb must be at least max(1, m). beta REAL for ssymm DOUBLE PRECISION for dsymm COMPLEX for csymm DOUBLE COMPLEX for zsymm Specifies the scalar beta. When beta is set to zero, then c need not be set on input. c REAL for ssymm DOUBLE PRECISION for dsymm COMPLEX for csymm DOUBLE COMPLEX for zsymm Array, DIMENSION (ldc,n). Before entry, the leading m-by-n part of the array c must contain the matrix C, except when beta is zero, in which case c need not be set on entry. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, m). Output Parameters c Overwritten by the m-by-n updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine symm interface are the following: a Holds the matrix A of size (k,k) where k = m if side = 'L', k = n otherwise. 2 Intel® Math Kernel Library Reference Manual 130 b Holds the matrix B of size (m,n). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?syrk Performs a rank-n update of a symmetric matrix. Syntax Fortran 77: call ssyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) call dsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) call csyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) call zsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) Fortran 95: call syrk(a, c [,uplo] [, trans] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?syrk routines perform a matrix-matrix operation using symmetric matrices. The operation is defined as C := alpha*A*A' + beta*C, or C := alpha*A'*A + beta*C, where: alpha and beta are scalars, C is an n-by-n symmetric matrix, A is an n-by-k matrix in the first case and a k-by-n matrix in the second case. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array c is used. If uplo = 'U' or 'u', then the upper triangular part of the array c is used. If uplo = 'L' or 'l', then the low triangular part of the array c is used. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then C := alpha*A*A' + beta*C; if trans = 'T' or 't', then C := alpha*A'*A + beta*C; if trans = 'C' or 'c', then C := alpha*A'*A + beta*C. BLAS and Sparse BLAS Routines 2 131 n INTEGER. Specifies the order of the matrix C. The value of n must be at least zero. k INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the matrix a, and on entry with trans = 'T' or 't' or 'C' or 'c', k specifies the number of rows of the matrix a. The value of k must be at least zero. alpha REAL for ssyrk DOUBLE PRECISION for dsyrk COMPLEX for csyrk DOUBLE COMPLEX for zsyrk Specifies the scalar alpha. a REAL for ssyrk DOUBLE PRECISION for dsyrk COMPLEX for csyrk DOUBLE COMPLEX for zsyrk Array, DIMENSION (lda,ka), where ka is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array a must contain the matrix A, otherwise the leading kby- n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then lda must be at least max(1,n), otherwise lda must be at least max(1, k). beta REAL for ssyrk DOUBLE PRECISION for dsyrk COMPLEX for csyrk DOUBLE COMPLEX for zsyrk Specifies the scalar beta. c REAL for ssyrk DOUBLE PRECISION for dsyrk COMPLEX for csyrk DOUBLE COMPLEX for zsyrk Array, DIMENSION (ldc,n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array c must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of c is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array c must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of c is not referenced. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, n). Output Parameters c With uplo = 'U' or 'u', the upper triangular part of the array c is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array c is overwritten by the lower triangular part of the updated matrix. 2 Intel® Math Kernel Library Reference Manual 132 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syrk interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if transa= 'N', ka = n otherwise, ma = n if transa= 'N', ma = k otherwise. c Holds the matrix C of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?syr2k Performs a rank-2k update of a symmetric matrix. Syntax Fortran 77: call ssyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call dsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call csyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call zsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call syr2k(a, b, c [,uplo][,trans] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?syr2k routines perform a rank-2k matrix-matrix operation using symmetric matrices. The operation is defined as C := alpha*A*B' + alpha*B*A' + beta*C, or C := alpha*A'*B + alpha*B'*A + beta*C, where: alpha and beta are scalars, C is an n-by-n symmetric matrix, A and B are n-by-k matrices in the first case, and k-by-n matrices in the second case. BLAS and Sparse BLAS Routines 2 133 Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array c is used. If uplo = 'U' or 'u', then the upper triangular part of the array c is used. If uplo = 'L' or 'l', then the low triangular part of the array c is used. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then C := alpha*A*B'+alpha*B*A'+beta*C; if trans = 'T' or 't', then C := alpha*A'*B +alpha*B'*A +beta*C; if trans = 'C' or 'c', then C := alpha*A'*B +alpha*B'*A +beta*C. n INTEGER. Specifies the order of the matrix C. The value of n must be at least zero. k INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the matrices A and B, and on entry with trans = 'T' or 't' or 'C' or 'c', k specifies the number of rows of the matrices A and B. The value of k must be at least zero. alpha REAL for ssyr2k DOUBLE PRECISION for dsyr2k COMPLEX for csyr2k DOUBLE COMPLEX for zsyr2k Specifies the scalar alpha. a REAL for ssyr2k DOUBLE PRECISION for dsyr2k COMPLEX for csyr2k DOUBLE COMPLEX for zsyr2k Array, DIMENSION (lda,ka), where ka is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array a must contain the matrix A, otherwise the leading kby- n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then lda must be at least max(1, n), otherwise lda must be at least max(1, k). b REAL for ssyr2k DOUBLE PRECISION for dsyr2k COMPLEX for csyr2k DOUBLE COMPLEX for zsyr2k Array, DIMENSION (ldb, kb) where kb is k when trans = 'N' or 'n' and is 'n' otherwise. Before entry with trans = 'N' or 'n', the leading n-byk part of the array b must contain the matrix B, otherwise the leading k-byn part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then ldb must be at least max(1, n), otherwise ldb must be at least max(1, k). beta REAL for ssyr2k DOUBLE PRECISION for dsyr2k COMPLEX for csyr2k DOUBLE COMPLEX for zsyr2k Specifies the scalar beta. c REAL for ssyr2k DOUBLE PRECISION for dsyr2k 2 Intel® Math Kernel Library Reference Manual 134 COMPLEX for csyr2k DOUBLE COMPLEX for zsyr2k Array, DIMENSION (ldc,n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array c must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of c is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array c must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of c is not referenced. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, n). Output Parameters c With uplo = 'U' or 'u', the upper triangular part of the array c is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array c is overwritten by the lower triangular part of the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syr2k interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if trans = 'N', ka = n otherwise, ma = n if trans = 'N', ma = k otherwise. b Holds the matrix B of size (mb,kb) where kb = k if trans = 'N', kb = n otherwise, mb = n if trans = 'N', mb = k otherwise. c Holds the matrix C of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?trmm Computes a scalar-matrix-matrix product (one matrix operand is triangular). Syntax Fortran 77: call strmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) call dtrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) BLAS and Sparse BLAS Routines 2 135 call ctrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) call ztrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) Fortran 95: call trmm(a, b [,side] [, uplo] [,transa][,diag] [,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?trmm routines perform a matrix-matrix operation using triangular matrices. The operation is defined as B := alpha*op(A)*B or B := alpha*B*op(A) where: alpha is a scalar, B is an m-by-n matrix, A is a unit, or non-unit, upper or lower triangular matrix op(A) is one of op(A) = A, or op(A) = A', or op(A) = conjg(A'). Input Parameters side CHARACTER*1. Specifies whether op(A) appears on the left or right of B in the operation: if side = 'L' or 'l', then B := alpha*op(A)*B; if side = 'R' or 'r', then B := alpha*B*op(A). uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. transa CHARACTER*1. Specifies the form of op(A) used in the matrix multiplication: if transa = 'N' or 'n', then op(A) = A; if transa = 'T' or 't', then op(A) = A'; if transa = 'C' or 'c', then op(A) = conjg(A'). diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. m INTEGER. Specifies the number of rows of B. The value of m must be at least zero. n INTEGER. Specifies the number of columns of B. The value of n must be at least zero. alpha REAL for strmm DOUBLE PRECISION for dtrmm COMPLEX for ctrmm DOUBLE COMPLEX for ztrmm Specifies the scalar alpha. 2 Intel® Math Kernel Library Reference Manual 136 When alpha is zero, then a is not referenced and b need not be set before entry. a REAL for strmm DOUBLE PRECISION for dtrmm COMPLEX for ctrmm DOUBLE COMPLEX for ztrmm Array, DIMENSION (lda,k), where k is m when side = 'L' or 'l' and is n when side = 'R' or 'r'. Before entry with uplo = 'U' or 'u', the leading k by k upper triangular part of the array a must contain the upper triangular matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading k by k lower triangular part of the array a must contain the lower triangular matrix and the strictly upper triangular part of a is not referenced. When diag = 'U' or 'u', the diagonal elements of a are not referenced either, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When side = 'L' or 'l', then lda must be at least max(1, m), when side = 'R' or 'r', then lda must be at least max(1, n). b REAL for strmm DOUBLE PRECISION for dtrmm COMPLEX for ctrmm DOUBLE COMPLEX for ztrmm Array, DIMENSION (ldb,n). Before entry, the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. The value of ldb must be at least max(1, m). Output Parameters b Overwritten by the transformed matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trmm interface are the following: a Holds the matrix A of size (k,k) where k = m if side = 'L', k = n otherwise. b Holds the matrix B of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. transa Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. alpha The default value is 1. BLAS and Sparse BLAS Routines 2 137 ?trsm Solves a matrix equation (one matrix operand is triangular). Syntax Fortran 77: call strsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) call dtrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) call ctrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) call ztrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) Fortran 95: call trsm(a, b [,side] [, uplo] [,transa][,diag] [,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?trsm routines solve one of the following matrix equations: op(A)*X = alpha*B, or X*op(A) = alpha*B, where: alpha is a scalar, X and B are m-by-n matrices, A is a unit, or non-unit, upper or lower triangular matrix op(A) is one of op(A) = A, or op(A) = A', or op(A) = conjg(A'). The matrix B is overwritten by the solution matrix X. Input Parameters side CHARACTER*1. Specifies whether op(A) appears on the left or right of X in the equation: if side = 'L' or 'l', then op(A)*X = alpha*B; if side = 'R' or 'r', then X*op(A) = alpha*B. uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. transa CHARACTER*1. Specifies the form of op(A) used in the matrix multiplication: if transa = 'N' or 'n', then op(A) = A; if transa = 'T' or 't', then op(A) = A'; if transa = 'C' or 'c', then op(A) = conjg(A'). diag CHARACTER*1. Specifies whether the matrix A is unit triangular: 2 Intel® Math Kernel Library Reference Manual 138 if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. m INTEGER. Specifies the number of rows of B. The value of m must be at least zero. n INTEGER. Specifies the number of columns of B. The value of n must be at least zero. alpha REAL for strsm DOUBLE PRECISION for dtrsm COMPLEX for ctrsm DOUBLE COMPLEX for ztrsm Specifies the scalar alpha. When alpha is zero, then a is not referenced and b need not be set before entry. a REAL for strsm DOUBLE PRECISION for dtrsm COMPLEX for ctrsm DOUBLE COMPLEX for ztrsm Array, DIMENSION (lda, k), where k is m when side = 'L' or 'l' and is n when side = 'R' or 'r'. Before entry with uplo = 'U' or 'u', the leading k by k upper triangular part of the array a must contain the upper triangular matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading k by k lower triangular part of the array a must contain the lower triangular matrix and the strictly upper triangular part of a is not referenced. When diag = 'U' or 'u', the diagonal elements of a are not referenced either, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When side = 'L' or 'l', then lda must be at least max(1, m), when side = 'R' or 'r', then lda must be at least max(1, n). b REAL for strsm DOUBLE PRECISION for dtrsm COMPLEX for ctrsm DOUBLE COMPLEX for ztrsm Array, DIMENSION (ldb,n). Before entry, the leading m-by-n part of the array b must contain the right-hand side matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. The value of ldb must be at least max(1, +m). Output Parameters b Overwritten by the solution matrix X. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trsm interface are the following: a Holds the matrix A of size (k,k) where k = m if side = 'L', k = n otherwise. BLAS and Sparse BLAS Routines 2 139 b Holds the matrix B of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. transa Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. alpha The default value is 1. Sparse BLAS Level 1 Routines This section describes Sparse BLAS Level 1, an extension of BLAS Level 1 included in the Intel® Math Kernel Library beginning with the Intel MKL release 2.1. Sparse BLAS Level 1 is a group of routines and functions that perform a number of common vector operations on sparse vectors stored in compressed form. Sparse vectors are those in which the majority of elements are zeros. Sparse BLAS routines and functions are specially implemented to take advantage of vector sparsity. This allows you to achieve large savings in computer time and memory. If nz is the number of non-zero vector elements, the computer time taken by Sparse BLAS operations will be O(nz). Vector Arguments Compressed sparse vectors. Let a be a vector stored in an array, and assume that the only non-zero elements of a are the following: a(k1), a (k2), a (k3) . . . a(knz), where nz is the total number of non-zero elements in a. In Sparse BLAS, this vector can be represented in compressed form by two FORTRAN arrays, x (values) and indx (indices). Each array has nz elements: x(1)=a(k1), x(2)=a(k2), . . . x(nz)= a(knz), indx(1)=k1, indx(2)=k2, . . . indx(nz)= knz. Thus, a sparse vector is fully determined by the triple (nz, x, indx). If you pass a negative or zero value of nz to Sparse BLAS, the subroutines do not modify any arrays or variables. Full-storage vectors. Sparse BLAS routines can also use a vector argument fully stored in a single FORTRAN array (a full-storage vector). If y is a full-storage vector, its elements must be stored contiguously: the first element in y(1), the second in y(2), and so on. This corresponds to an increment incy = 1 in BLAS Level 1. No increment value for full-storage vectors is passed as an argument to Sparse BLAS routines or functions. Naming Conventions Similar to BLAS, the names of Sparse BLAS subprograms have prefixes that determine the data type involved: s and d for single- and double-precision real; c and z for single- and double-precision complex respectively. If a Sparse BLAS routine is an extension of a "dense" one, the subprogram name is formed by appending the suffix i (standing for indexed) to the name of the corresponding "dense" subprogram. For example, the Sparse BLAS routine saxpyi corresponds to the BLAS routine saxpy, and the Sparse BLAS function cdotci corresponds to the BLAS function cdotc. 2 Intel® Math Kernel Library Reference Manual 140 Routines and Data Types Routines and data types supported in the Intel MKL implementation of Sparse BLAS are listed in Table “Sparse BLAS Routines and Their Data Types”. Sparse BLAS Routines and Their Data Types Routine/ Function Data Types Description ?axpyi s, d, c, z Scalar-vector product plus vector (routines) ?doti s, d Dot product (functions) ?dotci c, z Complex dot product conjugated (functions) ?dotui c, z Complex dot product unconjugated (functions) ?gthr s, d, c, z Gathering a full-storage sparse vector into compressed form nz, x, indx (routines) ?gthrz s, d, c, z Gathering a full-storage sparse vector into compressed form and assigning zeros to gathered elements in the fullstorage vector (routines) ?roti s, d Givens rotation (routines) ?sctr s, d, c, z Scattering a vector from compressed form to full-storage form (routines) BLAS Level 1 Routines That Can Work With Sparse Vectors The following BLAS Level 1 routines will give correct results when you pass to them a compressed-form array x(with the increment incx=1): ?asum sum of absolute values of vector elements ?copy copying a vector ?nrm2 Euclidean norm of a vector ?scal scaling a vector i?amax index of the element with the largest absolute value for real flavors, or the largest sum |Re(x(i))|+|Im(x(i))| for complex flavors. i?amin index of the element with the smallest absolute value for real flavors, or the smallest sum |Re(x(i))|+|Im(x(i))| for complex flavors. The result i returned by i?amax and i?amin should be interpreted as index in the compressed-form array, so that the largest (smallest) value is x(i); the corresponding index in full-storage array is indx(i). You can also call ?rotg to compute the parameters of Givens rotation and then pass these parameters to the Sparse BLAS routines ?roti. ?axpyi Adds a scalar multiple of compressed sparse vector to a full-storage vector. Syntax Fortran 77: call saxpyi(nz, a, x, indx, y) BLAS and Sparse BLAS Routines 2 141 call daxpyi(nz, a, x, indx, y) call caxpyi(nz, a, x, indx, y) call zaxpyi(nz, a, x, indx, y) Fortran 95: call axpyi(x, indx, y [, a]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?axpyi routines perform a vector-vector operation defined as y := a*x + y where: a is a scalar, x is a sparse vector stored in compressed form, y is a vector in full storage form. The ?axpyi routines reference or modify only the elements of y whose indices are listed in the array indx. The values in indx must be distinct. Input Parameters nz INTEGER. The number of elements in x and indx. a REAL for saxpyi DOUBLE PRECISION for daxpyi COMPLEX for caxpyi DOUBLE COMPLEX for zaxpyi Specifies the scalar a. x REAL for saxpyi DOUBLE PRECISION for daxpyi COMPLEX for caxpyi DOUBLE COMPLEX for zaxpyi Array, DIMENSION at least nz. indx INTEGER. Specifies the indices for the elements of x. Array, DIMENSION at least nz. y REAL for saxpyi DOUBLE PRECISION for daxpyi COMPLEX for caxpyi DOUBLE COMPLEX for zaxpyi Array, DIMENSION at least max(indx(i)). Output Parameters y Contains the updated vector y. 2 Intel® Math Kernel Library Reference Manual 142 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine axpyi interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. a The default value is 1. ?doti Computes the dot product of a compressed sparse real vector by a full-storage real vector. Syntax Fortran 77: res = sdoti(nz, x, indx, y ) res = ddoti(nz, x, indx, y ) Fortran 95: res = doti(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?doti routines return the dot product of x and y defined as res = x(1)*y(indx(1)) + x(2)*y(indx(2)) +...+ x(nz)*y(indx(nz)) where the triple (nz, x, indx) defines a sparse real vector stored in compressed form, and y is a real vector in full storage form. The functions reference only the elements of y whose indices are listed in the array indx. The values in indx must be distinct. Input Parameters nz INTEGER. The number of elements in x and indx . x REAL for sdoti DOUBLE PRECISION for ddoti Array, DIMENSION at least nz. indx INTEGER. Specifies the indices for the elements of x. Array, DIMENSION at least nz. y REAL for sdoti DOUBLE PRECISION for ddoti Array, DIMENSION at least max(indx(i)). BLAS and Sparse BLAS Routines 2 143 Output Parameters res REAL for sdoti DOUBLE PRECISION for ddoti Contains the dot product of x and y, if nz is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine doti interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. ?dotci Computes the conjugated dot product of a compressed sparse complex vector with a full-storage complex vector. Syntax Fortran 77: res = cdotci(nz, x, indx, y ) res = zdotci(nz, x, indx, y ) Fortran 95: res = dotci(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?dotci routines return the dot product of x and y defined as conjg(x(1))*y(indx(1)) + ... + conjg(x(nz))*y(indx(nz)) where the triple (nz, x, indx) defines a sparse complex vector stored in compressed form, and y is a real vector in full storage form. The functions reference only the elements of y whose indices are listed in the array indx. The values in indx must be distinct. Input Parameters nz INTEGER. The number of elements in x and indx . x COMPLEX for cdotci DOUBLE COMPLEX for zdotci Array, DIMENSION at least nz. indx INTEGER. Specifies the indices for the elements of x. 2 Intel® Math Kernel Library Reference Manual 144 Array, DIMENSION at least nz. y COMPLEX for cdotci DOUBLE COMPLEX for zdotci Array, DIMENSION at least max(indx(i)). Output Parameters res COMPLEX for cdotci DOUBLE COMPLEX for zdotci Contains the conjugated dot product of x and y, if nz is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine dotci interface are the following: x Holds the vector with the number of elements (nz). indx Holds the vector with the number of elements (nz). y Holds the vector with the number of elements (nz). ?dotui Computes the dot product of a compressed sparse complex vector by a full-storage complex vector. Syntax Fortran 77: res = cdotui(nz, x, indx, y ) res = zdotui(nz, x, indx, y ) Fortran 95: res = dotui(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?dotui routines return the dot product of x and y defined as res = x(1)*y(indx(1)) + x(2)*y(indx(2)) +...+ x(nz)*y(indx(nz)) where the triple (nz, x, indx) defines a sparse complex vector stored in compressed form, and y is a real vector in full storage form. The functions reference only the elements of y whose indices are listed in the array indx. The values in indx must be distinct. Input Parameters nz INTEGER. The number of elements in x and indx . BLAS and Sparse BLAS Routines 2 145 x COMPLEX for cdotui DOUBLE COMPLEX for zdotui Array, DIMENSION at least nz. indx INTEGER. Specifies the indices for the elements of x. Array, DIMENSION at least nz. y COMPLEX for cdotui DOUBLE COMPLEX for zdotui Array, DIMENSION at least max(indx(i)). Output Parameters res COMPLEX for cdotui DOUBLE COMPLEX for zdotui Contains the dot product of x and y, if nz is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine dotui interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. ?gthr Gathers a full-storage sparse vector's elements into compressed form. Syntax Fortran 77: call sgthr(nz, y, x, indx ) call dgthr(nz, y, x, indx ) call cgthr(nz, y, x, indx ) call zgthr(nz, y, x, indx ) Fortran 95: res = gthr(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gthr routines gather the specified elements of a full-storage sparse vector y into compressed form(nz, x, indx). The routines reference only the elements of y whose indices are listed in the array indx: 2 Intel® Math Kernel Library Reference Manual 146 x(i) = y(indx(i)), for i=1,2,... +nz. Input Parameters nz INTEGER. The number of elements of y to be gathered. indx INTEGER. Specifies indices of elements to be gathered. Array, DIMENSION at least nz. y REAL for sgthr DOUBLE PRECISION for dgthr COMPLEX for cgthr DOUBLE COMPLEX for zgthr Array, DIMENSION at least max(indx(i)). Output Parameters x REAL for sgthr DOUBLE PRECISION for dgthr COMPLEX for cgthr DOUBLE COMPLEX for zgthr Array, DIMENSION at least nz. Contains the vector converted to the compressed form. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gthr interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. ?gthrz Gathers a sparse vector's elements into compressed form, replacing them by zeros. Syntax Fortran 77: call sgthrz(nz, y, x, indx ) call dgthrz(nz, y, x, indx ) call cgthrz(nz, y, x, indx ) call zgthrz(nz, y, x, indx ) Fortran 95: res = gthrz(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h BLAS and Sparse BLAS Routines 2 147 Description The ?gthrz routines gather the elements with indices specified by the array indx from a full-storage vector y into compressed form (nz, x, indx) and overwrite the gathered elements of y by zeros. Other elements of y are not referenced or modified (see also ?gthr). Input Parameters nz INTEGER. The number of elements of y to be gathered. indx INTEGER. Specifies indices of elements to be gathered. Array, DIMENSION at least nz. y REAL for sgthrz DOUBLE PRECISION for dgthrz COMPLEX for cgthrz DOUBLE COMPLEX for zgthrz Array, DIMENSION at least max(indx(i)). Output Parameters x REAL for sgthrz DOUBLE PRECISION for d gthrz COMPLEX for cgthrz DOUBLE COMPLEX for zgthrz Array, DIMENSION at least nz. Contains the vector converted to the compressed form. y The updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gthrz interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. ?roti Applies Givens rotation to sparse vectors one of which is in compressed form. Syntax Fortran 77: call sroti(nz, x, indx, y, c, s) call droti(nz, x, indx, y, c, s) Fortran 95: call roti(x, indx, y, c, s) Include Files • FORTRAN 77: mkl_blas.fi 2 Intel® Math Kernel Library Reference Manual 148 • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?roti routines apply the Givens rotation to elements of two real vectors, x (in compressed form nz, x, indx) and y (in full storage form): x(i) = c*x(i) + s*y(indx(i)) y(indx(i)) = c*y(indx(i))- s*x(i) The routines reference only the elements of y whose indices are listed in the array indx. The values in indx must be distinct. Input Parameters nz INTEGER. The number of elements in x and indx. x REAL for sroti DOUBLE PRECISION for droti Array, DIMENSION at least nz. indx INTEGER. Specifies the indices for the elements of x. Array, DIMENSION at least nz. y REAL for sroti DOUBLE PRECISION for droti Array, DIMENSION at least max(indx(i)). c A scalar: REAL for sroti DOUBLE PRECISION for droti. s A scalar: REAL for sroti DOUBLE PRECISION for droti. Output Parameters x and y The updated arrays. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine roti interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. ?sctr Converts compressed sparse vectors into full storage form. Syntax Fortran 77: call ssctr(nz, x, indx, y ) call dsctr(nz, x, indx, y ) BLAS and Sparse BLAS Routines 2 149 call csctr(nz, x, indx, y ) call zsctr(nz, x, indx, y ) Fortran 95: call sctr(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?sctr routines scatter the elements of the compressed sparse vector (nz, x, indx) to a full-storage vector y. The routines modify only the elements of y whose indices are listed in the array indx: y(indx(i) = x(i), for i=1,2,... +nz. Input Parameters nz INTEGER. The number of elements of x to be scattered. indx INTEGER. Specifies indices of elements to be scattered. Array, DIMENSION at least nz. x REAL for ssctr DOUBLE PRECISION for dsctr COMPLEX for csctr DOUBLE COMPLEX for zsctr Array, DIMENSION at least nz. Contains the vector to be converted to full-storage form. Output Parameters y REAL for ssctr DOUBLE PRECISION for dsctr COMPLEX for csctr DOUBLE COMPLEX for zsctr Array, DIMENSION at least max(indx(i)). Contains the vector y with updated elements. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sctr interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. 2 Intel® Math Kernel Library Reference Manual 150 Sparse BLAS Level 2 and Level 3 Routines This section describes Sparse BLAS Level 2 and Level 3 routines included in the Intel® Math Kernel Library (Intel® MKL) . Sparse BLAS Level 2 is a group of routines and functions that perform operations between a sparse matrix and dense vectors. Sparse BLAS Level 3 is a group of routines and functions that perform operations between a sparse matrix and dense matrices. The terms and concepts required to understand the use of the Intel MKL Sparse BLAS Level 2 and Level 3 routines are discussed in the Linear Solvers Basics appendix. The Sparse BLAS routines can be useful to implement iterative methods for solving large sparse systems of equations or eigenvalue problems. For example, these routines can be considered as building blocks for Iterative Sparse Solvers based on Reverse Communication Interface (RCI ISS) described in the Chapter 8 of the manual. Intel MKL provides Sparse BLAS Level 2 and Level 3 routines with typical (or conventional) interface similar to the interface used in the NIST* Sparse BLAS library [Rem05]. Some software packages and libraries (the PARDISO* Solver used in Intel MKL, Sparskit 2 [Saad94], the Compaq* Extended Math Library (CXML)[CXML01]) use different (early) variation of the compressed sparse row (CSR) format and support only Level 2 operations with simplified interfaces. Intel MKL provides an additional set of Sparse BLAS Level 2 routines with similar simplified interfaces. Each of these routines operates only on a matrix of the fixed type. The routines described in this section support both one-based indexing and zero-based indexing of the input data (see details in the section One-based and Zero-based Indexing). Naming Conventions in Sparse BLAS Level 2 and Level 3 Each Sparse BLAS Level 2 and Level 3 routine has a six- or eight-character base name preceded by the prefix mkl_ or mkl_cspblas_ . The routines with typical (conventional) interface have six-character base names in accordance with the template: mkl_ ( ) The routines with simplified interfaces have eight-character base names in accordance with the templates: mkl_ ( ) for routines with one-based indexing; and mkl_cspblas_ ( ) for routines with zero-based indexing. The field indicates the data type: s real, single precision c complex, single precision d real, double precision z complex, double precision The field indicates the sparse matrix storage format (see section Sparse Matrix Storage Formats): coo coordinate format csr compressed sparse row format and its variations csc compressed sparse column format and its variations dia diagonal format sky skyline storage format bsr block sparse row format and its variations The field indicates the type of operation: BLAS and Sparse BLAS Routines 2 151 mv matrix-vector product (Level 2) mm matrix-matrix product (Level 3) sv solving a single triangular system (Level 2) sm solving triangular systems with multiple right-hand sides (Level 3) The field indicates the matrix type: ge sparse representation of a general matrix sy sparse representation of the upper or lower triangle of a symmetric matrix tr sparse representation of a triangular matrix Sparse Matrix Storage Formats The current version of Intel MKL Sparse BLAS Level 2 and Level 3 routines support the following point entry [Duff86] storage formats for sparse matrices: • compressed sparse row format (CSR) and its variations; • compressed sparse column format (CSC); • coordinate format; • diagonal format; • skyline storage format; and one block entry storage format: • block sparse row format (BSR) and its variations. For more information see "Sparse Matrix Storage Formats" in Appendix A. Intel MKL provides auxiliary routines - matrix converters - that convert sparse matrix from one storage format to another. Routines and Supported Operations This section describes operations supported by the Intel MKL Sparse BLAS Level 2 and Level 3 routines. The following notations are used here: A is a sparse matrix; B and C are dense matrices; D is a diagonal scaling matrix; x and y are dense vectors; alpha and beta are scalars; op(A) is one of the possible operations: op(A) = A; op(A) = A' - transpose of A; op(A) = conj(A') - conjugated transpose of A. inv(op(A)) denotes the inverse of op(A). The Intel MKL Sparse BLAS Level 2 and Level 3 routines support the following operations: • computing the vector product between a sparse matrix and a dense vector: y := alpha*op(A)*x + beta*y • solving a single triangular system: y := alpha*inv(op(A))*x 2 Intel® Math Kernel Library Reference Manual 152 • computing a product between sparse matrix and dense matrix: C := alpha*op(A)*B + beta*C • solving a sparse triangular system with multiple right-hand sides: C := alpha*inv(op(A))*B Intel MKL provides an additional set of the Sparse BLAS Level 2 routines with simplified interfaces. Each of these routines operates on a matrix of the fixed type. The following operations are supported: • computing the vector product between a sparse matrix and a dense vector (for general and symmetric matrices): y := op(A)*x • solving a single triangular system (for triangular matrices): y := inv(op(A))*x Matrix type is indicated by the field in the routine name (see section Naming Conventions in Sparse BLAS Level 2 and Level 3). NOTE The routines with simplified interfaces support only four sparse matrix storage formats, specifically: CSR format in the 3-array variation accepted in the direct sparse solvers and in the CXML; diagonal format accepted in the CXML; coordinate format; BSR format in the 3-array variation. Note that routines with both typical (conventional) and simplified interfaces use the same computational kernels that work with certain internal data structures. The Intel MKL Sparse BLAS Level 2 and Level 3 routines do not support in-place operations. Complete list of all routines is given in the “Sparse BLAS Level 2 and Level 3 Routines”. Interface Consideration One-Based and Zero-Based Indexing The Intel MKL Sparse BLAS Level 2 and Level 3 routines support one-based and zero-based indexing of data arrays. Routines with typical interfaces support zero-based indexing for the following sparse data storage formats: CSR, CSC, BSR, and COO. Routines with simplified interfaces support zero based indexing for the following sparse data storage formats: CSR, BSR, and COO. See the complete list of Sparse BLAS Level 2 and Level 3 Routines. The one-based indexing uses the convention of starting array indices at 1. The zero-based indexing uses the convention of starting array indices at 0. For example, indices of the 5-element array x can be presented in case of one-based indexing as follows: Element index: 1 2 3 4 5 Element value: 1.0 5.0 7.0 8.0 9.0 and in case of zero-based indexing as follows: Element index: 0 1 2 3 4 Element value: 1.0 5.0 7.0 8.0 9.0 The detailed descriptions of the one-based and zero-based variants of the sparse data storage formats are given in the "Sparse Matrix Storage Formats" in Appendix A. BLAS and Sparse BLAS Routines 2 153 Most parameters of the routines are identical for both one-based and zero-based indexing, but some of them have certain differences. The following table lists all these differences. Parameter One-based Indexing Zero-based Indexing val Array containing non-zero elements of the matrix A, its length is pntre(m) - pntrb(1). Array containing non-zero elements of the matrix A, its length is pntre(m—1) - pntrb(0). pntrb Array of length m. This array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of row i in the arrays val and indx Array of length m. This array contains row indices, such that pntrb(i) - pntrb(0) is the first index of row i in the arrays val and indx. pntre Array of length m. This array contains row indices, such that pntre(I) - pntrb(1) is the last index of row i in the arrays val and indx. Array of length m. This array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of row i in the arrays val and indx. ia Array of length m + 1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Array of length m+1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m) is equal to the number of nonzeros. ldb Specifies the leading dimension of b as declared in the calling (sub)program. Specifies the second dimension of b as declared in the calling (sub)program. ldc Specifies the leading dimension of c as declared in the calling (sub)program. Specifies the second dimension of c as declared in the calling (sub)program. Difference Between Fortran and C Interfaces Intel MKL provides both Fortran and C interfaces to all Sparse BLAS Level 2 and Level 3 routines. Parameter descriptions are common for both interfaces with the exception of data types that refer to the FORTRAN 77 standard types. Correspondence between data types specific to the Fortran and C interfaces are given below: Fortran C REAL*4 float REAL*8 double INTEGER*4 int INTEGER*8 long long int CHARACTER char For routines with C interfaces all parameters (including scalars) must be passed by references. Another difference is how two-dimensional arrays are represented. In Fortran the column-major order is used, and in C - row-major order. This changes the meaning of the parameters ldb and ldc (see the table above). Differences Between Intel MKL and NIST* Interfaces The Intel MKL Sparse BLAS Level 3 routines have the following conventional interfaces: 2 Intel® Math Kernel Library Reference Manual 154 mkl_xyyymm(transa, m, n, k, alpha, matdescra, arg(A), b, ldb, beta, c, ldc), for matrixmatrix product; mkl_xyyysm(transa, m, n, alpha, matdescra, arg(A), b, ldb, c, ldc), for triangular solvers with multiple right-hand sides. Here x denotes data type, and yyy - sparse matrix data structure (storage format). The analogous NIST* Sparse BLAS (NSB) library routines have the following interfaces: xyyymm(transa, m, n, k, alpha, descra, arg(A), b, ldb, beta, c, ldc, work, lwork), for matrix-matrix product; xyyysm(transa, m, n, unitd, dv, alpha, descra, arg(A), b, ldb, beta, c, ldc, work, lwork), for triangular solvers with multiple right-hand sides. Some similar arguments are used in both libraries. The argument transa indicates what operation is performed and is slightly different in the NSB library (see Table “Parameter transa”). The arguments m and k are the number of rows and column in the matrix A, respectively, n is the number of columns in the matrix C. The arguments alpha and beta are scalar alpha and beta respectively (beta is not used in the Intel MKL triangular solvers.) The arguments b and c are rectangular arrays with the leading dimension ldb and ldc, respectively. arg(A) denotes the list of arguments that describe the sparse representation of A. Parameter transa MKL interface NSB interface Operation data type CHARACTER*1 INTEGER value N or n 0 op(A) = A T or t 1 op(A) = A' C or c 2 op(A) = A' Parameter matdescra The parameter matdescra describes the relevant characteristic of the matrix A. This manual describes matdescra as an array of six elements in line with the NIST* implementation. However, only the first four elements of the array are used in the current versions of the Intel MKL Sparse BLAS routines. Elements matdescra(5) and matdescra(6) are reserved for future use. Note that whether matdescra is described in your application as an array of length 6 or 4 is of no importance because the array is declared as a pointer in the Intel MKL routines. To learn more about declaration of the matdescra array, see Sparse BLAS examples located in the following subdirectory of the Intel MKL installation directory: examples/spblas/. The table below lists elements of the parameter matdescra, their values and meanings. The parameter matdescra corresponds to the argument descra from NSB library. Possible Values of the Parameter matdescra (descra) MKL interface NSB interface Matrix characteristics one-based indexing zero-based indexing data type CHARACTER Char INTEGER 1st element matdescra(1) matdescra(0) descra(1) matrix structure value G G 0 general S S 1 symmetric (A = A') BLAS and Sparse BLAS Routines 2 155 MKL interface NSB interface Matrix characteristics H H 2 Hermitian (A=conjg(A')) T T 3 triangular A A 4 skew(anti)-symmetric (A=-A') D D 5 diagonal 2nd element matdescra(2) matdescra(1) descra(2) upper/lower triangular indicator value L L 1 lower U U 2 upper 3rd element matdescra(3) matdescra(2) descra(3) main diagonal type value N N 0 non-unit U U 1 unit 4th element matdescra(4) matdescra(3) type of indexing value F one-based indexing C zero-based indexing In some cases possible element values of the parameter matdescra depend on the values of other elements. The Table "Possible Combinations of Element Values of the Parameter matdescra" lists all possible combinations of element values for both multiplication routines and triangular solvers. Possible Combinations of Element Values of the Parameter matdescra Routines matdescra(1) matdescra(2) matdescra(3) matdescra(4) Multiplication Routines G ignored ignored F (default) or C S or H L (default) N (default) F (default) or C S or H L (default) U F (default) or C S or H U N (default) F (default) or C S or H U U F (default) or C A L (default) ignored F (default) or C A U ignored F (default) or C Multiplication Routines and Triangular Solvers T L U F (default) or C T L N F (default) or C T U U F (default) or C T U N F (default) or C D ignored N (default) F (default) or C D ignored U F (default) or C For a matrix in the skyline format with the main diagonal declared to be a unit, diagonal elements must be stored in the sparse representation even if they are zero. In all other formats, diagonal elements can be stored (if needed) in the sparse representation if they are not zero. 2 Intel® Math Kernel Library Reference Manual 156 Operations with Partial Matrices One of the distinctive feature of the Intel MKL Sparse BLAS routines is a possibility to perform operations only on partial matrices composed of certain parts (triangles and the main diagonal) of the input sparse matrix. It can be done by setting properly first three elements of the parameter matdescra. An arbitrary sparse matrix A can be decomposed as A = L + D + U where L is the strict lower triangle of A, U is the strict upper triangle of A, D is the main diagonal. Table "Output Matrices for Multiplication Routines" shows correspondence between the output matrices and values of the parameter matdescra for the sparse matrix A for multiplication routines. Output Matrices for Multiplication Routines matdescra(1) matdescra(2) matdescra(3) Output Matrix G ignored ignored alpha*op(A)*x + beta*y alpha*op(A)*B + beta*C S or H L N alpha*op(L+D+L')*x + beta*y alpha*op(L+D+L')*B + beta*C S or H L U alpha*op(L+I+L')*x + beta*y alpha*op(L+I+L')*B + beta*C S or H U N alpha*op(U'+D+U)*x + beta*y alpha*op(U'+D+U)*B + beta*C S or H U U alpha*op(U'+I+U)*x + beta*y alpha*op(U'+I+U)*B + beta*C T L U alpha*op(L+I)*x + beta*y alpha*op(L+I)*B + beta*C T L N alpha*op(L+D)*x + beta*y alpha*op(L+D)*B + beta*C T U U alpha*op(U+I)*x + beta*y alpha*op(U+I)*B + beta*C T U N alpha*op(U+D)*x + beta*y alpha*op(U+D)*B + beta*C A L ignored alpha*op(L-L')*x + beta*y alpha*op(L-L')*B + beta*C A U ignored alpha*op(U-U')*x + beta*y alpha*op(U-U')*B + beta*C D ignored N alpha*D*x + beta*y alpha*D*B + beta*C D ignored U alpha*x + beta*y alpha*B + beta*C Table “Output Matrices for Triangular Solvers” shows correspondence between the output matrices and values of the parameter matdescra for the sparse matrix A for triangular solvers. BLAS and Sparse BLAS Routines 2 157 Output Matrices for Triangular Solvers matdescra(1) matdescra(2) matdescra(3) Output Matrix T L N alpha*inv(op(L+L))*x alpha*inv(op(L+L))*B T L U alpha*inv(op(L+L))*x alpha*inv(op(L+L))*B T U N alpha*inv(op(U+U))*x alpha*inv(op(U+U))*B T U U alpha*inv(op(U+U))*x alpha*inv(op(U+U))*B D ignored N alpha*inv(D)*x alpha*inv(D)*B D ignored U alpha*x alpha*B Sparse BLAS Level 2 and Level 3 Routines. Table “Sparse BLAS Level 2 and Level 3 Routines” lists the sparse BLAS Level 2 and Level 3 routines described in more detail later in this section. Sparse BLAS Level 2 and Level 3 Routines Routine/Function Description Simplified interface, one-based indexing mkl_?csrgemv Computes matrix - vector product of a sparse general matrix in the CSR format (3-array variation) mkl_?bsrgemv Computes matrix - vector product of a sparse general matrix in the BSR format (3-array variation). mkl_?coogemv Computes matrix - vector product of a sparse general matrix in the coordinate format. mkl_?diagemv Computes matrix - vector product of a sparse general matrix in the diagonal format. mkl_?csrsymv Computes matrix - vector product of a sparse symmetrical matrix in the CSR format (3-array variation) mkl_?bsrsymv Computes matrix - vector product of a sparse symmetrical matrix in the BSR format (3-array variation). mkl_?coosymv Computes matrix - vector product of a sparse symmetrical matrix in the coordinate format. mkl_?diasymv Computes matrix - vector product of a sparse symmetrical matrix in the diagonal format. mkl_?csrtrsv Triangular solvers with simplified interface for a sparse matrix in the CSR format (3-array variation). 2 Intel® Math Kernel Library Reference Manual 158 Routine/Function Description mkl_?bsrtrsv Triangular solver with simplified interface for a sparse matrix in the BSR format (3-array variation). mkl_?cootrsv Triangular solvers with simplified interface for a sparse matrix in the coordinate format. mkl_?diatrsv Triangular solvers with simplified interface for a sparse matrix in the diagonal format. Simplified interface, zero-based indexing mkl_cspblas_?csrgemv Computes matrix - vector product of a sparse general matrix in the CSR format (3-array variation) with zero-based indexing. mkl_cspblas_?bsrgemv Computes matrix - vector product of a sparse general matrix in the BSR format (3-array variation)with zero-based indexing. mkl_cspblas_?coogemv Computes matrix - vector product of a sparse general matrix in the coordinate format with zero-based indexing. mkl_cspblas_?csrsymv Computes matrix - vector product of a sparse symmetrical matrix in the CSR format (3-array variation) with zero-based indexing mkl_cspblas_?bsrsymv Computes matrix - vector product of a sparse symmetrical matrix in the BSR format (3-array variation) with zero-based indexing. mkl_cspblas_?coosymv Computes matrix - vector product of a sparse symmetrical matrix in the coordinate format with zero-based indexing. mkl_cspblas_?csrtrsv Triangular solvers with simplified interface for a sparse matrix in the CSR format (3-array variation) with zero-based indexing. mkl_cspblas_?bsrtrsv Triangular solver with simplified interface for a sparse matrix in the BSR format (3-array variation) with zero-based indexing. mkl_cspblas_?cootrsv Triangular solver with simplified interface for a sparse matrix in the coordinate format with zero-based indexing. Typical (conventional) interface, one-based and zero-based indexing mkl_?csrmv Computes matrix - vector product of a sparse matrix in the CSR format. mkl_?bsrmv Computes matrix - vector product of a sparse matrix in the BSR format. mkl_?cscmv Computes matrix - vector product for a sparse matrix in the CSC format. mkl_?coomv Computes matrix - vector product for a sparse matrix in the coordinate format. mkl_?csrsv Solves a system of linear equations for a sparse matrix in the CSR format. BLAS and Sparse BLAS Routines 2 159 Routine/Function Description mkl_?bsrsv Solves a system of linear equations for a sparse matrix in the BSR format. mkl_?cscsv Solves a system of linear equations for a sparse matrix in the CSC format. mkl_?coosv Solves a system of linear equations for a sparse matrix in the coordinate format. mkl_?csrmm Computes matrix - matrix product of a sparse matrix in the CSR format mkl_?bsrmm Computes matrix - matrix product of a sparse matrix in the BSR format. mkl_?cscmm Computes matrix - matrix product of a sparse matrix in the CSC format mkl_?coomm Computes matrix - matrix product of a sparse matrix in the coordinate format. mkl_?csrsm Solves a system of linear matrix equations for a sparse matrix in the CSR format. mkl_?bsrsm Solves a system of linear matrix equations for a sparse matrix in the BSR format. mkl_?cscsm Solves a system of linear matrix equations for a sparse matrix in the CSC format. mkl_?coosm Solves a system of linear matrix equations for a sparse matrix in the coordinate format. Typical (conventional) interface, one-based indexing mkl_?diamv Computes matrix - vector product of a sparse matrix in the diagonal format. mkl_?skymv Computes matrix - vector product for a sparse matrix in the skyline storage format. mkl_?diasv Solves a system of linear equations for a sparse matrix in the diagonal format. mkl_?skysv Solves a system of linear equations for a sparse matrix in the skyline format. mkl_?diamm Computes matrix - matrix product of a sparse matrix in the diagonal format. mkl_?skymm Computes matrix - matrix product of a sparse matrix in the skyline storage format. mkl_?diasm Solves a system of linear matrix equations for a sparse matrix in the diagonal format. mkl_?skysm Solves a system of linear matrix equations for a sparse matrix in the skyline storage format. Auxiliary routines Matrix converters 2 Intel® Math Kernel Library Reference Manual 160 Routine/Function Description mkl_?dnscsr Converts a sparse matrix in the dense representation to the CSR format (3-array variation). mkl_?csrcoo Converts a sparse matrix in the CSR format (3-array variation) to the coordinate format and vice versa. mkl_?csrbsr Converts a sparse matrix in the CSR format to the BSR format (3-array variations) and vice versa. mkl_?csrcsc Converts a sparse matrix in the CSR format to the CSC and vice versa (3-array variations). mkl_?csrdia Converts a sparse matrix in the CSR format (3-array variation) to the diagonal format and vice versa. mkl_?csrsky Converts a sparse matrix in the CSR format (3-array variation) to the sky line format and vice versa. Operations on sparse matrices mkl_?csradd Computes the sum of two sparse matrices stored in the CSR format (3-array variation) with one-based indexing. mkl_?csrmultcsr Computes the product of two sparse matrices stored in the CSR format (3-array variation) with one-based indexing. mkl_?csrmultd Computes product of two sparse matrices stored in the CSR format (3-array variation) with one-based indexing. The result is stored in the dense matrix. mkl_?csrgemv Computes matrix - vector product of a sparse general matrix stored in the CSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_scsrgemv(transa, m, a, ia, ja, x, y) call mkl_dcsrgemv(transa, m, a, ia, ja, x, y) call mkl_ccsrgemv(transa, m, a, ia, ja, x, y) call mkl_zcsrgemv(transa, m, a, ia, ja, x, y) C: mkl_scsrgemv(&transa, &m, a, ia, ja, x, y); mkl_dcsrgemv(&transa, &m, a, ia, ja, x, y); mkl_ccsrgemv(&transa, &m, a, ia, ja, x, y); mkl_zcsrgemv(&transa, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h BLAS and Sparse BLAS Routines 2 161 Description The mkl_?csrgemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m sparse square matrix in the CSR format (3-array variation), A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then as y := A*x If transa = 'T' or 't' or 'C' or 'c', then y := A'*x, m INTEGER. Number of rows of the matrix A. a REAL for mkl_scsrgemv. DOUBLE PRECISION for mkl_dcsrgemv. COMPLEX for mkl_ccsrgemv. DOUBLE COMPLEX for mkl_zcsrgemv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. x REAL for mkl_scsrgemv. DOUBLE PRECISION for mkl_dcsrgemv. COMPLEX for mkl_ccsrgemv. DOUBLE COMPLEX for mkl_zcsrgemv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_scsrgemv. DOUBLE PRECISION for mkl_dcsrgemv. COMPLEX for mkl_ccsrgemv. 2 Intel® Math Kernel Library Reference Manual 162 DOUBLE COMPLEX for mkl_zcsrgemv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dcsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_ccsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_zcsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_scsrgemv(char *transa, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_dcsrgemv(char *transa, int *m, double *a, int *ia, int *ja, double *x, double *y); void mkl_ccsrgemv(char *transa, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcsrgemv(char *transa, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); BLAS and Sparse BLAS Routines 2 163 mkl_?bsrgemv Computes matrix - vector product of a sparse general matrix stored in the BSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_sbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_dbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_cbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_zbsrgemv(transa, m, lb, a, ia, ja, x, y) C: mkl_sbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); mkl_dbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); mkl_cbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); mkl_zbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrgemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m block sparse square matrix in the BSR format (3-array variation), A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x, m INTEGER. Number of block rows of the matrix A. 2 Intel® Math Kernel Library Reference Manual 164 lb INTEGER. Size of the block in the matrix A. a REAL for mkl_sbsrgemv. DOUBLE PRECISION for mkl_dbsrgemv. COMPLEX for mkl_cbsrgemv. DOUBLE COMPLEX for mkl_zbsrgemv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zero blocks plus one. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. x REAL for mkl_sbsrgemv. DOUBLE PRECISION for mkl_dbsrgemv. COMPLEX for mkl_cbsrgemv. DOUBLE COMPLEX for mkl_zbsrgemv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sbsrgemv. DOUBLE PRECISION for mkl_dbsrgemv. COMPLEX for mkl_cbsrgemv. DOUBLE COMPLEX for mkl_zbsrgemv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 165 SUBROUTINE mkl_cbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_zbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_dbsrgemv(char *transa, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_sbsrgemv(char *transa, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); void mkl_cbsrgemv(char *transa, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zbsrgemv(char *transa, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?coogemv Computes matrix-vector product of a sparse general matrix stored in the coordinate format with one-based indexing. Syntax Fortran: call mkl_scoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_dcoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_ccoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_zcoogemv(transa, m, val, rowind, colind, nnz, x, y) C: mkl_scoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_dcoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_ccoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_zcoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h 2 Intel® Math Kernel Library Reference Manual 166 Description The mkl_?coogemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m sparse square matrix in the coordinate format, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x, m INTEGER. Number of rows of the matrix A. val REAL for mkl_scoogemv. DOUBLE PRECISION for mkl_dcoogemv. COMPLEX for mkl_ccoogemv. DOUBLE COMPLEX for mkl_zcoogemv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_scoogemv. DOUBLE PRECISION for mkl_dcoogemv. COMPLEX for mkl_ccoogemv. DOUBLE COMPLEX for mkl_zcoogemv. Array, DIMENSION is m. One entry, the array x must contain the vector x. Output Parameters y REAL for mkl_scoogemv. DOUBLE PRECISION for mkl_dcoogemv. BLAS and Sparse BLAS Routines 2 167 COMPLEX for mkl_ccoogemv. DOUBLE COMPLEX for mkl_zcoogemv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) SUBROUTINE mkl_dcoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_ccoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zcoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scoogemv(char *transa, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_dcoogemv(char *transa, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_ccoogemv(char *transa, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcoogemv(char *transa, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); 2 Intel® Math Kernel Library Reference Manual 168 mkl_?diagemv Computes matrix - vector product of a sparse general matrix stored in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiagemv(transa, m, val, lval, idiag, ndiag, x, y) call mkl_ddiagemv(transa, m, val, lval, idiag, ndiag, x, y) call mkl_cdiagemv(transa, m, val, lval, idiag, ndiag, x, y) call mkl_zdiagemv(transa, m, val, lval, idiag, ndiag, x, y) C: mkl_sdiagemv(&transa, &m, val, &lval, idiag, &ndiag, x, y); mkl_ddiagemv(&transa, &m, val, &lval, idiag, &ndiag, x, y); mkl_cdiagemv(&transa, &m, val, &lval, idiag, &ndiag, x, y); mkl_zdiagemv(&transa, &m, val, &lval, idiag, &ndiag, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diagemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m sparse square matrix in the diagonal storage format, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := A*x If transa = 'T' or 't' or 'C' or 'c', then y := A'*x, m INTEGER. Number of rows of the matrix A. val REAL for mkl_sdiagemv. DOUBLE PRECISION for mkl_ddiagemv. BLAS and Sparse BLAS Routines 2 169 COMPLEX for mkl_ccsrgemv. DOUBLE COMPLEX for mkl_zdiagemv. Two-dimensional array of size lval*ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val lval=m. Refer to lval description in Diagonal Storage Scheme for more details. idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. x REAL for mkl_sdiagemv. DOUBLE PRECISION for mkl_ddiagemv. COMPLEX for mkl_ccsrgemv. DOUBLE COMPLEX for mkl_zdiagemv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sdiagemv. DOUBLE PRECISION for mkl_ddiagemv. COMPLEX for mkl_ccsrgemv. DOUBLE COMPLEX for mkl_zdiagemv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sdiagemv(transa, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa INTEGER m, lval, ndiag INTEGER idiag(*) REAL val(lval,*), x(*), y(*) SUBROUTINE mkl_ddiagemv(transa, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa INTEGER m, lval, ndiag INTEGER idiag(*) DOUBLE PRECISION val(lval,*), x(*), y(*) SUBROUTINE mkl_cdiagemv(transa, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa INTEGER m, lval, ndiag INTEGER idiag(*) COMPLEX val(lval,*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 170 SUBROUTINE mkl_zdiagemv(transa, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa INTEGER m, lval, ndiag INTEGER idiag(*) DOUBLE COMPLEX val(lval,*), x(*), y(*) C: void mkl_sdiagemv(char *transa, int *m, float *val, int *lval, int *idiag, int *ndiag, float *x, float *y); void mkl_ddiagemv(char *transa, int *m, double *val, int *lval, int *idiag, int *ndiag, double *x, double *y); void mkl_cdiagemv(char *transa, int *m, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zdiagemv(char *transa, int *m, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?csrsymv Computes matrix - vector product of a sparse symmetrical matrix stored in the CSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_scsrsymv(uplo, m, a, ia, ja, x, y) call mkl_dcsrsymv(uplo, m, a, ia, ja, x, y) call mkl_ccsrsymv(uplo, m, a, ia, ja, x, y) call mkl_zcsrsymv(uplo, m, a, ia, ja, x, y) C: mkl_scsrsymv(&uplo, &m, a, ia, ja, x, y); mkl_dcsrsymv(&uplo, &m, a, ia, ja, x, y); mkl_ccsrsymv(&uplo, &m, a, ia, ja, x, y); mkl_zcsrsymv(&uplo, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrsymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, BLAS and Sparse BLAS Routines 2 171 A is an upper or lower triangle of the symmetrical sparse matrix in the CSR format (3-array variation). NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of rows of the matrix A. a REAL for mkl_scsrsymv. DOUBLE PRECISION for mkl_dcsrsymv. COMPLEX for mkl_ccsrsymv. DOUBLE COMPLEX for mkl_zcsrsymv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. x REAL for mkl_scsrsymv. DOUBLE PRECISION for mkl_dcsrsymv. COMPLEX for mkl_ccsrsymv. DOUBLE COMPLEX for mkl_zcsrsymv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_scsrsymv. DOUBLE PRECISION for mkl_dcsrsymv. COMPLEX for mkl_ccsrsymv. DOUBLE COMPLEX for mkl_zcsrsymv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. 2 Intel® Math Kernel Library Reference Manual 172 Interfaces FORTRAN 77: SUBROUTINE mkl_scsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dcsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_ccsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_zcsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_scsrsymv(char *uplo, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_dcsrsymv(char *uplo, int *m, double *a, int *ia, int *ja, double *x, double *y); void mkl_ccsrsymv(char *uplo, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcsrsymv(char *uplo, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?bsrsymv Computes matrix-vector product of a sparse symmetrical matrix stored in the BSR format (3-array variation) with one-based indexing. BLAS and Sparse BLAS Routines 2 173 Syntax Fortran: call mkl_sbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_dbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_cbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_zbsrsymv(uplo, m, lb, a, ia, ja, x, y) C: mkl_sbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_dbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_cbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_zbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrsymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, A is an upper or lower triangle of the symmetrical sparse matrix in the BSR format (3-array variation). NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is considered. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of block rows of the matrix A. lb INTEGER. Size of the block in the matrix A. a REAL for mkl_sbsrsymv. DOUBLE PRECISION for mkl_dbsrsymv. COMPLEX for mkl_cbsrsymv. DOUBLE COMPLEX for mkl_zcsrgemv. 2 Intel® Math Kernel Library Reference Manual 174 Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zero blocks plus one. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. x REAL for mkl_sbsrsymv. DOUBLE PRECISION for mkl_dbsrsymv. COMPLEX for mkl_cbsrsymv. DOUBLE COMPLEX for mkl_zcsrgemv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sbsrsymv. DOUBLE PRECISION for mkl_dbsrsymv. COMPLEX for mkl_cbsrsymv. DOUBLE COMPLEX for mkl_zcsrgemv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 175 SUBROUTINE mkl_zbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_sbsrsymv(char *uplo, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); void mkl_dbsrsymv(char *uplo, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_cbsrsymv(char *uplo, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zbsrsymv(char *uplo, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?coosymv Computes matrix - vector product of a sparse symmetrical matrix stored in the coordinate format with one-based indexing. Syntax Fortran: call mkl_scoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_dcoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_ccoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_zcoosymv(uplo, m, val, rowind, colind, nnz, x, y) C: mkl_scoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); mkl_dcoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); mkl_ccoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); mkl_zcoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?coosymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, 2 Intel® Math Kernel Library Reference Manual 176 A is an upper or lower triangle of the symmetrical sparse matrix in the coordinate format. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of rows of the matrix A. val REAL for mkl_scoosymv. DOUBLE PRECISION for mkl_dcoosymv. COMPLEX for mkl_ccoosymv. DOUBLE COMPLEX for mkl_zcoosymv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_scoosymv. DOUBLE PRECISION for mkl_dcoosymv. COMPLEX for mkl_ccoosymv. DOUBLE COMPLEX for mkl_zcoosymv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_scoosymv. DOUBLE PRECISION for mkl_dcoosymv. COMPLEX for mkl_ccoosymv. DOUBLE COMPLEX for mkl_zcoosymv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. BLAS and Sparse BLAS Routines 2 177 Interfaces FORTRAN 77: SUBROUTINE mkl_scoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) SUBROUTINE mkl_dcoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_cdcoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zcoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scoosymv(char *uplo, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_dcoosymv(char *uplo, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_ccoosymv(char *uplo, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcoosymv(char *uplo, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?diasymv Computes matrix - vector product of a sparse symmetrical matrix stored in the diagonal format with one-based indexing. 2 Intel® Math Kernel Library Reference Manual 178 Syntax Fortran: call mkl_sdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) call mkl_ddiasymv(uplo, m, val, lval, idiag, ndiag, x, y) call mkl_cdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) call mkl_zdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) C: mkl_sdiasymv(&uplo, &m, val, &lval, idiag, &ndiag, x, y); mkl_ddiasymv(&uplo, &m, val, &lval, idiag, &ndiag, x, y); mkl_cdiasymv(&uplo, &m, val, &lval, idiag, &ndiag, x, y); mkl_zdiasymv(&uplo, &m, val, &lval, idiag, &ndiag, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diasymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, A is an upper or lower triangle of the symmetrical sparse matrix. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of rows of the matrix A. val REAL for mkl_sdiasymv. DOUBLE PRECISION for mkl_ddiasymv. COMPLEX for mkl_cdiasymv. DOUBLE COMPLEX for mkl_zdiasymv. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval =m. Refer to lval description in Diagonal Storage Scheme for more details. BLAS and Sparse BLAS Routines 2 179 idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. x REAL for mkl_sdiasymv. DOUBLE PRECISION for mkl_ddiasymv. COMPLEX for mkl_cdiasymv. DOUBLE COMPLEX for mkl_zdiasymv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sdiasymv. DOUBLE PRECISION for mkl_ddiasymv. COMPLEX for mkl_cdiasymv. DOUBLE COMPLEX for mkl_zdiasymv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo INTEGER m, lval, ndiag INTEGER idiag(*) REAL val(lval,*), x(*), y(*) SUBROUTINE mkl_ddiasymv(uplo, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo INTEGER m, lval, ndiag INTEGER idiag(*) DOUBLE PRECISION val(lval,*), x(*), y(*) SUBROUTINE mkl_cdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo INTEGER m, lval, ndiag INTEGER idiag(*) COMPLEX val(lval,*), x(*), y(*) SUBROUTINE mkl_zdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo INTEGER m, lval, ndiag INTEGER idiag(*) DOUBLE COMPLEX val(lval,*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 180 C: void mkl_sdiasymv(char *uplo, int *m, float *val, int *lval, int *idiag, int *ndiag, float *x, float *y); void mkl_ddiasymv(char *uplo, int *m, double *val, int *lval, int *idiag, int *ndiag, double *x, double *y); void mkl_cdiasymv(char *uplo, int *m, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zdiasymv(char *uplo, int *m, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?csrtrsv Triangular solvers with simplified interface for a sparse matrix in the CSR format (3-array variation) with onebased indexing. Syntax Fortran: call mkl_scsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_dcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_ccsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_zcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) C: mkl_scsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_dcsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_ccsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_zcsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrtrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the CSR format (3 array variation): A*y = x or A'*y = x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. BLAS and Sparse BLAS Routines 2 181 NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then A*y = x If transa = 'T' or 't' or 'C' or 'c', then A'*y = x, diag CHARACTER*1. Specifies whether A is unit triangular. If diag = 'U' or 'u', then A is a unit triangular. If diag = 'N' or 'n', then A is not unit triangular. m INTEGER. Number of rows of the matrix A. a REAL for mkl_scsrtrmv. DOUBLE PRECISION for mkl_dcsrtrmv. COMPLEX for mkl_ccsrtrmv. DOUBLE COMPLEX for mkl_zcsrtrmv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. NOTE Column indices must be sorted in increasing order for each row. x REAL for mkl_scsrtrmv. DOUBLE PRECISION for mkl_dcsrtrmv. COMPLEX for mkl_ccsrtrmv. DOUBLE COMPLEX for mkl_zcsrtrmv. Array, DIMENSION is m. On entry, the array x must contain the vector x. 2 Intel® Math Kernel Library Reference Manual 182 Output Parameters y REAL for mkl_scsrtrmv. DOUBLE PRECISION for mkl_dcsrtrmv. COMPLEX for mkl_ccsrtrmv. DOUBLE COMPLEX for mkl_zcsrtrmv. Array, DIMENSION at least m. Contains the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_ccsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_zcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_scsrtrsv(char *uplo, char *transa, char *diag, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_dcsrtrsv(char *uplo, char *transa, char *diag, int *m, double *a, int *ia, int *ja, double *x, double *y); void mkl_ccsrtrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcsrtrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); BLAS and Sparse BLAS Routines 2 183 mkl_?bsrtrsv Triangular solver with simplified interface for a sparse matrix stored in the BSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_sbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_dbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_cbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_zbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) C: mkl_sbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_dbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_cbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_zbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrtrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the BSR format (3-array variation) : y := A*x or y := A'*x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x 2 Intel® Math Kernel Library Reference Manual 184 If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x. diag CHARACTER*1. Specifies whether A is a unit triangular matrix. If diag = 'U' or 'u', then A is a unit triangular. If diag = 'N' or 'n', then A is not a unit triangular. m INTEGER. Number of block rows of the matrix A. lb INTEGER. Size of the block in the matrix A. a REAL for mkl_sbsrtrsv. DOUBLE PRECISION for mkl_dbsrtrsv. COMPLEX for mkl_cbsrtrsv. DOUBLE COMPLEX for mkl_zbsrtrsv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zero blocks plus one. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. x REAL for mkl_sbsrtrsv. DOUBLE PRECISION for mkl_dbsrtrsv. COMPLEX for mkl_cbsrtrsv. DOUBLE COMPLEX for mkl_zbsrtrsv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sbsrtrsv. DOUBLE PRECISION for mkl_dbsrtrsv. COMPLEX for mkl_cbsrtrsv. DOUBLE COMPLEX for mkl_zbsrtrsv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. BLAS and Sparse BLAS Routines 2 185 Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_zbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_sbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); void mkl_dbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_cbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?cootrsv Triangular solvers with simplified interface for a sparse matrix in the coordinate format with one-based indexing. 2 Intel® Math Kernel Library Reference Manual 186 Syntax Fortran: call mkl_scootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_dcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_ccootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_zcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) C: mkl_scootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); mkl_dcootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); mkl_ccootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); mkl_zcootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?cootrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the coordinate format: A*y = x or A'*y = x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is considered. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then A*y = x If transa = 'T' or 't' or 'C' or 'c', then A'*y = x, diag CHARACTER*1. Specifies whether A is unit triangular. If diag = 'U' or 'u', then A is unit triangular. If diag = 'N' or 'n', then A is not unit triangular. BLAS and Sparse BLAS Routines 2 187 m INTEGER. Number of rows of the matrix A. val REAL for mkl_scootrsv. DOUBLE PRECISION for mkl_dcootrsv. COMPLEX for mkl_ccootrsv. DOUBLE COMPLEX for mkl_zcootrsv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_scootrsv. DOUBLE PRECISION for mkl_dcootrsv. COMPLEX for mkl_ccootrsv. DOUBLE COMPLEX for mkl_zcootrsv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_scootrsv. DOUBLE PRECISION for mkl_dcootrsv. COMPLEX for mkl_ccootrsv. DOUBLE COMPLEX for mkl_zcootrsv. Array, DIMENSION at least m. Contains the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) SUBROUTINE mkl_dcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 188 SUBROUTINE mkl_ccootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scootrsv(char *uplo, char *transa, char *diag, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, double *y); void mkl_dcootrsv(char *uplo, char *transa, char *diag, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_ccootrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcootrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?diatrsv Triangular solvers with simplified interface for a sparse matrix in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) call mkl_ddiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) call mkl_cdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) call mkl_zdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) C: mkl_sdiatrsv(&uplo, &transa, &diag, &m, val, &lval, idiag, &ndiag, x, y); mkl_ddiatrsv(&uplo, &transa, &diag, &m, val, &lval, idiag, &ndiag, x, y); mkl_cdiatrsv(&uplo, &transa, &diag, &m, val, &lval, idiag, &ndiag, x, y); mkl_zdiatrsv(&uplo, &transa, &diag, &m, val, &lval, idiag, &ndiag, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h BLAS and Sparse BLAS Routines 2 189 Description The mkl_?diatrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the diagonal format: A*y = x or A'*y = x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then A*y = x If transa = 'T' or 't' or 'C' or 'c', then A'*y = x, diag CHARACTER*1. Specifies whether A is unit triangular. If diag = 'U' or 'u', then A is unit triangular. If diag = 'N' or 'n', then A is not unit triangular. m INTEGER. Number of rows of the matrix A. val REAL for mkl_sdiatrsv. DOUBLE PRECISION for mkl_ddiatrsv. COMPLEX for mkl_cdiatrsv. DOUBLE COMPLEX for mkl_zdiatrsv. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval=m. Refer to lval description in Diagonal Storage Scheme for more details. idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. NOTE All elements of this array must be sorted in increasing order. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. x REAL for mkl_sdiatrsv. 2 Intel® Math Kernel Library Reference Manual 190 DOUBLE PRECISION for mkl_ddiatrsv. COMPLEX for mkl_cdiatrsv. DOUBLE COMPLEX for mkl_zdiatrsv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sdiatrsv. DOUBLE PRECISION for mkl_ddiatrsv. COMPLEX for mkl_cdiatrsv. DOUBLE COMPLEX for mkl_zdiatrsv. Array, DIMENSION at least m. Contains the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lval, ndiag INTEGER indiag(*) REAL val(lval,*), x(*), y(*) SUBROUTINE mkl_ddiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lval, ndiag INTEGER indiag(*) DOUBLE PRECISION val(lval,*), x(*), y(*) SUBROUTINE mkl_cdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lval, ndiag INTEGER indiag(*) COMPLEX val(lval,*), x(*), y(*) SUBROUTINE mkl_zdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lval, ndiag INTEGER indiag(*) DOUBLE COMPLEX val(lval,*), x(*), y(*) C: void mkl_sdiatrsv(char *uplo, char *transa, char *diag, int *m, float *val, int *lval, int *idiag, int *ndiag, float *x, float *y); void mkl_ddiatrsv(char *uplo, char *transa, char *diag, int *m, double *val, int *lval, int *idiag, int *ndiag, double *x, double *y); BLAS and Sparse BLAS Routines 2 191 void mkl_cdiatrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zdiatrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?csrgemv Computes matrix - vector product of a sparse general matrix stored in the CSR format (3-array variation) with zero-based indexing. Syntax Fortran: call mkl_cspblas_scsrgemv(transa, m, a, ia, ja, x, y) call mkl_cspblas_dcsrgemv(transa, m, a, ia, ja, x, y) call mkl_cspblas_ccsrgemv(transa, m, a, ia, ja, x, y) call mkl_cspblas_zcsrgemv(transa, m, a, ia, ja, x, y) C: mkl_cspblas_scsrgemv(&transa, &m, a, ia, ja, x, y); mkl_cspblas_dcsrgemv(&transa, &m, a, ia, ja, x, y); mkl_cspblas_ccsrgemv(&transa, &m, a, ia, ja, x, y); mkl_cspblas_zcsrgemv(&transa, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?csrgemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m sparse square matrix in the CSR format (3-array variation) with zero-based indexing, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. 2 Intel® Math Kernel Library Reference Manual 192 transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x, m INTEGER. Number of rows of the matrix A. a REAL for mkl_cspblas_scsrgemv. DOUBLE PRECISION for mkl_cspblas_dcsrgemv. COMPLEX for mkl_cspblas_ccsrgemv. DOUBLE COMPLEX for mkl_cspblas_zcsrgemv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m) is equal to the number of non-zeros. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. x REAL for mkl_cspblas_scsrgemv. DOUBLE PRECISION for mkl_cspblas_dcsrgemv. COMPLEX for mkl_cspblas_ccsrgemv. DOUBLE COMPLEX for mkl_cspblas_zcsrgemv. Array, DIMENSION is m. One entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_scsrgemv. DOUBLE PRECISION for mkl_cspblas_dcsrgemv. COMPLEX for mkl_cspblas_ccsrgemv. DOUBLE COMPLEX for mkl_cspblas_zcsrgemv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 193 SUBROUTINE mkl_cspblas_dcsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_scsrgemv(char *transa, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_cspblas_dcsrgemv(char *transa, int *m, double *a, int *ia, int *ja, double *x, double *y); void mkl_cspblas_ccsrgemv(char *transa, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcsrgemv(char *transa, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?bsrgemv Computes matrix - vector product of a sparse general matrix stored in the BSR format (3-array variation) with zero-based indexing. Syntax Fortran: call mkl_cspblas_sbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_cspblas_dbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_cspblas_cbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_cspblas_zbsrgemv(transa, m, lb, a, ia, ja, x, y) C: mkl_cspblas_sbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); mkl_cspblas_dbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); 2 Intel® Math Kernel Library Reference Manual 194 mkl_cspblas_cbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); mkl_cspblas_zbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?bsrgemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m block sparse square matrix in the BSR format (3-array variation) with zero-based indexing, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x, m INTEGER. Number of block rows of the matrix A. lb INTEGER. Size of the block in the matrix A. a REAL for mkl_cspblas_sbsrgemv. DOUBLE PRECISION for mkl_cspblas_dbsrgemv. COMPLEX for mkl_cspblas_cbsrgemv. DOUBLE COMPLEX for mkl_cspblas_zbsrgemv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zero blocks. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. BLAS and Sparse BLAS Routines 2 195 x REAL for mkl_cspblas_sbsrgemv. DOUBLE PRECISION for mkl_cspblas_dbsrgemv. COMPLEX for mkl_cspblas_cbsrgemv. DOUBLE COMPLEX for mkl_cspblas_zbsrgemv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_sbsrgemv. DOUBLE PRECISION for mkl_cspblas_dbsrgemv. COMPLEX for mkl_cspblas_cbsrgemv. DOUBLE COMPLEX for mkl_cspblas_zbsrgemv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_sbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_cspblas_dbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_cbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_sbsrgemv(char *transa, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); 2 Intel® Math Kernel Library Reference Manual 196 void mkl_cspblas_dbsrgemv(char *transa, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_cspblas_cbsrgemv(char *transa, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zbsrgemv(char *transa, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?coogemv Computes matrix - vector product of a sparse general matrix stored in the coordinate format with zerobased indexing. Syntax Fortran: call mkl_cspblas_scoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_dcoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_ccoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_zcoogemv(transa, m, val, rowind, colind, nnz, x, y) C: mkl_cspblas_scoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_dcoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_ccoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_zcoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_dcoogemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m sparse square matrix in the coordinate format with zero-based indexing, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. BLAS and Sparse BLAS Routines 2 197 Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x. m INTEGER. Number of rows of the matrix A. val REAL for mkl_cspblas_scoogemv. DOUBLE PRECISION for mkl_cspblas_dcoogemv. COMPLEX for mkl_cspblas_ccoogemv. DOUBLE COMPLEX for mkl_cspblas_zcoogemv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_cspblas_scoogemv. DOUBLE PRECISION for mkl_cspblas_dcoogemv. COMPLEX for mkl_cspblas_ccoogemv. DOUBLE COMPLEX for mkl_cspblas_zcoogemv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_scoogemv. DOUBLE PRECISION for mkl_cspblas_dcoogemv. COMPLEX for mkl_cspblas_ccoogemv. DOUBLE COMPLEX for mkl_cspblas_zcoogemv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 198 SUBROUTINE mkl_cspblas_dcoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_cspblas_scoogemv(char *transa, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_cspblas_dcoogemv(char *transa, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_cspblas_ccoogemv(char *transa, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcoogemv(char *transa, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?csrsymv Computes matrix-vector product of a sparse symmetrical matrix stored in the CSR format (3-array variation) with zero-based indexing. Syntax Fortran: call mkl_cspblas_scsrsymv(uplo, m, a, ia, ja, x, y) call mkl_cspblas_dcsrsymv(uplo, m, a, ia, ja, x, y) call mkl_cspblas_ccsrsymv(uplo, m, a, ia, ja, x, y) call mkl_cspblas_zcsrsymv(uplo, m, a, ia, ja, x, y) C: mkl_cspblas_scsrsymv(&uplo, &m, a, ia, ja, x, y); mkl_cspblas_dcsrsymv(&uplo, &m, a, ia, ja, x, y); BLAS and Sparse BLAS Routines 2 199 mkl_cspblas_ccsrsymv(&uplo, &m, a, ia, ja, x, y); mkl_cspblas_zcsrsymv(&uplo, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?csrsymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, A is an upper or lower triangle of the symmetrical sparse matrix in the CSR format (3-array variation) with zero-based indexing. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of rows of the matrix A. a REAL for mkl_cspblas_scsrsymv. DOUBLE PRECISION for mkl_cspblas_dcsrsymv. COMPLEX for mkl_cspblas_ccsrsymv. DOUBLE COMPLEX for mkl_cspblas_zcsrsymv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zeros. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. x REAL for mkl_cspblas_scsrsymv. DOUBLE PRECISION for mkl_cspblas_dcsrsymv. COMPLEX for mkl_cspblas_ccsrsymv. 2 Intel® Math Kernel Library Reference Manual 200 DOUBLE COMPLEX for mkl_cspblas_zcsrsymv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_scsrsymv. DOUBLE PRECISION for mkl_cspblas_dcsrsymv. COMPLEX for mkl_cspblas_ccsrsymv. DOUBLE COMPLEX for mkl_cspblas_zcsrsymv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_cspblas_dcsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_scsrsymv(char *uplo, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_cspblas_dcsrsymv(char *uplo, int *m, double *a, int *ia, int *ja, double *x, double *y); BLAS and Sparse BLAS Routines 2 201 void mkl_cspblas_ccsrsymv(char *uplo, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcsrsymv(char *uplo, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?bsrsymv Computes matrix-vector product of a sparse symmetrical matrix stored in the BSR format (3-arrays variation) with zero-based indexing. Syntax Fortran: call mkl_cspblas_sbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_cspblas_dbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_cspblas_cbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_cspblas_zbsrsymv(uplo, m, lb, a, ia, ja, x, y) C: mkl_cspblas_sbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_cspblas_dbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_cspblas_cbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_cspblas_zbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?bsrsymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, A is an upper or lower triangle of the symmetrical sparse matrix in the BSR format (3-array variation) with zero-based indexing. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. 2 Intel® Math Kernel Library Reference Manual 202 uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of block rows of the matrix A. lb INTEGER. Size of the block in the matrix A. a REAL for mkl_cspblas_sbsrsymv. DOUBLE PRECISION for mkl_cspblas_dbsrsymv. COMPLEX for mkl_cspblas_cbsrsymv. DOUBLE COMPLEX for mkl_cspblas_zbsrsymv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zero blocks plus one. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. x REAL for mkl_cspblas_sbsrsymv. DOUBLE PRECISION for mkl_cspblas_dbsrsymv. COMPLEX for mkl_cspblas_cbsrsymv. DOUBLE COMPLEX for mkl_cspblas_zbsrsymv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_sbsrsymv. DOUBLE PRECISION for mkl_cspblas_dbsrsymv. COMPLEX for mkl_cspblas_cbsrsymv. DOUBLE COMPLEX for mkl_cspblas_zbsrsymv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_sbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 203 SUBROUTINE mkl_cspblas_dbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_cbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_sbsrsymv(char *uplo, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); void mkl_cspblas_dbsrsymv(char *uplo, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_cspblas_cbsrsymv(char *uplo, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zbsrsymv(char *uplo, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?coosymv Computes matrix - vector product of a sparse symmetrical matrix stored in the coordinate format with zero-based indexing . Syntax Fortran: call mkl_cspblas_scoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_dcoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_ccoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_zcoosymv(uplo, m, val, rowind, colind, nnz, x, y) C: mkl_cspblas_scoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_dcoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); 2 Intel® Math Kernel Library Reference Manual 204 mkl_cspblas_ccoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_zcoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?coosymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, A is an upper or lower triangle of the symmetrical sparse matrix in the coordinate format with zero-based indexing. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of rows of the matrix A. val REAL for mkl_cspblas_scoosymv. DOUBLE PRECISION for mkl_cspblas_dcoosymv. COMPLEX for mkl_cspblas_ccoosymv. DOUBLE COMPLEX for mkl_cspblas_zcoosymv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_cspblas_scoosymv. DOUBLE PRECISION for mkl_cspblas_dcoosymv. COMPLEX for mkl_cspblas_ccoosymv. DOUBLE COMPLEX for mkl_cspblas_zcoosymv. Array, DIMENSION is m. On entry, the array x must contain the vector x. BLAS and Sparse BLAS Routines 2 205 Output Parameters y REAL for mkl_cspblas_scoosymv. DOUBLE PRECISION for mkl_cspblas_dcoosymv. COMPLEX for mkl_cspblas_ccoosymv. DOUBLE COMPLEX for mkl_cspblas_zcoosymv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) SUBROUTINE mkl_cspblas_dcoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_cspblas_scoosymv(char *uplo, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_cspblas_dcoosymv(char *uplo, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_cspblas_ccoosymv(char *uplo, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcoosymv(char *uplo, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); 2 Intel® Math Kernel Library Reference Manual 206 mkl_cspblas_?csrtrsv Triangular solvers with simplified interface for a sparse matrix in the CSR format (3-array variation) with zero-based indexing. Syntax Fortran: call mkl_cspblas_scsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_cspblas_dcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_cspblas_ccsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_cspblas_zcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) C: mkl_cspblas_scsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_cspblas_dcsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_cspblas_ccsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_cspblas_zcsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?csrtrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the CSR format (3-array variation) with zero-based indexing: A*y = x or A'*y = x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then A*y = x BLAS and Sparse BLAS Routines 2 207 If transa = 'T' or 't' or 'C' or 'c', then A'*y = x, diag CHARACTER*1. Specifies whether matrix A is unit triangular. If diag = 'U' or 'u', then A is unit triangular. If diag = 'N' or 'n', then A is not unit triangular. m INTEGER. Number of rows of the matrix A. a REAL for mkl_cspblas_scsrtrsv. DOUBLE PRECISION for mkl_cspblas_dcsrtrsv. COMPLEX for mkl_cspblas_ccsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zcsrtrsv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. ia INTEGER. Array of length m+1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m) is equal to the number of non-zeros. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. NOTE Column indices must be sorted in increasing order for each row. x REAL for mkl_cspblas_scsrtrsv. DOUBLE PRECISION for mkl_cspblas_dcsrtrsv. COMPLEX for mkl_cspblas_ccsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zcsrtrsv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_scsrtrsv. DOUBLE PRECISION for mkl_cspblas_dcsrtrsv. COMPLEX for mkl_cspblas_ccsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zcsrtrsv. Array, DIMENSION at least m. Contains the vector y. 2 Intel® Math Kernel Library Reference Manual 208 Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_cspblas_dcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_scsrtrsv(char *uplo, char *transa, char *diag, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_cspblas_dcsrtrsv(char *uplo, char *transa, char *diag, int *m, double *a, int *ia, int *ja, double *x, double *y); void mkl_cspblas_ccsrtrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcsrtrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?bsrtrsv Triangular solver with simplified interface for a sparse matrix stored in the BSR format (3-array variation) with zero-based indexing. BLAS and Sparse BLAS Routines 2 209 Syntax Fortran: call mkl_cspblas_sbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_cspblas_dbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_cspblas_cbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_cspblas_zbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) C: mkl_cspblas_sbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_cspblas_dbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_cspblas_cbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_cspblas_zbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?bsrtrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the BSR format (3-array variation) with zero-based indexing: y := A*x or y := A'*x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x. diag CHARACTER*1. Specifies whether matrix A is unit triangular or not. If diag = 'U' or 'u', A is unit triangular. 2 Intel® Math Kernel Library Reference Manual 210 If diag = 'N' or 'n', A is not unit triangular. m INTEGER. Number of block rows of the matrix A. lb INTEGER. Size of the block in the matrix A. a REAL for mkl_cspblas_sbsrtrsv. DOUBLE PRECISION for mkl_cspblas_dbsrtrsv. COMPLEX for mkl_cspblas_cbsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zbsrtrsv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zero blocks. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. x REAL for mkl_cspblas_sbsrtrsv. DOUBLE PRECISION for mkl_cspblas_dbsrtrsv. COMPLEX for mkl_cspblas_cbsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zbsrtrsv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_sbsrtrsv. DOUBLE PRECISION for mkl_cspblas_dbsrtrsv. COMPLEX for mkl_cspblas_cbsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zbsrtrsv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_sbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 211 SUBROUTINE mkl_cspblas_dbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_cbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_sbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); void mkl_cspblas_dbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_cspblas_cbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?cootrsv Triangular solvers with simplified interface for a sparse matrix in the coordinate format with zero-based indexing . Syntax Fortran: call mkl_cspblas_scootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_dcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_ccootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_zcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) C: mkl_cspblas_scootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_dcootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); 2 Intel® Math Kernel Library Reference Manual 212 mkl_cspblas_ccootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_zcootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?cootrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the coordinate format with zero-based indexing: A*y = x or A'*y = x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is considered. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then A*y = x If transa = 'T' or 't' or 'C' or 'c', then A'*y = x, diag CHARACTER*1. Specifies whether A is unit triangular. If diag = 'U' or 'u', then A is unit triangular. If diag = 'N' or 'n', then A is not unit triangular. m INTEGER. Number of rows of the matrix A. val REAL for mkl_cspblas_scootrsv. DOUBLE PRECISION for mkl_cspblas_dcootrsv. COMPLEX for mkl_cspblas_ccootrsv. DOUBLE COMPLEX for mkl_cspblas_zcootrsv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. BLAS and Sparse BLAS Routines 2 213 colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_cspblas_scootrsv. DOUBLE PRECISION for mkl_cspblas_dcootrsv. COMPLEX for mkl_cspblas_ccootrsv. DOUBLE COMPLEX for mkl_cspblas_zcootrsv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_scootrsv. DOUBLE PRECISION for mkl_cspblas_dcootrsv. COMPLEX for mkl_cspblas_ccootrsv. DOUBLE COMPLEX for mkl_cspblas_zcootrsv. Array, DIMENSION at least m. Contains the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) SUBROUTINE mkl_cspblas_dcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 214 C: void mkl_cspblas_scootrsv(char *uplo, char *transa, char *diag, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_cspblas_dcootrsv(char *uplo, char *transa, char *diag, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_cspblas_ccootrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcootrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?csrmv Computes matrix - vector product of a sparse matrix stored in the CSR format. Syntax Fortran: call mkl_scsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_dcsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_ccsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_zcsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) C: mkl_scsrmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_dcsrmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_ccsrmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_zcsrmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrmv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-k sparse matrix in the CSR format, A' is the transpose of A. BLAS and Sparse BLAS Routines 2 215 NOTE This routine supports a CSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*A*x + beta*y If transa = 'T' or 't' or 'C' or 'c', then y := alpha*A'*x + beta*y, m INTEGER. Number of rows of the matrix A. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scsrmv. DOUBLE PRECISION for mkl_dcsrmv. COMPLEX for mkl_ccsrmv. DOUBLE COMPLEX for mkl_zcsrmv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scsrmv. DOUBLE PRECISION for mkl_dcsrmv. COMPLEX for mkl_ccsrmv. DOUBLE COMPLEX for mkl_zcsrmv. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(m) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSR Format for more details. indx INTEGER. Array containing the column indices for each non-zero element of the matrix A.Its length is equal to length of the val array. Refer to columns array description in CSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of row i in the arrays val and indx. Refer to pointerb array description in CSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of row i in the arrays val and indx.Refer to pointerE array description in CSR Format for more details. x REAL for mkl_scsrmv. DOUBLE PRECISION for mkl_dcsrmv. COMPLEX for mkl_ccsrmv. DOUBLE COMPLEX for mkl_zcsrmv. 2 Intel® Math Kernel Library Reference Manual 216 Array, DIMENSION at least k if transa = 'N' or 'n' and at least m otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_scsrmv. DOUBLE PRECISION for mkl_dcsrmv. COMPLEX for mkl_ccsrmv. DOUBLE COMPLEX for mkl_zcsrmv. Specifies the scalar beta. y REAL for mkl_scsrmv. DOUBLE PRECISION for mkl_dcsrmv. COMPLEX for mkl_ccsrmv. DOUBLE COMPLEX for mkl_zcsrmv. Array, DIMENSION at least m if transa = 'N' or 'n' and at least k otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER indx(*), pntrb(m), pntre(m) REAL alpha, beta REAL val(*), x(*), y(*) SUBROUTINE mkl_dcsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_ccsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha, beta COMPLEX val(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 217 SUBROUTINE mkl_zcsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scsrmv(char *transa, int *m, int *k, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *beta, float *y); void mkl_dcsrmv(char *transa, int *m, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *beta, double *y); void mkl_ccsrmv(char *transa, int *m, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *beta, double *y); void mkl_zcsrmv(char *transa, int *m, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?bsrmv Computes matrix - vector product of a sparse matrix stored in the BSR format. Syntax Fortran: call mkl_sbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_dbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_cbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_zbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) C: mkl_sbsrmv(&transa, &m, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_dbsrmv(&transa, &m, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_cbsrmv(&transa, &m, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_zbsrmv(&transa, &m, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); 2 Intel® Math Kernel Library Reference Manual 218 Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrmv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-k block sparse matrix in the BSR format, A' is the transpose of A. NOTE This routine supports a BSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := alpha*A*x + beta*y If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := alpha*A'*x + beta*y, m INTEGER. Number of block rows of the matrix A. k INTEGER. Number of block columns of the matrix A. lb INTEGER. Size of the block in the matrix A. alpha REAL for mkl_sbsrmv. DOUBLE PRECISION for mkl_dbsrmv. COMPLEX for mkl_cbsrmv. DOUBLE COMPLEX for mkl_zbsrmv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sbsrmv. DOUBLE PRECISION for mkl_dbsrmv. COMPLEX for mkl_cbsrmv. DOUBLE COMPLEX for mkl_zbsrmv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. BLAS and Sparse BLAS Routines 2 219 Refer to values array description in BSR Format for more details. indx INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks in the matrix A. Refer to columns array description in BSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing: this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of block row i in the array indx. For zero-based indexing: this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of block row i in the array indx Refer to pointerB array description in BSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of block row i in the array indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of block row i in the array indx. Refer to pointerE array description in BSR Format for more details. x REAL for mkl_sbsrmv. DOUBLE PRECISION for mkl_dbsrmv. COMPLEX for mkl_cbsrmv. DOUBLE COMPLEX for mkl_zbsrmv. Array, DIMENSION at least (k*lb) if transa = 'N' or 'n', and at least (m*lb) otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_sbsrmv. DOUBLE PRECISION for mkl_dbsrmv. COMPLEX for mkl_cbsrmv. DOUBLE COMPLEX for mkl_zbsrmv. Specifies the scalar beta. y REAL for mkl_sbsrmv. DOUBLE PRECISION for mkl_dbsrmv. COMPLEX for mkl_cbsrmv. DOUBLE COMPLEX for mkl_zbsrmv. Array, DIMENSION at least (m*lb) if transa = 'N' or 'n', and at least (k*lb) otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. 2 Intel® Math Kernel Library Reference Manual 220 Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lb INTEGER indx(*), pntrb(m), pntre(m) REAL alpha, beta REAL val(*), x(*), y(*) SUBROUTINE mkl_dbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lb INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_cbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lb INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha, beta COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lb INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_sbsrmv(char *transa, int *m, int *k, int *lb, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *beta, float *y); BLAS and Sparse BLAS Routines 2 221 void mkl_dbsrmv(char *transa, int *m, int *k, int *lb, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *beta, double *y); void mkl_cbsrmv(char *transa, int *m, int *k, int *lb, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *beta, MKL_Complex8 *y); void mkl_zbsrmv(char *transa, int *m, int *k, int *lb, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?cscmv Computes matrix-vector product for a sparse matrix in the CSC format. Syntax Fortran: call mkl_scscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_dcscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_ccscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_zcscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) C: mkl_scscmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_dcscmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_ccscmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_zcscmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?cscmv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-k sparse matrix in compressed sparse column (CSC) format, A' is the transpose of A. 2 Intel® Math Kernel Library Reference Manual 222 NOTE This routine supports CSC format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*A*x + beta*y If transa = 'T' or 't' or 'C' or 'c', then y := alpha*A'*x + beta*y, m INTEGER. Number of rows of the matrix A. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scscmv. DOUBLE PRECISION for mkl_dcscmv. COMPLEX for mkl_ccscmv. DOUBLE COMPLEX for mkl_zcscmv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scscmv. DOUBLE PRECISION for mkl_dcscmv. COMPLEX for mkl_ccscmv. DOUBLE COMPLEX for mkl_zcscmv. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(k) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSC Format for more details. indx INTEGER. Array containing the row indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to rows array description in CSC Format for more details. pntrb INTEGER. Array of length k. For one-based indexing this array contains column indices, such that pntrb(i) - pntrb(1)+1 is the first index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntrb(i) - pntrb(0) is the first index of column i in the arrays val and indx. Refer to pointerb array description in CSC Format for more details. pntre INTEGER. Array of length k. For one-based indexing this array contains column indices, such that pntre(i) - pntrb(1) is the last index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntre(i) - pntrb(1)-1 is the last index of column i in the arrays val and indx. Refer to pointerE array description in CSC Format for more details. BLAS and Sparse BLAS Routines 2 223 x REAL for mkl_scscmv. DOUBLE PRECISION for mkl_dcscmv. COMPLEX for mkl_ccscmv. DOUBLE COMPLEX for mkl_zcscmv. Array, DIMENSION at least k if transa = 'N' or 'n' and at least m otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_scscmv. DOUBLE PRECISION for mkl_dcscmv. COMPLEX for mkl_ccscmv. DOUBLE COMPLEX for mkl_zcscmv. Specifies the scalar beta. y REAL for mkl_scscmv. DOUBLE PRECISION for mkl_dcscmv. COMPLEX for mkl_ccscmv. DOUBLE COMPLEX for mkl_zcscmv. Array, DIMENSION at least m if transa = 'N' or 'n' and at least k otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha, beta REAL val(*), x(*), y(*) SUBROUTINE mkl_dcscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 224 SUBROUTINE mkl_ccscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha, beta COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zcscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scscmv(char *transa, int *m, int *k, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *beta, float *y); void mkl_dcscmv(char *transa, int *m, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *beta, double *y); void mkl_ccscmv(char *transa, int *m, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *beta, MKL_Complex8 *y); void mkl_zcscmv(char *transa, int *m, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?coomv Computes matrix - vector product for a sparse matrix in the coordinate format. Syntax Fortran: call mkl_scoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) call mkl_dcoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) call mkl_ccoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) BLAS and Sparse BLAS Routines 2 225 call mkl_zcoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) C: mkl_scoomv(&transa, &m, &k, &alpha, matdescra, val, rowind, colind, &nnz, x, &beta, y); mkl_dcoomv(&transa, &m, &k, &alpha, matdescra, val, rowind, colind, &nnz, x, &beta, y); mkl_ccoomv(&transa, &m, &k, &alpha, matdescra, val, rowind, colind, &nnz, x, &beta, y); mkl_zcoomv(&transa, &m, &k, &alpha, matdescra, val, rowind, colind, &nnz, x, &beta, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?coomv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-k sparse matrix in compressed coordinate format, A' is the transpose of A. NOTE This routine supports a coordinate format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*A*x + beta*y If transa = 'T' or 't' or 'C' or 'c', then y := alpha*A'*x + beta*y, m INTEGER. Number of rows of the matrix A. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scoomv. DOUBLE PRECISION for mkl_dcoomv. COMPLEX for mkl_ccoomv. DOUBLE COMPLEX for mkl_zcoomv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. 2 Intel® Math Kernel Library Reference Manual 226 Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scoomv. DOUBLE PRECISION for mkl_dcoomv. COMPLEX for mkl_ccoomv. DOUBLE COMPLEX for mkl_zcoomv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_scoomv. DOUBLE PRECISION for mkl_dcoomv. COMPLEX for mkl_ccoomv. DOUBLE COMPLEX for mkl_zcoomv. Array, DIMENSION at least k if transa = 'N' or 'n' and at least m otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_scoomv. DOUBLE PRECISION for mkl_dcoomv. COMPLEX for mkl_ccoomv. DOUBLE COMPLEX for mkl_zcoomv. Specifies the scalar beta. y REAL for mkl_scoomv. DOUBLE PRECISION for mkl_dcoomv. COMPLEX for mkl_ccoomv. DOUBLE COMPLEX for mkl_zcoomv. Array, DIMENSION at least m if transa = 'N' or 'n' and at least k otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, nnz INTEGER rowind(*), colind(*) REAL alpha, beta REAL val(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 227 SUBROUTINE mkl_dcoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_ccoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, nnz INTEGER rowind(*), colind(*) COMPLEX alpha, beta COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zcoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scoomv(char *transa, int *m, int *k, float *alpha, char *matdescra, float *val, int *rowind, int *colind, int *nnz, float *x, float *beta, float *y); void mkl_dcoomv(char *transa, int *m, int *k, double *alpha, char *matdescra, double *val, int *rowind, int *colind, int *nnz, double *x, double *beta, double *y); void mkl_ccoomv(char *transa, int *m, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *beta, MKL_Complex8 *y); void mkl_zcoomv(char *transa, int *m, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?csrsv Solves a system of linear equations for a sparse matrix in the CSR format. Syntax Fortran: call mkl_scsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) 2 Intel® Math Kernel Library Reference Manual 228 call mkl_dcsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) call mkl_ccsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) call mkl_zcsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) C: mkl_scsrsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_dcsrsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_ccsrsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_zcsrsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix in the CSR format: y := alpha*inv(A)*x or y := alpha*inv(A')*x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports a CSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa = 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')*x, m INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scsrsv. DOUBLE PRECISION for mkl_dcsrsv. COMPLEX for mkl_ccsrsv. DOUBLE COMPLEX for mkl_zcsrsv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. BLAS and Sparse BLAS Routines 2 229 val REAL for mkl_scsrsv. DOUBLE PRECISION for mkl_dcsrsv. COMPLEX for mkl_ccsrsv. DOUBLE COMPLEX for mkl_zcsrsv. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(m) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to columns array description in CSR Format for more details. NOTE Column indices must be sorted in increasing order for each row. pntrb INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of row i in the arrays val and indx. Refer to pointerb array description in CSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of row i in the arrays val and indx.Refer to pointerE array description in CSR Format for more details. x REAL for mkl_scsrsv. DOUBLE PRECISION for mkl_dcsrsv. COMPLEX for mkl_ccsrsv. DOUBLE COMPLEX for mkl_zcsrsv. Array, DIMENSION at least m. On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_scsrsv. DOUBLE PRECISION for mkl_dcsrsv. COMPLEX for mkl_ccsrsv. DOUBLE COMPLEX for mkl_zcsrsv. Array, DIMENSION at least m. On entry, the array y must contain the vector y. The elements are accessed with unit increment. Output Parameters y Contains solution vector x. 2 Intel® Math Kernel Library Reference Manual 230 Interfaces FORTRAN 77: SUBROUTINE mkl_scsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*) REAL x(*), y(*) SUBROUTINE mkl_dcsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*) DOUBLE PRECISION x(*), y(*) SUBROUTINE mkl_ccsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*) COMPLEX x(*), y(*) SUBROUTINE mkl_zcsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*) DOUBLE COMPLEX x(*), y(*) C: void mkl_scsrsv(char *transa, int *m, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *y); BLAS and Sparse BLAS Routines 2 231 void mkl_dcsrsv(char *transa, int *m, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *y); void mkl_ccsrsv(char *transa, int *m, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcsrsv(char *transa, int *m, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?bsrsv Solves a system of linear equations for a sparse matrix in the BSR format. Syntax Fortran: call mkl_sbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) call mkl_dbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) call mkl_cbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) call mkl_zbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) C: mkl_sbsrsv(&transa, &m, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_dbsrsv(&transa, &m, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_cbsrsv(&transa, &m, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_zbsrsv(&transa, &m, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix in the BSR format: y := alpha*inv(A)*x or y := alpha*inv(A')* x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports a BSR format both with one-based indexing and zero-based indexing. 2 Intel® Math Kernel Library Reference Manual 232 Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa = 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')* x, m INTEGER. Number of block columns of the matrix A. lb INTEGER. Size of the block in the matrix A. alpha REAL for mkl_sbsrsv. DOUBLE PRECISION for mkl_dbsrsv. COMPLEX for mkl_cbsrsv. DOUBLE COMPLEX for mkl_zbsrsv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sbsrsv. DOUBLE PRECISION for mkl_dbsrsv. COMPLEX for mkl_cbsrsv. DOUBLE COMPLEX for mkl_zbsrsv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to the values array description in BSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks in the matrix A. Refer to the columns array description in BSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing: this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of block row i in the array indx. For zero-based indexing: this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of block row i in the array indx Refer to pointerB array description in BSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of block row i in the array indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of block row i in the array indx. BLAS and Sparse BLAS Routines 2 233 Refer to pointerE array description in BSR Format for more details. x REAL for mkl_sbsrsv. DOUBLE PRECISION for mkl_dbsrsv. COMPLEX for mkl_cbsrsv. DOUBLE COMPLEX for mkl_zbsrsv. Array, DIMENSION at least (m*lb). On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_sbsrsv. DOUBLE PRECISION for mkl_dbsrsv. COMPLEX for mkl_cbsrsv. DOUBLE COMPLEX for mkl_zbsrsv. Array, DIMENSION at least (m*lb). On entry, the array y must contain the vector y. The elements are accessed with unit increment. Output Parameters y Contains solution vector x. Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lb INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*) REAL x(*), y(*) SUBROUTINE mkl_dbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lb INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*) DOUBLE PRECISION x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 234 SUBROUTINE mkl_cbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lb INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*) COMPLEX x(*), y(*) SUBROUTINE mkl_zbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lb INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*) DOUBLE COMPLEX x(*), y(*) C: void mkl_sbsrsv(char *transa, int *m, int *lb, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *y); void mkl_dbsrsv(char *transa, int *m, int *lb, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *y); void mkl_cbsrsv(char *transa, int *m, int *lb, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zbsrsv(char *transa, int *m, int *lb, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?cscsv Solves a system of linear equations for a sparse matrix in the CSC format. Syntax Fortran: call mkl_scscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y)call mkl_dcscsv call mkl_dcscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y)call mkl_dcscsv call mkl_ccscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y)call mkl_dcscsv call mkl_zcscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y)call mkl_dcscsv BLAS and Sparse BLAS Routines 2 235 C: mkl_scscsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_dcscsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_ccscsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_zcscsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?cscsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix in the CSC format: y := alpha*inv(A)*x or y := alpha*inv(A')* x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports a CSC format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa= 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')* x, m INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scscsv. DOUBLE PRECISION for mkl_dcscsv. COMPLEX for mkl_ccscsv. DOUBLE COMPLEX for mkl_zcscsv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scscsv. DOUBLE PRECISION for mkl_dcscsv. COMPLEX for mkl_ccscsv. DOUBLE COMPLEX for mkl_zcscsv. 2 Intel® Math Kernel Library Reference Manual 236 Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(m) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSC Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the row indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to columns array description in CSC Format for more details. NOTE Row indices must be sorted in increasing order for each column. pntrb INTEGER. Array of length m. For one-based indexing this array contains column indices, such that pntrb(i) - pntrb(1)+1 is the first index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntrb(i) - pntrb(0) is the first index of column i in the arrays val and indx. Refer to pointerb array description in CSC Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains column indices, such that pntre(i) - pntrb(1) is the last index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntre(i) - pntrb(1)-1 is the last index of column i in the arrays val and indx. Refer to pointerE array description in CSC Format for more details. x REAL for mkl_scscsv. DOUBLE PRECISION for mkl_dcscsv. COMPLEX for mkl_ccscsv. DOUBLE COMPLEX for mkl_zcscsv. Array, DIMENSION at least m. On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_scscsv. DOUBLE PRECISION for mkl_dcscsv. COMPLEX for mkl_ccscsv. DOUBLE COMPLEX for mkl_zcscsv. Array, DIMENSION at least m. On entry, the array y must contain the vector y. The elements are accessed with unit increment. Output Parameters y Contains the solution vector x. BLAS and Sparse BLAS Routines 2 237 Interfaces FORTRAN 77: SUBROUTINE mkl_scscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*) REAL x(*), y(*) SUBROUTINE mkl_dcscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*) DOUBLE PRECISION x(*), y(*) SUBROUTINE mkl_ccscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*) COMPLEX x(*), y(*) SUBROUTINE mkl_zcscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*) DOUBLE COMPLEX x(*), y(*) C: void mkl_scscsv(char *transa, int *m, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *y); 2 Intel® Math Kernel Library Reference Manual 238 void mkl_dcscsv(char *transa, int *m, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *y); void mkl_ccscsv(char *transa, int *m, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcscsv(char *transa, int *m, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?coosv Solves a system of linear equations for a sparse matrix in the coordinate format. Syntax Fortran: call mkl_scoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) call mkl_dcoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) call mkl_ccoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) call mkl_zcoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) C: mkl_scoosv(&transa, &m, &alpha, matdescra, val, rowind, colind, &nnz, x, y); mkl_dcoosv(&transa, &m, &alpha, matdescra, val, rowind, colind, &nnz, x, y); mkl_ccoosv(&transa, &m, &alpha, matdescra, val, rowind, colind, &nnz, x, y); mkl_zcoosv(&transa, &m, &alpha, matdescra, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?coosv routine solves a system of linear equations with matrix-vector operations for a sparse matrix in the coordinate format: y := alpha*inv(A)*x or y := alpha*inv(A')*x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports a coordinate format both with one-based indexing and zero-based indexing. BLAS and Sparse BLAS Routines 2 239 Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa = 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')* x, m INTEGER. Number of rows of the matrix A. alpha REAL for mkl_scoosv. DOUBLE PRECISION for mkl_dcoosv. COMPLEX for mkl_ccoosv. DOUBLE COMPLEX for mkl_zcoosv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scoosv. DOUBLE PRECISION for mkl_dcoosv. COMPLEX for mkl_ccoosv. DOUBLE COMPLEX for mkl_zcoosv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_scoosv. DOUBLE PRECISION for mkl_dcoosv. COMPLEX for mkl_ccoosv. DOUBLE COMPLEX for mkl_zcoosv. Array, DIMENSION at least m. On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_scoosv. DOUBLE PRECISION for mkl_dcoosv. COMPLEX for mkl_ccoosv. DOUBLE COMPLEX for mkl_zcoosv. Array, DIMENSION at least m. On entry, the array y must contain the vector y. The elements are accessed with unit increment. 2 Intel® Math Kernel Library Reference Manual 240 Output Parameters y Contains solution vector x. Interfaces FORTRAN 77: SUBROUTINE mkl_scoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, nnz INTEGER rowind(*), colind(*) REAL alpha REAL val(*) REAL x(*), y(*) SUBROUTINE mkl_dcoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(*) DOUBLE PRECISION x(*), y(*) SUBROUTINE mkl_ccoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX alpha COMPLEX val(*) COMPLEX x(*), y(*) SUBROUTINE mkl_zcoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*) DOUBLE COMPLEX x(*), y(*) BLAS and Sparse BLAS Routines 2 241 C: void mkl_scoosv(char *transa, int *m, float *alpha, char *matdescra, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_dcoosv(char *transa, int *m, double *alpha, char *matdescra, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_ccoosv(char *transa, int *m, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcoosv(char *transa, int *m, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?csrmm Computes matrix - matrix product of a sparse matrix stored in the CSR format. Syntax Fortran: call mkl_scsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_dcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_ccsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_zcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) C: mkl_scsrmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_dcsrmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_ccsrmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_zcsrmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrmm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C 2 Intel® Math Kernel Library Reference Manual 242 or C := alpha*A'*B + beta*C, where: alpha and beta are scalars, B and C are dense matrices, A is an m-by-k sparse matrix in compressed sparse row (CSR) format, A' is the transpose of A. NOTE This routine supports a CSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then C := alpha*A*B + beta*C If transa = 'T' or 't' or 'C' or 'c', then C := alpha*A'*B + beta*C, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scsrmm. DOUBLE PRECISION for mkl_dcsrmm. COMPLEX for mkl_ccsrmm. DOUBLE COMPLEX for mkl_zcsrmm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scsrmm. DOUBLE PRECISION for mkl_dcsrmm. COMPLEX for mkl_ccsrmm. DOUBLE COMPLEX for mkl_zcsrmm. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(m) - pntrb(1). For zero-based indexing its length is pntre(—1) - pntrb(0). Refer to values array description in CSR Format for more details. indx INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to columns array description in CSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntrb(I) - pntrb(1)+1 is the first index of row I in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntrb(I) - pntrb(0) is the first index of row I in the arrays val and indx. Refer to pointerb array description in CSR Format for more details. BLAS and Sparse BLAS Routines 2 243 pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(I) - pntrb(1) is the last index of row I in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntre(I) - pntrb(0)-1 is the last index of row I in the arrays val and indx. Refer to pointerE array description in CSR Format for more details. b REAL for mkl_scsrmm. DOUBLE PRECISION for mkl_dcsrmm. COMPLEX for mkl_ccsrmm. DOUBLE COMPLEX for mkl_zcsrmm. Array, DIMENSION (ldb, at least n for non-transposed matrix A and at least m for transposed) for one-based indexing, and (at least k for nontransposed matrix A and at least m for transposed, ldb) for zero-based indexing. On entry with transa= 'N' or 'n', the leading k-by-n part of the array b must contain the matrix B, otherwise the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. beta REAL for mkl_scsrmm. DOUBLE PRECISION for mkl_dcsrmm. COMPLEX for mkl_ccsrmm. DOUBLE COMPLEX for mkl_zcsrmm. Specifies the scalar beta. c REAL for mkl_scsrmm. DOUBLE PRECISION for mkl_dcsrmm. COMPLEX for mkl_ccsrmm. DOUBLE COMPLEX for mkl_zcsrmm. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. On entry, the leading m-by-n part of the array c must contain the matrix C, otherwise the leading k-by-n part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta* C) or (alpha*A'*B + beta*C). 2 Intel® Math Kernel Library Reference Manual 244 Interfaces FORTRAN 77: SUBROUTINE mkl_dcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha, beta REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha, beta COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_scsrmm(char *transa, int *m, int *n, int *k, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *beta, float *c, int *ldc,); BLAS and Sparse BLAS Routines 2 245 void mkl_dcsrmm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *beta, double *c, int *ldc,); void mkl_ccsrmm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *beta, double *c, int *ldc,); void mkl_zcsrmm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *beta, double *c, int *ldc,); mkl_?bsrmm Computes matrix - matrix product of a sparse matrix stored in the BSR format. Syntax Fortran: call mkl_sbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_dbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_cbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_zbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) C: mkl_sbsrmm(&transa, &m, &n, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_dbsrmm(&transa, &m, &n, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_cbsrmm(&transa, &m, &n, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_zbsrmm(&transa, &m, &n, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrmm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C or C := alpha*A'*B + beta*C, 2 Intel® Math Kernel Library Reference Manual 246 where: alpha and beta are scalars, B and C are dense matrices, A is an m-by-k sparse matrix in block sparse row (BSR) format, A' is the transpose of A. NOTE This routine supports a BSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-matrix product is computed as C := alpha*A*B + beta*C If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as C := alpha*A'*B + beta*C, m INTEGER. Number of block rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of block columns of the matrix A. lb INTEGER. Size of the block in the matrix A. alpha REAL for mkl_sbsrmm. DOUBLE PRECISION for mkl_dbsrmm. COMPLEX for mkl_cbsrmm. DOUBLE COMPLEX for mkl_zbsrmm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sbsrmm. DOUBLE PRECISION for mkl_dbsrmm. COMPLEX for mkl_cbsrmm. DOUBLE COMPLEX for mkl_zbsrmm. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to the values array description in BSR Format for more details. indx INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks in the matrix A. Refer to the columns array description in BSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing: this array contains row indices, such that pntrb(I) - pntrb(1)+1 is the first index of block row I in the array indx. For zero-based indexing: this array contains row indices, such that pntrb(I) - pntrb(0) is the first index of block row I in the array indx. BLAS and Sparse BLAS Routines 2 247 Refer to pointerB array description in BSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(I) - pntrb(1) is the last index of block row I in the array indx. For zero-based indexing this array contains row indices, such that pntre(I) - pntrb(0)-1 is the last index of block row I in the array indx. Refer to pointerE array description in BSR Format for more details. b REAL for mkl_sbsrmm. DOUBLE PRECISION for mkl_dbsrmm. COMPLEX for mkl_cbsrmm. DOUBLE COMPLEX for mkl_zbsrmm. Array, DIMENSION (ldb, at least n for non-transposed matrix A and at least m for transposed) for one-based indexing, and (at least k for nontransposed matrix A and at least m for transposed, ldb) for zero-based indexing. On entry with transa= 'N' or 'n', the leading n-by-k block part of the array b must contain the matrix B, otherwise the leading m-by-n block part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension (in blocks) of b as declared in the calling (sub)program. beta REAL for mkl_sbsrmm. DOUBLE PRECISION for mkl_dbsrmm. COMPLEX for mkl_cbsrmm. DOUBLE COMPLEX for mkl_zbsrmm. Specifies the scalar beta. c REAL for mkl_sbsrmm. DOUBLE PRECISION for mkl_dbsrmm. COMPLEX for mkl_cbsrmm. DOUBLE COMPLEX for mkl_zbsrmm. Array, DIMENSION (ldc, n) for one-based indexing, DIMENSION (k, ldc) for zero-based indexing. On entry, the leading m-by-n block part of the array c must contain the matrix C, otherwise the leading n-by-k block part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension (in blocks) of c as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta*C) or (alpha*A'*B + beta*C). 2 Intel® Math Kernel Library Reference Manual 248 Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ld, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha, beta REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ld, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_cbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ld, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha, beta COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ld, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_sbsrmm(char *transa, int *m, int *n, int *k, int *lb, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *beta, float *c, int *ldc,); BLAS and Sparse BLAS Routines 2 249 void mkl_dbsrmm(char *transa, int *m, int *n, int *k, int *lb, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *beta, double *c, int *ldc,); void mkl_cbsrmm(char *transa, int *m, int *n, int *k, int *lb, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *b, int *ldb, MKL_Complex8 *beta, MKL_Complex8 *c, int *ldc,); void mkl_zbsrmm(char *transa, int *m, int *n, int *k, int *lb, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *b, int *ldb, MKL_Complex16 *beta, MKL_Complex16 *c, int *ldc,); mkl_?cscmm Computes matrix-matrix product of a sparse matrix stored in the CSC format. Syntax Fortran: call mkl_scscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_dcscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_ccscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_zcscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) C: mkl_scscmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_dcscmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_ccscmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_zcscmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?cscmm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C or C := alpha*A'*B + beta*C, 2 Intel® Math Kernel Library Reference Manual 250 where: alpha and beta are scalars, B and C are dense matrices, A is an m-by-k sparse matrix in compressed sparse column (CSC) format, A' is the transpose of A. NOTE This routine supports CSC format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then C := alpha*A* B + beta*C If transa = 'T' or 't' or 'C' or 'c', then C := alpha*A'*B + beta*C, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scscmm. DOUBLE PRECISION for mkl_dcscmm. COMPLEX for mkl_ccscmm. DOUBLE COMPLEX for mkl_zcscmm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scscmm. DOUBLE PRECISION for mkl_dcscmm. COMPLEX for mkl_ccscmm. DOUBLE COMPLEX for mkl_zcscmm. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(k) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSC Format for more details. indx INTEGER. Array containing the row indices for each non-zero element of the matrix A.Its length is equal to length of the val array. Refer to rows array description in CSC Format for more details. pntrb INTEGER. Array of length k. For one-based indexing this array contains column indices, such that pntrb(i) - pntrb(1)+1 is the first index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntrb(i) - pntrb(0) is the first index of column i in the arrays val and indx. Refer to pointerb array description in CSC Format for more details. pntre INTEGER. Array of length k. BLAS and Sparse BLAS Routines 2 251 For one-based indexing this array contains column indices, such that pntre(i) - pntrb(1) is the last index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntre(i) - pntrb(1)-1 is the last index of column i in the arrays val and indx. Refer to pointerE array description in CSC Format for more details. b REAL for mkl_scscmm. DOUBLE PRECISION for mkl_dcscmm. COMPLEX for mkl_ccscmm. DOUBLE COMPLEX for mkl_zcscmm. Array, DIMENSION (ldb, at least n for non-transposed matrix A and at least m for transposed) for one-based indexing, and (at least k for nontransposed matrix A and at least m for transposed, ldb) for zero-based indexing. On entry with transa = 'N' or 'n', the leading k-by-n part of the array b must contain the matrix B, otherwise the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. beta REAL*8. Specifies the scalar beta. c REAL for mkl_scscmm. DOUBLE PRECISION for mkl_dcscmm. COMPLEX for mkl_ccscmm. DOUBLE COMPLEX for mkl_zcscmm. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. On entry, the leading m-by-n part of the array c must contain the matrix C, otherwise the leading k-by-n part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta* C) or (alpha*A'*B + beta*C). Interfaces FORTRAN 77: SUBROUTINE mkl_scscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(k), pntre(k) REAL alpha, beta REAL val(*), b(ldb,*), c(ldc,*) 2 Intel® Math Kernel Library Reference Manual 252 SUBROUTINE mkl_dcscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(k), pntre(k) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_ccscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(k), pntre(k) COMPLEX alpha, beta COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zcscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(k), pntre(k) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_scscmm(char *transa, int *m, int *n, int *k, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *beta, float *c, int *ldc); void mkl_dcscmm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *beta, double *c, int *ldc); void mkl_ccscmm(char *transa, int *m, int *n, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *b, int *ldb, MKL_Complex8 *beta, MKL_Complex8 *c, int *ldc); BLAS and Sparse BLAS Routines 2 253 void mkl_zcscmm(char *transa, int *m, int *n, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *b, int *ldb, MKL_Complex16 *beta, MKL_Complex16 *c, int *ldc); mkl_?coomm Computes matrix-matrix product of a sparse matrix stored in the coordinate format. Syntax Fortran: call mkl_scoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) call mkl_dcoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) call mkl_ccoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) call mkl_zcoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) C: mkl_scoomm(&transa, &m, &n, &k, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, &beta, c, &ldc); mkl_dcoomm(&transa, &m, &n, &k, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, &beta, c, &ldc); mkl_ccoomm(&transa, &m, &n, &k, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, &beta, c, &ldc); mkl_zcoomm(&transa, &m, &n, &k, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?coomm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C or C := alpha*A'*B + beta*C, where: alpha and beta are scalars, B and C are dense matrices, A is an m-by-k sparse matrix in the coordinate format, A' is the transpose of A. 2 Intel® Math Kernel Library Reference Manual 254 NOTE This routine supports a coordinate format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then C := alpha*A*B + beta*C If transa = 'T' or 't' or 'C' or 'c', then C := alpha*A'*B + beta*C, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scoomm. DOUBLE PRECISION for mkl_dcoomm. COMPLEX for mkl_ccoomm. DOUBLE COMPLEX for mkl_zcoomm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scoomm. DOUBLE PRECISION for mkl_dcoomm. COMPLEX for mkl_ccoomm. DOUBLE COMPLEX for mkl_zcoomm. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. b REAL for mkl_scoomm. DOUBLE PRECISION for mkl_dcoomm. COMPLEX for mkl_ccoomm. DOUBLE COMPLEX for mkl_zcoomm. Array, DIMENSION (ldb, at least n for non-transposed matrix A and at least m for transposed) for one-based indexing, and (at least k for nontransposed matrix A and at least m for transposed, ldb) for zero-based indexing. BLAS and Sparse BLAS Routines 2 255 On entry with transa = 'N' or 'n', the leading k-by-n part of the array b must contain the matrix B, otherwise the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. beta REAL for mkl_scoomm. DOUBLE PRECISION for mkl_dcoomm. COMPLEX for mkl_ccoomm. DOUBLE COMPLEX for mkl_zcoomm. Specifies the scalar beta. c REAL for mkl_scoomm. DOUBLE PRECISION for mkl_dcoomm. COMPLEX for mkl_ccoomm. DOUBLE COMPLEX for mkl_zcoomm. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. On entry, the leading m-by-n part of the array c must contain the matrix C, otherwise the leading k-by-n part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta*C) or (alpha*A'*B + beta*C). Interfaces FORTRAN 77: SUBROUTINE mkl_scoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, nnz INTEGER rowind(*), colind(*) REAL alpha, beta REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) 2 Intel® Math Kernel Library Reference Manual 256 SUBROUTINE mkl_ccoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, nnz INTEGER rowind(*), colind(*) COMPLEX alpha, beta COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zcoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_scoomm(char *transa, int *m, int *n, int *k, float *alpha, char *matdescra, float *val, int *rowind, int *colind, int *nnz, float *b, int *ldb, float *beta, float *c, int *ldc); void mkl_dcoomm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *rowind, int *colind, int *nnz, double *b, int *ldb, double *beta, double *c, int *ldc); void mkl_ccoomm(char *transa, int *m, int *n, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *b, int *ldb, MKL_Complex8 *beta, MKL_Complex8 *c, int *ldc); void mkl_zcoomm(char *transa, int *m, int *n, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *b, int *ldb, MKL_Complex16 *beta, MKL_Complex16 *c, int *ldc); mkl_?csrsm Solves a system of linear matrix equations for a sparse matrix in the CSR format. Syntax Fortran: call mkl_scsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_dcsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) BLAS and Sparse BLAS Routines 2 257 call mkl_ccsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_zcsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) C: mkl_scsrsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_dcsrsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_ccsrsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_zcsrsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrsm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the CSR format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. NOTE This routine supports a CSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then C := alpha*inv(A)*B If transa = 'T' or 't' or 'C' or 'c', then C := alpha*inv(A')*B, m INTEGER. Number of columns of the matrix A. n INTEGER. Number of columns of the matrix C. alpha REAL for mkl_scsrsm. DOUBLE PRECISION for mkl_dcsrsm. COMPLEX for mkl_ccsrsm. DOUBLE COMPLEX for mkl_zcsrsm. Specifies the scalar alpha. 2 Intel® Math Kernel Library Reference Manual 258 matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scsrsm. DOUBLE PRECISION for mkl_dcsrsm. COMPLEX for mkl_ccsrsm. DOUBLE COMPLEX for mkl_zcsrsm. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(m) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to columns array description in CSR Format for more details. NOTE Column indices must be sorted in increasing order for each row. pntrb INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of row i in the arrays val and indx. Refer to pointerb array description in CSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of row i in the arrays val and indx.Refer to pointerE array description in CSR Format for more details. b REAL for mkl_scsrsm. DOUBLE PRECISION for mkl_dcsrsm. COMPLEX for mkl_ccsrsm. DOUBLE COMPLEX for mkl_zcsrsm. Array, DIMENSION (ldb, n)for one-based indexing, and (m, ldb) for zero-based indexing. On entry the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. BLAS and Sparse BLAS Routines 2 259 ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c REAL*8. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. The leading m-by-n part of the array c contains the output matrix C. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_ccsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*), b(ldb,*), c(ldc,*) 2 Intel® Math Kernel Library Reference Manual 260 SUBROUTINE mkl_zcsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_scsrsm(char *transa, int *m, int *n, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *c, int *ldc); void mkl_dcsrsm(char *transa, int *m, int *n, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *c, int *ldc); void mkl_ccsrsm(char *transa, int *m, int *n, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zcsrsm(char *transa, int *m, int *n, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?cscsm Solves a system of linear matrix equations for a sparse matrix in the CSC format. Syntax Fortran: call mkl_scscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_dcscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_ccscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_zcscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) C: mkl_scscsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); BLAS and Sparse BLAS Routines 2 261 mkl_dcscsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_ccscsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_zcscsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?cscsm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the CSC format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. NOTE This routine supports a CSC format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of equations. If transa = 'N' or 'n', then C := alpha*inv(A)*B If transa = 'T' or 't' or 'C' or 'c', then C := alpha*inv(A')*B, m INTEGER. Number of columns of the matrix A. n INTEGER. Number of columns of the matrix C. alpha REAL for mkl_scscsm. DOUBLE PRECISION for mkl_dcscsm. COMPLEX for mkl_ccscsm. DOUBLE COMPLEX for mkl_zcscsm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scscsm. DOUBLE PRECISION for mkl_dcscsm. COMPLEX for mkl_ccscsm. 2 Intel® Math Kernel Library Reference Manual 262 DOUBLE COMPLEX for mkl_zcscsm. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(k) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSC Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the row indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to rows array description in CSC Format for more details. NOTE Row indices must be sorted in increasing order for each column. pntrb INTEGER. Array of length m. For one-based indexing this array contains column indices, such that pntrb(I) - pntrb(1)+1 is the first index of column I in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntrb(I) - pntrb(0) is the first index of column I in the arrays val and indx. Refer to pointerb array description in CSC Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains column indices, such that pntre(I) - pntrb(1) is the last index of column I in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntre(I) - pntrb(1)-1 is the last index of column I in the arrays val and indx. Refer to pointerE array description in CSC Format for more details. b REAL for mkl_scscsm. DOUBLE PRECISION for mkl_dcscsm. COMPLEX for mkl_ccscsm. DOUBLE COMPLEX for mkl_zcscsm. Array, DIMENSION (ldb, n) for one-based indexing, and (m, ldb) for zerobased indexing. On entry the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c REAL for mkl_scscsm. BLAS and Sparse BLAS Routines 2 263 DOUBLE PRECISION for mkl_dcscsm. COMPLEX for mkl_ccscsm. DOUBLE COMPLEX for mkl_zcscsm. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. The leading m-by-n part of the array c contains the output matrix C. Interfaces FORTRAN 77: SUBROUTINE mkl_scscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_ccscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zcscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) 2 Intel® Math Kernel Library Reference Manual 264 C: void mkl_scscsm(char *transa, int *m, int *n, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *c, int *ldc); void mkl_dcscsm(char *transa, int *m, int *n, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *c, int *ldc); void mkl_ccscsm(char *transa, int *m, int *n, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zcscsm(char *transa, int *m, int *n, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?coosm Solves a system of linear matrix equations for a sparse matrix in the coordinate format. Syntax Fortran: call mkl_scoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) call mkl_dcoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) call mkl_ccoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) call mkl_zcoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) C: mkl_scoosm(&transa, &m, &n, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, c, &ldc); mkl_dcoosm(&transa, &m, &n, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, c, &ldc); mkl_ccoosm(&transa, &m, &n, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, c, &ldc); mkl_zcoosm(&transa, &m, &n, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h BLAS and Sparse BLAS Routines 2 265 Description The mkl_?coosm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the coordinate format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. NOTE This routine supports a coordinate format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then the matrix-matrix product is computed as C := alpha*inv(A)*B If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as C := alpha*inv(A')*B, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. alpha REAL for mkl_scoosm. DOUBLE PRECISION for mkl_dcoosm. COMPLEX for mkl_ccoosm. DOUBLE COMPLEX for mkl_zcoosm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scoosm. DOUBLE PRECISION for mkl_dcoosm. COMPLEX for mkl_ccoosm. DOUBLE COMPLEX for mkl_zcoosm. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. 2 Intel® Math Kernel Library Reference Manual 266 colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. b REAL for mkl_scoosm. DOUBLE PRECISION for mkl_dcoosm. COMPLEX for mkl_ccoosm. DOUBLE COMPLEX for mkl_zcoosm. Array, DIMENSION (ldb, n) for one-based indexing, and (m, ldb) for zerobased indexing. Before entry the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c REAL for mkl_scoosm. DOUBLE PRECISION for mkl_dcoosm. COMPLEX for mkl_ccoosm. DOUBLE COMPLEX for mkl_zcoosm. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. The leading m-by-n part of the array c contains the output matrix C. Interfaces FORTRAN 77: SUBROUTINE mkl_scoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, nnz INTEGER rowind(*), colind(*) REAL alpha REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) BLAS and Sparse BLAS Routines 2 267 SUBROUTINE mkl_ccoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, nnz INTEGER rowind(*), colind(*) COMPLEX alpha COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zcoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_scoosm(char *transa, int *m, int *n, float *alpha, char *matdescra, float *val, int *rowind, int *colind, int *nnz, float *b, int *ldb, float *c, int *ldc); void mkl_dcoosm(char *transa, int *m, int *n, double *alpha, char *matdescra, double *val, int *rowind, int *colind, int *nnz, double *b, int *ldb, double *c, int *ldc); void mkl_ccoosm(char *transa, int *m, int *n, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zcoosm(char *transa, int *m, int *n, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?bsrsm Solves a system of linear matrix equations for a sparse matrix in the BSR format. Syntax Fortran: call mkl_scsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_dcsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_ccsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_zcsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) 2 Intel® Math Kernel Library Reference Manual 268 C: mkl_scsrsm(&transa, &m, &n, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_dcsrsm(&transa, &m, &n, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_ccsrsm(&transa, &m, &n, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_zcsrsm(&transa, &m, &n, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrsm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the BSR format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. NOTE This routine supports a BSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-matrix product is computed as C := alpha*inv(A)*B. If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as C := alpha*inv(A')*B. m INTEGER. Number of block columns of the matrix A. n INTEGER. Number of columns of the matrix C. lb INTEGER. Size of the block in the matrix A. alpha REAL for mkl_sbsrsm. DOUBLE PRECISION for mkl_dbsrsm. COMPLEX for mkl_cbsrsm. DOUBLE COMPLEX for mkl_zbsrsm. Specifies the scalar alpha. BLAS and Sparse BLAS Routines 2 269 matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sbsrsm. DOUBLE PRECISION for mkl_dbsrsm. COMPLEX for mkl_cbsrsm. DOUBLE COMPLEX for mkl_zbsrsm. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the ABAB number ABAB of non-zero blocks in the matrix A multiplied by lb*lb. Refer to the values array description in BSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks in the matrix A. Refer to the columns array description in BSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing: this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of block row i in the array indx. For zero-based indexing: this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of block row i in the array indx. Refer to pointerB array description in BSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of block row i in the array indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of block row i in the array indx. Refer to pointerE array description in BSR Format for more details. b REAL for mkl_sbsrsm. DOUBLE PRECISION for mkl_dbsrsm. COMPLEX for mkl_cbsrsm. DOUBLE COMPLEX for mkl_zbsrsm. Array, DIMENSION (ldb, n) for one-based indexing, DIMENSION (m, ldb) for zero-based indexing. On entry the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension (in blocks) of b as declared in the calling (sub)program. ldc INTEGER. Specifies the leading dimension (in blocks) of c as declared in the calling (sub)program. 2 Intel® Math Kernel Library Reference Manual 270 Output Parameters c REAL for mkl_sbsrsm. DOUBLE PRECISION for mkl_dbsrsm. COMPLEX for mkl_cbsrsm. DOUBLE COMPLEX for mkl_zbsrsm. Array, DIMENSION (ldc, n) for one-based indexing, DIMENSION (m, ldc) for zero-based indexing. The leading m-by-n part of the array c contains the output matrix C. Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, lb, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dbsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, lb, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_cbsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, lb, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zbsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, lb, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) BLAS and Sparse BLAS Routines 2 271 C: void mkl_sbsrsm(char *transa, int *m, int *n, int *lb, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *c, int *ldc); void mkl_dbsrsm(char *transa, int *m, int *n, int *lb, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *c, int *ldc); void mkl_cbsrsm(char *transa, int *m, int *n, int *lb, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zbsrsm(char *transa, int *m, int *n, int *lb, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?diamv Computes matrix - vector product for a sparse matrix in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) call mkl_ddiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) call mkl_cdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) call mkl_zdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) C: mkl_sdiamv(&transa, &m, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, x, &beta, y); mkl_ddiamv(&transa, &m, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, x, &beta, y); mkl_cdiamv(&transa, &m, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, x, &beta, y); mkl_zdiamv(&transa, &m, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, x, &beta, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diamv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, 2 Intel® Math Kernel Library Reference Manual 272 x and y are vectors, A is an m-by-k sparse matrix stored in the diagonal format, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*A*x + beta*y, If transa = 'T' or 't' or 'C' or 'c', then y := alpha*A'*x + beta*y. m INTEGER. Number of rows of the matrix A. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_sdiamv. DOUBLE PRECISION for mkl_ddiamv. COMPLEX for mkl_cdiamv. DOUBLE COMPLEX for mkl_zdiamv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sdiamv. DOUBLE PRECISION for mkl_ddiamv. COMPLEX for mkl_cdiamv. DOUBLE COMPLEX for mkl_zdiamv. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval =m. Refer to lval description in Diagonal Storage Scheme for more details. idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. x REAL for mkl_sdiamv. DOUBLE PRECISION for mkl_ddiamv. COMPLEX for mkl_cdiamv. DOUBLE COMPLEX for mkl_zdiamv. Array, DIMENSION at least k if transa = 'N' or 'n', and at least m otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_sdiamv. DOUBLE PRECISION for mkl_ddiamv. COMPLEX for mkl_cdiamv. BLAS and Sparse BLAS Routines 2 273 DOUBLE COMPLEX for mkl_zdiamv. Specifies the scalar beta. y REAL for mkl_sdiamv. DOUBLE PRECISION for mkl_ddiamv. COMPLEX for mkl_cdiamv. DOUBLE COMPLEX for mkl_zdiamv. Array, DIMENSION at least m if transa = 'N' or 'n', and at least k otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lval, ndiag INTEGER idiag(*) REAL alpha, beta REAL val(lval,*), x(*), y(*) SUBROUTINE mkl_ddiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lval, ndiag INTEGER idiag(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(lval,*), x(*), y(*) SUBROUTINE mkl_cdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lval, ndiag INTEGER idiag(*) COMPLEX alpha, beta COMPLEX val(lval,*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 274 SUBROUTINE mkl_zdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lval, ndiag INTEGER idiag(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(lval,*), x(*), y(*) C: void mkl_sdiamv(char *transa, int *m, int *k, float *alpha, char *matdescra, float *val, int *lval, int *idiag, int *ndiag, float *x, float *beta, float *y); void mkl_ddiamv(char *transa, int *m, int *k, double *alpha, char *matdescra, double *val, int *lval, int *idiag, int *ndiag, double *x, double *beta, double *y); void mkl_cdiamv(char *transa, int *m, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *x, MKL_Complex8 *beta, MKL_Complex8 *y); void mkl_zdiamv(char *transa, int *m, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?skymv Computes matrix - vector product for a sparse matrix in the skyline storage format with one-based indexing. Syntax Fortran: call mkl_sskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) call mkl_dskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) call mkl_cskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) call mkl_zskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) C: mkl_sskymv(&transa, &m, &k, &alpha, matdescra, val, pntr, x, &beta, y); mkl_dskymv(&transa, &m, &k, &alpha, matdescra, val, pntr, x, &beta, y); mkl_cskymv(&transa, &m, &k, &alpha, matdescra, val, pntr, x, &beta, y); mkl_zskymv(&transa, &m, &k, &alpha, matdescra, val, pntr, x, &beta, y); Include Files • FORTRAN 77: mkl_spblas.fi BLAS and Sparse BLAS Routines 2 275 • C: mkl_spblas.h Description The mkl_?skymv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-k sparse matrix stored using the skyline storage scheme, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*A*x + beta*y If transa = 'T' or 't' or 'C' or 'c', then y := alpha*A'*x + beta*y, m INTEGER. Number of rows of the matrix A. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_sskymv. DOUBLE PRECISION for mkl_dskymv. COMPLEX for mkl_cskymv. DOUBLE COMPLEX for mkl_zskymv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. NOTE General matrices (matdescra (1)='G') is not supported. val REAL for mkl_sskymv. DOUBLE PRECISION for mkl_dskymv. COMPLEX for mkl_cskymv. DOUBLE COMPLEX for mkl_zskymv. Array containing the set of elements of the matrix A in the skyline profile form. If matdescrsa(2)= 'L', then val contains elements from the low triangle of the matrix A. 2 Intel® Math Kernel Library Reference Manual 276 If matdescrsa(2)= 'U', then val contains elements from the upper triangle of the matrix A. Refer to values array description in Skyline Storage Scheme for more details. pntr INTEGER. Array of length (m+m) for lower triangle, and (k+k) for upper triangle. It contains the indices specifying in the val the positions of the first element in each row (column) of the matrix A. Refer to pointers array description in Skyline Storage Scheme for more details. x REAL for mkl_sskymv. DOUBLE PRECISION for mkl_dskymv. COMPLEX for mkl_cskymv. DOUBLE COMPLEX for mkl_zskymv. Array, DIMENSION at least k if transa = 'N' or 'n' and at least m otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_sskymv. DOUBLE PRECISION for mkl_dskymv. COMPLEX for mkl_cskymv. DOUBLE COMPLEX for mkl_zskymv. Specifies the scalar beta. y REAL for mkl_sskymv. DOUBLE PRECISION for mkl_dskymv. COMPLEX for mkl_cskymv. DOUBLE COMPLEX for mkl_zskymv. Array, DIMENSION at least m if transa = 'N' or 'n' and at least k otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER pntr(*) REAL alpha, beta REAL val(*), x(*), y(*) SUBROUTINE mkl_dskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER pntr(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 277 SUBROUTINE mkl_cdskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER pntr(*) COMPLEX alpha, beta COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER pntr(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_sskymv (char *transa, int *m, int *k, float *alpha, char *matdescra, float *val, int *pntr, float *x, float *beta, float *y); void mkl_dskymv (char *transa, int *m, int *k, double *alpha, char *matdescra, double *val, int *pntr, double *x, double *beta, double *y); void mkl_cskymv (char *transa, int *m, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *pntr, MKL_Complex8 *x, MKL_Complex8 *beta, MKL_Complex8 *y); void mkl_zskymv (char *transa, int *m, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *pntr, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?diasv Solves a system of linear equations for a sparse matrix in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) call mkl_ddiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) call mkl_cdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) call mkl_zdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) C: mkl_sdiasv(&transa, &m, &alpha, matdescra, val, &lval, idiag, &ndiag, x, y); mkl_ddiasv(&transa, &m, &alpha, matdescra, val, &lval, idiag, &ndiag, x, y); mkl_cdiasv(&transa, &m, &alpha, matdescra, val, &lval, idiag, &ndiag, x, y); 2 Intel® Math Kernel Library Reference Manual 278 mkl_zdiasv(&transa, &m, &alpha, matdescra, val, &lval, idiag, &ndiag, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diasv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the diagonal format: y := alpha*inv(A)*x or y := alpha*inv(A')* x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa = 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')*x, m INTEGER. Number of rows of the matrix A. alpha REAL for mkl_sdiasv. DOUBLE PRECISION for mkl_ddiasv. COMPLEX for mkl_cdiasv. DOUBLE COMPLEX for mkl_zdiasv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sdiasv. DOUBLE PRECISION for mkl_ddiasv. COMPLEX for mkl_cdiasv. DOUBLE COMPLEX for mkl_zdiasv. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval=m. Refer to lval description in Diagonal Storage Scheme for more details. BLAS and Sparse BLAS Routines 2 279 idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. NOTE All elements of this array must be sorted in increasing order. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. x REAL for mkl_sdiasv. DOUBLE PRECISION for mkl_ddiasv. COMPLEX for mkl_cdiasv. DOUBLE COMPLEX for mkl_zdiasv. Array, DIMENSION at least m. On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_sdiasv. DOUBLE PRECISION for mkl_ddiasv. COMPLEX for mkl_cdiasv. DOUBLE COMPLEX for mkl_zdiasv. Array, DIMENSION at least m. On entry, the array y must contain the vector y. The elements are accessed with unit increment. Output Parameters y Contains solution vector x. Interfaces FORTRAN 77: SUBROUTINE mkl_sdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lval, ndiag INTEGER indiag(*) REAL alpha REAL val(lval,*), x(*), y(*) SUBROUTINE mkl_ddiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lval, ndiag INTEGER indiag(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(lval,*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 280 SUBROUTINE mkl_cdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lval, ndiag INTEGER indiag(*) COMPLEX alpha COMPLEX val(lval,*), x(*), y(*) SUBROUTINE mkl_zdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lval, ndiag INTEGER indiag(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(lval,*), x(*), y(*) C: void mkl_sdiasv(char *transa, int *m, float *alpha, char *matdescra, float *val, int *lval, int *idiag, int *ndiag, float *x, float *y); void mkl_ddiasv(char *transa, int *m, double *alpha, char *matdescra, double *val, int *lval, int *idiag, int *ndiag, double *x, double *y); void mkl_cdiasv(char *transa, int *m, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zdiasv(char *transa, int *m, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?skysv Solves a system of linear equations for a sparse matrix in the skyline format with one-based indexing. Syntax Fortran: call mkl_sskysv(transa, m, alpha, matdescra, val, pntr, x, y) call mkl_dskysv(transa, m, alpha, matdescra, val, pntr, x, y) call mkl_cskysv(transa, m, alpha, matdescra, val, pntr, x, y) call mkl_zskysv(transa, m, alpha, matdescra, val, pntr, x, y) C: mkl_sskysv(&transa, &m, &alpha, matdescra, val, pntr, x, y); mkl_dskysv(&transa, &m, &alpha, matdescra, val, pntr, x, y); mkl_cskysv(&transa, &m, &alpha, matdescra, val, pntr, x, y); mkl_zskysv(&transa, &m, &alpha, matdescra, val, pntr, x, y); BLAS and Sparse BLAS Routines 2 281 Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?skysv routine solves a system of linear equations with matrix-vector operations for a sparse matrix in the skyline storage format: y := alpha*inv(A)*x or y := alpha*inv(A')*x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa = 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')* x, m INTEGER. Number of rows of the matrix A. alpha REAL for mkl_sskysv. DOUBLE PRECISION for mkl_dskysv. COMPLEX for mkl_cskysv. DOUBLE COMPLEX for mkl_zskysv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. NOTE General matrices (matdescra (1)='G') is not supported. val REAL for mkl_sskysv. DOUBLE PRECISION for mkl_dskysv. COMPLEX for mkl_cskysv. DOUBLE COMPLEX for mkl_zskysv. Array containing the set of elements of the matrix A in the skyline profile form. If matdescrsa(2)= 'L', then val contains elements from the low triangle of the matrix A. 2 Intel® Math Kernel Library Reference Manual 282 If matdescrsa(2)= 'U', then val contains elements from the upper triangle of the matrix A. Refer to values array description in Skyline Storage Scheme for more details. pntr INTEGER. Array of length (m+m) for lower triangle, and (k+k) for upper triangle. It contains the indices specifying in the val the positions of the first element in each row (column) of the matrix A. Refer to pointers array description in Skyline Storage Scheme for more details. x REAL for mkl_sskysv. DOUBLE PRECISION for mkl_dskysv. COMPLEX for mkl_cskysv. DOUBLE COMPLEX for mkl_zskysv. Array, DIMENSION at least m. On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_sskysv. DOUBLE PRECISION for mkl_dskysv. COMPLEX for mkl_cskysv. DOUBLE COMPLEX for mkl_zskysv. Array, DIMENSION at least m. On entry, the array y must contain the vector y. The elements are accessed with unit increment. Output Parameters y Contains solution vector x. Interfaces FORTRAN 77: SUBROUTINE mkl_sskysv(transa, m, alpha, matdescra, val, pntr, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER pntr(*) REAL alpha REAL val(*), x(*), y(*) SUBROUTINE mkl_dskysv(transa, m, alpha, matdescra, val, pntr, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER pntr(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 283 SUBROUTINE mkl_cskysv(transa, m, alpha, matdescra, val, pntr, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER pntr(*) COMPLEX alpha COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zskysv(transa, m, alpha, matdescra, val, pntr, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER pntr(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_sskysv(char *transa, int *m, float *alpha, char *matdescra, float *val, int *pntr, float *x, float *y); void mkl_dskysv(char *transa, int *m, double *alpha, char *matdescra, double *val, int *pntr, double *x, double *y); void mkl_cskysv(char *transa, int *m, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *pntr, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zskysv(char *transa, int *m, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *pntr, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?diamm Computes matrix-matrix product of a sparse matrix stored in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) call mkl_ddiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) call mkl_cdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) call mkl_zdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) 2 Intel® Math Kernel Library Reference Manual 284 C: mkl_sdiamm(&transa, &m, &n, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, &beta, c, &ldc); mkl_ddiamm(&transa, &m, &n, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, &beta, c, &ldc); mkl_cdiamm(&transa, &m, &n, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, &beta, c, &ldc); mkl_zdiamm(&transa, &m, &n, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diamm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C or C := alpha*A'*B + beta*C, where: alpha and beta are scalars, B and C are dense matrices, A is an m-by-k sparse matrix in the diagonal format, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then C := alpha*A*B + beta*C, If transa = 'T' or 't' or 'C' or 'c', then C := alpha*A'*B + beta*C. m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_sdiamm. DOUBLE PRECISION for mkl_ddiamm. COMPLEX for mkl_cdiamm. DOUBLE COMPLEX for mkl_zdiamm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. BLAS and Sparse BLAS Routines 2 285 Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sdiamm. DOUBLE PRECISION for mkl_ddiamm. COMPLEX for mkl_cdiamm. DOUBLE COMPLEX for mkl_zdiamm. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval=m. Refer to lval description in Diagonal Storage Scheme for more details. idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. b REAL for mkl_sdiamm. DOUBLE PRECISION for mkl_ddiamm. COMPLEX for mkl_cdiamm. DOUBLE COMPLEX for mkl_zdiamm. Array, DIMENSION (ldb, n). On entry with transa = 'N' or 'n', the leading k-by-n part of the array b must contain the matrix B, otherwise the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. beta REAL for mkl_sdiamm. DOUBLE PRECISION for mkl_ddiamm. COMPLEX for mkl_cdiamm. DOUBLE COMPLEX for mkl_zdiamm. Specifies the scalar beta. c REAL for mkl_sdiamm. DOUBLE PRECISION for mkl_ddiamm. COMPLEX for mkl_cdiamm. DOUBLE COMPLEX for mkl_zdiamm. Array, DIMENSION (ldc, n). On entry, the leading m-by-n part of the array c must contain the matrix C, otherwise the leading k-by-n part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta*C) or (alpha*A'*B + beta*C). 2 Intel® Math Kernel Library Reference Manual 286 Interfaces FORTRAN 77: SUBROUTINE mkl_sdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, lval, ndiag INTEGER idiag(*) REAL alpha, beta REAL val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_ddiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, lval, ndiag INTEGER idiag(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_cdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, lval, ndiag INTEGER idiag(*) COMPLEX alpha, beta COMPLEX val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, lval, ndiag INTEGER idiag(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(lval,*), b(ldb,*), c(ldc,*) C: void mkl_sdiamm(char *transa, int *m, int *n, int *k, float *alpha, char *matdescra, float *val, int *lval, int *idiag, int *ndiag, float *b, int *ldb, float *beta, float *c, int *ldc); BLAS and Sparse BLAS Routines 2 287 void mkl_ddiamm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *lval, int *idiag, int *ndiag, double *b, int *ldb, double *beta, double *c, int *ldc); void mkl_cdiamm(char *transa, int *m, int *n, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *b, int *ldb, MKL_Complex8 *beta, MKL_Complex8 *c, int *ldc); void mkl_zdiamm(char *transa, int *m, int *n, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *b, int *ldb, MKL_Complex16 *beta, MKL_Complex16 *c, int *ldc); mkl_?skymm Computes matrix-matrix product of a sparse matrix stored using the skyline storage scheme with onebased indexing. Syntax Fortran: call mkl_sskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) call mkl_dskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) call mkl_cskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) call mkl_zskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) C: mkl_sskymm(&transa, &m, &n, &k, &alpha, matdescra, val, pntr, b, &ldb, &beta, c, &ldc); mkl_dskymm(&transa, &m, &n, &k, &alpha, matdescra, val, pntr, b, &ldb, &beta, c, &ldc); mkl_cskymm(&transa, &m, &n, &k, &alpha, matdescra, val, pntr, b, &ldb, &beta, c, &ldc); mkl_zskymm(&transa, &m, &n, &k, &alpha, matdescra, val, pntr, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?skymm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C or C := alpha*A'*B + beta*C, where: alpha and beta are scalars, 2 Intel® Math Kernel Library Reference Manual 288 B and C are dense matrices, A is an m-by-k sparse matrix in the skyline storage format, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then C := alpha*A*B + beta*C, If transa = 'T' or 't' or 'C' or 'c', then C := alpha*A'*B + beta*C, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_sskymm. DOUBLE PRECISION for mkl_dskymm. COMPLEX for mkl_cskymm. DOUBLE COMPLEX for mkl_zskymm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. NOTE General matrices (matdescra (1)='G') is not supported. val REAL for mkl_sskymm. DOUBLE PRECISION for mkl_dskymm. COMPLEX for mkl_cskymm. DOUBLE COMPLEX for mkl_zskymm. Array containing the set of elements of the matrix A in the skyline profile form. If matdescrsa(2)= 'L', then val contains elements from the low triangle of the matrix A. If matdescrsa(2)= 'U', then val contains elements from the upper triangle of the matrix A. Refer to values array description in Skyline Storage Scheme for more details. pntr INTEGER. Array of length (m+m) for lower triangle, and (k+k) for upper triangle. It contains the indices specifying in the val the positions of the first element in each row (column) of the matrix A. Refer to pointers array description in Skyline Storage Scheme for more details. b REAL for mkl_sskymm. DOUBLE PRECISION for mkl_dskymm. BLAS and Sparse BLAS Routines 2 289 COMPLEX for mkl_cskymm. DOUBLE COMPLEX for mkl_zskymm. Array, DIMENSION (ldb, n). On entry with transa = 'N' or 'n', the leading k-by-n part of the array b must contain the matrix B, otherwise the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. beta REAL for mkl_sskymm. DOUBLE PRECISION for mkl_dskymm. COMPLEX for mkl_cskymm. DOUBLE COMPLEX for mkl_zskymm. Specifies the scalar beta. c REAL for mkl_sskymm. DOUBLE PRECISION for mkl_dskymm. COMPLEX for mkl_cskymm. DOUBLE COMPLEX for mkl_zskymm. Array, DIMENSION (ldc, n). On entry, the leading m-by-n part of the array c must contain the matrix C, otherwise the leading k-by-n part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta*C) or (alpha*A'*B + beta*C). Interfaces FORTRAN 77: SUBROUTINE mkl_sskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER pntr(*) REAL alpha, beta REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER pntr(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) 2 Intel® Math Kernel Library Reference Manual 290 SUBROUTINE mkl_cskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER pntr(*) COMPLEX alpha, beta COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER pntr(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_sskymm(char *transa, int *m, int *n, int *k, float *alpha, char *matdescra, float *val, int *pntr, float *b, int *ldb, float *beta, float *c, int *ldc); void mkl_dskymm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *pntr, double *b, int *ldb, double *beta, double *c, int *ldc); void mkl_cskymm(char *transa, int *m, int *n, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *pntr, MKL_Complex8 *b, int *ldb, MKL_Complex8 *beta, MKL_Complex8 *c, int *ldc); void mkl_zskymm(char *transa, int *m, int *n, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *pntr, MKL_Complex16 *b, int *ldb, MKL_Complex16 *beta, MKL_Complex16 *c, int *ldc); mkl_?diasm Solves a system of linear matrix equations for a sparse matrix in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) BLAS and Sparse BLAS Routines 2 291 call mkl_ddiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) call mkl_cdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) call mkl_zdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) C: mkl_sdiasm(&transa, &m, &n, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, c, &ldc); mkl_ddiasm(&transa, &m, &n, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, c, &ldc); mkl_cdiasm(&transa, &m, &n, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, c, &ldc); mkl_zdiasm(&transa, &m, &n, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diasm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the diagonal format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then C := alpha*inv(A)*B, If transa = 'T' or 't' or 'C' or 'c', then C := alpha*inv(A')*B. m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. alpha REAL for mkl_sdiasm. DOUBLE PRECISION for mkl_ddiasm. COMPLEX for mkl_cdiasm. DOUBLE COMPLEX for mkl_zdiasm. 2 Intel® Math Kernel Library Reference Manual 292 Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sdiasm. DOUBLE PRECISION for mkl_ddiasm. COMPLEX for mkl_cdiasm. DOUBLE COMPLEX for mkl_zdiasm. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval=m. Refer to lval description in Diagonal Storage Scheme for more details. idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. NOTE All elements of this array must be sorted in increasing order. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. b REAL for mkl_sdiasm. DOUBLE PRECISION for mkl_ddiasm. COMPLEX for mkl_cdiasm. DOUBLE COMPLEX for mkl_zdiasm. Array, DIMENSION (ldb, n). On entry the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. Output Parameters c REAL for mkl_sdiasm. DOUBLE PRECISION for mkl_ddiasm. COMPLEX for mkl_cdiasm. DOUBLE COMPLEX for mkl_zdiasm. Array, DIMENSION (ldc, n). The leading m-by-n part of the array c contains the matrix C. BLAS and Sparse BLAS Routines 2 293 Interfaces FORTRAN 77: SUBROUTINE mkl_sdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, lval, ndiag INTEGER idiag(*) REAL alpha REAL val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_ddiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, lval, ndiag INTEGER idiag(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_cdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, lval, ndiag INTEGER idiag(*) COMPLEX alpha COMPLEX val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, lval, ndiag INTEGER idiag(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(lval,*), b(ldb,*), c(ldc,*) C: void mkl_sdiasm(char *transa, int *m, int *n, float *alpha, char *matdescra, float *val, int *lval, int *idiag, int *ndiag, float *b, int *ldb, float *c, int *ldc); 2 Intel® Math Kernel Library Reference Manual 294 void mkl_ddiasm(char *transa, int *m, int *n, double *alpha, char *matdescra, double *val, int *lval, int *idiag, int *ndiag, double *b, int *ldb, double *c, int *ldc); void mkl_cdiasm(char *transa, int *m, int *n, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zdiasm(char *transa, int *m, int *n, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?skysm Solves a system of linear matrix equations for a sparse matrix stored using the skyline storage scheme with one-based indexing. Syntax Fortran: call mkl_sskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) call mkl_dskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) call mkl_cskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) call mkl_zskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) C: mkl_sskysm(&transa, &m, &n, &alpha, matdescra, val, pntr, b, &ldb, c, &ldc); mkl_dskysm(&transa, &m, &n, &alpha, matdescra, val, pntr, b, &ldb, c, &ldc); mkl_cskysm(&transa, &m, &n, &alpha, matdescra, val, pntr, b, &ldb, c, &ldc); mkl_zskysm(&transa, &m, &n, &alpha, matdescra, val, pntr, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?skysm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the skyline storage format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. BLAS and Sparse BLAS Routines 2 295 NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then C := alpha*inv(A)*B, If transa = 'T' or 't' or 'C' or 'c', then C := alpha*inv(A')*B, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. alpha REAL for mkl_sskysm. DOUBLE PRECISION for mkl_dskysm. COMPLEX for mkl_cskysm. DOUBLE COMPLEX for mkl_zskysm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. NOTE General matrices (matdescra (1)='G') is not supported. val REAL for mkl_sskysm. DOUBLE PRECISION for mkl_dskysm. COMPLEX for mkl_cskysm. DOUBLE COMPLEX for mkl_zskysm. Array containing the set of elements of the matrix A in the skyline profile form. If matdescrsa(2)= 'L', then val contains elements from the low triangle of the matrix A. If matdescrsa(2)= 'U', then val contains elements from the upper triangle of the matrix A. Refer to values array description in Skyline Storage Scheme for more details. pntr INTEGER. Array of length (m+m). It contains the indices specifying in the val the positions of the first non-zero element of each i-row (column) of the matrix A such that pointers(i)- pointers(1)+1. Refer to pointers array description in Skyline Storage Scheme for more details. b REAL for mkl_sskysm. DOUBLE PRECISION for mkl_dskysm. COMPLEX for mkl_cskysm. DOUBLE COMPLEX for mkl_zskysm. Array, DIMENSION (ldb, n). On entry the leading m-by-n part of the array b must contain the matrix B. 2 Intel® Math Kernel Library Reference Manual 296 ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. Output Parameters c REAL for mkl_sskysm. DOUBLE PRECISION for mkl_dskysm. COMPLEX for mkl_cskysm. DOUBLE COMPLEX for mkl_zskysm. Array, DIMENSION (ldc, n). The leading m-by-n part of the array c contains the matrix C. Interfaces FORTRAN 77: SUBROUTINE mkl_sskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER pntr(*) REAL alpha REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER pntr(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_cskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER pntr(*) COMPLEX alpha COMPLEX val(*), b(ldb,*), c(ldc,*) BLAS and Sparse BLAS Routines 2 297 SUBROUTINE mkl_zskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER pntr(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_sskysm(char *transa, int *m, int *n, float *alpha, char *matdescra, float *val, int *pntr, float *b, int *ldb, float *c, int *ldc); void mkl_dskysm(char *transa, int *m, int *n, double *alpha, char *matdescra, double *val, int *pntr, double *b, int *ldb, double *c, int *ldc); void mkl_cskysm(char *transa, int *m, int *n, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *pntr, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zskysm(char *transa, int *m, int *n, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *pntr, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?dnscsr Convert a sparse matrix in dense representation to the CSR format and vice versa. Syntax Fortran: call mkl_sdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) call mkl_ddnscsr(job, m, n, adns, lda, acsr, ja, ia, info) call mkl_cdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) call mkl_zdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) C: mkl_sdnscsr(job, &m, &n, adns, &lda, acsr, ja, ia, &info); mkl_ddnscsr(job, &m, &n, adns, &lda, acsr, ja, ia, &info); mkl_cdnscsr(job, &m, &n, adns, &lda, acsr, ja, ia, &info); mkl_zdnscsr(job, &m, &n, adns, &lda, acsr, ja, ia, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts an sparse matrix stored as a rectangular m-by-n matrix A (dense representation) to the compressed sparse row (CSR) format (3-array variation) and vice versa. 2 Intel® Math Kernel Library Reference Manual 298 Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the rectangular matrix A is converted to the CSR format; if job(1)=1, the rectangular matrix A is restored from the CSR format. job(2) If job(2)=0, zero-based indexing for the rectangular matrix A is used; if job(2)=1, one-based indexing for the rectangular matrix A is used. job(3) If job(3)=0, zero-based indexing for the matrix in CSR format is used; if job(3)=1, one-based indexing for the matrix in CSR format is used. job(4) If job(4)=0, adns is a lower triangular part of matrix A; If job(4)=1, adns is an upper triangular part of matrix A; If job(4)=2, adns is a whole matrix A. job(5) job(5)=nzmax - maximum number of the non-zero elements allowed if job(1)=0. job(6) - job indicator for conversion to CSR format. If job(6)=0, only array ia is generated for the output storage. If job(6)>0, arrays acsr, ia, ja are generated for the output storage. m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix A. adns (input/output) REAL for mkl_sdnscsr. DOUBLE PRECISION for mkl_ddnscsr. COMPLEX for mkl_cdnscsr. DOUBLE COMPLEX for mkl_zdnscsr. Array containing non-zero elements of the matrix A. lda (input/output)INTEGER. Specifies the leading dimension of adns as declared in the calling (sub)program, must be at least max(1, m). acsr (input/output) REAL for mkl_sdnscsr. DOUBLE PRECISION for mkl_ddnscsr. COMPLEX for mkl_cdnscsr. DOUBLE COMPLEX for mkl_zdnscsr. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output)INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output)INTEGER. Array of length m + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element BLAS and Sparse BLAS Routines 2 299 ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. Output Parameters info INTEGER. Integer info indicator only for restoring the matrix A from the CSR format. If info=0, the execution is successful. If info=i, the routine is interrupted processing the i-th row because there is no space in the arrays adns and ja according to the value nzmax. Interfaces FORTRAN 77: SUBROUTINE mkl_sdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) INTEGER job(8) INTEGER m, n, lda, info INTEGER ja(*), ia(m+1) REAL adns(*), acsr(*) SUBROUTINE mkl_ddnscsr(job, m, n, adns, lda, acsr, ja, ia, info) INTEGER job(8) INTEGER m, n, lda, info INTEGER ja(*), ia(m+1) DOUBLE PRECISION adns(*), acsr(*) SUBROUTINE mkl_cdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) INTEGER job(8) INTEGER m, n, lda, info INTEGER ja(*), ia(m+1) COMPLEX adns(*), acsr(*) SUBROUTINE mkl_zdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) INTEGER job(8) INTEGER m, n, lda, info INTEGER ja(*), ia(m+1) DOUBLE COMPLEX adns(*), acsr(*) C: void mkl_sdnscsr(int *job, int *m, int *n, float *adns, int *lda, float *acsr, int *ja, int *ia, int *info); void mkl_ddnscsr(int *job, int *m, int *n, double *adns, int *lda, double *acsr, int *ja, int *ia, int *info); void mkl_cdnscsr(int *job, int *m, int *n, MKL_Complex8 *adns, int *lda, MKL_Complex8 *acsr, int *ja, int *ia, int *info); 2 Intel® Math Kernel Library Reference Manual 300 void mkl_zdnscsr(int *job, int *m, int *n, MKL_Complex16 *adns, int *lda, MKL_Complex16 *acsr, int *ja, int *ia, int *info); mkl_?csrcoo Converts a sparse matrix in the CSR format to the coordinate format and vice versa. Syntax Fortran: call mkl_scsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) call mkl_dcsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) call mkl_ccsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) call mkl_zcsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) C: mkl_scsrcoo(job, &n, acsr, ja, ia, &nnz, acoo, rowind, colind, &info); mkl_dcsrcoo(job, &n, acsr, ja, ia, &nnz, acoo, rowind, colind, &info); mkl_ccsrcoo(job, &n, acsr, ja, ia, &nnz, acoo, rowind, colind, &info); mkl_zcsrcoo(job, &n, acsr, ja, ia, &nnz, acoo, rowind, colind, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts a sparse matrix A stored in the compressed sparse row (CSR) format (3-array variation) to coordinate format and vice versa. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the matrix in the CSR format is converted to the coordinate format; if job(1)=1, the matrix in the coordinate format is converted to the CSR format. if job(1)=2, the matrix in the coordinate format is converted to the CSR format, and the column indices in CSR representation are sorted in the increasing order within each row. job(2) If job(2)=0, zero-based indexing for the matrix in CSR format is used; if job(2)=1, one-based indexing for the matrix in CSR format is used. job(3) BLAS and Sparse BLAS Routines 2 301 If job(3)=0, zero-based indexing for the matrix in coordinate format is used; if job(3)=1, one-based indexing for the matrix in coordinate format is used. job(5) job(5)=nzmax - maximum number of the non-zero elements allowed if job(1)=0. job(5)=nnz - sets number of the non-zero elements of the matrix A if job(1)=1. job(6) - job indicator. For conversion to the coordinate format: If job(6)=1, only array rowind is filled in for the output storage. If job(6)=2, arrays rowind, colind are filled in for the output storage. If job(6)=3, all arrays rowind, colind, acoo are filled in for the output storage. For conversion to the CSR format: If job(6)=0, all arrays acsr, ja, ia are filled in for the output storage. If job(6)=1, only array ia is filled in for the output storage. If job(6)=2, then it is assumed that the routine already has been called with the job(6)=1, and the user allocated the required space for storing the output arrays acsr and ja. n INTEGER. Dimension of the matrix A. acsr (input/output) REAL for mkl_scsrcoo. DOUBLE PRECISION for mkl_dcsrcoo. COMPLEX for mkl_ccsrcoo. DOUBLE COMPLEX for mkl_zcsrcoo. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output) INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output) INTEGER. Array of length n + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element ia(n + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. acoo (input/output) REAL for mkl_scsrcoo. DOUBLE PRECISION for mkl_dcsrcoo. COMPLEX for mkl_ccsrcoo. DOUBLE COMPLEX for mkl_zcsrcoo. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. rowind (input/output)INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. 2 Intel® Math Kernel Library Reference Manual 302 colind (input/output) INTEGER. Array of length nnz, contains the column indices for each non-zero element of the matrix A. Refer to columns array description in Coordinate Format for more details. Output Parameters nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. info INTEGER. Integer info indicator only for converting the matrix A from the CSR format. If info=0, the execution is successful. If info=1, the routine is interrupted because there is no space in the arrays acoo, rowind, colind according to the value nzmax. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) INTEGER job(8) INTEGER n, nnz, info INTEGER ja(*), ia(n+1), rowind(*), colind(*) REAL acsr(*), acoo(*) SUBROUTINE mkl_dcsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) INTEGER job(8) INTEGER n, nnz, info INTEGER ja(*), ia(n+1), rowind(*), colind(*) DOUBLE PRECISION acsr(*), acoo(*) SUBROUTINE mkl_ccsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) INTEGER job(8) INTEGER n, nnz, info INTEGER ja(*), ia(n+1), rowind(*), colind(*) COMPLEX acsr(*), acoo(*) SUBROUTINE mkl_zcsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) INTEGER job(8) INTEGER n, nnz, info INTEGER ja(*), ia(n+1), rowind(*), colind(*) DOUBLE COMPLEX acsr(*), acoo(*) C: void mkl_scsrcoo(int *job, int *n, float *acsr, int *ja, int *ia, int *nnz, float *acoo, int *rowind, int *colind, int *info); void mkl_dcsrcoo(int *job, int *n, double *acsr, int *ja, int *ia, int *nnz, double *acoo, int *rowind, int *colind, int *info); BLAS and Sparse BLAS Routines 2 303 void mkl_ccsrcoo(int *job, int *n, MKL_Complex8 *acsr, int *ja, int *ia, int *nnz, MKL_Complex8 *acoo, int *rowind, int *colind, int *info); void mkl_zcsrcoo(int *job, int *n, MKL_Complex16 *acsr, int *ja, int *ia, int *nnz, MKL_Complex16 *acoo, int *rowind, int *colind, int *info); mkl_?csrbsr Converts a sparse matrix in the CSR format to the BSR format and vice versa. Syntax Fortran: call mkl_scsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) call mkl_dcsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) call mkl_ccsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) call mkl_zcsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) C: mkl_scsrbsr(job, &m, &mblk, &ldabsr, acsr, ja, ia, absr, jab, iab, &info); mkl_dcsrbsr(job, &m, &mblk, &ldabsr, acsr, ja, ia, absr, jab, iab, &info); mkl_ccsrbsr(job, &m, &mblk, &ldabsr, acsr, ja, ia, absr, jab, iab, &info); mkl_zcsrbsr(job, &m, &mblk, &ldabsr, acsr, ja, ia, absr, jab, iab, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts a sparse matrix A stored in the compressed sparse row (CSR) format (3-array variation) to the block sparse row (BSR) format and vice versa. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the matrix in the CSR format is converted to the BSR format; if job(1)=1, the matrix in the BSR format is converted to the CSR format. job(2) If job(2)=0, zero-based indexing for the matrix in CSR format is used; if job(2)=1, one-based indexing for the matrix in CSR format is used. job(3) If job(3)=0, zero-based indexing for the matrix in the BSR format is used; if job(3)=1, one-based indexing for the matrix in the BSR format is used. 2 Intel® Math Kernel Library Reference Manual 304 job(4) is only used for conversion to CSR format. By default, the converter saves the blocks without checking whether an element is zero or not. If job(4)=1, then the converter only saves non-zero elements in blocks. job(6) - job indicator. For conversion to the BSR format: If job(6)=0, only arrays jab, iab are generated for the output storage. If job(6)>0, all output arrays absr, jab, and iab are filled in for the output storage. If job(6)=-1, iab(1) returns the number of non-zero blocks. For conversion to the CSR format: If job(6)=0, only arrays ja, ia are generated for the output storage. m INTEGER. Actual row dimension of the matrix A for convert to the BSR format; block row dimension of the matrix A for convert to the CSR format. mblk INTEGER. Size of the block in the matrix A. ldabsr INTEGER. Leading dimension of the array absr as declared in the calling program. ldabsr must be greater than or equal to mblk*mblk. acsr (input/output) REAL for mkl_scsrbsr. DOUBLE PRECISION for mkl_dcsrbsr. COMPLEX for mkl_ccsrbsr. DOUBLE COMPLEX for mkl_zcsrbsr. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output) INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output) INTEGER. Array of length m + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. absr (input/output) REAL for mkl_scsrbsr. DOUBLE PRECISION for mkl_dcsrbsr. COMPLEX for mkl_ccsrbsr. DOUBLE COMPLEX for mkl_zcsrbsr. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by mblk*mblk. Refer to values array description in BSR Format for more details. jab (input/output) INTEGER. Array containing the column indices for each nonzero block of the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. BLAS and Sparse BLAS Routines 2 305 iab (input/output) INTEGER. Array of length (m + 1), containing indices of blocks in the array absr, such that iab(i) is the index in the array absr of the first non-zero element from the i-th row . The value of the last element iab(m + 1) is equal to the number of non-zero blocks plus one. Refer to rowIndex array description in BSR Format for more details. Output Parameters info INTEGER. Integer info indicator only for converting the matrix A from the CSR format. If info=0, the execution is successful. If info=1, it means that mblk is equal to 0. If info=2, it means that ldabsr is less than mblk*mblk and there is no space for all blocks. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) INTEGER job(8) INTEGER m, mblk, ldabsr, info INTEGER ja(*), ia(m+1), jab(*), iab(*) REAL acsr(*), absr(ldabsr,*) SUBROUTINE mkl_dcsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) INTEGER job(8) INTEGER m, mblk, ldabsr, info INTEGER ja(*), ia(m+1), jab(*), iab(*) DOUBLE PRECISION acsr(*), absr(ldabsr,*) SUBROUTINE mkl_ccsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) INTEGER job(8) INTEGER m, mblk, ldabsr, info INTEGER ja(*), ia(m+1), jab(*), iab(*) COMPLEX acsr(*), absr(ldabsr,*) SUBROUTINE mkl_zcsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) INTEGER job(8) INTEGER m, mblk, ldabsr, info INTEGER ja(*), ia(m+1), jab(*), iab(*) DOUBLE COMPLEX acsr(*), absr(ldabsr,*) C: void mkl_scsrbsr(int *job, int *m, int *mblk, int *ldabsr, float *acsr, int *ja, int *ia, float *absr, int *jab, int *iab, int *info); void mkl_dcsrbsr(int *job, int *m, int *mblk, int *ldabsr, double *acsr, int *ja, int *ia, double *absr, int *jab, int *iab, int *info); 2 Intel® Math Kernel Library Reference Manual 306 void mkl_ccsrbsr(int *job, int *m, int *mblk, int *ldabsr, MKL_Complex8 *acsr, int *ja, int *ia, MKL_Complex8 *absr, int *jab, int *iab, int *info); void mkl_zcsrbsr(int *job, int *m, int *mblk, int *ldabsr, MKL_Complex16 *acsr, int *ja, int *ia, MKL_Complex16 *absr, int *jab, int *iab, int *info); mkl_?csrcsc Converts a square sparse matrix in the CSR format to the CSC format and vice versa. Syntax Fortran: call mkl_scsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) call mkl_dcsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) call mkl_ccsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) call mkl_zcsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) C: mkl_scsrcsc(job, &m, acsr, ja, ia, acsc, ja1, ia1, &info); mkl_dcsrcsc(job, &m, acsr, ja, ia, acsc, ja1, ia1, &info); mkl_ccsrcsc(job, &m, acsr, ja, ia, acsc, ja1, ia1, &info); mkl_zcsrcsc(job, &m, acsr, ja, ia, acsc, ja1, ia1, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts a square sparse matrix A stored in the compressed sparse row (CSR) format (3-array variation) to the compressed sparse column (CSC) format and vice versa. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the matrix in the CSR format is converted to the CSC format; if job(1)=1, the matrix in the CSC format is converted to the CSR format. job(2) If job(2)=0, zero-based indexing for the matrix in CSR format is used; if job(2)=1, one-based indexing for the matrix in CSR format is used. job(3) If job(3)=0, zero-based indexing for the matrix in the CSC format is used; if job(3)=1, one-based indexing for the matrix in the CSC format is used. job(6) - job indicator. BLAS and Sparse BLAS Routines 2 307 For conversion to the CSC format: If job(6)=0, only arrays ja1, ia1 are filled in for the output storage. If job(6)?0, all output arrays acsc, ja1, and ia1 are filled in for the output storage. For conversion to the CSR format: If job(6)=0, only arrays ja, ia are filled in for the output storage. If job(6)?0, all output arrays acsr, ja, and ia are filled in for the output storage. m INTEGER. Dimension of the square matrix A. acsr (input/output) REAL for mkl_scsrcsc. DOUBLE PRECISION for mkl_dcsrcsc. COMPLEX for mkl_ccsrcsc. DOUBLE COMPLEX for mkl_zcsrcsc. Array containing non-zero elements of the square matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output) INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output) INTEGER. Array of length m + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. acsc (input/output) REAL for mkl_scsrcsc. DOUBLE PRECISION for mkl_dcsrcsc. COMPLEX for mkl_ccsrcsc. DOUBLE COMPLEX for mkl_zcsrcsc. Array containing non-zero elements of the square matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja1 (input/output) INTEGER. Array containing the row indices for each non-zero element of the matrix A. Its length is equal to the length of the array acsc. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia1 (input/output) INTEGER. Array of length m + 1, containing indices of elements in the array acsc, such that ia1(I) is the index in the array acsc of the first non-zero element from the column I. The value of the last element ia1(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. Output Parameters info INTEGER. This parameter is not used now. 2 Intel® Math Kernel Library Reference Manual 308 Interfaces FORTRAN 77: SUBROUTINE mkl_scsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), ja1(*), ia1(m+1) REAL acsr(*), acsc(*) SUBROUTINE mkl_dcsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), ja1(*), ia1(m+1) DOUBLE PRECISION acsr(*), acsc(*) SUBROUTINE mkl_ccsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), ja1(*), ia1(m+1) COMPLEX acsr(*), acsc(*) SUBROUTINE mkl_zcsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), ja1(*), ia1(m+1) DOUBLE COMPLEX acsr(*), acsc(*) C: void mkl_scsrcsc(int *job, int *m, float *acsr, int *ja, int *ia, float *acsc, int *ja1, int *ia1, int *info); void mkl_dcsrcsc(int *job, int *m, double *acsr, int *ja, int *ia, double *acsc, int *ja1, int *ia1, int *info); void mkl_ccsrcsc(int *job, int *m, MKL_Complex8 *acsr, int *ja, int *ia, MKL_Complex8 *acsc, int *ja1, int *ia1, int *info); void mkl_zcsrcsc(int *job, int *m, MKL_Complex16 *acsr, int *ja, int *ia, MKL_Complex16 *acsc, int *ja1, int *ia1, int *info); mkl_?csrdia Converts a sparse matrix in the CSR format to the diagonal format and vice versa. BLAS and Sparse BLAS Routines 2 309 Syntax Fortran: call mkl_scsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) call mkl_dcsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) call mkl_ccsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) call mkl_zcsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) C: mkl_scsrdia(job, &m, acsr, ja, ia, adia, &ngiag, distance, &idiag, acsr_rem, ja_rem, ia_rem, &info); mkl_dcsrdia(job, &m, acsr, ja, ia, adia, &ngiag, distance, &idiag, acsr_rem, ja_rem, ia_rem, &info); mkl_ccsrdia(job, &m, acsr, ja, ia, adia, &ngiag, distance, &idiag, acsr_rem, ja_rem, ia_rem, &info); mkl_zcsrdia(job, &m, acsr, ja, ia, adia, &ngiag, distance, &idiag, acsr_rem, ja_rem, ia_rem, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts a sparse matrix A stored in the compressed sparse row (CSR) format (3-array variation) to the diagonal format and vice versa. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the matrix in the CSR format is converted to the diagonal format; if job(1)=1, the matrix in the diagonal format is converted to the CSR format. job(2) If job(2)=0, zero-based indexing for the matrix in CSR format is used; if job(2)=1, one-based indexing for the matrix in CSR format is used. job(3) If job(3)=0, zero-based indexing for the matrix in the diagonal format is used; if job(3)=1, one-based indexing for the matrix in the diagonal format is used. 2 Intel® Math Kernel Library Reference Manual 310 job(6) - job indicator. For conversion to the diagonal format: If job(6)=0, diagonals are not selected internally, and acsr_rem, ja_rem, ia_rem are not filled in for the output storage. If job(6)=1, diagonals are not selected internally, and acsr_rem, ja_rem, ia_rem are filled in for the output storage. If job(6)=10, diagonals are selected internally, and acsr_rem, ja_rem, ia_rem are not filled in for the output storage. If job(6)=11, diagonals are selected internally, and csr_rem, ja_rem, ia_rem are filled in for the output storage. For conversion to the CSR format: If job(6)=0, each entry in the array adia is checked whether it is zero. Zero entries are not included in the array acsr. If job(6)?0, each entry in the array adia is not checked whether it is zero. m INTEGER. Dimension of the matrix A. acsr (input/output) REAL for mkl_scsrdia. DOUBLE PRECISION for mkl_dcsrdia. COMPLEX for mkl_ccsrdia. DOUBLE COMPLEX for mkl_zcsrdia. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output) INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output) INTEGER. Array of length m + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. adia (input/output) REAL for mkl_scsrdia. DOUBLE PRECISION for mkl_dcsrdia. COMPLEX for mkl_ccsrdia. DOUBLE COMPLEX for mkl_zcsrdia. Array of size (ndiag x idiag) containing diagonals of the matrix A. The key point of the storage is that each element in the array adia retains the row number of the original matrix. To achieve this diagonals in the lower triangular part of the matrix are padded from the top, and those in the upper triangular part are padded from the bottom. ndiag INTEGER. Specifies the leading dimension of the array adia as declared in the calling (sub)program, must be at least max(1, m). distance INTEGER. Array of length idiag, containing the distances between the main diagonal and each non-zero diagonal to be extracted. The distance is positive if the diagonal is above the main diagonal, and negative if the diagonal is below the main diagonal. The main diagonal has a distance equal to zero. BLAS and Sparse BLAS Routines 2 311 idiag INTEGER. Number of diagonals to be extracted. For conversion to diagonal format on return this parameter may be modified. acsr_rem, ja_rem, ia_rem Remainder of the matrix in the CSR format if it is needed for conversion to the diagonal format. Output Parameters info INTEGER. This parameter is not used now. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) INTEGER job(8) INTEGER m, info, ndiag, idiag INTEGER ja(*), ia(m+1), distance(*), ja_rem(*), ia_rem(*) REAL acsr(*), adia(*), acsr_rem(*) SUBROUTINE mkl_dcsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) INTEGER job(8) INTEGER m, info, ndiag, idiag INTEGER ja(*), ia(m+1), distance(*), ja_rem(*), ia_rem(*) DOUBLE PRECISION acsr(*), adia(*), acsr_rem(*) SUBROUTINE mkl_ccsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) INTEGER job(8) INTEGER m, info, ndiag, idiag INTEGER ja(*), ia(m+1), distance(*), ja_rem(*), ia_rem(*) COMPLEX acsr(*), adia(*), acsr_rem(*) SUBROUTINE mkl_zcsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) INTEGER job(8) INTEGER m, info, ndiag, idiag INTEGER ja(*), ia(m+1), distance(*), ja_rem(*), ia_rem(*) DOUBLE COMPLEX acsr(*), adia(*), acsr_rem(*) C: void mkl_scsrdia(int *job, int *m, float *acsr, int *ja, int *ia, float *adia, int *ndiag, int *distance, int *distance, int *idiag, float *acsr_rem, int *ja_rem, int *ia_rem, int *info); void mkl_dcsrdia(int *job, int *m, double *acsr, int *ja, int *ia, double *adia, int *ndiag, int *distance, int *distance, int *idiag, double *acsr_rem, int *ja_rem, int *ia_rem, int *info); 2 Intel® Math Kernel Library Reference Manual 312 void mkl_ccsrdia(int *job, int *m, MKL_Complex8 *acsr, int *ja, int *ia, MKL_Complex8 *adia, int *ndiag, int *distance, int *distance, int *idiag, MKL_Complex8 *acsr_rem, int *ja_rem, int *ia_rem, int *info); void mkl_zcsrdia(int *job, int *m, MKL_Complex16 *acsr, int *ja, int *ia, MKL_Complex16 *adia, int *ndiag, int *distance, int *distance, int *idiag, MKL_Complex16 *acsr_rem, int *ja_rem, int *ia_rem, int *info); mkl_?csrsky Converts a sparse matrix in CSR format to the skyline format and vice versa. Syntax Fortran: call mkl_scsrsky(job, m, acsr, ja, ia, asky, pointers, info) call mkl_dcsrsky(job, m, acsr, ja, ia, asky, pointers, info) call mkl_ccsrsky(job, m, acsr, ja, ia, asky, pointers, info) call mkl_zcsrsky(job, m, acsr, ja, ia, asky, pointers, info) C: mkl_scsrsky(job, &m, acsr, ja, ia, asky, pointers, &info); mkl_dcsrsky(job, &m, acsr, ja, ia, asky, pointers, &info); mkl_ccsrsky(job, &m, acsr, ja, ia, asky, pointers, &info); mkl_zcsrsky(job, &m, acsr, ja, ia, asky, pointers, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts a sparse matrix A stored in the compressed sparse row (CSR) format (3-array variation) to the skyline format and vice versa. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the matrix in the CSR format is converted to the skyline format; if job(1)=1, the matrix in the skyline format is converted to the CSR format. job(2) If job(2)=0, zero-based indexing for the matrix in CSR format is used; if job(2)=1, one-based indexing for the matrix in CSR format is used. BLAS and Sparse BLAS Routines 2 313 job(3) If job(3)=0, zero-based indexing for the matrix in the skyline format is used; if job(3)=1, one-based indexing for the matrix in the skyline format is used. job(4) For conversion to the skyline format: If job(4)=0, the upper part of the matrix A in the CSR format is converted. If job(4)=1, the lower part of the matrix A in the CSR format is converted. For conversion to the CSR format: If job(4)=0, the matrix is converted to the upper part of the matrix A in the CSR format. If job(4)=1, the matrix is converted to the lower part of the matrix A in the CSR format. job(5) job(5)=nzmax - maximum number od the non-zero elements of the matrix A if job(1)=0. job(6) - job indicator. Only for conversion to the skyline format: If job(6)=0, only arrays pointers is filled in for the output storage. If job(6)=1, all output arrays asky and pointers are filled in for the output storage. m INTEGER. Dimension of the matrix A. acsr (input/output) REAL for mkl_scsrsky. DOUBLE PRECISION for mkl_dcsrsky. COMPLEX for mkl_ccsrsky. DOUBLE COMPLEX for mkl_zcsrsky. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output) INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output) INTEGER. Array of length m + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. asky (input/output) REAL for mkl_scsrsky. DOUBLE PRECISION for mkl_dcsrsky. COMPLEX for mkl_ccsrsky. DOUBLE COMPLEX for mkl_zcsrsky. Array, for a lower triangular part of A it contains the set of elements from each row starting from the first none-zero element to and including the diagonal element. For an upper triangular matrix it contains the set of elements from each column of the matrix starting with the first non-zero 2 Intel® Math Kernel Library Reference Manual 314 element down to and including the diagonal element. Encountered zero elements are included in the sets. Refer to values array description in Skyline Storage Format for more details. pointers (input/output) INTEGER. Array with dimension (m+1), where m is number of rows for lower triangle (columns for upper triangle), pointers(I) - pointers(1)+1 gives the index of element in the array asky that is first non-zero element in row (column)I . The value of pointers(m +1) is set tonnz + pointers(1), wherennz is the number of elements in the array asky. Refer to pointers array description in Skyline Storage Format for more details Output Parameters info INTEGER. Integer info indicator only for converting the matrix A from the CSR format. If info=0, the execution is successful. If info=1, the routine is interrupted because there is no space in the array asky according to the value nzmax. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrsky(job, m, acsr, ja, ia, asky, pointers, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), pointers(m+1) REAL acsr(*), asky(*) SUBROUTINE mkl_dcsrsky(job, m, acsr, ja, ia, asky, pointers, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), pointers(m+1) DOUBLE PRECISION acsr(*), asky(*) SUBROUTINE mkl_ccsrsky(job, m, acsr, ja, ia, asky, pointers, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), pointers(m+1) COMPLEX acsr(*), asky(*) SUBROUTINE mkl_zcsrsky(job, m, acsr, ja, ia, asky, pointers, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), pointers(m+1) DOUBLE COMPLEX acsr(*), asky(*) BLAS and Sparse BLAS Routines 2 315 C: void mkl_scsrsky(int *job, int *m, float *acsr, int *ja, int *ia, float *asky, int *pointers, int *info); void mkl_dcsrsky(int *job, int *m, double *acsr, int *ja, int *ia, double *asky, int *pointers, int *info); void mkl_ccsrsky(int *job, int *m, MKL_COMPLEX8 *acsr, int *ja, int *ia, MKL_COMPLEX8 *asky, int *pointers, int *info); void mkl_zcsrsky(int *job, int *m, MKL_COMPLEX16 *acsr, int *ja, int *ia, MKL_COMPLEX16 *asky, int *pointers, int *info); mkl_?csradd Computes the sum of two matrices stored in the CSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_scsradd(trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) call mkl_dcsradd(trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) call mkl_ccsradd(trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) call mkl_zcsradd(trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) C: mkl_scsradd(&trans, &request, &sort, &m, &n, a, ja, ia, &beta, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_dcsradd(&trans, &request, &sort, &m, &n, a, ja, ia, &beta, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_ccsradd(&trans, &request, &sort, &m, &n, a, ja, ia, &beta, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_zcsradd(&trans, &request, &sort, &m, &n, a, ja, ia, &beta, b, jb, ib, c, jc, ic, &nzmax, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csradd routine performs a matrix-matrix operation defined as C := A+beta*op(B) where: A, B, C are the sparse matrices in the CSR format (3-array variation). 2 Intel® Math Kernel Library Reference Manual 316 op(B) is one of op(B) = B, or op(B) = B', or op(A) = conjg(B') beta is a scalar. The routine works correctly if and only if the column indices in sparse matrix representations of matrices A and B are arranged in the increasing order for each row. If not, use the parameter sort (see below) to reorder column indices and the corresponding elements of the input matrices. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. trans CHARACTER*1. Specifies the operation. If trans = 'N' or 'n', then C := A+beta*B If trans = 'T' or 't' or 'C' or 'c', then C := A+beta*B'. request INTEGER. If request=0, the routine performs addition, the memory for the output arrays ic, jc, c must be allocated beforehand. If request=1, the routine computes only values of the array ic of length m + 1, the memory for this array must be allocated beforehand. On exit the value ic(m+1) - 1 is the actual number of the elements in the arrays c and jc. If request=2, the routine has been called previously with the parameter request=1, the output arrays jc and c are allocated in the calling program and they are of the length (m+1)-1 at least. sort INTEGER. Specifies the type of reordering. If this parameter is not set (default), the routine does not perform reordering. If sort=1, the routine arranges the column indices ja for each row in the increasing order and reorders the corresponding values of the matrix A in the array a. If sort=2, the routine arranges the column indices jb for each row in the increasing order and reorders the corresponding values of the matrix B in the array b. If sort=3, the routine performs reordering for both input matrices A and B. m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix A. a REAL for mkl_scsradd. DOUBLE PRECISION for mkl_dcsradd. COMPLEX for mkl_ccsradd. DOUBLE COMPLEX for mkl_zcsradd. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. For each row the column indices must be arranged in the increasing order. The length of this array is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. BLAS and Sparse BLAS Routines 2 317 ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zero elements of the matrix B plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. beta REAL for mkl_scsradd. DOUBLE PRECISION for mkl_dcsradd. COMPLEX for mkl_ccsradd. DOUBLE COMPLEX for mkl_zcsradd. Specifies the scalar beta. b REAL for mkl_scsradd. DOUBLE PRECISION for mkl_dcsradd. COMPLEX for mkl_ccsradd. DOUBLE COMPLEX for mkl_zcsradd. Array containing non-zero elements of the matrix B. Its length is equal to the number of non-zero elements in the matrix B. Refer to values array description in Sparse Matrix Storage Formats for more details. jb INTEGER. Array containing the column indices for each non-zero element of the matrix B. For each row the column indices must be arranged in the increasing order. The length of this array is equal to the length of the array b. Refer to columns array description in Sparse Matrix Storage Formats for more details. ib INTEGER. Array of length m + 1 when trans = 'N' or 'n', or n + 1 otherwise. This array contains indices of elements in the array b, such that ib(I) is the index in the array b of the first non-zero element from the row I. The value of the last element ib(m + 1) or ib(n + 1) is equal to the number of non-zero elements of the matrix B plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. nzmax INTEGER. The length of the arrays c and jc. This parameter is used only if request=0. The routine stops calculation if the number of elements in the result matrix C exceeds the specified value of nzmax. Output Parameters c REAL for mkl_scsradd. DOUBLE PRECISION for mkl_dcsradd. COMPLEX for mkl_ccsradd. DOUBLE COMPLEX for mkl_zcsradd. Array containing non-zero elements of the result matrix C. Its length is equal to the number of non-zero elements in the matrix C. Refer to values array description in Sparse Matrix Storage Formats for more details. jc INTEGER. Array containing the column indices for each non-zero element of the matrix C. The length of this array is equal to the length of the array c. Refer to columns array description in Sparse Matrix Storage Formats for more details. 2 Intel® Math Kernel Library Reference Manual 318 ic INTEGER. Array of length m + 1, containing indices of elements in the array c, such that ic(I) is the index in the array c of the first non-zero element from the row I. The value of the last element ic(m + 1) is equal to the number of non-zero elements of the matrix C plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. info INTEGER. If info=0, the execution is successful. If info=I>0, the routine stops calculation in the I-th row of the matrix C because number of elements in C exceeds nzmax. If info=-1, the routine calculates only the size of the arrays c and jc and returns this value plus 1 as the last element of the array ic. Interfaces FORTRAN 77: SUBROUTINE mkl_scsradd( trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER trans INTEGER request, sort, m, n, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) REAL a(*), b(*), c(*), beta SUBROUTINE mkl_dcsradd( trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER trans INTEGER request, sort, m, n, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) DOUBLE PRECISION a(*), b(*), c(*), beta SUBROUTINE mkl_ccsradd( trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER trans INTEGER request, sort, m, n, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) COMPLEX a(*), b(*), c(*), beta SUBROUTINE mkl_zcsradd( trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER trans INTEGER request, sort, m, n, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) DOUBLE COMPLEX a(*), b(*), c(*), beta C: void mkl_scsradd(char *trans, int *request, int *sort, int *m, int *n, float *a, int *ja, int *ia, float *beta, float *b, int *jb, int *ib, float *c, int *jc, int *ic, int *nzmax, int *info); void mkl_dcsradd(char *trans, int *request, int *sort, int *m, int *n, double *a, int *ja, int *ia, double *beta, double *b, int *jb, int *ib, double *c, int *jc, int *ic, int *nzmax, int *info); BLAS and Sparse BLAS Routines 2 319 void mkl_ccsradd(char *trans, int *request, int *sort, int *m, int *n, MKL_Complex8 *a, int *ja, int *ia, MKL_Complex8 *beta, MKL_Complex8 *b, int *jb, int *ib, MKL_Complex8 *c, int *jc, int *ic, int *nzmax, int *info); void mkl_zcsradd(char *trans, int *request, int *sort, int *m, int *n, MKL_Complex16 *a, int *ja, int *ia, MKL_Complex16 *beta, MKL_Complex16 *b, int *jb, int *ib, MKL_Complex16 *c, int *jc, int *ic, int *nzmax, int *info); mkl_?csrmultcsr Computes product of two sparse matrices stored in the CSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_scsrmultcsr(trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) call mkl_dcsrmultcsr(trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) call mkl_ccsrmultcsr(trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) call mkl_zcsrmultcsr(trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) C: mkl_scsrmultcsr(&trans, &request, &sort, &m, &n, &k, a, ja, ia, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_dcsrmultcsr(&trans, &request, &sort, &m, &n, &k, a, ja, ia, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_ccsrmultcsr(&trans, &request, &sort, &m, &n, &k, a, ja, ia, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_zcsrmultcsr(&trans, &request, &sort, &m, &n, &k, a, ja, ia, b, jb, ib, c, jc, ic, &nzmax, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrmultcsr routine performs a matrix-matrix operation defined as C := op(A)*B where: A, B, C are the sparse matrices in the CSR format (3-array variation); op(A) is one of op(A) = A, or op(A) =A', or op(A) = conjg(A') . You can use the parameter sort to perform or not perform reordering of non-zero entries in input and output sparse matrices. The purpose of reordering is to rearrange non-zero entries in compressed sparse row matrix so that column indices in compressed sparse representation are sorted in the increasing order for each row. 2 Intel® Math Kernel Library Reference Manual 320 The following table shows correspondence between the value of the parameter sort and the type of reordering performed by this routine for each sparse matrix involved: Value of the parameter sort Reordering of A (arrays a, ja, ia) Reordering of B (arrays b, ja, ib) Reordering of C (arrays c, jc, ic) 1 yes no yes 2 no yes yes 3 yes yes yes 4 yes no no 5 no yes no 6 yes yes no 7 no no no arbitrary value not equal to 1, 2,..., 7 no no yes NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. trans CHARACTER*1. Specifies the operation. If trans = 'N' or 'n', then C := A*B If trans = 'T' or 't' or 'C' or 'c', then C := A'*B. request INTEGER. If request=0, the routine performs multiplication, the memory for the output arrays ic, jc, c must be allocated beforehand. If request=1, the routine computes only values of the array ic of length m + 1, the memory for this array must be allocated beforehand. On exit the value ic(m+1) - 1 is the actual number of the elements in the arrays c and jc. If request=2, the routine has been called previously with the parameter request=1, the output arrays jc and c are allocated in the calling program and they are of the length ic(m+1) - 1 at least. sort INTEGER. Specifies whether the routine performs reordering of non-zeros entries in input and/or output sparse matrices (see table above). m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix A. k INTEGER. Number of columns of the matrix B. a REAL for mkl_scsrmultcsr. DOUBLE PRECISION for mkl_dcsrmultcsr. COMPLEX for mkl_ccsrmultcsr. DOUBLE COMPLEX for mkl_zcsrmultcsr. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. For each row the column indices must be arranged in the increasing order. BLAS and Sparse BLAS Routines 2 321 The length of this array is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1. This array contains indices of elements in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zero elements of the matrix A plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. b REAL for mkl_scsrmultcsr. DOUBLE PRECISION for mkl_dcsrmultcsr. COMPLEX for mkl_ccsrmultcsr. DOUBLE COMPLEX for mkl_zcsrmultcsr. Array containing non-zero elements of the matrix B. Its length is equal to the number of non-zero elements in the matrix B. Refer to values array description in Sparse Matrix Storage Formats for more details. jb INTEGER. Array containing the column indices for each non-zero element of the matrix B. For each row the column indices must be arranged in the increasing order. The length of this array is equal to the length of the array b. Refer to columns array description in Sparse Matrix Storage Formats for more details. ib INTEGER. Array of length n + 1 when trans = 'N' or 'n', or m + 1 otherwise. This array contains indices of elements in the array b, such that ib(I) is the index in the array b of the first non-zero element from the row I. The value of the last element ib(n + 1) or ib(m + 1) is equal to the number of non-zero elements of the matrix B plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. nzmax INTEGER. The length of the arrays c and jc. This parameter is used only if request=0. The routine stops calculation if the number of elements in the result matrix C exceeds the specified value of nzmax. Output Parameters c REAL for mkl_scsrmultcsr. DOUBLE PRECISION for mkl_dcsrmultcsr. COMPLEX for mkl_ccsrmultcsr. DOUBLE COMPLEX for mkl_zcsrmultcsr. Array containing non-zero elements of the result matrix C. Its length is equal to the number of non-zero elements in the matrix C. Refer to values array description in Sparse Matrix Storage Formats for more details. jc INTEGER. Array containing the column indices for each non-zero element of the matrix C. The length of this array is equal to the length of the array c. Refer to columns array description in Sparse Matrix Storage Formats for more details. ic INTEGER. Array of length m + 1 when trans = 'N' or 'n', or n + 1 otherwise. 2 Intel® Math Kernel Library Reference Manual 322 This array contains indices of elements in the array c, such that ic(I) is the index in the array c of the first non-zero element from the row I. The value of the last element ic(m + 1) or ic(n + 1) is equal to the number of non-zero elements of the matrix C plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. info INTEGER. If info=0, the execution is successful. If info=I>0, the routine stops calculation in the I-th row of the matrix C because number of elements in C exceeds nzmax. If info=-1, the routine calculates only the size of the arrays c and jc and returns this value plus 1 as the last element of the array ic. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrmultcsr( trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER*1 trans INTEGER request, sort, m, n, k, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) REAL a(*), b(*), c(*) SUBROUTINE mkl_dcsrmultcsr( trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER*1 trans INTEGER request, sort, m, n, k, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) DOUBLE PRECISION a(*), b(*), c(*) SUBROUTINE mkl_ccsrmultcsr( trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER*1 trans INTEGER request, sort, m, n, k, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) COMPLEX a(*), b(*), c(*) SUBROUTINE mkl_zcsrmultcsr( trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER*1 trans INTEGER request, sort, m, n, k, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) DOUBLE COMPLEX a(*), b(*), c(*) C: void mkl_scsrmultcsr(char *trans, int *request, int *sort, int *m, int *n, int *k, float *a, int *ja, int *ia, float *b, int *jb, int *ib, float *c, int *jc, int *ic, int *nzmax, int *info); BLAS and Sparse BLAS Routines 2 323 void mkl_dcsrmultcsr(char *trans, int *request, int *sort, int *m, int *n, int *k, double *a, int *ja, int *ia, double *b, int *jb, int *ib, double *c, int *jc, int *ic, int *nzmax, int *info); void mkl_ccsrmultcsr(char *trans, int *request, int *sort, int *m, int *n, int *k, MKL_Complex8 *a, int *ja, int *ia, MKL_Complex8 *b, int *jb, int *ib, MKL_Complex8 *c, int *jc, int *ic, int *nzmax, int *info); void mkl_zcsrmultcsr(char *trans, int *request, int *sort, int *m, int *n, int *k, MKL_Complex16 *a, int *ja, int *ia, MKL_Complex16 *b, int *jb, int *ib, MKL_Complex16 *c, int *jc, int *ic, int *nzmax, int *info); mkl_?csrmultd Computes product of two sparse matrices stored in the CSR format (3-array variation) with one-based indexing. The result is stored in the dense matrix. Syntax Fortran: call mkl_scsrmultd(trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) call mkl_dcsrmultd(trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) call mkl_ccsrmultd(trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) call mkl_zcsrmultd(trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) C: mkl_scsrmultd(&trans, &m, &n, &k, a, ja, ia, b, jb, ib, c, &ldc); mkl_dcsrmultd(&trans, &m, &n, &k, a, ja, ia, b, jb, ib, c, &ldc); mkl_ccsrmultd(&trans, &m, &n, &k, a, ja, ia, b, jb, ib, c, &ldc); mkl_zcsrmultd(&trans, &m, &n, &k, a, ja, ia, b, jb, ib, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrmultd routine performs a matrix-matrix operation defined as C := op(A)*B where: A, B are the sparse matrices in the CSR format (3-array variation), C is dense matrix; op(A) is one of op(A) = A, or op(A) =A', or op(A) = conjg(A') . The routine works correctly if and only if the column indices in sparse matrix representations of matrices A and B are arranged in the increasing order for each row. If not, use the parameter sort (see below) to reorder column indices and the corresponding elements of the input matrices. NOTE This routine supports only one-based indexing of the input arrays. 2 Intel® Math Kernel Library Reference Manual 324 Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. trans CHARACTER*1. Specifies the operation. If trans = 'N' or 'n', then C := A*B If trans = 'T' or 't' or 'C' or 'c', then C := A'*B. m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix A. k INTEGER. Number of columns of the matrix B. a REAL for mkl_scsrmultd. DOUBLE PRECISION for mkl_dcsrmultd. COMPLEX for mkl_ccsrmultd. DOUBLE COMPLEX for mkl_zcsrmultd. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. For each row the column indices must be arranged in the increasing order. The length of this array is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1 when trans = 'N' or 'n', or n + 1 otherwise. This array contains indices of elements in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m + 1) or ia(n + 1) is equal to the number of non-zero elements of the matrix A plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. b REAL for mkl_scsrmultd. DOUBLE PRECISION for mkl_dcsrmultd. COMPLEX for mkl_ccsrmultd. DOUBLE COMPLEX for mkl_zcsrmultd. Array containing non-zero elements of the matrix B. Its length is equal to the number of non-zero elements in the matrix B. Refer to values array description in Sparse Matrix Storage Formats for more details. jb INTEGER. Array containing the column indices for each non-zero element of the matrix B. For each row the column indices must be arranged in the increasing order. The length of this array is equal to the length of the array b. Refer to columns array description in Sparse Matrix Storage Formats for more details. ib INTEGER. Array of length m + 1. This array contains indices of elements in the array b, such that ib(I) is the index in the array b of the first non-zero element from the row I. The value of the last element ib(m + 1) is equal to the number of non-zero elements of the matrix B plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. BLAS and Sparse BLAS Routines 2 325 Output Parameters c REAL for mkl_scsrmultd. DOUBLE PRECISION for mkl_dcsrmultd. COMPLEX for mkl_ccsrmultd. DOUBLE COMPLEX for mkl_zcsrmultd. Array containing non-zero elements of the result matrix C. ldc INTEGER. Specifies the leading dimension of the dense matrix C as declared in the calling (sub)program. Must be at least max(m, 1) when trans = 'N' or 'n', or max(1, n) otherwise. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrmultd( trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) CHARACTER*1 trans INTEGER m, n, k, ldc INTEGER ja(*), jb(*), ia(*), ib(*) REAL a(*), b(*), c(ldc, *) SUBROUTINE mkl_dcsrmultd( trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) CHARACTER*1 trans INTEGER m, n, k, ldc INTEGER ja(*), jb(*), ia(*), ib(*) DOUBLE PRECISION a(*), b(*), c(ldc, *) SUBROUTINE mkl_ccsrmultd( trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) CHARACTER*1 trans INTEGER m, n, k, ldc INTEGER ja(*), jb(*), ia(*), ib(*) COMPLEX a(*), b(*), c(ldc, *) SUBROUTINE mkl_zcsrmultd( trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) CHARACTER*1 trans INTEGER m, n, k, ldc INTEGER ja(*), jb(*), ia(*), ib(*) DOUBLE COMPLEX a(*), b(*), c(ldc, *) C: void mkl_scsrmultd(char *trans, int *m, int *n, int *k, float *a, int *ja, int *ia, float *b, int *jb, int *ib, float *c, int *ldc); void mkl_dcsrmultd(char *trans, int *m, int *n, int *k, double *a, int *ja, int *ia, double *b, int *jb, int *ib, double *c, int *ldc); void mkl_ccsrmultd(char *trans, int *m, int *n, int *k, MKL_Complex8 *a, int *ja, int *ia, MKL_Complex8 *b, int *jb, int *ib, MKL_Complex8 *c, int *ldc); 2 Intel® Math Kernel Library Reference Manual 326 void mkl_zcsrmultd(char *trans, int *m, int *n, int *k, MKL_Complex16 *a, int *ja, int *ia, MKL_Complex16 *b, int *jb, int *ib, MKL_Complex16 *c, int *ldc); BLAS-like Extensions Intel MKL provides C and Fortran routines to extend the functionality of the BLAS routines. These include routines to compute vector products, matrix-vector products, and matrix-matrix products. Intel MKL also provides routines to perform certain data manipulation, including matrix in-place and out-ofplace transposition operations combined with simple matrix arithmetic operations. Transposition operations are Copy As Is, Conjugate transpose, Transpose, and Conjugate. Each routine adds the possibility of scaling during the transposition operation by giving some alpha and/or beta parameters. Each routine supports both row-major orderings and column-major orderings. Table “BLAS-like Extensions” lists these routines. The symbol in the routine short names is a precision prefix that indicates the data type: s REAL for Fortran interface, or float for C interface d DOUBLE PRECISION for Fortran interface, or double for C interface. c COMPLEX for Fortran interface, or MKL_Complex8 for C interface. z DOUBLE COMPLEX for Fortran interface, or MKL_Complex16 for C interface. BLAS-like Extensions Routine Data Types Description axpby s, d, c, z Scales two vectors, adds them to one another and stores result in the vector (routines) gem2vu s, d Two matrix-vector products using a general matrix, real data gem2vc c, z Two matrix-vector products using a general matrix, complex data ?gemm3m c, z Computes a scalar-matrix-matrix product using matrix multiplications and adds the result to a scalar-matrix product. mkl_?imatcopy s, d, c, z Performs scaling and in-place transposition/copying of matrices. mkl_?omatcopy s, d, c, z Performs scaling and out-of-place transposition/copying of matrices. mkl_?omatcopy2 s, d, c, z Performs two-strided scaling and out-of-place transposition/copying of matrices. mkl_?omatadd s, d, c, z Performs scaling and sum of two matrices including their out-of-place transposition/copying. ?axpby Scales two vectors, adds them to one another and stores result in the vector. Syntax Fortran 77: call saxpby(n, a, x, incx, b, y, incy) BLAS and Sparse BLAS Routines 2 327 call daxpby(n, a, x, incx, b, y, incy) call caxpby(n, a, x, incx, b, y, incy) call zaxpby(n, a, x, incx, b, y, incy) Fortran 95: call axpby(x, y [,a] [,b]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?axpby routines perform a vector-vector operation defined as y := a*x + b*y where: a and b are scalars x and y are vectors each with n elements. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. a REAL for saxpby DOUBLE PRECISION for daxpby COMPLEX for caxpby DOUBLE COMPLEX for zaxpby Specifies the scalar a. x REAL for saxpby DOUBLE PRECISION for daxpby COMPLEX for caxpby DOUBLE COMPLEX for zaxpby Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. b REAL for saxpby DOUBLE PRECISION for daxpby COMPLEX for caxpby DOUBLE COMPLEX for zaxpby Specifies the scalar b. y REAL for saxpby DOUBLE PRECISION for daxpby COMPLEX for caxpby DOUBLE COMPLEX for zaxpby Array, DIMENSION at least (1 + (n-1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. Output Parameters y Contains the updated vector y. 2 Intel® Math Kernel Library Reference Manual 328 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine axpby interface are the following: x Holds the array of size n. y Holds the array of size n. a The default value is 1. b The default value is 1. ?gem2vu Computes two matrix-vector products using a general matrix (real data) Syntax Fortran 77: call sgem2vu(m, n, alpha, a, lda, x1, incx1, x2, incx2, beta, y1, incy1, y2, incy2) call dgem2vu(m, n, alpha, a, lda, x1, incx1, x2, incx2, beta, y1, incy1, y2, incy2) Fortran 95: call gem2vu(a, x1, x2, y1, y2 [,alpha][,beta] ) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gem2vu routines perform two matrix-vector operations defined as y1 := alpha*A*x1 + beta*y1, and y2 := alpha*A'*x2 + beta*y2, where: alpha and beta are scalars, x1, x2, y1, and y2 are vectors, A is an m-by-n matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha REAL for sgem2vu DOUBLE PRECISION for dgem2vu BLAS and Sparse BLAS Routines 2 329 Specifies the scalar alpha. a REAL for sgem2vu DOUBLE PRECISION for dgem2vu Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). x1 REAL for sgem2vu DOUBLE PRECISION for dgem2vu Array, DIMENSION at least (1+(n-1)*abs(incx1)). Before entry, the incremented array x1 must contain the vector x1. incx1 INTEGER. Specifies the increment for the elements of x1. The value of incx1 must not be zero. x2 REAL for sgem2vu DOUBLE PRECISION for dgem2vu Array, DIMENSION at least (1+(m-1)*abs(incx2)). Before entry, the incremented array x2 must contain the vector x2. incx2 INTEGER. Specifies the increment for the elements of x2. The value of incx2 must not be zero. beta REAL for sgem2vu DOUBLE PRECISION for dgem2vu Specifies the scalar beta. When beta is set to zero, then y1 and y2 need not be set on input. y1 REAL for sgem2vu DOUBLE PRECISION for dgem2vu Array, DIMENSION at least (1+(m-1)*abs(incy1)). Before entry with nonzero beta, the incremented array y1 must contain the vector y1. incy1 INTEGER. Specifies the increment for the elements of y1. The value of incy1 must not be zero. y REAL for sgem2vu DOUBLE PRECISION for dgem2vu Array, DIMENSION at least (1+(n-1)*abs(incy2)). Before entry with nonzero beta, the incremented array y2 must contain the vector y2. incy2 INTEGER. Specifies the increment for the elements of y2. The value of incy2 must not be zero. Output Parameters y1 Updated vector y1. y2 Updated vector y2. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gem2vu interface are the following: a Holds the matrix A of size (m,n). x1 Holds the vector with the number of elements rx1 where rx1 = n. x2 Holds the vector with the number of elements rx2 where rx2 = m. 2 Intel® Math Kernel Library Reference Manual 330 y1 Holds the vector with the number of elements ry1 where ry1 = m. y2 Holds the vector with the number of elements ry2 where ry2 = n. alpha The default value is 1. beta The default value is 0. ?gem2vc Computes two matrix-vector products using a general matrix (complex data) Syntax Fortran 77: call cgem2vc(m, n, alpha, a, lda, x1, incx1, x2, incx2, beta, y1, incy1, y2, incy2) call zgem2vc(m, n, alpha, a, lda, x1, incx1, x2, incx2, beta, y1, incy1, y2, incy2) Fortran 95: call gem2vc(a, x1, x2, y1, y2 [,alpha][,beta] ) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gem2vc routines perform two matrix-vector operations defined as y1 := alpha*A*x1 + beta*y1, and y2 := alpha*conjg(A')*x2 + beta*y2, where: alpha and beta are scalars, x1, x2, y1, and y2 are vectors, A is an m-by-n matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Specifies the scalar alpha. a COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. BLAS and Sparse BLAS Routines 2 331 lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). x1 COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Array, DIMENSION at least (1+(n-1)*abs(incx1)). Before entry, the incremented array x1 must contain the vector x1. incx1 INTEGER. Specifies the increment for the elements of x1. The value of incx1 must not be zero. x2 COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Array, DIMENSION at least (1+(m-1)*abs(incx2)). Before entry, the incremented array x2 must contain the vector x2. incx2 INTEGER. Specifies the increment for the elements of x2. The value of incx2 must not be zero. beta COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Specifies the scalar beta. When beta is set to zero, then y1 and y2 need not be set on input. y1 COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Array, DIMENSION at least (1+(m-1)*abs(incy1)). Before entry with nonzero beta, the incremented array y1 must contain the vector y1. incy1 INTEGER. Specifies the increment for the elements of y1. The value of incy1 must not be zero. y2 COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Array, DIMENSION at least (1+(n-1)*abs(incy2)). Before entry with nonzero beta, the incremented array y2 must contain the vector y2. incy2 INTEGER. Specifies the increment for the elements of y2. The value of incy must not be zero. INTEGER. Specifies the increment for the elements of y. Output Parameters y1 Updated vector y1. y2 Updated vector y2. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gem2vc interface are the following: a Holds the matrix A of size (m,n). x1 Holds the vector with the number of elements rx1 where rx1 = n. x2 Holds the vector with the number of elements rx2 where rx2 = m. y1 Holds the vector with the number of elements ry1 where ry1 = m. y2 Holds the vector with the number of elements ry2 where ry2 = n. alpha The default value is 1. 2 Intel® Math Kernel Library Reference Manual 332 beta The default value is 0. ?gemm3m Computes a scalar-matrix-matrix product using matrix multiplications and adds the result to a scalar-matrix product. Syntax Fortran 77: call cgemm3m(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call zgemm3m(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call gemm3m(a, b, c [,transa][,transb] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gemm3m routines perform a matrix-matrix operation with general complex matrices. These routines are similar to the ?gemm routines, but they use matrix multiplications(see Application Notes below). The operation is defined as C := alpha*op(A)*op(B) + beta*C, where: op(x) is one of op(x) = x, or op(x) = x', or op(x) = conjg(x'), alpha and beta are scalars, A, B and C are matrices: op(A) is an m-by-k matrix, op(B) is a k-by-n matrix, C is an m-by-n matrix. Input Parameters transa CHARACTER*1. Specifies the form of op(A) used in the matrix multiplication: if transa = 'N' or 'n', then op(A) = A; if transa = 'T' or 't', then op(A) = A'; if transa = 'C' or 'c', then op(A) = conjg(A'). transb CHARACTER*1. Specifies the form of op(B) used in the matrix multiplication: if transb = 'N' or 'n', then op(B) = B; if transb = 'T' or 't', then op(B) = B'; if transb = 'C' or 'c', then op(B) = conjg(B'). BLAS and Sparse BLAS Routines 2 333 m INTEGER. Specifies the number of rows of the matrix op(A) and of the matrix C. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix op(B) and the number of columns of the matrix C. The value of n must be at least zero. k INTEGER. Specifies the number of columns of the matrix op(A) and the number of rows of the matrix op(B). The value of k must be at least zero. alpha COMPLEX for cgemm3m DOUBLE COMPLEX for zgemm3m Specifies the scalar alpha. a COMPLEX for cgemm3m DOUBLE COMPLEX for zgemm3m Array, DIMENSION (lda, ka), where ka is k when transa= 'N' or 'n', and is m otherwise. Before entry with transa= 'N' or 'n', the leading mby- k part of the array a must contain the matrix A, otherwise the leading kby- m part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When transa= 'N' or 'n', then lda must be at least max(1, m), otherwise lda must be at least max(1, k). b COMPLEX for cgemm3m DOUBLE COMPLEX for zgemm3m Array, DIMENSION (ldb, kb), where kb is n when transb = 'N' or 'n', and is k otherwise. Before entry with transb = 'N' or 'n', the leading kby- n part of the array b must contain the matrix B, otherwise the leading nby- k part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. When transb = 'N' or 'n', then ldb must be at least max(1, k), otherwise ldb must be at least max(1, n). beta COMPLEX for cgemm3m DOUBLE COMPLEX for zgemm3m Specifies the scalar beta. When beta is equal to zero, then c need not be set on input. c COMPLEX for cgemm3m DOUBLE COMPLEX for zgemm3m Array, DIMENSION (ldc, n). Before entry, the leading m-by-n part of the array c must contain the matrix C, except when beta is equal to zero, in which case c need not be set on entry. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, m). Output Parameters c Overwritten by the m-by-n matrix (alpha*op(A)*op(B) + beta*C). Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. 2 Intel® Math Kernel Library Reference Manual 334 Specific details for the routine gemm3m interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if transa= 'N', ka = m otherwise, ma = m if transa= 'N', ma = k otherwise. b Holds the matrix B of size (mb,kb) where kb = n if transb = 'N', kb = k otherwise, mb = k if transb = 'N', mb = n otherwise. c Holds the matrix C of size (m,n). transa Must be 'N', 'C', or 'T'. The default value is 'N'. transb Must be 'N', 'C', or 'T'. The default value is 'N'. alpha The default value is 1. beta The default value is 1. Application Notes These routines perform the complex multiplication by forming the real and imaginary parts of the input matrices. It allows to use three real matrix multiplications and five real matrix additions, instead of the conventional four real matrix multiplications and two real matrix additions. The use of three real matrix multiplications only gives a 25% reduction of time in matrix operations. This can result in significant savings in computing time for large matrices. If the errors in the floating point calculations satisfy the following conditions: fl(x op y)=(x op y)(1+d),|d|=u, op=×,/, fl(x±y)=x(1+a)±y(1+ß), |a|,|ß|=u then for n-by-n matrix C=fl(C1+iC2)= fl((A1+iA2)(B1+iB2))=C1+iC2 the following estimations are correct ¦C1-C2¦= 2(n+1)u¦A¦8¦B¦8+O(u2), ¦C2-C1¦= 4(n+4)u¦A¦8¦B¦8+O(u2), where ¦A¦8=max(¦A1¦8,¦A2¦8), and ¦B¦8=max(¦B1¦8,¦B2¦8). and hence the matrix multiplications are stable. mkl_?imatcopy Performs scaling and in-place transposition/copying of matrices. Syntax Fortran: call mkl_simatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda) call mkl_dimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda) call mkl_cimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda) call mkl_zimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda) BLAS and Sparse BLAS Routines 2 335 C: mkl_simatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda); mkl_dimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda); mkl_cimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda); mkl_zimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda); Include Files • FORTRAN 77: mkl_trans.fi • C: mkl_trans.h Description The mkl_?imatcopy routine performs scaling and in-place transposition/copying of matrices. A transposition operation can be a normal matrix copy, a transposition, a conjugate transposition, or just a conjugation. The operation is defined as follows: A := alpha*op(A). The routine parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. Note that different arrays should not overlap. Input Parameters ordering CHARACTER*1. Ordering of the matrix storage. If ordering = 'R' or 'r', the ordering is row-major. If ordering = 'C' or 'c', the ordering is column-major. trans CHARACTER*1. Parameter that specifies the operation type. If trans = 'N' or 'n', op(A)=A and the matrix A is assumed unchanged on input. If trans = 'T' or 't', it is assumed that A should be transposed. If trans = 'C' or 'c', it is assumed that A should be conjugate transposed. If trans = 'R' or 'r', it is assumed that A should be only conjugated. If the data is real, then trans = 'R' is the same as trans = 'N', and trans = 'C' is the same as trans = 'T'. rows INTEGER. The number of matrix rows. cols INTEGER. The number of matrix columns. a REAL for mkl_simatcopy. DOUBLE PRECISION for mkl_dimatcopy. COMPLEX for mkl_cimatcopy. DOUBLE COMPLEX for mkl_zimatcopy. Array, DIMENSION a(scr_lda,*). alpha REAL for mkl_simatcopy. DOUBLE PRECISION for mkl_dimatcopy. COMPLEX for mkl_cimatcopy. DOUBLE COMPLEX for mkl_zimatcopy. This parameter scales the input matrix by alpha. 2 Intel® Math Kernel Library Reference Manual 336 src_lda INTEGER. Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the source matrix; measured in the number of elements. This parameter must be at least max(1,rows) if ordering = 'C' or 'c', and max(1,cols) otherwise. dst_lda INTEGER. Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the destination matrix; measured in the number of elements. To determine the minimum value of dst_lda on output, consider the following guideline: If ordering = 'C' or 'c', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,rows) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,cols) If ordering = 'R' or 'r', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,cols) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,rows) Output Parameters a REAL for mkl_simatcopy. DOUBLE PRECISION for mkl_dimatcopy. COMPLEX for mkl_cimatcopy. DOUBLE COMPLEX for mkl_zimatcopy. Array, DIMENSION at least m. Contains the matrix A. Interfaces FORTRAN 77: SUBROUTINE mkl_simatcopy ( ordering, trans, rows, cols, alpha, a, src_lda, dst_lda ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_ld, dst_ld REAL a(*), alpha* SUBROUTINE mkl_dimatcopy ( ordering, trans, rows, cols, alpha, a, src_lda, dst_lda ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_ld, dst_ld DOUBLE PRECISION a(*), alpha* SUBROUTINE mkl_cimatcopy ( ordering, trans, rows, cols, alpha, a, src_lda, dst_lda ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_ld, dst_ld COMPLEX a(*), alpha* SUBROUTINE mkl_zimatcopy ( ordering, trans, rows, cols, alpha, a, src_lda, dst_lda ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_ld, dst_ld DOUBLE COMPLEX a(*), alpha* C: void mkl_simatcopy(char ordering, char trans, size_t rows, size_t cols, float *alpha, float *a, size_t src_lda, size_t dst_lda); BLAS and Sparse BLAS Routines 2 337 void mkl_dimatcopy(char ordering, char trans, size_t rows, size_t cols, double *alpha, float *a, size_t src_lda, size_t dst_lda); void mkl_cimatcopy(char ordering, char trans, size_t rows, size_t cols, MKL_Complex8 *alpha, MKL_Complex8 *a, size_t src_lda, size_t dst_lda); void mkl_zimatcopy(char ordering, char trans, size_t rows, size_t cols, MKL_Complex16 *alpha, MKL_Complex16 *a, size_t src_lda, size_t dst_lda); mkl_?omatcopy Performs scaling and out-place transposition/copying of matrices. Syntax Fortran: call mkl_somatcopy(ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld) call mkl_domatcopy(ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld) call mkl_comatcopy(ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld) call mkl_zomatcopy(ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld) C: mkl_somatcopy(ordering, trans, rows, cols, alpha, SRC, src_stride, DST, dst_stride); mkl_domatcopy(ordering, trans, rows, cols, alpha, SRC, src_stride, DST, dst_stride); mkl_comatcopy(ordering, trans, rows, cols, alpha, SRC, src_stride, DST, dst_stride); mkl_zomatcopy(ordering, trans, rows, cols, alpha, SRC, src_stride, DST, dst_stride); Include Files • FORTRAN 77: mkl_trans.fi • C: mkl_trans.h Description The mkl_?omatcopy routine performs scaling and out-of-place transposition/copying of matrices. A transposition operation can be a normal matrix copy, a transposition, a conjugate transposition, or just a conjugation. The operation is defined as follows: B := alpha*op(A) The routine parameter descriptions are common for all implemented interfaces with the exception of data types that mostly refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. Note that different arrays should not overlap. Input Parameters ordering CHARACTER*1. Ordering of the matrix storage. If ordering = 'R' or 'r', the ordering is row-major. If ordering = 'C' or 'c', the ordering is column-major. trans CHARACTER*1. Parameter that specifies the operation type. If trans = 'N' or 'n', op(A)=A and the matrix A is assumed unchanged on input. 2 Intel® Math Kernel Library Reference Manual 338 If trans = 'T' or 't', it is assumed that A should be transposed. If trans = 'C' or 'c', it is assumed that A should be conjugate transposed. If trans = 'R' or 'r', it is assumed that A should be only conjugated. If the data is real, then trans = 'R' is the same as trans = 'N', and trans = 'C' is the same as trans = 'T'. rows INTEGER. The number of matrix rows. cols INTEGER. The number of matrix columns. alpha REAL for mkl_somatcopy. DOUBLE PRECISION for mkl_domatcopy. COMPLEX for mkl_comatcopy. DOUBLE COMPLEX for mkl_zomatcopy. This parameter scales the input matrix by alpha. src REAL for mkl_somatcopy. DOUBLE PRECISION for mkl_domatcopy. COMPLEX for mkl_comatcopy. DOUBLE COMPLEX for mkl_zomatcopy. Array, DIMENSION src(scr_ld,*). src_ld INTEGER. (Fortran interface). Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the source matrix; measured in the number of elements. This parameter must be at least max(1,rows) if ordering = 'C' or 'c', and max(1,cols) otherwise. src_stride INTEGER. (C interface). Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the source matrix; measured in the number of elements. This parameter must be at least max(1,rows) if ordering = 'C' or 'c', and max(1,cols) otherwise. dst REAL for mkl_somatcopy. DOUBLE PRECISION for mkl_domatcopy. COMPLEX for mkl_comatcopy. DOUBLE COMPLEX for mkl_zomatcopy. Array, DIMENSION dst(dst_ld,*). dst_ld INTEGER. (Fortran interface). Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the destination matrix; measured in the number of elements. To determine the minimum value of dst_lda on output, consider the following guideline: If ordering = 'C' or 'c', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,rows) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,cols) If ordering = 'R' or 'r', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,cols) BLAS and Sparse BLAS Routines 2 339 • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,rows) dst_stride INTEGER. (C interface). Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the destination matrix; measured in the number of elements. To determine the minimum value of dst_lda on output, consider the following guideline: If ordering = 'C' or 'c', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,rows) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,cols) If ordering = 'R' or 'r', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,cols) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,rows) Output Parameters dst REAL for mkl_somatcopy. DOUBLE PRECISION for mkl_domatcopy. COMPLEX for mkl_comatcopy. DOUBLE COMPLEX for mkl_zomatcopy. Array, DIMENSION at least m. Contains the destination matrix. Interfaces FORTRAN 77: SUBROUTINE mkl_somatcopy ( ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_ld, dst_ld REAL alpha, dst(dst_ld,*), src(src_ld,*) SUBROUTINE mkl_domatcopy ( ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_lda, dst_lda DOUBLE PRECISION alpha, dst(dst_ld,*), src(src_ld,*) SUBROUTINE mkl_comatcopy ( ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_lda, dst_lda COMPLEX alpha, dst(dst_ld,*), src(src_ld,*) SUBROUTINE mkl_zomatcopy ( ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_lda, dst_lda DOUBLE COMPLEX alpha, dst(dst_ld,*), src(src_ld,*) C: void mkl_somatcopy(char ordering, char trans, size_t rows, size_t cols, float alpha, float *SRC, size_t src_stride, float *DST, size_t dst_stride); 2 Intel® Math Kernel Library Reference Manual 340 void mkl_domatcopy(char ordering, char trans, size_t rows, size_t cols, double alpha, double *SRC, size_t src_stride, double *DST, size_t dst_stride); void mkl_comatcopy(char ordering, char trans, size_t rows, size_t cols, MKL_Complex8 alpha, MKL_Complex8 *SRC, size_t src_stride, MKL_Complex8 *DST, size_t dst_stride); void mkl_zomatcopy(char ordering, char trans, size_t rows, size_t cols, MKL_Complex16 alpha, MKL_Complex16 *SRC, size_t src_stride, MKL_Complex16 *DST, size_t dst_stride); mkl_?omatcopy2 Performs two-strided scaling and out-of-place transposition/copying of matrices. Syntax Fortran: call mkl_somatcopy2(ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col) call mkl_domatcopy2(ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col) call mkl_comatcopy2(ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col) call mkl_zomatcopy2(ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col) C: mkl_somatcopy2(ordering, trans, rows, cols, alpha, SRC, src_row, src_col, DST, dst_row, dst_col); mkl_domatcopy2(ordering, trans, rows, cols, alpha, SRC, src_row, src_col, DST, dst_row, dst_col); mkl_comatcopy2(ordering, trans, rows, cols, alpha, SRC, src_row, src_col, DST, dst_row, dst_col); mkl_zomatcopy2(ordering, trans, rows, cols, alpha, SRC, src_row, src_col, DST, dst_row, dst_col); Include Files • FORTRAN 77: mkl_trans.fi • C: mkl_trans.h Description The mkl_?omatcopy2 routine performs two-strided scaling and out-of-place transposition/copying of matrices. A transposition operation can be a normal matrix copy, a transposition, a conjugate transposition, or just a conjugation. The operation is defined as follows: B := alpha*op(A) Normally, matrices in the BLAS or LAPACK are specified by a single stride index. For instance, in the columnmajor order, A(2,1) is stored in memory one element away from A(1,1), but A(1,2) is a leading dimension away. The leading dimension in this case is the single stride. If a matrix has two strides, then both A(2,1) and A(1,2) may be an arbitrary distance from A(1,1). BLAS and Sparse BLAS Routines 2 341 The routine parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. Note that different arrays should not overlap. Input Parameters ordering CHARACTER*1. Ordering of the matrix storage. If ordering = 'R' or 'r', the ordering is row-major. If ordering = 'C' or 'c', the ordering is column-major. trans CHARACTER*1. Parameter that specifies the operation type. If trans = 'N' or 'n', op(A)=A and the matrix A is assumed unchanged on input. If trans = 'T' or 't', it is assumed that A should be transposed. If trans = 'C' or 'c', it is assumed that A should be conjugate transposed. If trans = 'R' or 'r', it is assumed that A should be only conjugated. If the data is real, then trans = 'R' is the same as trans = 'N', and trans = 'C' is the same as trans = 'T'. rows INTEGER. The number of matrix rows. cols INTEGER. The number of matrix columns. alpha REAL for mkl_somatcopy2. DOUBLE PRECISION for mkl_domatcopy2. COMPLEX for mkl_comatcopy2. DOUBLE COMPLEX for mkl_zomatcopy2. This parameter scales the input matrix by alpha. src REAL for mkl_somatcopy2. DOUBLE PRECISION for mkl_domatcopy2. COMPLEX for mkl_comatcopy2. DOUBLE COMPLEX for mkl_zomatcopy2. Array, DIMENSION src(*). src_row INTEGER. Distance between the first elements in adjacent rows in the source matrix; measured in the number of elements. This parameter must be at least max(1,rows). src_col INTEGER. Distance between the first elements in adjacent columns in the source matrix; measured in the number of elements. This parameter must be at least max(1,cols). dst REAL for mkl_somatcopy2. DOUBLE PRECISION for mkl_domatcopy2. COMPLEX for mkl_comatcopy2. DOUBLE COMPLEX for mkl_zomatcopy2. Array, DIMENSION dst(*). dst_row INTEGER. Distance between the first elements in adjacent rows in the destination matrix; measured in the number of elements. To determine the minimum value of dst_row on output, consider the following guideline: • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,cols) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,rows) 2 Intel® Math Kernel Library Reference Manual 342 dst_col INTEGER. Distance between the first elements in adjacent columns in the destination matrix; measured in the number of elements. To determine the minimum value of dst_lda on output, consider the following guideline: • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,rows) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,cols) Output Parameters dst REAL for mkl_somatcopy2. DOUBLE PRECISION for mkl_domatcopy2. COMPLEX for mkl_comatcopy2. DOUBLE COMPLEX for mkl_zomatcopy2. Array, DIMENSION at least m. Contains the destination matrix. Interfaces FORTRAN 77: SUBROUTINE mkl_somatcopy2 ( ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_row, src_col, dst_row, dst_col REAL alpha, dst(*), src(*) SUBROUTINE mkl_domatcopy2 ( ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_row, src_col, dst_row, dst_col DOUBLE PRECISION alpha, dst(*), src(*) SUBROUTINE mkl_comatcopy2 ( ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_row, src_col, dst_row, dst_col COMPLEX alpha, dst(*), src(*) SUBROUTINE mkl_zomatcopy2 ( ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_row, src_col, dst_row, dst_col DOUBLE COMPLEX alpha, dst(*), src(*) C: void mkl_somatcopy2(char ordering, char trans, size_t rows, size_t cols, float *alpha, float *SRC, size_t src_row, size_t src_col, float *DST, size_t dst_row, size_t dst_col); void mkl_domatcopy2(char ordering, char trans, size_t rows, size_t cols, float *alpha, double *SRC, size_t src_row, size_t src_col, double *DST, size_t dst_row, size_t dst_col); void mkl_comatcopy2(char ordering, char trans, size_t rows, size_t cols, MKL_Complex8 *alpha, MKL_Complex8 *SRC, size_t src_row, size_t src_col, MKL_Complex8 *DST, size_t dst_row, size_t dst_col); void mkl_zomatcopy2(char ordering, char trans, size_t rows, size_t cols, MKL_Complex16 *alpha, MKL_Complex16 *SRC, size_t src_row, size_t src_col, MKL_Complex16 *DST, size_t dst_row, size_t dst_col); BLAS and Sparse BLAS Routines 2 343 mkl_?omatadd Performs scaling and sum of two matrices including their out-of-place transposition/copying. Syntax Fortran: call mkl_somatadd(ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc) call mkl_domatadd(ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc) call mkl_comatadd(ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc) call mkl_zomatadd(ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc) C: mkl_somatadd(ordering, transa, transb, m, n, alpha, A, lda, beta, B, ldb, C, ldc); mkl_domatadd(ordering, transa, transb, m, n, alpha, A, lda, beta, B, ldb, C, ldc); mkl_comatadd(ordering, transa, transb, m, n, alpha, A, lda, beta, B, ldb, C, ldc); mkl_zomatadd(ordering, transa, transb, m, n, alpha, A, lda, beta, B, ldb, C, ldc); Include Files • FORTRAN 77: mkl_trans.fi • C: mkl_trans.h Description The mkl_?omatadd routine scaling and sum of two matrices including their out-of-place transposition/ copying. A transposition operation can be a normal matrix copy, a transposition, a conjugate transposition, or just a conjugation. The following out-of-place memory movement is done: C := alpha*op(A) + beta*op(B) op(A) is either transpose, conjugate-transpose, or leave alone depending on transa. If no transposition of the source matrices is required, m is the number of rows and n is the number of columns in the source matrices A and B. In this case, the output matrix C is m-by-n. The routine parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. Note that different arrays should not overlap. Input Parameters ordering CHARACTER*1. Ordering of the matrix storage. If ordering = 'R' or 'r', the ordering is row-major. If ordering = 'C' or 'c', the ordering is column-major. transa CHARACTER*1. Parameter that specifies the operation type on matrix A. If transa = 'N' or 'n', op(A)=A and the matrix A is assumed unchanged on input. If transa = 'T' or 't', it is assumed that A should be transposed. If transa = 'C' or 'c', it is assumed that A should be conjugate transposed. If transa = 'R' or 'r', it is assumed that A should be only conjugated. 2 Intel® Math Kernel Library Reference Manual 344 If the data is real, then transa = 'R' is the same as transa = 'N', and transa = 'C' is the same as transa = 'T'. transb CHARACTER*1. Parameter that specifies the operation type on matrix B. If transb = 'N' or 'n', op(B)=B and the matrix B is assumed unchanged on input. If transb = 'T' or 't', it is assumed that B should be transposed. If transb = 'C' or 'c', it is assumed that B should be conjugate transposed. If transb = 'R' or 'r', it is assumed that B should be only conjugated. If the data is real, then transb = 'R' is the same as transb = 'N', and transb = 'C' is the same as transb = 'T'. m INTEGER. The number of matrix rows. n INTEGER. The number of matrix columns. alpha REAL for mkl_somatadd. DOUBLE PRECISION for mkl_domatadd. COMPLEX for mkl_comatadd. DOUBLE COMPLEX for mkl_zomatadd. This parameter scales the input matrix by alpha. a REAL for mkl_somatadd. DOUBLE PRECISION for mkl_domatadd. COMPLEX for mkl_comatadd. DOUBLE COMPLEX for mkl_zomatadd. Array, DIMENSION a(lda,*). lda INTEGER. Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the source matrix A; measured in the number of elements. This parameter must be at least max(1,rows) if ordering = 'C' or 'c', and max(1,cols) otherwise. beta REAL for mkl_somatadd. DOUBLE PRECISION for mkl_domatadd. COMPLEX for mkl_comatadd. DOUBLE COMPLEX for mkl_zomatadd. This parameter scales the input matrix by beta. b REAL for mkl_somatadd. DOUBLE PRECISION for mkl_domatadd. COMPLEX for mkl_comatadd. DOUBLE COMPLEX for mkl_zomatadd. Array, DIMENSION b(ldb,*). ldb INTEGER. Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the source matrix B; measured in the number of elements. This parameter must be at least max(1,rows) if ordering = 'C' or 'c', and max(1,cols) otherwise. Output Parameters c REAL for mkl_somatadd. DOUBLE PRECISION for mkl_domatadd. COMPLEX for mkl_comatadd. DOUBLE COMPLEX for mkl_zomatadd. Array, DIMENSION c(ldc,*). BLAS and Sparse BLAS Routines 2 345 ldc INTEGER. Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the destination matrix C; measured in the number of elements. To determine the minimum value of ldc, consider the following guideline: If ordering = 'C' or 'c', then • If transa or transb = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,rows) • If transa or transb = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,cols) If ordering = 'R' or 'r', then • If transa or transb = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,cols) • If transa or transb = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,rows) Interfaces FORTRAN 77: SUBROUTINE mkl_somatadd ( ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc ) CHARACTER*1 ordering, transa, transb INTEGER m, n, lda, ldb, ldc REAL alpha, beta REAL a(lda,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_domatadd ( ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc ) CHARACTER*1 ordering, transa, transb INTEGER m, n, lda, ldb, ldc DOUBLE PRECISION alpha, beta DOUBLE PRECISION a(lda,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_comatadd ( ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc ) CHARACTER*1 ordering, transa, transb INTEGER m, n, lda, ldb, ldc COMPLEX alpha, beta COMPLEX a(lda,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zomatadd ( ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc ) CHARACTER*1 ordering, transa, transb INTEGER m, n, lda, ldb, ldc DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX a(lda,*), b(ldb,*), c(ldc,*) C: void mkl_somatadd(char ordering, char transa, char transb, size_t m, size_t n, float *alpha, float *A, size_t lda, float *beta, float *B, size_t ldb, float *C, size_t ldc); void mkl_domatadd(char ordering, char transa, char transb, size_t m, size_t n, double *alpha, double *A, size_t lda, double *beta, float *B, size_t ldb, double *C, size_t ldc); void mkl_comatadd(char ordering, char transa, char transb, size_t m, size_t n, MKL_Complex8 *alpha, MKL_Complex8 *A, size_t lda, float *beta, float *B, size_t ldb, MKL_Complex8 *C, size_t ldc); void mkl_zomatadd(char ordering, char transa, char transb, size_t m, size_t n, MKL_Complex16 *alpha, MKL_Complex16 *A, size_t lda, float *beta, float *B, size_t ldb, MKL_Complex16 *C, size_t ldc); 2 Intel® Math Kernel Library Reference Manual 346 LAPACK Routines: Linear Equations 3 This chapter describes the Intel® Math Kernel Library implementation of routines from the LAPACK package that are used for solving systems of linear equations and performing a number of related computational tasks. The library includes LAPACK routines for both real and complex data. Routines are supported for systems of equations with the following types of matrices: • general • banded • symmetric or Hermitian positive-definite (full, packed, and rectangular full packed (RFP) storage) • symmetric or Hermitian positive-definite banded • symmetric or Hermitian indefinite (both full and packed storage) • symmetric or Hermitian indefinite banded • triangular (full, packed, and RFP storage) • triangular banded • tridiagonal • diagonally dominant tridiagonal. For each of the above matrix types, the library includes routines for performing the following computations: – factoring the matrix (except for triangular matrices) – equilibrating the matrix (except for RFP matrices) – solving a system of linear equations – estimating the condition number of a matrix (except for RFP matrices) – refining the solution of linear equations and computing its error bounds (except for RFP matrices) – inverting the matrix. To solve a particular problem, you can call two or more computational routines or call a corresponding driver routine that combines several tasks in one call. For example, to solve a system of linear equations with a general matrix, call ?getrf (LU factorization) and then ?getrs (computing the solution). Then, call ?gerfs to refine the solution and get the error bounds. Alternatively, use the driver routine ?gesvx that performs all these tasks in one call. WARNING LAPACK routines assume that input matrices do not contain IEEE 754 special values such as INF or NaN values. Using these special values may cause LAPACK to return unexpected results or become unstable. Starting from release 8.0, Intel MKL along with the FORTRAN 77 interface to LAPACK computational and driver routines also supports the Fortran 95 interface that uses simplified routine calls with shorter argument lists. The syntax section of the routine description gives the calling sequence for the Fortran 95 interface, where available, immediately after the FORTRAN 77 calls. Routine Naming Conventions To call each routine introduced in this chapter from the FORTRAN 77 program, you can use the LAPACK name. LAPACK names are listed in Table "Computational Routines for Systems of Equations with Real Matrices" and Table "Computational Routines for Systems of Equations with Complex Matrices", and have the structure ?yyzzz or ?yyzz, which is described below. The initial symbol ? indicates the data type: s real, single precision 347 c complex, single precision d real, double precision z complex, double precision Some routines can have combined character codes, such as ds or zc. The second and third letters yy indicate the matrix type and storage scheme: ge general gb general band gt general tridiagonal dt diagonally dominant tridiagonal po symmetric or Hermitian positive-definite pp symmetric or Hermitian positive-definite (packed storage) pf symmetric or Hermitian positive-definite (RFP storage) pb symmetric or Hermitian positive-definite band pt symmetric or Hermitian positive-definite tridiagonal sy symmetric indefinite sp symmetric indefinite (packed storage) he Hermitian indefinite hp Hermitian indefinite (packed storage) tr triangular tp triangular (packed storage) tf triangular (RFP storage) tb triangular band The last three letters zzz indicate the computation performed: trf perform a triangular matrix factorization trs solve the linear system with a factored matrix con estimate the matrix condition number rfs refine the solution and compute error bounds rfsx refine the solution and compute error bounds using extra-precise iterative refinement tri compute the inverse matrix using the factorization equ, equb equilibrate a matrix. For example, the sgetrf routine performs the triangular factorization of general real matrices in single precision; the corresponding routine for complex matrices is cgetrf. Driver routine names can end with -sv (meaning a simple driver), or with -svx (meaning an expert driver) or with -svxx (meaning an extra-precise iterative refinement expert driver). The Fortran 95 interfaces to the LAPACK computational and driver routines are the same as the FORTRAN 77 names but without the first letter that indicates the data type. For example, the name of the routine that performs a triangular factorization of general real matrices in Fortran 95 is getrf. Different data types are handled through the definition of a specific internal parameter that refers to a module block with named constants for single and double precision. C Interface Conventions The C interfaces are implemented for most of the Intel MKL LAPACK driver and computational routines. The arguments of the C interfaces for the Intel MKL LAPACK functions comply with the following rules: • Scalar input arguments are passed by value. 3 Intel® Math Kernel Library Reference Manual 348 • Array arguments are passed by reference. • Array input arguments are declared with the const modifier. • Function arguments are passed by pointer. • An integer return value replaces the info output parameter. The return value equal to 0 means the function operation is completed successfully. See also special error codes below. Matrix Order Most of the LAPACK C interfaces have an additional parameter matrix_order of type int as their first argument. This parameter specifies whether the two-dimensional arrays are row-major (LAPACK_ROW_MAJOR) or column-major (LAPACK_COL_MAJOR). In general the leading dimension lda is equal to the number of elements in the major dimension. It is also equal to the distance in elements between two neighboring elements in a line in the minor dimension. If there are no extra elements in a matrix with m rows and n columns, then • For row-major ordering: the number of elements in a row is n, and row i is stored in memory right after row i-1. Therefore lda is n. • For column-major ordering: the number of elements in a column is m, and column i is stored in memory right after column i-1. Therefore lda is m. To refer to a submatrix with dimensions k by l, use the number of elements in the major dimension of the whole matrix (as above) as the leading dimension and k and l in the subroutine's input parameters to describe the size of the submatrix. Workspace Arrays The LAPACK C interface omits workspace parameters because workspace is allocated during runtime and released upon completion of the function operation. For some functions, work arrays contain valuable information on exit. In such cases, the interface contains an additional argument or arguments, namely: • ?gesvx and ?gbsvx contain rpivot • ?gesvd contains superb • ?gejsv and ?gesvj contain istat and stat, respectively. Function Types The function types are used in non-symmetric eigenproblem functions only. typedef lapack_logical (*LAPACK_S_SELECT2) (const float*, const float*); typedef lapack_logical (*LAPACK_S_SELECT3) (const float*, const float*, const float*); typedef lapack_logical (*LAPACK_D_SELECT2) (const double*, const double*); typedef lapack_logical (*LAPACK_D_SELECT3) (const double*, const double*, const double*); LAPACK Routines: Linear Equations 3 349 typedef lapack_logical (*LAPACK_C_SELECT1) (const lapack_complex_float*); typedef lapack_logical (*LAPACK_C_SELECT2) (const lapack_complex_float*, const lapack_complex_float*); typedef lapack_logical (*LAPACK_Z_SELECT1) (const lapack_complex_double*); typedef lapack_logical (*LAPACK_Z_SELECT2) (const lapack_complex_double*, const lapack_complex_double*); Mapping FORTRAN Data Types against C Data Types FORTRAN Data Types vs. C Data Types FORTRAN C INTEGER lapack_int LOGICAL lapack_logical REAL float DOUBLE PRECISION double COMPLEX lapack_complex_float COMPLEX*16/DOUBLE COMPLEX lapack_complex_double CHARACTER char C Type Definitions #ifndef lapack_int #define lapack_int MKL_INT #endif #ifndef lapack_logical #define lapack_logical lapack_int #endif Complex Type Definitions Complex type for single precision: #ifndef lapack_complex_float #define lapack_complex_float MKL_Complex8 #endif Complex type for double precision: #ifndef lapack_complex_double #define lapack_complex_double MKL_Complex16 #endif Matrix Order Definitions #define LAPACK_ROW_MAJOR 101 #define LAPACK_COL_MAJOR 102 See Matrix Order for an explanation of row-major order and column-major order storage. Error Code Definitions #define LAPACK_WORK_MEMORY_ERROR -1010 /* Failed to allocate memory for a working array */ #define LAPACK_TRANSPOSE_MEMORY_ERROR -1011 /* Failed to allocate memory for transposed matrix */ If the return value is -i, the -i-th parameter has an invalid value. 3 Intel® Math Kernel Library Reference Manual 350 Function Prototypes Some Intel MKL functions differ in data types they support and vary in the parameters they take. Each function type has a unique prototype defined. Use this prototype when you call the function from your application program. In most cases, Intel MKL supports four distinct floating-point precisions. Each corresponding prototype looks similar, usually differing only in the data type. To avoid listing all the prototypes in every supported precision, a generic prototype template is provided. denotes precision and is s, d, c, or z: • s for real, single precision • d for real, double precision • c for complex, single precision • z for complex, double precision stands for a respective data type: float, double, lapack_complex_float, or lapack_complex_double. For example, the C prototype template for the ?pptrs function that solves a system of linear equations with a packed Cholesky-factored symmetric (Hermitian) positive-definite matrix looks as follows: lapack_int LAPACKE_pptrs(int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * ap, * b, lapack_int ldb); To obtain the function name and parameter list that corresponds to a specific precision, replace the symbol with s, d, c, or z and the field with the corresponding data type (float, double, lapack_complex_float, or lapack_complex_double respectively). A specific example follows. To solve a system of linear equations with a packed Cholesky-factored Hermitian positive-definite matrix with complex precision, use the following: lapack_int LAPACKE_cpptrs(int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, lapack_complex_float* b, lapack_int ldb); NOTE For the select parameter, the respective values of the field for s, d, c, or z are as follows: LAPACK_S_SELECT3, LAPACK_D_SELECT3, LAPACK_C_SELECT2, and LAPACK_Z_SELECT2. Fortran 95 Interface Conventions Intel® MKL implements the Fortran 95 interface to LAPACK through wrappers that call respective FORTRAN 77 routines. This interface uses such Fortran 95 features as assumed-shape arrays and optional arguments to provide simplified calls to LAPACK routines with fewer arguments. NOTE For LAPACK, Intel MKL offers two types of the Fortran 95 interfaces: • using mkl_lapack.fi only through the include ‘mkl_lapack.fi’ statement. Such interfaces allow you to make use of the original LAPACK routines with all their arguments • using lapack.f90 that includes improved interfaces. This file is used to generate the module files lapack95.mod and f95_precision.mod. The module files mkl95_lapack.mod and mkl95_precision.mod are also generated. See also the section "Fortran 95 interfaces and wrappers to LAPACK and BLAS" of the Intel® MKL User's Guide for details. The module files are used to process the FORTRAN use clauses referencing the LAPACK interface: use lapack95 (or an equivalent use mkl95_lapack) and use f95_precision (or an equivalent use mkl95_precision). The main conventions for the Fortran 95 interface are as follows: LAPACK Routines: Linear Equations 3 351 • The names of arguments used in Fortran 95 call are typically the same as for the respective generic (FORTRAN 77) interface. In rare cases, formal argument names may be different. For instance, select instead of selctg. • Input arguments such as array dimensions are not required in Fortran 95 and are skipped from the calling sequence. Array dimensions are reconstructed from the user data that must exactly follow the required array shape. Another type of generic arguments that are skipped in the Fortran 95 interface are arguments that represent workspace arrays (such as work, rwork, and so on). The only exception are cases when workspace arrays return significant information on output. An argument can also be skipped if its value is completely defined by the presence or absence of another argument in the calling sequence, and the restored value is the only meaningful value for the skipped argument. • Some generic arguments are declared as optional in the Fortran 95 interface and may or may not be present in the calling sequence. An argument can be declared optional if it meets one of the following conditions: – If an argument value is completely defined by the presence or absence of another argument in the calling sequence, it can be declared optional. The difference from the skipped argument in this case is that the optional argument can have some meaningful values that are distinct from the value reconstructed by default. For example, if some argument (like jobz) can take only two values and one of these values directly implies the use of another argument, then the value of jobz can be uniquely reconstructed from the actual presence or absence of this second argument, and jobz can be omitted. – If an input argument can take only a few possible values, it can be declared as optional. The default value of such argument is typically set as the first value in the list and all exceptions to this rule are explicitly stated in the routine description. – If an input argument has a natural default value, it can be declared as optional. The default value of such optional argument is set to its natural default value. • Argument info is declared as optional in the Fortran 95 interface. If it is present in the calling sequence, the value assigned to info is interpreted as follows: – If this value is more than -1000, its meaning is the same as in the FORTRAN 77 routine. – If this value is equal to -1000, it means that there is not enough work memory. – If this value is equal to -1001, incompatible arguments are present in the calling sequence. – If this value is equal to -i, the ith parameter (counting parameters in the FORTRAN 77 interface, not the Fortran 95 interface) had an illegal value. • Optional arguments are given in square brackets in the Fortran 95 call syntax. The "Fortran 95 Notes" subsection at the end of the topic describing each routine details concrete rules for reconstructing the values of the omitted optional parameters. Intel® MKL Fortran 95 Interfaces for LAPACK Routines vs. Netlib Implementation The following list presents general digressions of the Intel MKL LAPACK95 implementation from the Netlib analog: • The Intel MKL Fortran 95 interfaces are provided for pure procedures. • Names of interfaces do not contain the LA_ prefix. • An optional array argument always has the target attribute. • Functionality of the Intel MKL LAPACK95 wrapper is close to the FORTRAN 77 original implementation in the getrf, gbtrf, and potrf interfaces. • If jobz argument value specifies presence or absence of z argument, then z is always declared as optional and jobz is restored depending on whether z is present or not. It is not always so in the Netlib version (see "Modified Netlib Interfaces" in Appendix E). • To avoid double error checking, processing of the info argument is limited to checking of the allocated memory and disarranging of optional arguments. • If an argument that is present in the list of arguments completely defines another argument, the latter is always declared as optional. 3 Intel® Math Kernel Library Reference Manual 352 You can transform an application that uses the Netlib LAPACK interfaces to ensure its work with the Intel MKL interfaces providing that: a. The application is correct, that is, unambiguous, compiler-independent, and contains no errors. b. Each routine name denotes only one specific routine. If any routine name in the application coincides with a name of the original Netlib routine (for example, after removing the LA_ prefix) but denotes a routine different from the Netlib original routine, this name should be modified through context name replacement. You should transform your application in the following cases (see Appendix E for specific differences of individual interfaces): • When using the Netlib routines that differ from the Intel MKL routines only by the LA_ prefix or in the array attribute target. The only transformation required in this case is context name replacement. See "Interfaces Identical to Netlib" in Appendix E for details. • When using Netlib routines that differ from the Intel MKL routines by the LA_ prefix, the target array attribute, and the names of formal arguments. In the case of positional passing of arguments, no additional transformation except context name replacement is required. In the case of the keywords passing of arguments, in addition to the context name replacement the names of mismatching keywords should also be modified. See "Interfaces with Replaced Argument Names" in Appendix E for details. • When using the Netlib routines that differ from the respective Intel MKL routines by the LA_ prefix, the target array attribute, sequence of the arguments, arguments missing in Intel MKL but present in Netlib and, vice versa, present in Intel MKL but missing in Netlib. Remove the differences in the sequence and range of the arguments in process of all the transformations when you use the Netlib routines specified by this bullet and the preceding bullet. See "Modified Netlib Interfaces" in Appendix E for details. • When using the getrf, gbtrf, and potrf interfaces, that is, new functionality implemented in Intel MKL but unavailable in the Netlib source. To override the differences, build the desired functionality explicitly with the Intel MKL means or create a new subroutine with the new functionality, using specific MKL interfaces corresponding to LAPACK 77 routines. You can call the LAPACK 77 routines directly but using the new Intel MKL interfaces is preferable. See "Interfaces Absent From Netlib" and "Interfaces of New Functionality" in Appendix E for details. Note that if the transformed application calls getrf, gbtrf or potrf without controlling arguments rcond and norm, just context name replacement is enough in modifying the calls into the Intel MKL interfaces, as described in the first bullet above. The Netlib functionality is preserved in such cases. • When using the Netlib auxiliary routines. In this case, call a corresponding subroutine directly, using the Intel MKL LAPACK 77 interfaces. Transform your application as follows: 1. Make sure conditions a. and b. are met. 2. Select Netlib LAPACK 95 calls. For each call, do the following: • Select the type of digression and do the required transformations. • Revise results to eliminate unneeded code or data, which may appear after several identical calls. 3. Make sure the transformations are correct and complete. Matrix Storage Schemes LAPACK routines use the following matrix storage schemes: • Full storage: a matrix A is stored in a two-dimensional array a, with the matrix element aij stored in the array element a(i,j). • Packed storage scheme allows you to store symmetric, Hermitian, or triangular matrices more compactly: the upper or lower triangle of the matrix is packed by columns in a one-dimensional array. • Band storage: an m-by-n band matrix with kl sub-diagonals and ku superdiagonals is stored compactly in a two-dimensional array ab with kl+ku+1 rows and n columns. Columns of the matrix are stored in the corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. • Rectangular Full Packed (RFP) storage: the upper or lower triangle of the matrix is packed combining the full and packed storage schemes. This combination enables using half of the full storage as packed storage while maintaining efficiency by using Level 3 BLAS/LAPACK kernels as the full storage. LAPACK Routines: Linear Equations 3 353 In Chapters 4 and 5, arrays that hold matrices in packed storage have names ending in p; arrays with matrices in band storage have names ending in b; arrays with matrices in the RFP storage have names ending in fp. For more information on matrix storage schemes, see "Matrix Arguments" in Appendix B. Mathematical Notation Descriptions of LAPACK routines use the following notation: Ax = b A system of linear equations with an n-by-n matrix A = {aij}, a right-hand side vector b = {bi}, and an unknown vector x = {xi}. AX = B A set of systems with a common matrix A and multiple right-hand sides. The columns of B are individual right-hand sides, and the columns of X are the corresponding solutions. |x| the vector with elements |xi| (absolute values of xi). |A| the matrix with elements |aij| (absolute values of aij). ||x||8 = maxi|xi| The infinity-norm of the vector x. ||A||8 = maxiSj|aij| The infinity-norm of the matrix A. ||A||1 = maxjSi|aij| The one-norm of the matrix A. ||A||1 = ||AT||8 = ||AH||8 ?(A) = ||A|| ||A-1|| The condition number of the matrix A. Error Analysis In practice, most computations are performed with rounding errors. Besides, you often need to solve a system Ax = b, where the data (the elements of A and b) are not known exactly. Therefore, it is important to understand how the data errors and rounding errors can affect the solution x. Data perturbations. If x is the exact solution of Ax = b, and x + dx is the exact solution of a perturbed problem (A + dA)x = (b + db), then where In other words, relative errors in A or b may be amplified in the solution vector x by a factor ?(A) = ||A|| ||A-1|| called the condition number of A. Rounding errors have the same effect as relative perturbations c(n)e in the original data. Here e is the machine precision, and c(n) is a modest function of the matrix order n. The corresponding solution error is ||dx||/||x||= c(n)?(A)e. (The value of c(n) is seldom greater than 10n.) Thus, if your matrix A is ill-conditioned (that is, its condition number ?(A) is very large), then the error in the solution x is also large; you may even encounter a complete loss of precision. LAPACK provides routines that allow you to estimate ?(A) (see Routines for Estimating the Condition Number) and also give you a more precise estimate for the actual solution error (see Refining the Solution and Estimating Its Error). 3 Intel® Math Kernel Library Reference Manual 354 Computational Routines Table "Computational Routines for Systems of Equations with Real Matrices" lists the LAPACK computational routines (FORTRAN 77 and Fortran 95 interfaces) for factorizing, equilibrating, and inverting real matrices, estimating their condition numbers, solving systems of equations with real matrices, refining the solution, and estimating its error. Table "Computational Routines for Systems of Equations with Complex Matrices" lists similar routines for complex matrices. Respective routine names in the Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Computational Routines for Systems of Equations with Real Matrices Matrix type, storage scheme Factorize matrix Equilibrate matrix Solve system Condition number Estimate error Invert matrix general ?getrf ?geequ, ?geequb ?getrs ?gecon ?gerfs, ?gerfsx ?getri general band ?gbtrf ?gbequ, ?gbequb ?gbtrs ?gbcon ?gbrfs, ?gbrfsx general tridiagonal ?gttrf ?gttrs ?gtcon ?gtrfs diagonally dominant tridiagonal ?dttrfb ?dttrsb symmetric positive-definite ?potrf ?poequ, ?poequb ?potrs ?pocon ?porfs, ?porfsx ?potri symmetric positive-definite, packed storage ?pptrf ?ppequ ?pptrs ?ppcon ?pprfs ?pptri symmetric positive-definite, RFP storage ?pftrf ?pftrs ?pftri symmetric positive-definite, band ?pbtrf ?pbequ ?pbtrs ?pbcon ?pbrfs symmetric positive-definite, tridiagonal ?pttrf ?pttrs ?ptcon ?ptrfs symmetric indefinite ?sytrf ?syequb ?sytrs ?sytrs2 ?sycon ?syconv ?syrfs, ?syrfsx ?sytri ?sytri2 ?sytri2x symmetric indefinite, packed storage ?sptrf ?sptrs ?spcon ?sprfs ?sptri triangular ?trtrs ?trcon ?trrfs ?trtri triangular, packed storage ?tptrs ?tpcon ?tprfs ?tptri triangular, RFP storage ?tftri triangular band ?tbtrs ?tbcon ?tbrfs LAPACK Routines: Linear Equations 3 355 In the table above, ? denotes s (single precision) or d (double precision) for the FORTRAN 77 interface. Computational Routines for Systems of Equations with Complex Matrices Matrix type, storage scheme Factorize matrix Equilibrate matrix Solve system Condition number Estimate error Invert matrix general ?getrf ?geequ, ?geequb ?getrs ?gecon ?gerfs, ?gerfsx ?getri general band ?gbtrf ?gbequ, ?gbequb ?gbtrs ?gbcon ?gbrfs, ?gbrfsx general tridiagonal ?gttrf ?gttrs ?gtcon ?gtrfs Hermitian positive-definite ?potrf ?poequ, ?poequb ?potrs ?pocon ?porfs, ?porfsx ?potri Hermitian positive-definite, packed storage ?pptrf ?ppequ ?pptrs ?ppcon ?pprfs ?pptri Hermitian positive-definite, RFP storage ?pftrf ?pftrs ?pftri Hermitian positive-definite, band ?pbtrf ?pbequ ?pbtrs ?pbcon ?pbrfs Hermitian positive-definite, tridiagonal ?pttrf ?pttrs ?ptcon ?ptrfs Hermitian indefinite ?hetrf ?heequb ?hetrs ?hetrs2 ?hecon ?herfs, ?herfsx ?hetri ?hetri2 ?hetri2x symmetric indefinite ?sytrf ?syequb ?sytrs ?sytrs2 ?sycon ?syconv ?syrfs, ?syrfsx ?sytri ?sytri2 ?sytri2x Hermitian indefinite, packed storage ?hptrf ?hptrs ?hpcon ?hprfs ?hptri symmetric indefinite, packed storage ?sptrf ?sptrs ?spcon ?sprfs ?sptri triangular ?trtrs ?trcon ?trrfs ?trtri triangular, packed storage ?tptrs ?tpcon ?tprfs ?tptri triangular, RFP storage ?tftri triangular band ?tbtrs ?tbcon ?tbrfs In the table above, ? stands for c (single precision complex) or z (double precision complex) for FORTRAN 77 interface. 3 Intel® Math Kernel Library Reference Manual 356 Routines for Matrix Factorization This section describes the LAPACK routines for matrix factorization. The following factorizations are supported: • LU factorization • Cholesky factorization of real symmetric positive-definite matrices • Cholesky factorization of real symmetric positive-definite matrices with pivoting • Cholesky factorization of Hermitian positive-definite matrices • Cholesky factorization of Hermitian positive-definite matrices with pivoting • Bunch-Kaufman factorization of real and complex symmetric matrices • Bunch-Kaufman factorization of Hermitian matrices. You can compute: • the LU factorization using full and band storage of matrices • the Cholesky factorization using full, packed, RFP, and band storage • the Bunch-Kaufman factorization using full and packed storage. ?getrf Computes the LU factorization of a general m-by-n matrix. Syntax Fortran 77: call sgetrf( m, n, a, lda, ipiv, info ) call dgetrf( m, n, a, lda, ipiv, info ) call cgetrf( m, n, a, lda, ipiv, info ) call zgetrf( m, n, a, lda, ipiv, info ) Fortran 95: call getrf( a [,ipiv] [,info] ) C: lapack_int LAPACKE_getrf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the LU factorization of a general m-by-n matrix A as A = P*L*U, where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n) and U is upper triangular (upper trapezoidal if m < n). The routine uses partial pivoting, with row interchanges. LAPACK Routines: Linear Equations 3 357 NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A; n = 0. a REAL for sgetrf DOUBLE PRECISION for dgetrf COMPLEX for cgetrf DOUBLE COMPLEX for zgetrf. Array, DIMENSION (lda,*). Contains the matrix A. The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of array a. Output Parameters a Overwritten by L and U. The unit diagonal elements of L are not stored. ipiv INTEGER. Array, DIMENSION at least max(1,min(m, n)). The pivot indices; for 1 = i = min(m, n), row i was interchanged with row ipiv(i). info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, uii is 0. The factorization has been completed, but U is exactly singular. Division by 0 will occur if you use the factor U for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine getrf interface are as follows: a Holds the matrix A of size (m,n). ipiv Holds the vector of length min(m,n). Application Notes The computed L and U are the exact factors of a perturbed matrix A + E, where |E| = c(min(m,n))e P|L||U| c(n) is a modest linear function of n, and e is the machine precision. The approximate number of floating-point operations for real flavors is (2/3)n3 If m = n, (1/3)n2(3m-n) If m > n, (1/3)m2(3n-m) If m < n. The number of operations for complex flavors is four times greater. 3 Intel® Math Kernel Library Reference Manual 358 After calling this routine with m = n, you can call the following: ?getrs to solve A*x = B or ATX = B or AHX = B ?gecon to estimate the condition number of A ?getri to compute the inverse of A. See Also mkl_progress ?gbtrf Computes the LU factorization of a general m-by-n band matrix. Syntax Fortran 77: call sgbtrf( m, n, kl, ku, ab, ldab, ipiv, info ) call dgbtrf( m, n, kl, ku, ab, ldab, ipiv, info ) call cgbtrf( m, n, kl, ku, ab, ldab, ipiv, info ) call zgbtrf( m, n, kl, ku, ab, ldab, ipiv, info ) Fortran 95: call gbtrf( ab [,kl] [,m] [,ipiv] [,info] ) C: lapack_int LAPACKE_gbtrf( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, * ab, lapack_int ldab, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the LU factorization of a general m-by-n band matrix A with kl non-zero subdiagonals and ku non-zero superdiagonals, that is, A = P*L*U, where P is a permutation matrix; L is lower triangular with unit diagonal elements and at most kl non-zero elements in each column; U is an upper triangular band matrix with kl + ku superdiagonals. The routine uses partial pivoting, with row interchanges (which creates the additional kl superdiagonals in U). NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in matrix A; m = 0. LAPACK Routines: Linear Equations 3 359 n INTEGER. The number of columns in matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ab REAL for sgbtrf DOUBLE PRECISION for dgbtrf COMPLEX for cgbtrf DOUBLE COMPLEX for zgbtrf. Array, DIMENSION (ldab,*). The array ab contains the matrix A in band storage, in rows kl + 1 to 2*kl + ku + 1; rows 1 to kl of the array need not be set. The j-th column of A is stored in the j-th column of the array ab as follows: ab(kl + ku + 1 + i - j, j) = a(i,j) for max(1,j-ku) = i = min(m,j+kl). ldab INTEGER. The leading dimension of the array ab. (ldab = 2*kl + ku + 1) Output Parameters ab Overwritten by L and U. U is stored as an upper triangular band matrix with kl + ku superdiagonals in rows 1 to kl + ku + 1, and the multipliers used during the factorization are stored in rows kl + ku + 2 to 2*kl + ku + 1. See Application Notes below for further details. ipiv INTEGER. Array, DIMENSION at least max(1,min(m, n)). The pivot indices; for 1 = i = min(m, n) , row i was interchanged with row ipiv(i). . info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, uii is 0. The factorization has been completed, but U is exactly singular. Division by 0 will occur if you use the factor U for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbtrf interface are as follows: ab Holds the array A of size (2*kl+ku+1,n). ipiv Holds the vector of length min(m,n). kl If omitted, assumed kl = ku. ku Restored as ku = lda-2*kl-1. m If omitted, assumed m = n. Application Notes The computed L and U are the exact factors of a perturbed matrix A + E, where |E| = c(kl+ku+1) e P|L||U| c(k) is a modest linear function of k, and e is the machine precision. 3 Intel® Math Kernel Library Reference Manual 360 The total number of floating-point operations for real flavors varies between approximately 2n(ku+1)kl and 2n(kl+ku+1)kl. The number of operations for complex flavors is four times greater. All these estimates assume that kl and ku are much less than min(m,n). The band storage scheme is illustrated by the following example, when m = n = 6, kl = 2, ku = 1: Array elements marked * are not used by the routine; elements marked + need not be set on entry, but are required by the routine to store elements ofU because of fill-in resulting from the row interchanges. After calling this routine with m = n, you can call the following routines: gbtrs to solve A*X = B or AT*X = B or AH*X = B gbcon to estimate the condition number of A. See Also mkl_progress ?gttrf Computes the LU factorization of a tridiagonal matrix. Syntax Fortran 77: call sgttrf( n, dl, d, du, du2, ipiv, info ) call dgttrf( n, dl, d, du, du2, ipiv, info ) call cgttrf( n, dl, d, du, du2, ipiv, info ) call zgttrf( n, dl, d, du, du2, ipiv, info ) Fortran 95: call gttrf( dl, d, du, du2 [, ipiv] [,info] ) C: lapack_int LAPACKE_gttrf( lapack_int n, * dl, * d, * du, * du2, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h LAPACK Routines: Linear Equations 3 361 Description The routine computes the LU factorization of a real or complex tridiagonal matrix A in the form A = P*L*U, where P is a permutation matrix; L is lower bidiagonal with unit diagonal elements; and U is an upper triangular matrix with nonzeroes in only the main diagonal and first two superdiagonals. The routine uses elimination with partial pivoting and row interchanges. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. dl, d, du REAL for sgttrf DOUBLE PRECISION for dgttrf COMPLEX for cgttrf DOUBLE COMPLEX for zgttrf. Arrays containing elements of A. The array dl of dimension (n - 1) contains the subdiagonal elements of A. The array d of dimension n contains the diagonal elements of A. The array du of dimension (n - 1) contains the superdiagonal elements of A. Output Parameters dl Overwritten by the (n-1) multipliers that define the matrix L from the LU factorization of A. The matrix L has unit diagonal elements, and the (n-1) elements of dl form the subdiagonal. All other elements of L are zero. d Overwritten by the n diagonal elements of the upper triangular matrix U from the LU factorization of A. du Overwritten by the (n-1) elements of the first superdiagonal of U. du2 REAL for sgttrf DOUBLE PRECISION for dgttrf COMPLEX for cgttrf DOUBLE COMPLEX for zgttrf. Array, dimension (n -2). On exit, du2 contains (n-2) elements of the second superdiagonal of U. ipiv INTEGER. Array, dimension (n). The pivot indices: for 1 = i = n, row i was interchanged with row ipiv(i). ipiv(i) is always i or i+1; ipiv(i) = i indicates a row interchange was not required. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, uii is 0. The factorization has been completed, but U is exactly singular. Division by zero will occur if you use the factor U for solving a system of linear equations. 3 Intel® Math Kernel Library Reference Manual 362 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gttrf interface are as follows: dl Holds the vector of length (n-1). d Holds the vector of length n. du Holds the vector of length (n-1). du2 Holds the vector of length (n-2). ipiv Holds the vector of length n. Application Notes ?gbtrs to solve A*X = B or AT*X = B or AH*X = B ?gbcon to estimate the condition number of A. ?dttrfb Computes the factorization of a diagonally dominant tridiagonal matrix. Syntax Fortran 77: call sdttrfb( n, dl, d, du, info ) call ddttrfb( n, dl, d, du, info ) call cdttrfb( n, dl, d, du, info ) call zdttrfb( n, dl, d, du, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?dttrfb routine computes the factorization of a real or complex diagonally dominant tridiagonal matrix A with the BABE (Burning At Both Ends) algorithm in the form A = L1*U*L2 where • L1, L2 are lower bidiagonal with unit diagonal elements corresponding to the Gaussian elimination taken from both ends of the matrix. • U is an upper triangular matrix with nonzeroes in only the main diagonal and first two superdiagonals. Input Parameters n INTEGER. The order of the matrix A; n = 0. dl, d, du REAL for sdttrfb DOUBLE PRECISION for ddttrfb COMPLEX for cdttrfb DOUBLE COMPLEX for zdttrfb. LAPACK Routines: Linear Equations 3 363 Arrays containing elements of A. The array dl of dimension (n - 1) contains the subdiagonal elements of A. The array d of dimension n contains the diagonal elements of A. The array du of dimension (n - 1) contains the superdiagonal elements of A. Output Parameters dl Overwritten by the (n -1) multipliers that define the matrix L from the LU factorization of A. d Overwritten by the n diagonal element reciprocals of the upper triangular matrix U from the factorization of A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, uii is 0. The factorization has been completed, but U is exactly singular. Division by zero will occur if you use the factor U for solving a system of linear equations. Application Notes A diagonally dominant tridiagonal system is defined such that |di| > |dli-1| + |dui| for any i: 1 < i < n, and |d1| > |du1|, |dn| > |dln-1| The underlying BABE algorithm is designed for diagonally dominant systems. Such systems are free from the numerical stability issue unlike the canonical systems that use elimination with partial pivoting (see ?gttrf). The diagonally dominant systems are much faster than the canonical systems. NOTE • The current implementation of BABE has a potential accuracy issue on very small or large data close to the underflow or overflow threshold respectively. Scale the matrix before applying the solver in the case of such input data. • Applying the ?dttrfb factorization to non-diagonally dominant systems may lead to an accuracy loss, or false singularity detected due to no pivoting. ?potrf Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite matrix. Syntax Fortran 77: call spotrf( uplo, n, a, lda, info ) call dpotrf( uplo, n, a, lda, info ) call cpotrf( uplo, n, a, lda, info ) call zpotrf( uplo, n, a, lda, info ) Fortran 95: call potrf( a [, uplo] [,info] ) 3 Intel® Math Kernel Library Reference Manual 364 C: lapack_int LAPACKE_potrf( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the Cholesky factorization of a symmetric positive-definite or, for complex data, Hermitian positive-definite matrix A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of matrix A; n = 0. a REAL for spotrf DOUBLE PRECISION for dpotrf COMPLEX for cpotrf DOUBLE COMPLEX for zpotrf. Array, DIMENSION (lda,*). The array a contains either the upper or the lower triangular part of the matrix A (see uplo). The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of a. Output Parameters a The upper or lower triangular part of a is overwritten by the Cholesky factor U or L, as specified by uplo. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Linear Equations 3 365 If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, and the factorization could not be completed. This may indicate an error in forming the matrix A. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine potrf interface are as follows: a Holds the matrix A of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If uplo = 'U', the computed factor U is the exact factor of a perturbed matrix A + E, where c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for uplo = 'L'. The total number of floating-point operations is approximately (1/3)n3 for real flavors or (4/3)n3 for complex flavors. After calling this routine, you can call the following routines: ?potrs to solve A*X = B ?pocon to estimate the condition number of A ?potri to compute the inverse of A. See Also mkl_progress ?pstrf Computes the Cholesky factorization with complete pivoting of a real symmetric (complex Hermitian) positive semidefinite matrix. Syntax Fortran 77: call spstrf( uplo, n, a, lda, piv, rank, tol, work, info ) call dpstrf( uplo, n, a, lda, piv, rank, tol, work, info ) call cpstrf( uplo, n, a, lda, piv, rank, tol, work, info ) call zpstrf( uplo, n, a, lda, piv, rank, tol, work, info ) C: lapack_int LAPACKE_spstrf( int matrix_order, char uplo, lapack_int n, float* a, lapack_int lda, lapack_int* piv, lapack_int* rank, float tol ); lapack_int LAPACKE_dpstrf( int matrix_order, char uplo, lapack_int n, double* a, lapack_int lda, lapack_int* piv, lapack_int* rank, double tol ); 3 Intel® Math Kernel Library Reference Manual 366 lapack_int LAPACKE_cpstrf( int matrix_order, char uplo, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_int* piv, lapack_int* rank, float tol ); lapack_int LAPACKE_zpstrf( int matrix_order, char uplo, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_int* piv, lapack_int* rank, double tol ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the Cholesky factorization with complete pivoting of a real symmetric (complex Hermitian) positive semidefinite matrix. The form of the factorization is: PT * A * P = UT * U, if uplo ='U' for real flavors, PH * A * P = UH * U, if uplo ='U' for complex flavors, PT * A * P = L * LT, if uplo ='L' for real flavors, PH * A * P = L * LH, if uplo ='L' for complex flavors, where P is stored as vector piv, 'U' and 'L' are upper and lower triangular matrices respectively. This algorithm does not attempt to check that A is positive semidefinite. This version of the algorithm calls level 3 BLAS. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A, and the strictly lower triangular part of the matrix is not referenced. If uplo = 'L', the array a stores the lower triangular part of the matrix A, and the strictly upper triangular part of the matrix is not referenced. n INTEGER. The order of matrix A; n = 0. a, work REAL for spstrf DOUBLE PRECISION for dpstrf COMPLEX for cpstrf DOUBLE COMPLEX for zpstrf. Array a, DIMENSION (lda,*). The array a contains either the upper or the lower triangular part of the matrix A (see uplo). The second dimension of a must be at least max(1, n). work(*) is a workspace array. The dimension of work is at least max(1,2*n). tol REAL for single precision flavors DOUBLE PRECISION for double precision flavors. LAPACK Routines: Linear Equations 3 367 User difined tolerance. If tol < 0, then n*U*max(a(k,k)) will be used. The algorithm terminates at the (k-1)-th step, if the pivot = tol. lda INTEGER. The leading dimension of a; at least max(1, n). Output Parameters a If info = 0, the factor U or L from the Cholesky factorization is as described in Description. piv INTEGER. Array, DIMENSION at least max(1, n). The array piv is such that the nonzero entries are p( piv(k),k ) = 1. rank INTEGER. The rank of a given by the number of steps the algorithm completed. info INTEGER. If info = 0, the execution is successful. If info = -k, the k-th argument had an illegal value. If info > 0, the matrix A is either rank deficient with a computed rank as returned in rank, or is indefinite. ?pftrf Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite matrix using the Rectangular Full Packed (RFP) format . Syntax Fortran 77: call spftrf( transr, uplo, n, a, info ) call dpftrf( transr, uplo, n, a, info ) call cpftrf( transr, uplo, n, a, info ) call zpftrf( transr, uplo, n, a, info ) C: lapack_int LAPACKE_pftrf( int matrix_order, char transr, char uplo, lapack_int n, * a ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the Cholesky factorization of a symmetric positive-definite or, for complex data, a Hermitian positive-definite matrix A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. The matrix A is in the Rectangular Full Packed (RFP) format. For the description of the RFP format, see Matrix Storage Schemes. 3 Intel® Math Kernel Library Reference Manual 368 This is the block version of the algorithm, calling Level 3 BLAS. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. transr CHARACTER*1. Must be 'N', 'T' (for real data) or 'C' (for complex data). If transr = 'N', the Normal transr of RFP A is stored. If transr = 'T', the Transpose transr of RFP A is stored. If transr = 'C', the Conjugate-Transpose transr of RFP A is stored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of the matrix A; n = 0. a REAL for spftrf DOUBLE PRECISION for dpftrf COMPLEX for cpftrf DOUBLE COMPLEX for zpftrf. Array, DIMENSION (n*(n+1)/2). The array a contains the matrix A in the RFP format. Output Parameters a The upper or lower triangular part of a is overwritten by the Cholesky factor U or L, as specified by info. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, and the factorization could not be completed. This may indicate an error in forming the matrix A. ?pptrf Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite matrix using packed storage. Syntax Fortran 77: call spptrf( uplo, n, ap, info ) call dpptrf( uplo, n, ap, info ) call cpptrf( uplo, n, ap, info ) call zpptrf( uplo, n, ap, info ) Fortran 95: call pptrf( ap [, uplo] [,info] ) LAPACK Routines: Linear Equations 3 369 C: lapack_int LAPACKE_pptrf( int matrix_order, char uplo, lapack_int n, * ap ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the Cholesky factorization of a symmetric positive-definite or, for complex data, Hermitian positive-definite packed matrix A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is packed in the array ap, and how A is factored: If uplo = 'U', the array ap stores the upper triangular part of the matrix A, and A is factored as UH*U. If uplo = 'L', the array ap stores the lower triangular part of the matrix A; A is factored as L*LH. n INTEGER. The order of matrix A; n = 0. ap REAL for spptrf DOUBLE PRECISION for dpptrf COMPLEX for cpptrf DOUBLE COMPLEX for zpptrf. Array, DIMENSION at least max(1, n(n+1)/2). The array ap contains either the upper or the lower triangular part of the matrix A (as specified by uplo) in packed storage (see Matrix Storage Schemes). Output Parameters ap The upper or lower triangular part of A in packed storage is overwritten by the Cholesky factor U or L, as specified by uplo. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, and the factorization could not be completed. This may indicate an error in forming the matrix A. 3 Intel® Math Kernel Library Reference Manual 370 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pptrf interface are as follows: ap Holds the array A of size (n*(n+1)/2). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If uplo = 'U', the computed factor U is the exact factor of a perturbed matrix A + E, where c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for uplo = 'L'. The total number of floating-point operations is approximately (1/3)n3 for real flavors and (4/3)n3 for complex flavors. After calling this routine, you can call the following routines: ?pptrs to solve A*X = B ?ppcon to estimate the condition number of A ?pptri to compute the inverse of A. See Also mkl_progress ?pbtrf Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite band matrix. Syntax Fortran 77: call spbtrf( uplo, n, kd, ab, ldab, info ) call dpbtrf( uplo, n, kd, ab, ldab, info ) call cpbtrf( uplo, n, kd, ab, ldab, info ) call zpbtrf( uplo, n, kd, ab, ldab, info ) Fortran 95: call pbtrf( ab [, uplo] [,info] ) C: lapack_int LAPACKE_pbtrf( int matrix_order, char uplo, lapack_int n, lapack_int kd, * ab, lapack_int ldab ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h LAPACK Routines: Linear Equations 3 371 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the Cholesky factorization of a symmetric positive-definite or, for complex data, Hermitian positive-definite band matrix A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored in the array ab, and how A is factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. ab REAL for spbtrf DOUBLE PRECISION for dpbtrf COMPLEX for cpbtrf DOUBLE COMPLEX for zpbtrf. Array, DIMENSION (,*). The array ab contains either the upper or the lower triangular part of the matrix A (as specified by uplo) in band storage (see Matrix Storage Schemes). The second dimension of ab must be at least max(1, n). ldab INTEGER. The leading dimension of the array ab. (ldab = kd + 1) Output Parameters ab The upper or lower triangular part of A (in band storage) is overwritten by the Cholesky factor U or L, as specified by uplo. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, and the factorization could not be completed. This may indicate an error in forming the matrix A. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. 3 Intel® Math Kernel Library Reference Manual 372 Specific details for the routine pbtrf interface are as follows: ab Holds the array A of size (kd+1,n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If uplo = 'U', the computed factor U is the exact factor of a perturbed matrix A + E, where c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for uplo = 'L'. The total number of floating-point operations for real flavors is approximately n(kd+1)2. The number of operations for complex flavors is 4 times greater. All these estimates assume that kd is much less than n. After calling this routine, you can call the following routines: ?pbtrs to solve A*X = B ?pbcon to estimate the condition number of A. See Also mkl_progress ?pttrf Computes the factorization of a symmetric (Hermitian) positive-definite tridiagonal matrix. Syntax Fortran 77: call spttrf( n, d, e, info ) call dpttrf( n, d, e, info ) call cpttrf( n, d, e, info ) call zpttrf( n, d, e, info ) Fortran 95: call pttrf( d, e [,info] ) C: lapack_int LAPACKE_spttrf( lapack_int n, float* d, float* e ); lapack_int LAPACKE_dpttrf( lapack_int n, double* d, double* e ); lapack_int LAPACKE_cpttrf( lapack_int n, float* d, lapack_complex_float* e ); lapack_int LAPACKE_zpttrf( lapack_int n, double* d, lapack_complex_double* e ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h LAPACK Routines: Linear Equations 3 373 Description The routine forms the factorization of a symmetric positive-definite or, for complex data, Hermitian positivedefinite tridiagonal matrix A: A = L*D*LT for real flavors, or A = L*D*LH for complex flavors, where D is diagonal and L is unit lower bidiagonal. The factorization may also be regarded as having the form A = UT*D*U for real flavors, or A = UH*D*U for complex flavors, where D is unit upper bidiagonal. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. d REAL for spttrf, cpttrf DOUBLE PRECISION for dpttrf, zpttrf. Array, dimension (n). Contains the diagonal elements of A. e REAL for spttrf DOUBLE PRECISION for dpttrf COMPLEX for cpttrf DOUBLE COMPLEX for zpttrf. Array, dimension (n -1). Contains the subdiagonal elements of A. Output Parameters d Overwritten by the n diagonal elements of the diagonal matrix D from the L*D*LT (for real flavors) or L*D*LH (for complex flavors) factorization of A. e Overwritten by the (n - 1) off-diagonal elements of the unit bidiagonal factor L or U from the factorization of A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite; if i < n, the factorization could not be completed, while if i = n, the factorization was completed, but d(n) = 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pttrf interface are as follows: d Holds the vector of length n. e Holds the vector of length (n-1). ?sytrf Computes the Bunch-Kaufman factorization of a symmetric matrix. 3 Intel® Math Kernel Library Reference Manual 374 Syntax Fortran 77: call ssytrf( uplo, n, a, lda, ipiv, work, lwork, info ) call dsytrf( uplo, n, a, lda, ipiv, work, lwork, info ) call csytrf( uplo, n, a, lda, ipiv, work, lwork, info ) call zsytrf( uplo, n, a, lda, ipiv, work, lwork, info ) Fortran 95: call sytrf( a [, uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_sytrf( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the factorization of a real/complex symmetric matrix A using the Bunch-Kaufman diagonal pivoting method. The form of the factorization is: if uplo='U', A = P*U*D*UT*PT if uplo='L', A = P*L*D*LT*PT, where A is the input matrix, P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks. U and L have 2-by-2 unit diagonal blocks corresponding to the 2-by-2 blocks of D. NOTE This routine supports the Progress Routine feature. See Progress Routine section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array a stores the upper triangular part of the matrix A, and A is factored as P*U*D*UT*PT. If uplo = 'L', the array a stores the lower triangular part of the matrix A, and A is factored as P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. a REAL for ssytrf DOUBLE PRECISION for dsytrf COMPLEX for csytrf DOUBLE COMPLEX for zsytrf. LAPACK Routines: Linear Equations 3 375 Array, DIMENSION (lda,*). The array a contains either the upper or the lower triangular part of the matrix A (see uplo). The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of a; at least max(1, n). work Same type as a. A workspace array, dimension at least max(1,lwork). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a The upper or lower triangular part of a is overwritten by details of the block-diagonal matrix D and the multipliers used to obtain the factor U (or L). work(1) If info=0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D. If ipiv(i) = k >0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, Dii is 0. The factorization has been completed, but D is exactly singular. Division by 0 will occur if you use D for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytrf interface are as follows: a holds the matrix A of size (n, n) ipiv holds the vector of length n uplo must be 'U' or 'L'. The default value is 'U'. Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. 3 Intel® Math Kernel Library Reference Manual 376 If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The 2-by-2 unit diagonal blocks and the unit diagonal elements of U and L are not stored. The remaining elements of U and L are stored in the corresponding columns of the array a, but additional row interchanges are required to recover U or L explicitly (which is seldom necessary). If ipiv(i) = i for all i =1...n, then all off-diagonal elements of U (L) are stored explicitly in the corresponding elements of the array a. If uplo = 'U', the computed factors U and D are the exact factors of a perturbed matrix A + E, where |E| = c(n)e P|U||D||UT|PT c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for the computed L and D when uplo = 'L'. The total number of floating-point operations is approximately (1/3)n3 for real flavors or (4/3)n3 for complex flavors. After calling this routine, you can call the following routines: ?sytrs to solve A*X = B ?sycon to estimate the condition number of A ?sytri to compute the inverse of A. If uplo = 'U', then A = U*D*U', where U = P(n)*U(n)* ... *P(k)*U(k)*..., that is, U is a product of terms P(k)*U(k), where • k decreases from n to 1 in steps of 1 and 2. • D is a block diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks D(k). • P(k) is a permutation matrix as defined by ipiv(k). • U(k) is a unit upper triangular matrix, such that if the diagonal block D(k) is of order s (s = 1 or 2), then If s = 1, D(k) overwrites A(k,k), and v overwrites A(1:k-1,k). LAPACK Routines: Linear Equations 3 377 If s = 2, the upper triangle of D(k) overwrites A(k-1,k-1), A(k-1,k) and A(k,k), and v overwrites A(1:k-2,k -1:k). If uplo = 'L', then A = L*D*L', where L = P(1)*L(1)* ... *P(k)*L(k)*..., that is, L is a product of terms P(k)*L(k), where • k decreases from 1 to n in steps of 1 and 2. • D is a block diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks D(k). • P(k) is a permutation matrix as defined by ipiv(k). • L(k) is a unit lower triangular matrix, such that if the diagonal block D(k) is of order s (s = 1 or 2), then If s = 1, D(k) overwrites A(k,k), and v overwrites A(k+1:n,k). If s = 2, the lower triangle of D(k) overwrites A(k,k), A(k+1,k), and A(k+1,k+1), and v overwrites A(k +2:n,k:k+1). See Also mkl_progress ?hetrf Computes the Bunch-Kaufman factorization of a complex Hermitian matrix. Syntax Fortran 77: call chetrf( uplo, n, a, lda, ipiv, work, lwork, info ) call zhetrf( uplo, n, a, lda, ipiv, work, lwork, info ) Fortran 95: call hetrf( a [, uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_hetrf( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 3 Intel® Math Kernel Library Reference Manual 378 Description The routine computes the factorization of a complex Hermitian matrix A using the Bunch-Kaufman diagonal pivoting method: if uplo='U', A = P*U*D*UH*PT if uplo='L', A = P*L*D*LH*PT, where A is the input matrix, P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a Hermitian block-diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks. U and L have 2-by-2 unit diagonal blocks corresponding to the 2-by-2 blocks of D. NOTE This routine supports the Progress Routine feature. See Progress Routine section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array a stores the upper triangular part of the matrix A, and A is factored as P*U*D*UH*PT. If uplo = 'L', the array a stores the lower triangular part of the matrix A, and A is factored as P*L*D*LH*PT. n INTEGER. The order of matrix A; n = 0. a, work COMPLEX for chetrf DOUBLE COMPLEX for zhetrf. Arrays, DIMENSION a(lda,*), work(*). The array a contains the upper or the lower triangular part of the matrix A (see uplo). The second dimension of a must be at least max(1, n). work(*) is a workspace array of dimension at least max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a The upper or lower triangular part of a is overwritten by details of the block-diagonal matrix D and the multipliers used to obtain the factor U (or L). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. ipiv INTEGER. LAPACK Routines: Linear Equations 3 379 Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D. If ipiv(i) = k > 0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) = ipiv(i-1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i-1, and the (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) = ipiv(i+1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i+1, and the (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular. Division by 0 will occur if you use D for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetrf interface are as follows: a holds the matrix A of size (n, n) ipiv holds the vector of length n uplo must be 'U' or 'L'. The default value is 'U'. Application Notes This routine is suitable for Hermitian matrices that are not known to be positive-definite. If A is in fact positive-definite, the routine does not perform interchanges, and no 2-by-2 diagonal blocks occur in D. For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The 2-by-2 unit diagonal blocks and the unit diagonal elements of U and L are not stored. The remaining elements of U and L are stored in the corresponding columns of the array a, but additional row interchanges are required to recover U or L explicitly (which is seldom necessary). If ipiv(i) = i for all i =1...n, then all off-diagonal elements of U (L) are stored explicitly in the corresponding elements of the array a. If uplo = 'U', the computed factors U and D are the exact factors of a perturbed matrix A + E, where |E| = c(n)e P|U||D||UT|PT 3 Intel® Math Kernel Library Reference Manual 380 c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for the computed L and D when uplo = 'L'. The total number of floating-point operations is approximately (4/3)n3. After calling this routine, you can call the following routines: ?hetrs to solve A*X = B ?hecon to estimate the condition number of A ?hetri to compute the inverse of A. See Also mkl_progress ?sptrf Computes the Bunch-Kaufman factorization of a symmetric matrix using packed storage. Syntax Fortran 77: call ssptrf( uplo, n, ap, ipiv, info ) call dsptrf( uplo, n, ap, ipiv, info ) call csptrf( uplo, n, ap, ipiv, info ) call zsptrf( uplo, n, ap, ipiv, info ) Fortran 95: call sptrf( ap [,uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_sptrf( int matrix_order, char uplo, lapack_int n, * ap, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the factorization of a real/complex symmetric matrix A stored in the packed format using the Bunch-Kaufman diagonal pivoting method. The form of the factorization is: if uplo='U', A = P*U*D*UT*PT if uplo='L', A = P*L*D*LT*PT, where P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks. U and L have 2-by-2 unit diagonal blocks corresponding to the 2-by-2 blocks of D. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. LAPACK Routines: Linear Equations 3 381 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is packed in the array ap and how A is factored: If uplo = 'U', the array ap stores the upper triangular part of the matrix A, and A is factored as P*U*D*UT*PT. If uplo = 'L', the array ap stores the lower triangular part of the matrix A, and A is factored as P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. ap REAL for ssptrf DOUBLE PRECISION for dsptrf COMPLEX for csptrf DOUBLE COMPLEX for zsptrf. Array, DIMENSION at least max(1, n(n+1)/2). The array ap contains the upper or the lower triangular part of the matrix A (as specified by uplo) in packed storage (see Matrix Storage Schemes). Output Parameters ap The upper or lower triangle of A (as specified by uplo) is overwritten by details of the block-diagonal matrix D and the multipliers used to obtain the factor U (or L). ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D. If ipiv(i) = k > 0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i-1, and the (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i+1, and the (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular. Division by 0 will occur if you use D for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sptrf interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. 3 Intel® Math Kernel Library Reference Manual 382 Application Notes The 2-by-2 unit diagonal blocks and the unit diagonal elements of U and L are not stored. The remaining elements of U and L overwrite elements of the corresponding columns of the matrix A, but additional row interchanges are required to recover U or L explicitly (which is seldom necessary). If ipiv(i) = i for all i = 1...n, then all off-diagonal elements of U (L) are stored explicitly in packed form. If uplo = 'U', the computed factors U and D are the exact factors of a perturbed matrix A + E, where |E| = c(n)e P|U||D||UT|PT c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for the computed L and D when uplo = 'L'. The total number of floating-point operations is approximately (1/3)n3 for real flavors or (4/3)n3 for complex flavors. After calling this routine, you can call the following routines: ?sptrs to solve A*X = B ?spcon to estimate the condition number of A ?sptri to compute the inverse of A. See Also mkl_progress ?hptrf Computes the Bunch-Kaufman factorization of a complex Hermitian matrix using packed storage. Syntax Fortran 77: call chptrf( uplo, n, ap, ipiv, info ) call zhptrf( uplo, n, ap, ipiv, info ) Fortran 95: call hptrf( ap [,uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_hptrf( int matrix_order, char uplo, lapack_int n, * ap, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the factorization of a complex Hermitian packed matrix A using the Bunch-Kaufman diagonal pivoting method: if uplo='U', A = P*U*D*UH*PT if uplo='L', A = P*L*D*LH*PT, LAPACK Routines: Linear Equations 3 383 where A is the input matrix, P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a Hermitian block-diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks. U and L have 2-by-2 unit diagonal blocks corresponding to the 2-by-2 blocks of D. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is packed and how A is factored: If uplo = 'U', the array ap stores the upper triangular part of the matrix A, and A is factored as P*U*D*UH*PT. If uplo = 'L', the array ap stores the lower triangular part of the matrix A, and A is factored as P*L*D*LH*PT. n INTEGER. The order of matrix A; n = 0. ap COMPLEX for chptrf DOUBLE COMPLEX for zhptrf. Array, DIMENSION at least max(1, n(n+1)/2). The array ap contains the upper or the lower triangular part of the matrix A (as specified by uplo) in packed storage (see Matrix Storage Schemes). Output Parameters ap The upper or lower triangle of A (as specified by uplo) is overwritten by details of the block-diagonal matrix D and the multipliers used to obtain the factor U (or L). ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D. If ipiv(i) = k > 0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i-1, and the (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i+1, and the (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular. Division by 0 will occur if you use D for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. 3 Intel® Math Kernel Library Reference Manual 384 Specific details for the routine hptrf interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The 2-by-2 unit diagonal blocks and the unit diagonal elements of U and L are not stored. The remaining elements of U and L are stored in the corresponding columns of the array a, but additional row interchanges are required to recover U or L explicitly (which is seldom necessary). If ipiv(i) = i for all i = 1...n, then all off-diagonal elements of U (L) are stored explicitly in the corresponding elements of the array a. If uplo = 'U', the computed factors U and D are the exact factors of a perturbed matrix A + E, where |E| = c(n)e P|U||D||UT|PT c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for the computed L and D when uplo = 'L'. The total number of floating-point operations is approximately (4/3)n3. After calling this routine, you can call the following routines: ?hptrs to solve A*X = B ?hpcon to estimate the condition number of A ?hptri to compute the inverse of A. See Also mkl_progress Routines for Solving Systems of Linear Equations This section describes the LAPACK routines for solving systems of linear equations. Before calling most of these routines, you need to factorize the matrix of your system of equations (see Routines for Matrix Factorization in this chapter). However, the factorization is not necessary if your system of equations has a triangular matrix. ?getrs Solves a system of linear equations with an LUfactored square matrix, with multiple right-hand sides. Syntax Fortran 77: call sgetrs( trans, n, nrhs, a, lda, ipiv, b, ldb, info ) call dgetrs( trans, n, nrhs, a, lda, ipiv, b, ldb, info ) call cgetrs( trans, n, nrhs, a, lda, ipiv, b, ldb, info ) call zgetrs( trans, n, nrhs, a, lda, ipiv, b, ldb, info ) Fortran 95: call getrs( a, ipiv, b [, trans] [,info] ) LAPACK Routines: Linear Equations 3 385 C: lapack_int LAPACKE_getrs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Before calling this routine, you must call ?getrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', then A*X = B is solved for X. If trans = 'T', then AT*X = B is solved for X. If trans = 'C', then AH*X = B is solved for X. n INTEGER. The order of A; the number of rows in B(n = 0). nrhs INTEGER. The number of right-hand sides; nrhs = 0. a, b REAL for sgetrs DOUBLE PRECISION for dgetrs COMPLEX for cgetrs DOUBLE COMPLEX for zgetrs. Arrays: a(lda,*), b(ldb,*). The array a contains LU factorization of matrix A resulting from the call of ?getrf . The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of a must be at least max(1,n), the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?getrf. Output Parameters b Overwritten by the solution matrix X. 3 Intel® Math Kernel Library Reference Manual 386 info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine getrs interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. trans Must be 'N', 'C', or 'T'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|L||U| c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). The approximate number of floating-point operations for one right-hand side vector b is 2n2 for real flavors and 8n2 for complex flavors. To estimate the condition number ?8(A), call ?gecon. To refine the solution and estimate the error, call ?gerfs. ?gbtrs Solves a system of linear equations with an LUfactored band matrix, with multiple right-hand sides. Syntax Fortran 77: call sgbtrs( trans, n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call dgbtrs( trans, n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call cgbtrs( trans, n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call zgbtrs( trans, n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) LAPACK Routines: Linear Equations 3 387 Fortran 95: call gbtrs( ab, b, ipiv, [, kl] [, trans] [, info] ) C: lapack_int LAPACKE_gbtrs( int matrix_order, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const * ab, lapack_int ldab, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Here A is an LU-factored general band matrix of order n with kl non-zero subdiagonals and ku nonzero superdiagonals. Before calling this routine, call ?gbtrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. n INTEGER. The order of A; the number of rows in B; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab, b REAL for sgbtrs DOUBLE PRECISION for dgbtrs COMPLEX for cgbtrs DOUBLE COMPLEX for zgbtrs. Arrays: ab(ldab,*), b(ldb,*). The array ab contains the matrix A in band storage (see Matrix Storage Schemes). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of ab must be at least max(1, n), and the second dimension of b at least max(1,nrhs). ldab INTEGER. The leading dimension of the array ab; ldab = 2*kl + ku +1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?gbtrf. 3 Intel® Math Kernel Library Reference Manual 388 Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbtrs interface are as follows: ab Holds the array A of size (2*kl+ku+1,n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length min(m, n). kl If omitted, assumed kl = ku. ku Restored as lda-2*kl-1. trans Must be 'N', 'C', or 'T'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(kl + ku + 1)e P|L||U| c(k) is a modest linear function of k, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). The approximate number of floating-point operations for one right-hand side vector is 2n(ku + 2kl) for real flavors. The number of operations for complex flavors is 4 times greater. All these estimates assume that kl and ku are much less than min(m,n). To estimate the condition number ?8(A), call ?gbcon. To refine the solution and estimate the error, call ?gbrfs. ?gttrs Solves a system of linear equations with a tridiagonal matrix using the LU factorization computed by ? gttrf. LAPACK Routines: Linear Equations 3 389 Syntax Fortran 77: call sgttrs( trans, n, nrhs, dl, d, du, du2, ipiv, b, ldb, info ) call dgttrs( trans, n, nrhs, dl, d, du, du2, ipiv, b, ldb, info ) call cgttrs( trans, n, nrhs, dl, d, du, du2, ipiv, b, ldb, info ) call zgttrs( trans, n, nrhs, dl, d, du, du2, ipiv, b, ldb, info ) Fortran 95: call gttrs( dl, d, du, du2, b, ipiv [, trans] [,info] ) C: lapack_int LAPACKE_gttrs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const * dl, const * d, const * du, const * du2, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations with multiple right hand sides: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Before calling this routine, you must call ?gttrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', then A*X = B is solved for X. If trans = 'T', then AT*X = B is solved for X. If trans = 'C', then AH*X = B is solved for X. n INTEGER. The order of A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns in B; nrhs = 0. dl,d,du,du2,b REAL for sgttrs DOUBLE PRECISION for dgttrs COMPLEX for cgttrs DOUBLE COMPLEX for zgttrs. Arrays: dl(n -1), d(n), du(n -1), du2(n -2), b(ldb,nrhs). The array dl contains the (n - 1) multipliers that define the matrix L from the LU factorization of A. 3 Intel® Math Kernel Library Reference Manual 390 The array d contains the n diagonal elements of the upper triangular matrix U from the LU factorization of A. The array du contains the (n - 1) elements of the first superdiagonal of U. The array du2 contains the (n - 2) elements of the second superdiagonal of U. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION (n). The ipiv array, as returned by ? gttrf. Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gttrs interface are as follows: dl Holds the vector of length (n-1). d Holds the vector of length n. du Holds the vector of length (n-1). du2 Holds the vector of length (n-2). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. trans Must be 'N', 'C', or 'T'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|L||U| c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). LAPACK Routines: Linear Equations 3 391 The approximate number of floating-point operations for one right-hand side vector b is 7n (including n divisions) for real flavors and 34n (including 2n divisions) for complex flavors. To estimate the condition number ?8(A), call ?gtcon. To refine the solution and estimate the error, call ?gtrfs. ?dttrsb Solves a system of linear equations with a diagonally dominant tridiagonal matrix using the LU factorization computed by ?dttrfb. Syntax Fortran 77: call sdttrsb( trans, n, nrhs, dl, d, du, b, ldb, info ) call ddttrsb( trans, n, nrhs, dl, d, du, b, ldb, info ) call cdttrsb( trans, n, nrhs, dl, d, du, b, ldb, info ) call zdttrsb( trans, n, nrhs, dl, d, du, b, ldb, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?dttrsb routine solves the following systems of linear equations with multiple right hand sides for X: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Before calling this routine, call ?dttrfb to compute the factorization of A. Input Parameters trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations solved for X: If trans = 'N', then A*X = B. If trans = 'T', then AT*X = B. If trans = 'C', then AH*X = B. n INTEGER. The order of A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns in B; nrhs = 0. dl, d, du, b REAL for sdttrsb DOUBLE PRECISION for ddttrsb COMPLEX for cdttrsb DOUBLE COMPLEX for zdttrsb. Arrays: dl(n -1), d(n), du(n -1), b(ldb,nrhs). The array dl contains the (n - 1) multipliers that define the matrices L1, L2 from the factorization of A. The array d contains the n diagonal elements of the upper triangular matrix U from the factorization of A. The array du contains the (n - 1) elements of the superdiagonal of U. 3 Intel® Math Kernel Library Reference Manual 392 The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?potrs Solves a system of linear equations with a Choleskyfactored symmetric (Hermitian) positive-definite matrix. Syntax Fortran 77: call spotrs( uplo, n, nrhs, a, lda, b, ldb, info ) call dpotrs( uplo, n, nrhs, a, lda, b, ldb, info ) call cpotrs( uplo, n, nrhs, a, lda, b, ldb, info ) call zpotrs( uplo, n, nrhs, a, lda, b, ldb, info ) Fortran 95: call potrs( a, b [,uplo] [, info] ) C: lapack_int LAPACKE_potrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B with a symmetric positive-definite or, for complex data, Hermitian positive-definite matrix A, given the Cholesky factorization of A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. The system is solved with multiple right-hand sides stored in the columns of the matrix B. Before calling this routine, you must call ?potrf to compute the Cholesky factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Linear Equations 3 393 uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides (nrhs = 0). a, b REAL for spotrs DOUBLE PRECISION for dpotrs COMPLEX for cpotrs DOUBLE COMPLEX for zpotrs. Arrays: a(lda,*), b(ldb,*). The array a contains the factor U or L (see uplo). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of a must be at least max(1,n), the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine potrs interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If uplo = 'U', the computed solution for each right-hand side b is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e |UH||U| c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for uplo = 'L'. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). 3 Intel® Math Kernel Library Reference Manual 394 Note that cond(A,x) can be much smaller than ?8 (A). The approximate number of floating-point operations for one right-hand side vector b is 2n2 for real flavors and 8n2 for complex flavors. To estimate the condition number ?8(A), call ?pocon. To refine the solution and estimate the error, call ?porfs. ?pftrs Solves a system of linear equations with a Choleskyfactored symmetric (Hermitian) positive-definite matrix using the Rectangular Full Packed (RFP) format. Syntax Fortran 77: call spftrs( transr, uplo, n, nrhs, a, b, ldb, info ) call dpftrs( transr, uplo, n, nrhs, a, b, ldb, info ) call cpftrs( transr, uplo, n, nrhs, a, b, ldb, info ) call zpftrs( transr, uplo, n, nrhs, a, b, ldb, info ) C: lapack_int LAPACKE_pftrs( int matrix_order, char transr, char uplo, lapack_int n, lapack_int nrhs, const * a, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves a system of linear equations A*X = B with a symmetric positive-definite or, for complex data, Hermitian positive-definite matrix A using the Cholesky factorization of A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' computed by ?pftrf. L stands for a lower triangular matrix and U - for an upper triangular matrix. The matrix A is in the Rectangular Full Packed (RFP) format. For the description of the RFP format, see Matrix Storage Schemes. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. transr CHARACTER*1. Must be 'N', 'T' (for real data) or 'C' (for complex data). If transr = 'N', the Normal transr of RFP A is stored. If transr = 'T', the Transpose transr of RFP A is stored. If transr = 'C', the Conjugate-Transpose transr of RFP A is stored. uplo CHARACTER*1. Must be 'U' or 'L'. LAPACK Routines: Linear Equations 3 395 Indicates whether the upper or lower triangular part of the RFP matrix A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns of the matrix B; nrhs = 0. a, b REAL for spftrs DOUBLE PRECISION for dpftrs COMPLEX for cpftrs DOUBLE COMPLEX for zpftrs. Arrays: a(n*(n+1)/2), b(ldb,nrhs). The array a contains the matrix A in the RFP format. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b The solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?pptrs Solves a system of linear equations with a packed Cholesky-factored symmetric (Hermitian) positivedefinite matrix. Syntax Fortran 77: call spptrs( uplo, n, nrhs, ap, b, ldb, info ) call dpptrs( uplo, n, nrhs, ap, b, ldb, info ) call cpptrs( uplo, n, nrhs, ap, b, ldb, info ) call zpptrs( uplo, n, nrhs, ap, b, ldb, info ) Fortran 95: call pptrs( ap, b [,uplo] [,info] ) C: lapack_int LAPACKE_pptrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * ap, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 3 Intel® Math Kernel Library Reference Manual 396 Description The routine solves for X the system of linear equations A*X = B with a packed symmetric positive-definite or, for complex data, Hermitian positive-definite matrix A, given the Cholesky factorization of A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. The system is solved with multiple right-hand sides stored in the columns of the matrix B. Before calling this routine, you must call ?pptrf to compute the Cholesky factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides (nrhs = 0). ap, b REAL for spptrs DOUBLE PRECISION for dpptrs COMPLEX for cpptrs DOUBLE COMPLEX for zpptrs. Arrays: ap(*), b(ldb,*) The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the factor U or L, as specified by uplo, in packed storage (see Matrix Storage Schemes). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pptrs interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. LAPACK Routines: Linear Equations 3 397 Application Notes If uplo = 'U', the computed solution for each right-hand side b is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e |UH||U| c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for uplo = 'L'. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The approximate number of floating-point operations for one right-hand side vector b is 2n2 for real flavors and 8n2 for complex flavors. To estimate the condition number ?8(A), call ?ppcon. To refine the solution and estimate the error, call ?pprfs. ?pbtrs Solves a system of linear equations with a Choleskyfactored symmetric (Hermitian) positive-definite band matrix. Syntax Fortran 77: call spbtrs( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call dpbtrs( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call cpbtrs( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call zpbtrs( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) Fortran 95: call pbtrs( ab, b [,uplo] [,info] ) C: lapack_int LAPACKE_pbtrs( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, const * ab, lapack_int ldab, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 3 Intel® Math Kernel Library Reference Manual 398 Description The routine solves for real data a system of linear equations A*X = B with a symmetric positive-definite or, for complex data, Hermitian positive-definite band matrix A, given the Cholesky factorization of A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. The system is solved with multiple right-hand sides stored in the columns of the matrix B. Before calling this routine, you must call ?pbtrf to compute the Cholesky factorization of A in the band storage form. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangular factor is stored in ab. If uplo = 'L', the lower triangular factor is stored in ab. n INTEGER. The order of matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab, b REAL for spbtrs DOUBLE PRECISION for dpbtrs COMPLEX for cpbtrs DOUBLE COMPLEX for zpbtrs. Arrays: ab(ldab,*), b(ldb,*). The array ab contains the Cholesky factor, as returned by the factorization routine, in band storage form. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of ab must be at least max(1, n), and the second dimension of b at least max(1,nrhs). ldab INTEGER. The leading dimension of the array ab; ldab = kd +1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbtrs interface are as follows: LAPACK Routines: Linear Equations 3 399 ab Holds the array A of size (kd+1,n). b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(kd + 1)e P|UH||U| or |E| = c(kd + 1)e P|LH||L| c(k) is a modest linear function of k, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The approximate number of floating-point operations for one right-hand side vector is 4n*kd for real flavors and 16n*kd for complex flavors. To estimate the condition number ?8(A), call ?pbcon. To refine the solution and estimate the error, call ?pbrfs. ?pttrs Solves a system of linear equations with a symmetric (Hermitian) positive-definite tridiagonal matrix using the factorization computed by ?pttrf. Syntax Fortran 77: call spttrs( n, nrhs, d, e, b, ldb, info ) call dpttrs( n, nrhs, d, e, b, ldb, info ) call cpttrs( uplo, n, nrhs, d, e, b, ldb, info ) call zpttrs( uplo, n, nrhs, d, e, b, ldb, info ) Fortran 95: call pttrs( d, e, b [,info] ) call pttrs( d, e, b [,uplo] [,info] ) C: lapack_int LAPACKE_spttrs( int matrix_order, lapack_int n, lapack_int nrhs, const float* d, const float* e, float* b, lapack_int ldb ); lapack_int LAPACKE_dpttrs( int matrix_order, lapack_int n, lapack_int nrhs, const double* d, const double* e, double* b, lapack_int ldb ); 3 Intel® Math Kernel Library Reference Manual 400 lapack_int LAPACKE_cpttrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* d, const lapack_complex_float* e, lapack_complex_float* b, lapack_int ldb ); lapack_int LAPACKE_zpttrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* d, const lapack_complex_double* e, lapack_complex_double* b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X a system of linear equations A*X = B with a symmetric (Hermitian) positive-definite tridiagonal matrix A. Before calling this routine, call ?pttrf to compute the L*D*L' for real data and the L*D*L' or U'*D*U factorization of A for complex data. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Used for cpttrs/zpttrs only. Must be 'U' or 'L'. Specifies whether the superdiagonal or the subdiagonal of the tridiagonal matrix A is stored and how A is factored: If uplo = 'U', the array e stores the superdiagonal of A, and A is factored as U'*D*U. If uplo = 'L', the array e stores the subdiagonal of A, and A is factored as L*D*L'. n INTEGER. The order of A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns of the matrix B; nrhs = 0. d REAL for spttrs, cpttrs DOUBLE PRECISION for dpttrs, zpttrs. Array, dimension (n). Contains the diagonal elements of the diagonal matrix D from the factorization computed by ?pttrf. e, b REAL for spttrs DOUBLE PRECISION for dpttrs COMPLEX for cpttrs DOUBLE COMPLEX for zpttrs. Arrays: e(n -1), b(ldb, nrhs). The array e contains the (n - 1) off-diagonal elements of the unit bidiagonal factor U or L from the factorization computed by ?pttrf (see uplo). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. LAPACK Routines: Linear Equations 3 401 info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pttrs interface are as follows: d Holds the vector of length n. e Holds the vector of length (n-1). b Holds the matrix B of size (n, nrhs). uplo Used in complex flavors only. Must be 'U' or 'L'. The default value is 'U'. ?sytrs Solves a system of linear equations with a UDU- or LDL-factored symmetric matrix. Syntax Fortran 77: call ssytrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) call dsytrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) call csytrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) call zsytrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) Fortran 95: call sytrs( a, b, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_sytrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B with a symmetric matrix A, given the Bunch- Kaufman factorization of A: if uplo='U', A = P*U*D*UT*PT if uplo='L', A = P*L*D*LT*PT, where P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix. The system is solved with multiple right-hand sides stored in the columns of the matrix B. You must supply to this routine the factor U (or L) and the array ipiv returned by the factorization routine ?sytrf. 3 Intel® Math Kernel Library Reference Manual 402 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = P*U*D*UT*PT. If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sytrf. a, b REAL for ssytrs DOUBLE PRECISION for dsytrs COMPLEX for csytrs DOUBLE COMPLEX for zsytrs. Arrays: a(lda,*), b(ldb,*). The array a contains the factor U or L (see uplo). The array b contains the matrix B whose columns are the right-hand sides for the system of equations. The second dimension of a must be at least max(1,n), and the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytrs interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|U||D||UT|PT or |E| = c(n)e P|L||D||UT|PT c(n) is a modest linear function of n, and e is the machine precision. LAPACK Routines: Linear Equations 3 403 If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The total number of floating-point operations for one right-hand side vector is approximately 2n2 for real flavors or 8n2 for complex flavors. To estimate the condition number ?8(A), call ?sycon. To refine the solution and estimate the error, call ?syrfs. ?hetrs Solves a system of linear equations with a UDU- or LDL-factored Hermitian matrix. Syntax Fortran 77: call chetrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) call zhetrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) Fortran 95: call hetrs( a, b, ipiv [, uplo] [,info] ) C: lapack_int LAPACKE_hetrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B with a Hermitian matrix A, given the Bunch- Kaufman factorization of A: if uplo = 'U' A = P*U*D*UH*PT if uplo = 'L' A = P*L*D*LH*PT, where P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix. The system is solved with multiple right-hand sides stored in the columns of the matrix B. You must supply to this routine the factor U (or L) and the array ipiv returned by the factorization routine ?hetrf. 3 Intel® Math Kernel Library Reference Manual 404 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = P*U*D*UH*PT. If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = P*L*D*LH*PT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hetrf. a, b COMPLEX for chetrs DOUBLE COMPLEX for zhetrs. Arrays: a(lda,*), b(ldb,*). The array a contains the factor U or L (see uplo). The array b contains the matrix B whose columns are the right-hand sides for the system of equations. The second dimension of a must be at least max(1,n), the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetrs interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|U||D||UH|PT or |E| = c(n)e P|L||D||LH|PT c(n) is a modest linear function of n, and e is the machine precision. LAPACK Routines: Linear Equations 3 405 If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The total number of floating-point operations for one right-hand side vector is approximately 8n2. To estimate the condition number ?8(A), call ?hecon. To refine the solution and estimate the error, call ?herfs. ?sytrs2 Solves a system of linear equations with a UDU- or LDL-factored symmetric matrix computed by ?sytrf and converted by ?syconv. Syntax Fortran 77: call ssytrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) call dsytrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) call csytrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) call zsytrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) Fortran 95: call sytrs2( a,b,ipiv[,uplo][,info] ) C: lapack_int LAPACKE_sytrs2( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves a system of linear equations A*X = B with a symmetric matrix A using the factorization of A: if uplo='U', A = U*D*UT if uplo='L', A = L*D*LT where • U and L are upper and lower triangular matrices with unit diagonal • D is a symmetric block-diagonal matrix. 3 Intel® Math Kernel Library Reference Manual 406 The factorization is computed by ?sytrf and converted by ?syconv. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = U*D*UT. If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = L*D*LT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a, b REAL for ssytrs2 DOUBLE PRECISION for dsytrs2 COMPLEX for csytrs2 DOUBLE COMPLEX for zsytrs2 Arrays: a(lda,*), b(ldb,*). The array a contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?sytrf. The array b contains the right-hand side matrix B. The second dimension of a must be at least max(1,n), and the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array of DIMENSION n. The ipiv array contains details of the interchanges and the block structure of D as determined by ? sytrf. work REAL for ssytrs2 DOUBLE PRECISION for dsytrs2 COMPLEX for csytrs2 DOUBLE COMPLEX for zsytrs2 Workspace array, DIMENSION n. Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytrs2 interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. LAPACK Routines: Linear Equations 3 407 uplo Indicates how the input matrix A has been factored. Must be 'U' or 'L'. See Also ?sytrf ?syconv ?hetrs2 Solves a system of linear equations with a UDU- or LDL-factored Hermitian matrix computed by ?hetrf and converted by ?syconv. Syntax Fortran 77: call chetrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) call zhetrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) Fortran 95: call hetrs2( a, b, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_hetrs2( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves a system of linear equations A*X = B with a complex Hermitian matrix A using the factorization of A: if uplo='U', A = U*D*UH if uplo='L', A = L*D*LH where • U and L are upper and lower triangular matrices with unit diagonal • D is a Hermitian block-diagonal matrix. The factorization is computed by ?hetrf and converted by ?syconv. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = U*D*UH. 3 Intel® Math Kernel Library Reference Manual 408 If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = L*D*LH. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a, b COMPLEX for chetrs2 DOUBLE COMPLEX for zhetrs2 Arrays: a(lda,*), b(ldb,*). The array a contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?hetrf. The array b contains the right-hand side matrix B. The second dimension of a must be at least max(1,n), and the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array of DIMENSION n. The ipiv array contains details of the interchanges and the block structure of D as determined by ? hetrf. work COMPLEX for chetrs2 DOUBLE COMPLEX for zhetrs2 Workspace array, DIMENSION n. Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetrs2 interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. See Also ?hetrf ?syconv ?sptrs Solves a system of linear equations with a UDU- or LDL-factored symmetric matrix using packed storage. Syntax Fortran 77: call ssptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call dsptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) LAPACK Routines: Linear Equations 3 409 call csptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call zsptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) Fortran 95: call sptrs( ap, b, ipiv [, uplo] [,info] ) C: lapack_int LAPACKE_sptrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * ap, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B with a symmetric matrix A, given the Bunch- Kaufman factorization of A: if uplo='U', A = PUDUTPT if uplo='L', A = PLDLTPT, where P is a permutation matrix, U and L are upper and lower packed triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix. The system is solved with multiple right-hand sides stored in the columns of the matrix B. You must supply the factor U (or L) and the array ipiv returned by the factorization routine ?sptrf. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the packed factor U of the factorization A = P*U*D*UT*PT. If uplo = 'L', the array ap stores the packed factor L of the factorization A = P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sptrf. ap, b REAL for ssptrs DOUBLE PRECISION for dsptrs COMPLEX for csptrs DOUBLE COMPLEX for zsptrs. Arrays: ap(*), b(ldb,*). The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the factor U or L, as specified by uplo, in packed storage (see Matrix Storage Schemes). 3 Intel® Math Kernel Library Reference Manual 410 The array b contains the matrix B whose columns are the right-hand sides for the system of equations. The second dimension of b must be at least max(1, nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sptrs interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|U||D||UT|PT or |E| = c(n)e P|L||D||LT|PT c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The total number of floating-point operations for one right-hand side vector is approximately 2n2 for real flavors or 8n2 for complex flavors. To estimate the condition number ?8(A), call ?spcon. To refine the solution and estimate the error, call ?sprfs. ?hptrs Solves a system of linear equations with a UDU- or LDL-factored Hermitian matrix using packed storage. LAPACK Routines: Linear Equations 3 411 Syntax Fortran 77: call chptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call zhptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) Fortran 95: call hptrs( ap, b, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_hptrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * ap, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B with a Hermitian matrix A, given the Bunch- Kaufman factorization of A: if uplo='U', A = P*U*D*UH*PT if uplo='L', A = P*L*D*LH*PT, where P is a permutation matrix, U and L are upper and lower packed triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix. The system is solved with multiple right-hand sides stored in the columns of the matrix B. You must supply to this routine the arrays ap (containing U or L)and ipiv in the form returned by the factorization routine ?hptrf. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the packed factor U of the factorization A = P*U*D*UH*PT. If uplo = 'L', the array ap stores the packed factor L of the factorization A = P*L*D*LH*PT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hptrf. ap, b COMPLEX for chptrs DOUBLE COMPLEX for zhptrs. Arrays: ap(*), b(ldb,*). The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the factor U or L, as specified by uplo, in packed storage (see Matrix Storage Schemes). 3 Intel® Math Kernel Library Reference Manual 412 The array b contains the matrix B whose columns are the right-hand sides for the system of equations. The second dimension of b must be at least max(1, nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hptrs interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|U||D||UH|PT or |E| = c(n)e P|L||D||LH|PT c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The total number of floating-point operations for one right-hand side vector is approximately 8n2 for complex flavors. To estimate the condition number ?8(A), call ?hpcon. To refine the solution and estimate the error, call ?hprfs. ?trtrs Solves a system of linear equations with a triangular matrix, with multiple right-hand sides. LAPACK Routines: Linear Equations 3 413 Syntax Fortran 77: call strtrs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, info ) call dtrtrs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, info ) call ctrtrs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, info ) call ztrtrs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, info ) Fortran 95: call trtrs( a, b [,uplo] [, trans] [,diag] [,info] ) C: lapack_int LAPACKE_trtrs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations with a triangular matrix A, with multiple right-hand sides stored in B: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. If trans = 'N', then A*X = B is solved for X. If trans = 'T', then AT*X = B is solved for X. If trans = 'C', then AH*X = B is solved for X. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array a. n INTEGER. The order of A; the number of rows in B; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a, b REAL for strtrs 3 Intel® Math Kernel Library Reference Manual 414 DOUBLE PRECISION for dtrtrs COMPLEX for ctrtrs DOUBLE COMPLEX for ztrtrs. Arrays: a(lda,*), b(ldb,*). The array a contains the matrix A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of a must be at least max(1,n), the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trtrs interface are as follows: a Stands for argument ap in FORTRAN 77 interface. Holds the matrix A of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e |A| c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). The approximate number of floating-point operations for one right-hand side vector b is n2 for real flavors and 4n2 for complex flavors. LAPACK Routines: Linear Equations 3 415 To estimate the condition number ?8(A), call ?trcon. To estimate the error in the solution, call ?trrfs. ?tptrs Solves a system of linear equations with a packed triangular matrix, with multiple right-hand sides. Syntax Fortran 77: call stptrs( uplo, trans, diag, n, nrhs, ap, b, ldb, info ) call dtptrs( uplo, trans, diag, n, nrhs, ap, b, ldb, info ) call ctptrs( uplo, trans, diag, n, nrhs, ap, b, ldb, info ) call ztptrs( uplo, trans, diag, n, nrhs, ap, b, ldb, info ) Fortran 95: call tptrs( ap, b [,uplo] [, trans] [,diag] [,info] ) C: lapack_int LAPACKE_tptrs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const * ap, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations with a packed triangular matrix A, with multiple right-hand sides stored in B: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. If trans = 'N', then A*X = B is solved for X. If trans = 'T', then AT*X = B is solved for X. If trans = 'C', then AH*X = B is solved for X. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. 3 Intel® Math Kernel Library Reference Manual 416 If diag = 'U', then A is unit triangular: diagonal elements are assumed to be 1 and not referenced in the array ap. n INTEGER. The order of A; the number of rows in B; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ap, b REAL for stptrs DOUBLE PRECISION for dtptrs COMPLEX for ctptrs DOUBLE COMPLEX for ztptrs. Arrays: ap(*), b(ldb,*). The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the matrix A in packed storage (see Matrix Storage Schemes). The array b contains the matrix B whose columns are the right-hand sides for the system of equations. The second dimension of b must be at least max(1, nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tptrs interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e |A| c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). LAPACK Routines: Linear Equations 3 417 Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). The approximate number of floating-point operations for one right-hand side vector b is n2 for real flavors and 4n2 for complex flavors. To estimate the condition number ?8(A), call ?tpcon. To estimate the error in the solution, call ?tprfs. ?tbtrs Solves a system of linear equations with a band triangular matrix, with multiple right-hand sides. Syntax Fortran 77: call stbtrs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, info ) call dtbtrs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, info ) call ctbtrs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, info ) call ztbtrs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, info ) Fortran 95: call tbtrs( ab, b [,uplo] [, trans] [,diag] [,info] ) C: lapack_int LAPACKE_tbtrs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int kd, lapack_int nrhs, const * ab, lapack_int ldab, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations with a band triangular matrix A, with multiple right-hand sides stored in B: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. 3 Intel® Math Kernel Library Reference Manual 418 trans CHARACTER*1. Must be 'N' or 'T' or 'C'. If trans = 'N', then A*X = B is solved for X. If trans = 'T', then AT*X = B is solved for X. If trans = 'C', then AH*X = B is solved for X. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements are assumed to be 1 and not referenced in the array ab. n INTEGER. The order of A; the number of rows in B; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab, b REAL for stbtrs DOUBLE PRECISION for dtbtrs COMPLEX for ctbtrs DOUBLE COMPLEX for ztbtrs. Arrays: ab(ldab,*), b(ldb,*). The array ab contains the matrix A in band storage form. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of ab must be at least max(1, n), the second dimension of b at least max(1,nrhs). ldab INTEGER. The leading dimension of ab; ldab = kd + 1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tbtrs interface are as follows: ab Holds the array A of size (kd+1,n) b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E|= c(n)e|A| c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: LAPACK Routines: Linear Equations 3 419 where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). The approximate number of floating-point operations for one right-hand side vector b is 2n*kd for real flavors and 8n*kd for complex flavors. To estimate the condition number ?8(A), call ?tbcon. To estimate the error in the solution, call ?tbrfs. Routines for Estimating the Condition Number This section describes the LAPACK routines for estimating the condition number of a matrix. The condition number is used for analyzing the errors in the solution of a system of linear equations (see Error Analysis). Since the condition number may be arbitrarily large when the matrix is nearly singular, the routines actually compute the reciprocal condition number. ?gecon Estimates the reciprocal of the condition number of a general matrix in the 1-norm or the infinity-norm. Syntax Fortran 77: call sgecon( norm, n, a, lda, anorm, rcond, work, iwork, info ) call dgecon( norm, n, a, lda, anorm, rcond, work, iwork, info ) call cgecon( norm, n, a, lda, anorm, rcond, work, rwork, info ) call zgecon( norm, n, a, lda, anorm, rcond, work, rwork, info ) Fortran 95: call gecon( a, anorm, rcond [,norm] [,info] ) C: lapack_int LAPACKE_sgecon( int matrix_order, char norm, lapack_int n, const float* a, lapack_int lda, float anorm, float* rcond ); lapack_int LAPACKE_dgecon( int matrix_order, char norm, lapack_int n, const double* a, lapack_int lda, double anorm, double* rcond ); lapack_int LAPACKE_cgecon( int matrix_order, char norm, lapack_int n, const lapack_complex_float* a, lapack_int lda, float anorm, float* rcond ); lapack_int LAPACKE_zgecon( int matrix_order, char norm, lapack_int n, const lapack_complex_double* a, lapack_int lda, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h 3 Intel® Math Kernel Library Reference Manual 420 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a general matrix A in the 1-norm or infinitynorm: ? 1(A) =||A||1||A-1||1 = ? 8(AT) = ? 8(AH) ? 8(A) =||A||8||A-1||8 = ? 1(AT) = ? 1(AH). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?getrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. n INTEGER. The order of the matrix A; n = 0. a, work REAL for sgecon DOUBLE PRECISION for dgecon COMPLEX for cgecon DOUBLE COMPLEX for zgecon. Arrays: a(lda,*), work(*). The array a contains the LU-factored matrix A, as returned by ?getrf. The second dimension of a must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 4*n) for real flavors and max(1, 2*n) for complex flavors. anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). lda INTEGER. The leading dimension of a; lda = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cgecon DOUBLE PRECISION for zgecon. Workspace array, DIMENSION at least max(1, 2*n). Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond = 0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. LAPACK Routines: Linear Equations 3 421 info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gecon interface are as follows: a Holds the matrix A of size (n, n). norm Must be '1', 'O', or 'I'. The default value is '1'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b or AH*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2*n2 floating-point operations for real flavors and 8*n2 for complex flavors. ?gbcon Estimates the reciprocal of the condition number of a band matrix in the 1-norm or the infinity-norm. Syntax Fortran 77: call sgbcon( norm, n, kl, ku, ab, ldab, ipiv, anorm, rcond, work, iwork, info ) call dgbcon( norm, n, kl, ku, ab, ldab, ipiv, anorm, rcond, work, iwork, info ) call cgbcon( norm, n, kl, ku, ab, ldab, ipiv, anorm, rcond, work, rwork, info ) call zgbcon( norm, n, kl, ku, ab, ldab, ipiv, anorm, rcond, work, rwork, info ) Fortran 95: call gbcon( ab, ipiv, anorm, rcond [,kl] [,norm] [,info] ) C: lapack_int LAPACKE_sgbcon( int matrix_order, char norm, lapack_int n, lapack_int kl, lapack_int ku, const float* ab, lapack_int ldab, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_dgbcon( int matrix_order, char norm, lapack_int n, lapack_int kl, lapack_int ku, const double* ab, lapack_int ldab, const lapack_int* ipiv, double anorm, double* rcond ); lapack_int LAPACKE_cgbcon( int matrix_order, char norm, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_float* ab, lapack_int ldab, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zgbcon( int matrix_order, char norm, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_double* ab, lapack_int ldab, const lapack_int* ipiv, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 3 Intel® Math Kernel Library Reference Manual 422 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a general band matrix A in the 1-norm or infinity-norm: ?1(A) = ||A||1||A-1||1 = ?8(AT) = ?8(AH) ?8(A) = ||A||8||A-1||8 = ?1(AT) = ?1(AH). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?gbtrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. n INTEGER. The order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ldab INTEGER. The leading dimension of the array ab. (ldab = 2*kl + ku +1). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?gbtrf. ab, work REAL for sgbcon DOUBLE PRECISION for dgbcon COMPLEX for cgbcon DOUBLE COMPLEX for zgbcon. Arrays: ab(ldab,*), work(*). The array ab contains the factored band matrix A, as returned by ? gbtrf. The second dimension of ab must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cgbcon DOUBLE PRECISION for zgbcon. Workspace array, DIMENSION at least max(1, 2*n). LAPACK Routines: Linear Equations 3 423 Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbcon interface are as follows: ab Holds the array A of size (2*kl+ku+1,n). ipiv Holds the vector of length n. norm Must be '1', 'O', or 'I'. The default value is '1'. kl If omitted, assumed kl = ku. ku Restored as ku = lda-2*kl-1. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b or AH*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n(ku + 2kl) floating-point operations for real flavors and 8n(ku + 2kl) for complex flavors. ?gtcon Estimates the reciprocal of the condition number of a tridiagonal matrix using the factorization computed by ?gttrf. Syntax Fortran 77: call sgtcon( norm, n, dl, d, du, du2, ipiv, anorm, rcond, work, iwork, info ) call dgtcon( norm, n, dl, d, du, du2, ipiv, anorm, rcond, work, iwork, info ) call cgtcon( norm, n, dl, d, du, du2, ipiv, anorm, rcond, work, info ) call zgtcon( norm, n, dl, d, du, du2, ipiv, anorm, rcond, work, info ) Fortran 95: call gtcon( dl, d, du, du2, ipiv, anorm, rcond [,norm] [,info] ) C: lapack_int LAPACKE_sgtcon( char norm, lapack_int n, const float* dl, const float* d, const float* du, const float* du2, const lapack_int* ipiv, float anorm, float* rcond ); 3 Intel® Math Kernel Library Reference Manual 424 lapack_int LAPACKE_dgtcon( char norm, lapack_int n, const double* dl, const double* d, const double* du, const double* du2, const lapack_int* ipiv, double anorm, double* rcond ); lapack_int LAPACKE_cgtcon( char norm, lapack_int n, const lapack_complex_float* dl, const lapack_complex_float* d, const lapack_complex_float* du, const lapack_complex_float* du2, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zgtcon( char norm, lapack_int n, const lapack_complex_double* dl, const lapack_complex_double* d, const lapack_complex_double* du, const lapack_complex_double* du2, const lapack_int* ipiv, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a real or complex tridiagonal matrix A in the 1-norm or infinity-norm: ?1(A) = ||A||1||A-1||1 ?8(A) = ||A||8||A-1||8 An estimate is obtained for ||A-1||, and the reciprocal of the condition number is computed as rcond = 1 / (||A|| ||A-1||). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?gttrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. n INTEGER. The order of the matrix A; n = 0. dl,d,du,du2 REAL for sgtcon DOUBLE PRECISION for dgtcon COMPLEX for cgtcon DOUBLE COMPLEX for zgtcon. Arrays: dl(n -1), d(n), du(n -1), du2(n -2). The array dl contains the (n - 1) multipliers that define the matrix L from the LU factorization of A as computed by ?gttrf. The array d contains the n diagonal elements of the upper triangular matrix U from the LU factorization of A. The array du contains the (n - 1) elements of the first superdiagonal of U. LAPACK Routines: Linear Equations 3 425 The array du2 contains the (n - 2) elements of the second superdiagonal of U. ipiv INTEGER. Array, DIMENSION (n). The array of pivot indices, as returned by ? gttrf. anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). work REAL for sgtcon DOUBLE PRECISION for dgtcon COMPLEX for cgtcon DOUBLE COMPLEX for zgtcon. Workspace array, DIMENSION (2*n). iwork INTEGER. Workspace array, DIMENSION (n). Used for real flavors only. Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond=0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gtcon interface are as follows: dl Holds the vector of length (n-1). d Holds the vector of length n. du Holds the vector of length (n-1). du2 Holds the vector of length (n-2). ipiv Holds the vector of length n. norm Must be '1', 'O', or 'I'. The default value is '1'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors and 8n2 for complex flavors. ?pocon Estimates the reciprocal of the condition number of a symmetric (Hermitian) positive-definite matrix. 3 Intel® Math Kernel Library Reference Manual 426 Syntax Fortran 77: call spocon( uplo, n, a, lda, anorm, rcond, work, iwork, info ) call dpocon( uplo, n, a, lda, anorm, rcond, work, iwork, info ) call cpocon( uplo, n, a, lda, anorm, rcond, work, rwork, info ) call zpocon( uplo, n, a, lda, anorm, rcond, work, rwork, info ) Fortran 95: call pocon( a, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_spocon( int matrix_order, char uplo, lapack_int n, const float* a, lapack_int lda, float anorm, float* rcond ); lapack_int LAPACKE_dpocon( int matrix_order, char uplo, lapack_int n, const double* a, lapack_int lda, double anorm, double* rcond ); lapack_int LAPACKE_cpocon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* a, lapack_int lda, float anorm, float* rcond ); lapack_int LAPACKE_zpocon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* a, lapack_int lda, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a symmetric (Hermitian) positive-definite matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is symmetric or Hermitian, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?potrf to compute the Cholesky factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. a, work REAL for spocon DOUBLE PRECISION for dpocon COMPLEX for cpocon LAPACK Routines: Linear Equations 3 427 DOUBLE COMPLEX for zpocon. Arrays: a(lda,*), work(*). The array a contains the factored matrix A, as returned by ?potrf. The second dimension of a must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). anorm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cpocon DOUBLE PRECISION for zpocon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pocon interface are as follows: a Holds the matrix A of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors and 8n2 for complex flavors. ?ppcon Estimates the reciprocal of the condition number of a packed symmetric (Hermitian) positive-definite matrix. Syntax Fortran 77: call sppcon( uplo, n, ap, anorm, rcond, work, iwork, info ) 3 Intel® Math Kernel Library Reference Manual 428 call dppcon( uplo, n, ap, anorm, rcond, work, iwork, info ) call cppcon( uplo, n, ap, anorm, rcond, work, rwork, info ) call zppcon( uplo, n, ap, anorm, rcond, work, rwork, info ) Fortran 95: call ppcon( ap, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_sppcon( int matrix_order, char uplo, lapack_int n, const float* ap, float anorm, float* rcond ); lapack_int LAPACKE_dppcon( int matrix_order, char uplo, lapack_int n, const double* ap, double anorm, double* rcond ); lapack_int LAPACKE_cppcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* ap, float anorm, float* rcond ); lapack_int LAPACKE_zppcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* ap, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a packed symmetric (Hermitian) positivedefinite matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is symmetric or Hermitian, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?pptrf to compute the Cholesky factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. ap, work REAL for sppcon DOUBLE PRECISION for dppcon COMPLEX for cppcon DOUBLE COMPLEX for zppcon. Arrays: ap(*), work(*). The array ap contains the packed factored matrix A, as returned by ? pptrf. The dimension of ap must be at least max(1,n(n+1)/2). LAPACK Routines: Linear Equations 3 429 The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. anorm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cppcon DOUBLE PRECISION for zppcon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ppcon interface are as follows: ap Holds the array A of size (n*(n+1)/2). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors and 8n2 for complex flavors. ?pbcon Estimates the reciprocal of the condition number of a symmetric (Hermitian) positive-definite band matrix. Syntax Fortran 77: call spbcon( uplo, n, kd, ab, ldab, anorm, rcond, work, iwork, info ) call dpbcon( uplo, n, kd, ab, ldab, anorm, rcond, work, iwork, info ) call cpbcon( uplo, n, kd, ab, ldab, anorm, rcond, work, rwork, info ) call zpbcon( uplo, n, kd, ab, ldab, anorm, rcond, work, rwork, info ) 3 Intel® Math Kernel Library Reference Manual 430 Fortran 95: call pbcon( ab, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_spbcon( int matrix_order, char uplo, lapack_int n, lapack_int kd, const float* ab, lapack_int ldab, float anorm, float* rcond ); lapack_int LAPACKE_dpbcon( int matrix_order, char uplo, lapack_int n, lapack_int kd, const double* ab, lapack_int ldab, double anorm, double* rcond ); lapack_int LAPACKE_cpbcon( int matrix_order, char uplo, lapack_int n, lapack_int kd, const lapack_complex_float* ab, lapack_int ldab, float anorm, float* rcond ); lapack_int LAPACKE_zpbcon( int matrix_order, char uplo, lapack_int n, lapack_int kd, const lapack_complex_double* ab, lapack_int ldab, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a symmetric (Hermitian) positive-definite band matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is symmetric or Hermitian, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?pbtrf to compute the Cholesky factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangular factor is stored in ab. If uplo = 'L', the lower triangular factor is stored in ab. n INTEGER. The order of the matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. ldab INTEGER. The leading dimension of the array ab. (ldab = kd +1). ab, work REAL for spbcon DOUBLE PRECISION for dpbcon COMPLEX for cpbcon DOUBLE COMPLEX for zpbcon. Arrays: ab(ldab,*), work(*). The array ab contains the factored matrix A in band form, as returned by ?pbtrf. The second dimension of ab must be at least max(1, n). LAPACK Routines: Linear Equations 3 431 The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cpbcon DOUBLE PRECISION for zpbcon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbcon interface are as follows: ab Holds the array A of size (kd+1,n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 4*n(kd + 1) floating-point operations for real flavors and 16*n(kd + 1) for complex flavors. ?ptcon Estimates the reciprocal of the condition number of a symmetric (Hermitian) positive-definite tridiagonal matrix. Syntax Fortran 77: call sptcon( n, d, e, anorm, rcond, work, info ) call dptcon( n, d, e, anorm, rcond, work, info ) call cptcon( n, d, e, anorm, rcond, work, info ) call zptcon( n, d, e, anorm, rcond, work, info ) 3 Intel® Math Kernel Library Reference Manual 432 Fortran 95: call ptcon( d, e, anorm, rcond [,info] ) C: lapack_int LAPACKE_sptcon( lapack_int n, const float* d, const float* e, float anorm, float* rcond ); lapack_int LAPACKE_dptcon( lapack_int n, const double* d, const double* e, double anorm, double* rcond ); lapack_int LAPACKE_cptcon( lapack_int n, const float* d, const lapack_complex_float* e, float anorm, float* rcond ); lapack_int LAPACKE_zptcon( lapack_int n, const double* d, const lapack_complex_double* e, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the reciprocal of the condition number (in the 1-norm) of a real symmetric or complex Hermitian positive-definite tridiagonal matrix using the factorization A = L*D*LT for real flavors and A = L*D*LH for complex flavors or A = UT*D*U for real flavors and A = UH*D*U for complex flavors computed by ?pttrf : ?1(A) = ||A||1 ||A-1||1 (since A is symmetric or Hermitian, ?8(A) = ?1(A)). The norm ||A-1|| is computed by a direct method, and the reciprocal of the condition number is computed as rcond = 1 / (||A|| ||A-1||). Before calling this routine: • compute anorm as ||A||1 = maxj Si |aij| • call ?pttrf to compute the factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. d, work REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, dimension (n). The array d contains the n diagonal elements of the diagonal matrix D from the factorization of A, as computed by ?pttrf ; work is a workspace array. e REAL for sptcon DOUBLE PRECISION for dptcon COMPLEX for cptcon DOUBLE COMPLEX for zptcon. Array, DIMENSION (n -1). LAPACK Routines: Linear Equations 3 433 Contains off-diagonal elements of the unit bidiagonal factor U or L from the factorization computed by ?pttrf . anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The 1- norm of the original matrix A (see Description). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gtcon interface are as follows: d Holds the vector of length n. e Holds the vector of length (n-1). Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 4*n(kd + 1) floating-point operations for real flavors and 16*n(kd + 1) for complex flavors. ?sycon Estimates the reciprocal of the condition number of a symmetric matrix. Syntax Fortran 77: call ssycon( uplo, n, a, lda, ipiv, anorm, rcond, work, iwork, info ) call dsycon( uplo, n, a, lda, ipiv, anorm, rcond, work, iwork, info ) call csycon( uplo, n, a, lda, ipiv, anorm, rcond, work, info ) call zsycon( uplo, n, a, lda, ipiv, anorm, rcond, work, info ) Fortran 95: call sycon( a, ipiv, anorm, rcond [,uplo] [,info] ) 3 Intel® Math Kernel Library Reference Manual 434 C: lapack_int LAPACKE_ssycon( int matrix_order, char uplo, lapack_int n, const float* a, lapack_int lda, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_dsycon( int matrix_order, char uplo, lapack_int n, const double* a, lapack_int lda, const lapack_int* ipiv, double anorm, double* rcond ); lapack_int LAPACKE_csycon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* a, lapack_int lda, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zsycon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* a, lapack_int lda, const lapack_int* ipiv, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a symmetric matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is symmetric, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?sytrf to compute the factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = P*U*D*UT*PT. If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. a, work REAL for ssycon DOUBLE PRECISION for dsycon COMPLEX for csycon DOUBLE COMPLEX for zsycon. Arrays: a(lda,*), work(*). The array a contains the factored matrix A, as returned by ?sytrf. The second dimension of a must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 2*n). lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). LAPACK Routines: Linear Equations 3 435 The array ipiv, as returned by ?sytrf. anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sycon interface are as follows: a Holds the matrix A of size (n, n). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors and 8n2 for complex flavors. ?syconv Converts a symmetric matrix given by a triangular matrix factorization into two matrices and vice versa. Syntax Fortran 77: call ssyconv( uplo, way, n, a, lda, ipiv, work, info ) call dsyconv( uplo, way, n, a, lda, ipiv, work, info ) call csyconv( uplo, way, n, a, lda, ipiv, work, info ) call zsyconv( uplo, way, n, a, lda, ipiv, work, info ) Fortran 95: call sycon( a[,uplo][,way][,ipiv][,info] ) Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h 3 Intel® Math Kernel Library Reference Manual 436 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine converts matrix A, which results from a triangular matrix factorization, into matrices L and D and vice versa. The routine gets non-diagonalized elements of D returned in the workspace and applies or reverses permutation done with the triangular matrix factorization. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the details of the factorization are stored as an upper or lower triangular matrix: If uplo = 'U': the upper triangular, A = U*D*UT. If uplo = 'L': the lower triangular, A = L*D*LT. way CHARACTER*1. Must be 'C' or 'R'. Indicates whether the routine converts or reverts the matrix: way = 'C' means conversion. way = 'R' means reversion. n INTEGER. The order of matrix A; n = 0. a REAL for ssyconv DOUBLE PRECISION for dsyconv COMPLEX for csyconv DOUBLE COMPLEX for zsyconv Array of DIMENSION (lda,n). The block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?sytrf. lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Details of the interchanges and the block structure of D, as returned by ?sytrf. work INTEGER. Workspace array, DIMENSION at least max(1, n). Output Parameters info INTEGER. If info = 0, the execution is successful. If info < 0, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syconv interface are as follows: a Holds the matrix A of size (n, n). uplo Must be 'U' or 'L'. way Must be 'C' or 'R'. ipiv Holds the vector of length n. See Also ?sytrf LAPACK Routines: Linear Equations 3 437 ?hecon Estimates the reciprocal of the condition number of a Hermitian matrix. Syntax Fortran 77: call checon( uplo, n, a, lda, ipiv, anorm, rcond, work, info ) call zhecon( uplo, n, a, lda, ipiv, anorm, rcond, work, info ) Fortran 95: call hecon( a, ipiv, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_checon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* a, lapack_int lda, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zhecon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* a, lapack_int lda, const lapack_int* ipiv, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a Hermitian matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is Hermitian, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 =maxj Si |aij| or ||A||8 =maxi Sj |aij|) • call ?hetrf to compute the factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = P*U*D*UH*PT. If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = P*L*D*LH*PT. n INTEGER. The order of matrix A; n = 0. a, work COMPLEX for checon DOUBLE COMPLEX for zhecon. Arrays: a(lda,*), work(*). 3 Intel® Math Kernel Library Reference Manual 438 The array a contains the factored matrix A, as returned by ?hetrf. The second dimension of a must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 2*n). lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv, as returned by ?hetrf. anorm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hecon interface are as follows: a Holds the matrix A of size (n, n). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 5 and never more than 11. Each solution requires approximately 8n2 floating-point operations. ?spcon Estimates the reciprocal of the condition number of a packed symmetric matrix. Syntax Fortran 77: call sspcon( uplo, n, ap, ipiv, anorm, rcond, work, iwork, info ) call dspcon( uplo, n, ap, ipiv, anorm, rcond, work, iwork, info ) call cspcon( uplo, n, ap, ipiv, anorm, rcond, work, info ) call zspcon( uplo, n, ap, ipiv, anorm, rcond, work, info ) LAPACK Routines: Linear Equations 3 439 Fortran 95: call spcon( ap, ipiv, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_sspcon( int matrix_order, char uplo, lapack_int n, const float* ap, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_dspcon( int matrix_order, char uplo, lapack_int n, const double* ap, const lapack_int* ipiv, double anorm, double* rcond ); lapack_int LAPACKE_cspcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* ap, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zspcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* ap, const lapack_int* ipiv, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a packed symmetric matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is symmetric, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?sptrf to compute the factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the packed upper triangular factor U of the factorization A = P*U*D*UT*PT. If uplo = 'L', the array ap stores the packed lower triangular factor L of the factorization A = P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. ap, work REAL for sspcon DOUBLE PRECISION for dspcon COMPLEX for cspcon DOUBLE COMPLEX for zspcon. Arrays: ap(*), work(*). The array ap contains the packed factored matrix A, as returned by ? sptrf. The dimension of ap must be at least max(1,n(n+1)/2). The array work is a workspace for the routine. The dimension of work must be at least max(1, 2*n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv, as returned by ?sptrf. 3 Intel® Math Kernel Library Reference Manual 440 anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond = 0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spcon interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors and 8n2 for complex flavors. ?hpcon Estimates the reciprocal of the condition number of a packed Hermitian matrix. Syntax Fortran 77: call chpcon( uplo, n, ap, ipiv, anorm, rcond, work, info ) call zhpcon( uplo, n, ap, ipiv, anorm, rcond, work, info ) Fortran 95: call hpcon( ap, ipiv, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_chpcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* ap, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zhpcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* ap, const lapack_int* ipiv, double anorm, double* rcond ); LAPACK Routines: Linear Equations 3 441 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a Hermitian matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is Hermitian, ?8(A) = k1(A)). Before calling this routine: • compute anorm (either ||A||1 =maxj Si |aij| or ||A||8 =maxi Sj |aij|) • call ?hptrf to compute the factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the packed upper triangular factor U of the factorization A = P*U*D*UT*PT. If uplo = 'L', the array ap stores the packed lower triangular factor L of the factorization A = P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. ap, work COMPLEX for chpcon DOUBLE COMPLEX for zhpcon. Arrays: ap(*), work(*). The array ap contains the packed factored matrix A, as returned by ? hptrf. The dimension of ap must be at least max(1,n(n+1)/2). The array work is a workspace for the routine. The dimension of work must be at least max(1, 2*n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv, as returned by ?hptrf. anorm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. 3 Intel® Math Kernel Library Reference Manual 442 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbcon interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 5 and never more than 11. Each solution requires approximately 8n2 floating-point operations. ?trcon Estimates the reciprocal of the condition number of a triangular matrix. Syntax Fortran 77: call strcon( norm, uplo, diag, n, a, lda, rcond, work, iwork, info ) call dtrcon( norm, uplo, diag, n, a, lda, rcond, work, iwork, info ) call ctrcon( norm, uplo, diag, n, a, lda, rcond, work, rwork, info ) call ztrcon( norm, uplo, diag, n, a, lda, rcond, work, rwork, info ) Fortran 95: call trcon( a, rcond [,uplo] [,diag] [,norm] [,info] ) C: lapack_int LAPACKE_strcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const float* a, lapack_int lda, float* rcond ); lapack_int LAPACKE_dtrcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const double* a, lapack_int lda, double* rcond ); lapack_int LAPACKE_ctrcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* rcond ); lapack_int LAPACKE_ztrcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a triangular matrix A in either the 1-norm or infinity-norm: ?1(A) =||A||1 ||A-1||1 = ?8(AT) = ?8(AH) LAPACK Routines: Linear Equations 3 443 ?8 (A) =||A||8 ||A-1||8 =k1 (AT) = ?1 (AH) . Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', the array a stores the upper triangle of A, other array elements are not referenced. If uplo = 'L', the array a stores the lower triangle of A, other array elements are not referenced. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements are assumed to be 1 and not referenced in the array a. n INTEGER. The order of the matrix A; n = 0. a, work REAL for strcon DOUBLE PRECISION for dtrcon COMPLEX for ctrcon DOUBLE COMPLEX for ztrcon. Arrays: a(lda,*), work(*). The array a contains the matrix A. The second dimension of a must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctrcon DOUBLE PRECISION for ztrcon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. 3 Intel® Math Kernel Library Reference Manual 444 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trcon interface are as follows: a Holds the matrix A of size (n, n). norm Must be '1', 'O', or 'I'. The default value is '1'. uplo Must be 'U' or 'L'. The default value is 'U'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately n2 floating-point operations for real flavors and 4n2 operations for complex flavors. ?tpcon Estimates the reciprocal of the condition number of a packed triangular matrix. Syntax Fortran 77: call stpcon( norm, uplo, diag, n, ap, rcond, work, iwork, info ) call dtpcon( norm, uplo, diag, n, ap, rcond, work, iwork, info ) call ctpcon( norm, uplo, diag, n, ap, rcond, work, rwork, info ) call ztpcon( norm, uplo, diag, n, ap, rcond, work, rwork, info ) Fortran 95: call tpcon( ap, rcond [,uplo] [,diag] [,norm] [,info] ) C: lapack_int LAPACKE_stpcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const float* ap, float* rcond ); lapack_int LAPACKE_dtpcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const double* ap, double* rcond ); lapack_int LAPACKE_ctpcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const lapack_complex_float* ap, float* rcond ); lapack_int LAPACKE_ztpcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const lapack_complex_double* ap, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description LAPACK Routines: Linear Equations 3 445 The routine estimates the reciprocal of the condition number of a packed triangular matrix A in either the 1- norm or infinity-norm: ?1(A) =||A||1 ||A-1||1 = ?8(AT) = ?8(AH) ?8(A) =||A||8 ||A-1||8 =?1 (AT) = ?1(AH) . Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', the array ap stores the upper triangle of A in packed form. If uplo = 'L', the array ap stores the lower triangle of A in packed form. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements are assumed to be 1 and not referenced in the array ap. n INTEGER. The order of the matrix A; n = 0. ap, work REAL for stpcon DOUBLE PRECISION for dtpcon COMPLEX for ctpcon DOUBLE COMPLEX for ztpcon. Arrays: ap(*), work(*). The array ap contains the packed matrix A. The dimension of ap must be at least max(1,n(n+1)/2). The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctpcon DOUBLE PRECISION for ztpcon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. 3 Intel® Math Kernel Library Reference Manual 446 If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tpcon interface are as follows: ap Holds the array A of size (n*(n+1)/2). norm Must be '1', 'O', or 'I'. The default value is '1'. uplo Must be 'U' or 'L'. The default value is 'U'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately n2 floating-point operations for real flavors and 4n2 operations for complex flavors. ?tbcon Estimates the reciprocal of the condition number of a triangular band matrix. Syntax Fortran 77: call stbcon( norm, uplo, diag, n, kd, ab, ldab, rcond, work, iwork, info ) call dtbcon( norm, uplo, diag, n, kd, ab, ldab, rcond, work, iwork, info ) call ctbcon( norm, uplo, diag, n, kd, ab, ldab, rcond, work, rwork, info ) call ztbcon( norm, uplo, diag, n, kd, ab, ldab, rcond, work, rwork, info ) Fortran 95: call tbcon( ab, rcond [,uplo] [,diag] [,norm] [,info] ) C: lapack_int LAPACKE_stbcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, lapack_int kd, const float* ab, lapack_int ldab, float* rcond ); lapack_int LAPACKE_dtbcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, lapack_int kd, const double* ab, lapack_int ldab, double* rcond ); lapack_int LAPACKE_ctbcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, lapack_int kd, const lapack_complex_float* ab, lapack_int ldab, float* rcond ); lapack_int LAPACKE_ztbcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, lapack_int kd, const lapack_complex_double* ab, lapack_int ldab, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 LAPACK Routines: Linear Equations 3 447 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a triangular band matrix A in either the 1- norm or infinity-norm: ?1(A) =||A||1 ||A-1||1 = ?8(AT) = ?8(AH) ?8(A) =||A||8 ||A-1||8 =?1 (AT) = ?1(AH) . Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', the array ap stores the upper triangle of A in packed form. If uplo = 'L', the array ap stores the lower triangle of A in packed form. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements are assumed to be 1 and not referenced in the array ab. n INTEGER. The order of the matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. ab, work REAL for stbcon DOUBLE PRECISION for dtbcon COMPLEX for ctbcon DOUBLE COMPLEX for ztbcon. Arrays: ab(ldab,*), work(*). The array ab contains the band matrix A. The second dimension of ab must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldab INTEGER. The leading dimension of the array ab. (ldab = kd +1). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctbcon DOUBLE PRECISION for ztbcon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors. 3 Intel® Math Kernel Library Reference Manual 448 DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tbcon interface are as follows: ab Holds the array A of size (kd+1,n). norm Must be '1', 'O', or 'I'. The default value is '1'. uplo Must be 'U' or 'L'. The default value is 'U'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2*n(kd + 1) floating-point operations for real flavors and 8*n(kd + 1) operations for complex flavors. Refining the Solution and Estimating Its Error This section describes the LAPACK routines for refining the computed solution of a system of linear equations and estimating the solution error. You can call these routines after factorizing the matrix of the system of equations and computing the solution (see Routines for Matrix Factorization and Routines for Solving Systems of Linear Equations). ?gerfs Refines the solution of a system of linear equations with a general matrix and estimates its error. Syntax Fortran 77: call sgerfs( trans, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dgerfs( trans, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cgerfs( trans, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zgerfs( trans, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call gerfs( a, af, ipiv, b, x [,trans] [,ferr] [,berr] [,info] ) LAPACK Routines: Linear Equations 3 449 C: lapack_int LAPACKE_sgerfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dgerfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cgerfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zgerfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a general matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?getrf • call the solver routine ?getrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. 3 Intel® Math Kernel Library Reference Manual 450 a,af,b,x,work REAL for sgerfs DOUBLE PRECISION for dgerfs COMPLEX for cgerfs DOUBLE COMPLEX for zgerfs. Arrays: a(lda,*) contains the original matrix A, as supplied to ?getrf. af(ldaf,*) contains the factored matrix A, as returned by ?getrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of a and af must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?getrf. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cgerfs DOUBLE PRECISION for zgerfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gerfs interface are as follows: a Holds the matrix A of size (n, n). af Holds the matrix AF of size (n, n). ipiv Holds the vector of length n. b Holds the matrix B of size (n, nrhs). x Holds the matrix X of size (n, nrhs). ferr Holds the vector of length (nrhs). LAPACK Routines: Linear Equations 3 451 berr Holds the vector of length (nrhs). trans Must be 'N', 'C', or 'T'. The default value is 'N'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 4n2 floating-point operations (for real flavors) or 16n2 operations (for complex flavors). In addition, each step of iterative refinement involves 6n2 operations (for real flavors) or 24n2 operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. ?gerfsx Uses extra precise iterative refinement to improve the solution to the system of linear equations with a general matrix A and provides error bounds and backward error estimates. Syntax Fortran 77: call sgerfsx( trans, equed, n, nrhs, a, lda, af, ldaf, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dgerfsx( trans, equed, n, nrhs, a, lda, af, ldaf, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cgerfsx( trans, equed, n, nrhs, a, lda, af, ldaf, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zgerfsx( trans, equed, n, nrhs, a, lda, af, ldaf, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sgerfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const lapack_int* ipiv, const float* r, const float* c, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_dgerfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const lapack_int* ipiv, const double* r, const double* c, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); lapack_int LAPACKE_cgerfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const float* r, 3 Intel® Math Kernel Library Reference Manual 452 const float* c, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_zgerfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const double* r, const double* c, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine improves the computed solution to a system of linear equations and provides error bounds and backward error estimates for the solution. In addition to a normwise error bound, the code provides a maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. The original system of linear equations may have been equilibrated before calling this routine, as described by the parameters equed, r, and c below. In this case, the solution and error bounds returned are for the original unequilibrated system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form A**T*X = B (Transpose). If trans = 'C', the system has the form A**H*X = B (Conjugate transpose = Transpose). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. Specifies the form of equilibration that was done to A before calling this routine. If equed = 'N', no equilibration was done. If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). If equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). The right-hand side B has been changed accordingly. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for sgerfsx DOUBLE PRECISION for dgerfsx LAPACK Routines: Linear Equations 3 453 COMPLEX for cgerfsx DOUBLE COMPLEX for zgerfsx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the original n-by-n matrix A. The array af contains the factored form of the matrix A, that is, the factors L and U from the factorization A = P*L*U as computed by ?getrf. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains the pivot indices as computed by ?getrf; for row 1 = i = n, row i of the matrix was interchanged with row ipiv(i). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If equed = 'C' or 'B', each element of c must be positive. Each element of r or c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). x REAL for sgerfsx DOUBLE PRECISION for dgerfsx COMPLEX for cgerfsx DOUBLE COMPLEX for zgerfsx. Array, DIMENSION (ldx,*). The solution matrix X as computed by ?getrs ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used 3 Intel® Math Kernel Library Reference Manual 454 for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sgerfsx DOUBLE PRECISION for dgerfsx COMPLEX for cgerfsx DOUBLE COMPLEX for zgerfsx. The improved solution matrix X. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. LAPACK Routines: Linear Equations 3 455 err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: 3 Intel® Math Kernel Library Reference Manual 456 The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Output parameter only if the input contains erroneous values, namely, in params(1), params(2), params(3). In such a case, the corresponding elements of params are filled with default values on output. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested LAPACK Routines: Linear Equations 3 457 params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?gbrfs Refines the solution of a system of linear equations with a general band matrix and estimates its error. Syntax Fortran 77: call sgbrfs( trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dgbrfs( trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cgbrfs( trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zgbrfs( trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call gbrfs( ab, afb, ipiv, b, x [,kl] [,trans] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_sgbrfs( int matrix_order, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const float* ab, lapack_int ldab, const float* afb, lapack_int ldafb, const lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dgbrfs( int matrix_order, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const double* ab, lapack_int ldab, const double* afb, lapack_int ldafb, const lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cgbrfs( int matrix_order, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const lapack_complex_float* ab, lapack_int ldab, const lapack_complex_float* afb, lapack_int ldafb, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zgbrfs( int matrix_order, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const lapack_complex_double* ab, lapack_int ldab, const lapack_complex_double* afb, lapack_int ldafb, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 3 Intel® Math Kernel Library Reference Manual 458 Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a band matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?gbtrf • call the solver routine ?gbtrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. n INTEGER. The order of the matrix A; n = 0. kl INTEGER. The number of sub-diagonals within the band of A; kl = 0. ku INTEGER. The number of super-diagonals within the band of A; ku = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab,afb,b,x,work REAL for sgbrfs DOUBLE PRECISION for dgbrfs COMPLEX for cgbrfs DOUBLE COMPLEX for zgbrfs. Arrays: ab(ldab,*) contains the original band matrix A, as supplied to ? gbtrf, but stored in rows from 1 to kl + ku + 1. afb(ldafb,*) contains the factored band matrix A, as returned by ? gbtrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of ab and afb must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldab INTEGER. The leading dimension of ab. ldafb INTEGER. The leading dimension of afb . ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. LAPACK Routines: Linear Equations 3 459 Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?gbtrf. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cgbrfs DOUBLE PRECISION for zgbrfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info =0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbrfs interface are as follows: ab Holds the array A of size (kl+ku+1,n). afb Holds the array AF of size (2*kl*ku+1,n). ipiv Holds the vector of length n. b Holds the matrix B of size (n, nrhs). x Holds the matrix X of size (n, nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). trans Must be 'N', 'C', or 'T'. The default value is 'N'. kl If omitted, assumed kl = ku. ku Restored as ku = lda-kl-1. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 4n(kl + ku) floatingpoint operations (for real flavors) or 16n(kl + ku) operations (for complex flavors). In addition, each step of iterative refinement involves 2n(4kl + 3ku) operations (for real flavors) or 8n(4kl + 3ku) operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. 3 Intel® Math Kernel Library Reference Manual 460 ?gbrfsx Uses extra precise iterative refinement to improve the solution to the system of linear equations with a banded matrix A and provides error bounds and backward error estimates. Syntax Fortran 77: call sgbrfsx( trans, equed, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dgbrfsx( trans, equed, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cgbrfsx( trans, equed, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zgbrfsx( trans, equed, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sgbrfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const float* ab, lapack_int ldab, const float* afb, lapack_int ldafb, const lapack_int* ipiv, const float* r, const float* c, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_dgbrfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const double* ab, lapack_int ldab, const double* afb, lapack_int ldafb, const lapack_int* ipiv, const double* r, const double* c, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); lapack_int LAPACKE_cgbrfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const lapack_complex_float* ab, lapack_int ldab, const lapack_complex_float* afb, lapack_int ldafb, const lapack_int* ipiv, const float* r, const float* c, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_zgbrfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const lapack_complex_double* ab, lapack_int ldab, const lapack_complex_double* afb, lapack_int ldafb, const lapack_int* ipiv, const double* r, const double* c, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); LAPACK Routines: Linear Equations 3 461 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine improves the computed solution to a system of linear equations and provides error bounds and backward error estimates for the solution. In addition to a normwise error bound, the code provides a maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. The original system of linear equations may have been equilibrated before calling this routine, as described by the parameters equed, r, and c below. In this case, the solution and error bounds returned are for the original unequilibrated system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form A**T*X = B (Transpose). If trans = 'C', the system has the form A**H*X = B (Conjugate transpose = Transpose). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. Specifies the form of equilibration that was done to A before calling this routine. If equed = 'N', no equilibration was done. If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). If equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). The right-hand side B has been changed accordingly. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. ab, afb, b, work REAL for sgbrfsx DOUBLE PRECISION for dgbrfsx COMPLEX for cgbrfsx DOUBLE COMPLEX for zgbrfsx. Arrays: ab(ldab,*), afb(ldafb,*), b(ldb,*), work(*). The array ab contains the original matrix A in band storage, in rows 1 to kl +ku+1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = A(i,j) for max(1,j-ku) = i = min(n,j+kl). 3 Intel® Math Kernel Library Reference Manual 462 The array afb contains details of the LU factorization of the banded matrix A as computed by ?gbtrf. U is stored as an upper triangular banded matrix with kl + ku superdiagonals in rows 1 to kl + ku + 1. The multipliers used during the factorization are stored in rows kl + ku + 2 to 2*kl + ku + 1. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. ldab INTEGER. The leading dimension of the array ab; ldab = kl+ku+1. ldafb INTEGER. The leading dimension of the array afb; ldafb = 2*kl+ku+1. ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains the pivot indices as computed by ?gbtrf; for row 1 = i = n, row i of the matrix was interchanged with row ipiv(i). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. If equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If equed = 'C' or 'B', each element of c must be positive. Each element of r or c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). x REAL for sgbrfsx DOUBLE PRECISION for dgbrfsx COMPLEX for cgbrfsx DOUBLE COMPLEX for zgbrfsx. Array, DIMENSION (ldx,*). The solution matrix X as computed by sgbtrs/dgbtrs for real flavors or cgbtrs/zgbtrs for complex flavors. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right-hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry will be filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used LAPACK Routines: Linear Equations 3 463 for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sgbrfsx DOUBLE PRECISION for dgbrfsx COMPLEX for cgbrfsx DOUBLE COMPLEX for zgbrfsx. The improved solution matrix X. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. 3 Intel® Math Kernel Library Reference Manual 464 err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the following three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: LAPACK Routines: Linear Equations 3 465 The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Output parameter only if the input contains erroneous values, namely, in params(1), params(2), params(3). In such a case, the corresponding elements of params are filled with default values on output. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested 3 Intel® Math Kernel Library Reference Manual 466 params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?gtrfs Refines the solution of a system of linear equations with a tridiagonal matrix and estimates its error. Syntax Fortran 77: call sgtrfs( trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dgtrfs( trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cgtrfs( trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zgtrfs( trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call gtrfs( dl, d, du, dlf, df, duf, du2, ipiv, b, x [,trans] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_sgtrfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const float* dl, const float* d, const float* du, const float* dlf, const float* df, const float* duf, const float* du2, const lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dgtrfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const double* dl, const double* d, const double* du, const double* dlf, const double* df, const double* duf, const double* du2, const lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cgtrfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_float* dl, const lapack_complex_float* d, const lapack_complex_float* du, const lapack_complex_float* dlf, const lapack_complex_float* df, const lapack_complex_float* duf, const lapack_complex_float* du2, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zgtrfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_double* dl, const lapack_complex_double* d, const lapack_complex_double* du, const lapack_complex_double* dlf, const lapack_complex_double* df, const lapack_complex_double* duf, const lapack_complex_double* du2, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h LAPACK Routines: Linear Equations 3 467 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a tridiagonal matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij|/|aij| = ß|aij|, |dbi|/|bi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?gttrf • call the solver routine ?gttrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns of the matrix B; nrhs = 0. dl,d,du,dlf, df,duf,du2, b,x,work REAL for sgtrfs DOUBLE PRECISION for dgtrfs COMPLEX for cgtrfs DOUBLE COMPLEX for zgtrfs. Arrays: dl, dimension (n -1), contains the subdiagonal elements of A. d, dimension (n), contains the diagonal elements of A. du, dimension (n -1), contains the superdiagonal elements of A. dlf, dimension (n -1), contains the (n - 1) multipliers that define the matrix L from the LU factorization of A as computed by ?gttrf. df, dimension (n), contains the n diagonal elements of the upper triangular matrix U from the LU factorization of A. duf, dimension (n -1), contains the (n - 1) elements of the first superdiagonal of U. du2, dimension (n -2), contains the (n - 2) elements of the second superdiagonal of U. b(ldb,nrhs) contains the right-hand side matrix B. x(ldx,nrhs) contains the solution matrix X, as computed by ?gttrs. work(*) is a workspace array; the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). 3 Intel® Math Kernel Library Reference Manual 468 ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?gttrf. iwork INTEGER. Workspace array, DIMENSION (n). Used for real flavors only. rwork REAL for cgtrfs DOUBLE PRECISION for zgtrfs. Workspace array, DIMENSION (n). Used for complex flavors only. Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1,nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gtrfs interface are as follows: dl Holds the vector of length (n-1). d Holds the vector of length n. du Holds the vector of length (n-1). dlf Holds the vector of length (n-1). df Holds the vector of length n. duf Holds the vector of length (n-1). du2 Holds the vector of length (n-2). ipiv Holds the vector of length n. b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). trans Must be 'N', 'C', or 'T'. The default value is 'N'. ?porfs Refines the solution of a system of linear equations with a symmetric (Hermitian) positive-definite matrix and estimates its error. Syntax Fortran 77: call sporfs( uplo, n, nrhs, a, lda, af, ldaf, b, ldb, x, ldx, ferr, berr, work, iwork, info ) LAPACK Routines: Linear Equations 3 469 call dporfs( uplo, n, nrhs, a, lda, af, ldaf, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cporfs( uplo, n, nrhs, a, lda, af, ldaf, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zporfs( uplo, n, nrhs, a, lda, af, ldaf, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call porfs( a, af, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_sporfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dporfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cporfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zporfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a symmetric (Hermitian) positive definite matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?potrf • call the solver routine ?potrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. 3 Intel® Math Kernel Library Reference Manual 470 uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a,af,b,x,work REAL for sporfs DOUBLE PRECISION for dporfs COMPLEX for cporfs DOUBLE COMPLEX for zporfs. Arrays: a(lda,*) contains the original matrix A, as supplied to ?potrf. af(ldaf,*) contains the factored matrix A, as returned by ?potrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of a and af must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cporfs DOUBLE PRECISION for zporfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine porfs interface are as follows: a Holds the matrix A of size (n,n). af Holds the matrix AF of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). LAPACK Routines: Linear Equations 3 471 ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 4n2 floating-point operations (for real flavors) or 16n2 operations (for complex flavors). In addition, each step of iterative refinement involves 6n2 operations (for real flavors) or 24n2 operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. ?porfsx Uses extra precise iterative refinement to improve the solution to the system of linear equations with a symmetric/Hermitian positive-definite matrix A and provides error bounds and backward error estimates. Syntax Fortran 77: call sporfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dporfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cporfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zporfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sporfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const float* s, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_dporfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const double* s, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); lapack_int LAPACKE_cporfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const float* s, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); 3 Intel® Math Kernel Library Reference Manual 472 lapack_int LAPACKE_zporfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const double* s, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine improves the computed solution to a system of linear equations and provides error bounds and backward error estimates for the solution. In addition to a normwise error bound, the code provides a maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. The original system of linear equations may have been equilibrated before calling this routine, as described by the parameters equed and s below. In this case, the solution and error bounds returned are for the original unequilibrated system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. equed CHARACTER*1. Must be 'N' or 'Y'. Specifies the form of equilibration that was done to A before calling this routine. If equed = 'N', no equilibration was done. If equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). The right-hand side B has been changed accordingly. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for sporfsx DOUBLE PRECISION for dporfsx COMPLEX for cporfsx DOUBLE COMPLEX for zporfsx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the symmetric/Hermitian matrix A as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower LAPACK Routines: Linear Equations 3 473 triangular part of a contains the lower triangular part of the matrix A and the strictly upper triangular part of a is not referenced. The second dimension of a must be at least max(1,n). The array af contains the triangular factor L or U from the Cholesky factorization A = U**T*U or A = L*L**T as computed by spotrf for real flavors or dpotrf for complex flavors. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. If equed = 'N', s is not accessed. If equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). x REAL for sporfsx DOUBLE PRECISION for dporfsx COMPLEX for cporfsx DOUBLE COMPLEX for zporfsx. Array, DIMENSION (ldx,*). The solution matrix X as computed by ?potrs ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry will be filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. 3 Intel® Math Kernel Library Reference Manual 474 =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sporfsx DOUBLE PRECISION for dporfsx COMPLEX for cporfsx DOUBLE COMPLEX for zporfsx. The improved solution matrix X. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector LAPACK Routines: Linear Equations 3 475 The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. 3 Intel® Math Kernel Library Reference Manual 476 The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Output parameter only if the input contains erroneous values. namely, in params(1), params(2), params(3). In such a case, the corresponding elements of params are filled with default values on output. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. LAPACK Routines: Linear Equations 3 477 ?pprfs Refines the solution of a system of linear equations with a packed symmetric (Hermitian) positive-definite matrix and estimates its error. Syntax Fortran 77: call spprfs( uplo, n, nrhs, ap, afp, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dpprfs( uplo, n, nrhs, ap, afp, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cpprfs( uplo, n, nrhs, ap, afp, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zpprfs( uplo, n, nrhs, ap, afp, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call pprfs( ap, afp, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_spprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* ap, const float* afp, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dpprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* ap, const double* afp, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cpprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, const lapack_complex_float* afp, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zpprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, const lapack_complex_double* afp, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a packed symmetric (Hermitian)positive definite matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/||x||8 where xe is the exact solution. Before calling this routine: 3 Intel® Math Kernel Library Reference Manual 478 • call the factorization routine ?pptrf • call the solver routine ?pptrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ap, afp, b, x, work REAL for spprfs DOUBLE PRECISION for dpprfs COMPLEX for cpprfs DOUBLE COMPLEX for zpprfs. Arrays: ap(*) contains the original packed matrix A, as supplied to ?pptrf. afp(*) contains the factored packed matrix A, as returned by ? pptrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1,n(n+1)/ 2); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cpprfs DOUBLE PRECISION for zpprfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pprfs interface are as follows: LAPACK Routines: Linear Equations 3 479 ap Holds the array A of size (n*(n+1)/2). afp Holds the array AF of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). x Holds the matrix X of size (n, nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 4n2 floating-point operations (for real flavors) or 16n2 operations (for complex flavors). In addition, each step of iterative refinement involves 6n2 operations (for real flavors) or 24n2 operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number of systems is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. ?pbrfs Refines the solution of a system of linear equations with a band symmetric (Hermitian) positive-definite matrix and estimates its error. Syntax Fortran 77: call spbrfs( uplo, n, kd, nrhs, ab, ldab, afb, ldafb, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dpbrfs( uplo, n, kd, nrhs, ab, ldab, afb, ldafb, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cpbrfs( uplo, n, kd, nrhs, ab, ldab, afb, ldafb, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zpbrfs( uplo, n, kd, nrhs, ab, ldab, afb, ldafb, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call pbrfs( ab, afb, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_spbrfs( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, const float* ab, lapack_int ldab, const float* afb, lapack_int ldafb, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dpbrfs( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, const double* ab, lapack_int ldab, const double* afb, lapack_int ldafb, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); 3 Intel® Math Kernel Library Reference Manual 480 lapack_int LAPACKE_cpbrfs( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, const lapack_complex_float* ab, lapack_int ldab, const lapack_complex_float* afb, lapack_int ldafb, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zpbrfs( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, const lapack_complex_double* ab, lapack_int ldab, const lapack_complex_double* afb, lapack_int ldafb, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a symmetric (Hermitian) positive definite band matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?pbtrf • call the solver routine ?pbtrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab,afb,b,x,work REAL for spbrfs DOUBLE PRECISION for dpbrfs COMPLEX for cpbrfs DOUBLE COMPLEX for zpbrfs. Arrays: ab(ldab,*) contains the original band matrix A, as supplied to ? pbtrf. afb(ldafb,*) contains the factored band matrix A, as returned by ? pbtrf. LAPACK Routines: Linear Equations 3 481 b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of ab and afb must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldab INTEGER. The leading dimension of ab; ldab = kd + 1. ldafb INTEGER. The leading dimension of afb; ldafb = kd + 1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cpbrfs DOUBLE PRECISION for zpbrfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbrfs interface are as follows: ab Holds the array A of size (kd+1, n). afb Holds the array AF of size (kd+1, n). b Holds the matrix B of size (n, nrhs). x Holds the matrix X of size (n, nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 8n*kd floating-point operations (for real flavors) or 32n*kd operations (for complex flavors). In addition, each step of iterative refinement involves 12n*kd operations (for real flavors) or 48n*kd operations (for complex flavors); the number of iterations may range from 1 to 5. 3 Intel® Math Kernel Library Reference Manual 482 Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 4n*kd floating-point operations for real flavors or 16n*kd for complex flavors. ?ptrfs Refines the solution of a system of linear equations with a symmetric (Hermitian) positive-definite tridiagonal matrix and estimates its error. Syntax Fortran 77: call sptrfs( n, nrhs, d, e, df, ef, b, ldb, x, ldx, ferr, berr, work, info ) call dptrfs( n, nrhs, d, e, df, ef, b, ldb, x, ldx, ferr, berr, work, info ) call cptrfs( uplo, n, nrhs, d, e, df, ef, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zptrfs( uplo, n, nrhs, d, e, df, ef, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call ptrfs( d, df, e, ef, b, x [,ferr] [,berr] [,info] ) call ptrfs( d, df, e, ef, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_sptrfs( int matrix_order, lapack_int n, lapack_int nrhs, const float* d, const float* e, const float* df, const float* ef, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dptrfs( int matrix_order, lapack_int n, lapack_int nrhs, const double* d, const double* e, const double* df, const double* ef, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cptrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* d, const lapack_complex_float* e, const float* df, const lapack_complex_float* ef, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zptrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* d, const lapack_complex_double* e, const double* df, const lapack_complex_double* ef, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description LAPACK Routines: Linear Equations 3 483 The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a symmetric (Hermitian) positive definite tridiagonal matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?pttrf • call the solver routine ?pttrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Used for complex flavors only. Must be 'U' or 'L'. Specifies whether the superdiagonal or the subdiagonal of the tridiagonal matrix A is stored and how A is factored: If uplo = 'U', the array e stores the superdiagonal of A, and A is factored as UH*D*U. If uplo = 'L', the array e stores the subdiagonal of A, and A is factored as L*D*LH. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. d, df, rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors Arrays: d(n), df(n), rwork(n). The array d contains the n diagonal elements of the tridiagonal matrix A. The array df contains the n diagonal elements of the diagonal matrix D from the factorization of A as computed by ?pttrf. The array rwork is a workspace array used for complex flavors only. e,ef,b,x,work REAL for sptrfs DOUBLE PRECISION for dptrfs COMPLEX for cptrfs DOUBLE COMPLEX for zptrfs. Arrays: e(n -1), ef(n -1), b(ldb,nrhs), x(ldx,nrhs), work(*). The array e contains the (n - 1) off-diagonal elements of the tridiagonal matrix A (see uplo). The array ef contains the (n - 1) off-diagonal elements of the unit bidiagonal factor U or L from the factorization computed by ?pttrf (see uplo). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The array x contains the solution matrix X as computed by ?pttrs. The array work is a workspace array. The dimension of work must be at least 2*n for real flavors, and at least n for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). 3 Intel® Math Kernel Library Reference Manual 484 ldx INTEGER. The leading dimension of x; ldx = max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ptrfs interface are as follows: d Holds the vector of length n. df Holds the vector of length n. e Holds the vector of length (n-1). ef Holds the vector of length (n-1). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Used in complex flavors only. Must be 'U' or 'L'. The default value is 'U'. ?syrfs Refines the solution of a system of linear equations with a symmetric matrix and estimates its error. Syntax Fortran 77: call ssyrfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dsyrfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call csyrfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zsyrfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call syrfs( a, af, ipiv, b, x [,uplo] [,ferr] [,berr] [,info] ) LAPACK Routines: Linear Equations 3 485 C: lapack_int LAPACKE_ssyrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dsyrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_csyrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zsyrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a symmetric full-storage matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?sytrf • call the solver routine ?sytrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a,af,b,x,work REAL for ssyrfs DOUBLE PRECISION for dsyrfs 3 Intel® Math Kernel Library Reference Manual 486 COMPLEX for csyrfs DOUBLE COMPLEX for zsyrfs. Arrays: a(lda,*) contains the original matrix A, as supplied to ?sytrf. af(ldaf,*) contains the factored matrix A, as returned by ?sytrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of a and af must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sytrf. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for csyrfs DOUBLE PRECISION for zsyrfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syrfs interface are as follows: a Holds the matrix A of size (n,n). af Holds the matrix AF of size (n,n). ipiv Holds the vector of length n. b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. LAPACK Routines: Linear Equations 3 487 Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 4n2 floating-point operations (for real flavors) or 16n2 operations (for complex flavors). In addition, each step of iterative refinement involves 6n2 operations (for real flavors) or 24n2 operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. ?syrfsx Uses extra precise iterative refinement to improve the solution to the system of linear equations with a symmetric indefinite matrix A and provides error bounds and backward error estimates. Syntax Fortran 77: call ssyrfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dsyrfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call csyrfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zsyrfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_ssyrfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const lapack_int* ipiv, const float* s, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_dsyrfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const lapack_int* ipiv, const double* s, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); lapack_int LAPACKE_csyrfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const float* s, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_zsyrfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const double* s, 3 Intel® Math Kernel Library Reference Manual 488 const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine improves the computed solution to a system of linear equations when the coefficient matrix is symmetric indefinite, and provides error bounds and backward error estimates for the solution. In addition to a normwise error bound, the code provides a maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. The original system of linear equations may have been equilibrated before calling this routine, as described by the parameters equed and s below. In this case, the solution and error bounds returned are for the original unequilibrated system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. equed CHARACTER*1. Must be 'N' or 'Y'. Specifies the form of equilibration that was done to A before calling this routine. If equed = 'N', no equilibration was done. If equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). The right-hand side B has been changed accordingly. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for ssyrfsx DOUBLE PRECISION for dsyrfsx COMPLEX for csyrfsx DOUBLE COMPLEX for zsyrfsx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the symmetric/Hermitian matrix A as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A and the strictly upper triangular part of a is not referenced. The second dimension of a must be at least max(1,n). LAPACK Routines: Linear Equations 3 489 The array af contains the triangular factor L or U from the Cholesky factorization A = U**T*U or A = L*L**T as computed by ssytrf for real flavors or dsytrf for complex flavors. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D as determined by ssytrf for real flavors or dsytrf for complex flavors. s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. If equed = 'N', s is not accessed. If equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). x REAL for ssyrfsx DOUBLE PRECISION for dsyrfsx COMPLEX for csyrfsx DOUBLE COMPLEX for zsyrfsx. Array, DIMENSION (ldx,*). The solution matrix X as computed by ?sytrs ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry will be filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. 3 Intel® Math Kernel Library Reference Manual 490 =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for ssyrfsx DOUBLE PRECISION for dsyrfsx COMPLEX for csyrfsx DOUBLE COMPLEX for zsyrfsx. The improved solution matrix X. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector LAPACK Routines: Linear Equations 3 491 The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. 3 Intel® Math Kernel Library Reference Manual 492 The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Output parameter only if the input contains erroneous values, namely, in params(1), params(2), params(3). In such a case, the corresponding elements of params are filled with default values on output. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. LAPACK Routines: Linear Equations 3 493 ?herfs Refines the solution of a system of linear equations with a complex Hermitian matrix and estimates its error. Syntax Fortran 77: call cherfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zherfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call herfs( a, af, ipiv, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_cherfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zherfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a complex Hermitian full-storage matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?hetrf • call the solver routine ?hetrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. 3 Intel® Math Kernel Library Reference Manual 494 If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a,af,b,x,work COMPLEX for cherfs DOUBLE COMPLEX for zherfs. Arrays: a(lda,*) contains the original matrix A, as supplied to ?hetrf. af(ldaf,*) contains the factored matrix A, as returned by ?hetrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of a and af must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 2*n). lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hetrf. rwork REAL for cherfs DOUBLE PRECISION for zherfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for cherfs DOUBLE PRECISION for zherfs. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine herfs interface are as follows: a Holds the matrix A of size (n,n). af Holds the matrix AF of size (n,n). ipiv Holds the vector of length n. b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). LAPACK Routines: Linear Equations 3 495 berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 16n2 operations. In addition, each step of iterative refinement involves 24n2 operations; the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 8n2 floating-point operations. The real counterpart of this routine is ?ssyrfs/?dsyrfs ?herfsx Uses extra precise iterative refinement to improve the solution to the system of linear equations with a symmetric indefinite matrix A and provides error bounds and backward error estimates. Syntax Fortran 77: call cherfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zherfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_cherfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const float* s, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_zherfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const double* s, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description 3 Intel® Math Kernel Library Reference Manual 496 The routine improves the computed solution to a system of linear equations when the coefficient matrix is Hermitian indefinite, and provides error bounds and backward error estimates for the solution. In addition to a normwise error bound, the code provides a maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. The original system of linear equations may have been equilibrated before calling this routine, as described by the parameters equed and s below. In this case, the solution and error bounds returned are for the original unequilibrated system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. equed CHARACTER*1. Must be 'N' or 'Y'. Specifies the form of equilibration that was done to A before calling this routine. If equed = 'N', no equilibration was done. If equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). The right-hand side B has been changed accordingly. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work COMPLEX for cherfsx DOUBLE COMPLEX for zherfsx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the Hermitian matrix A as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A and the strictly upper triangular part of a is not referenced. The second dimension of a must be at least max(1,n). The factored form of the matrix A. The array af contains the block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*U**T or A = L*D*L**T as computed by ssytrf for cherfsx or dsytrf for zherfsx. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,2*n). lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ipiv INTEGER. LAPACK Routines: Linear Equations 3 497 Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D as determined by ssytrf for real flavors or dsytrf for complex flavors. s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. If equed = 'N', s is not accessed. If equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). x COMPLEX for cherfsx DOUBLE COMPLEX for zherfsx. Array, DIMENSION (ldx,*). The solution matrix X as computed by ?hetrs ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry will be filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for cherfsx), 1.0D+0 (for zherfsx). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. 3 Intel® Math Kernel Library Reference Manual 498 params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). rwork REAL for cherfsx DOUBLE PRECISION for zherfsx. Workspace array, DIMENSION at least max(1, 3*n). Output Parameters x COMPLEX for cherfsx DOUBLE COMPLEX for zherfsx. The improved solution matrix X. rcond REAL for cherfsx DOUBLE PRECISION for zherfsx. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. berr REAL for cherfsx DOUBLE PRECISION for zherfsx. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for cherfsx DOUBLE PRECISION for zherfsx. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx. This error bound should only be trusted if the previous boolean is true. LAPACK Routines: Linear Equations 3 499 err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for cherfsx DOUBLE PRECISION for zherfsx. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx to determine if the error estimate is "guaranteed". These 3 Intel® Math Kernel Library Reference Manual 500 reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Output parameter only if the input contains erroneous values, namely, in params(1), params(2), params(3). In such a case, the corresponding elements of params are filled with default values on output. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?sprfs Refines the solution of a system of linear equations with a packed symmetric matrix and estimates the solution error. Syntax Fortran 77: call ssprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dsprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call csprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zsprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call sprfs( ap, afp, ipiv, b, x [,uplo] [,ferr] [,berr] [,info] ) LAPACK Routines: Linear Equations 3 501 C: lapack_int LAPACKE_ssprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* ap, const float* afp, const lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dsprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* ap, const double* afp, const lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_csprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, const lapack_complex_float* afp, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zsprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, const lapack_complex_double* afp, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a packed symmetric matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?sptrf • call the solver routine ?sptrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ap,afp,b,x,work REAL for ssprfs DOUBLE PRECISION for dsprfs COMPLEX for csprfs DOUBLE COMPLEX for zsprfs. 3 Intel® Math Kernel Library Reference Manual 502 Arrays: ap(*) contains the original packed matrix A, as supplied to ?sptrf. afp(*) contains the factored packed matrix A, as returned by ? sptrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1, n(n +1)/2); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sptrf. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for csprfs DOUBLE PRECISION for zsprfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sprfs interface are as follows: ap Holds the array A of size (n*(n+1)/2). afp Holds the array AF of size (n*(n+1)/2). ipiv Holds the vector of length n. b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. LAPACK Routines: Linear Equations 3 503 For each right-hand side, computation of the backward error involves a minimum of 4n2 floating-point operations (for real flavors) or 16n2 operations (for complex flavors). In addition, each step of iterative refinement involves 6n2 operations (for real flavors) or 24n2 operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number of systems is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. ?hprfs Refines the solution of a system of linear equations with a packed complex Hermitian matrix and estimates the solution error. Syntax Fortran 77: call chprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zhprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call hprfs( ap, afp, ipiv, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_chprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, const lapack_complex_float* afp, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zhprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, const lapack_complex_double* afp, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a packed complex Hermitian matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?hptrf 3 Intel® Math Kernel Library Reference Manual 504 • call the solver routine ?hptrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ap,afp,b,x,work COMPLEX for chprfs DOUBLE COMPLEX for zhprfs. Arrays: ap(*) contains the original packed matrix A, as supplied to ?hptrf. afp(*) contains the factored packed matrix A, as returned by ? hptrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1,n(n+1)/ 2); the second dimension of b and x must be at least max(1,nrhs); the dimension of work must be at least max(1, 2*n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hptrf. rwork REAL for chprfs DOUBLE PRECISION for zhprfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for chprfs. DOUBLE PRECISION for zhprfs. Arrays, DIMENSION at least max(1,nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hprfs interface are as follows: ap Holds the array A of size (n*(n+1)/2). LAPACK Routines: Linear Equations 3 505 afp Holds the array AF of size (n*(n+1)/2). ipiv Holds the vector of length n. b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 16n2 operations. In addition, each step of iterative refinement involves 24n2 operations; the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 8n2 floating-point operations. The real counterpart of this routine is ?ssprfs/?dsprfs. ?trrfs Estimates the error in the solution of a system of linear equations with a triangular matrix. Syntax Fortran 77: call strrfs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dtrrfs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call ctrrfs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call ztrrfs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call trrfs( a, b, x [,uplo] [,trans] [,diag] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_strrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* b, lapack_int ldb, const float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dtrrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* b, lapack_int ldb, const double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_ctrrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* b, lapack_int ldb, const lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); 3 Intel® Math Kernel Library Reference Manual 506 lapack_int LAPACKE_ztrrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* b, lapack_int ldb, const lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the errors in the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a triangular matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). The routine also estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine, call the solver routine ?trtrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array a. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a, b, x, work REAL for strrfs DOUBLE PRECISION for dtrrfs COMPLEX for ctrrfs DOUBLE COMPLEX for ztrrfs. Arrays: a(lda,*) contains the upper or lower triangular matrix A, as specified by uplo. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. LAPACK Routines: Linear Equations 3 507 work(*) is a workspace array. The second dimension of a must be at least max(1,n); the second dimension of b and x must be at least max(1,nrhs); the dimension of work must be at least max(1,3*n) for real flavors and max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctrrfs DOUBLE PRECISION for ztrrfs. Workspace array, DIMENSION at least max(1, n). Output Parameters ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trrfs interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. A call to this routine involves, for each right-hand side, solving a number of systems of linear equations A*x = b; the number of systems is usually 4 or 5 and never more than 11. Each solution requires approximately n2 floating-point operations for real flavors or 4n2 for complex flavors. ?tprfs Estimates the error in the solution of a system of linear equations with a packed triangular matrix. 3 Intel® Math Kernel Library Reference Manual 508 Syntax Fortran 77: call stprfs( uplo, trans, diag, n, nrhs, ap, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dtprfs( uplo, trans, diag, n, nrhs, ap, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call ctprfs( uplo, trans, diag, n, nrhs, ap, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call ztprfs( uplo, trans, diag, n, nrhs, ap, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call tprfs( ap, b, x [,uplo] [,trans] [,diag] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_stprfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const float* ap, const float* b, lapack_int ldb, const float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dtprfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const double* ap, const double* b, lapack_int ldb, const double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_ctprfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, const lapack_complex_float* b, lapack_int ldb, const lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_ztprfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, const lapack_complex_double* b, lapack_int ldb, const lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the errors in the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a packed triangular matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). The routine also estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine, call the solver routine ?tptrs. LAPACK Routines: Linear Equations 3 509 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', A is not a unit triangular matrix. If diag = 'U', A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array ap. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ap, b, x, work REAL for stprfs DOUBLE PRECISION for dtprfs COMPLEX for ctprfs DOUBLE COMPLEX for ztprfs. Arrays: ap(*) contains the upper or lower triangular matrix A, as specified by uplo. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The dimension of ap must be at least max(1,n(n+1)/2); the second dimension of b and x must be at least max(1,nrhs); the dimension of work must be at least max(1,3*n) for real flavors and max(1,2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctprfs DOUBLE PRECISION for ztprfs. Workspace array, DIMENSION at least max(1, n). Output Parameters ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. 3 Intel® Math Kernel Library Reference Manual 510 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tprfs interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. A call to this routine involves, for each right-hand side, solving a number of systems of linear equations A*x = b; the number of systems is usually 4 or 5 and never more than 11. Each solution requires approximately n2 floating-point operations for real flavors or 4n2 for complex flavors. ?tbrfs Estimates the error in the solution of a system of linear equations with a triangular band matrix. Syntax Fortran 77: call stbrfs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dtbrfs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call ctbrfs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call ztbrfs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call tbrfs( ab, b, x [,uplo] [,trans] [,diag] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_stbrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int kd, lapack_int nrhs, const float* ab, lapack_int ldab, const float* b, lapack_int ldb, const float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dtbrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int kd, lapack_int nrhs, const double* ab, lapack_int ldab, const double* b, lapack_int ldb, const double* x, lapack_int ldx, double* ferr, double* berr ); LAPACK Routines: Linear Equations 3 511 lapack_int LAPACKE_ctbrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int kd, lapack_int nrhs, const lapack_complex_float* ab, lapack_int ldab, const lapack_complex_float* b, lapack_int ldb, const lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_ztbrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int kd, lapack_int nrhs, const lapack_complex_double* ab, lapack_int ldab, const lapack_complex_double* b, lapack_int ldb, const lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the errors in the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a triangular band matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). The routine also estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine, call the solver routine ?tbtrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', A is not a unit triangular matrix. If diag = 'U', A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array ab. n INTEGER. The order of the matrix A; n = 0. kd INTEGER. The number of super-diagonals or sub-diagonals in the matrix A; kd = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab, b, x, work REAL for stbrfs DOUBLE PRECISION for dtbrfs COMPLEX for ctbrfs 3 Intel® Math Kernel Library Reference Manual 512 DOUBLE COMPLEX for ztbrfs. Arrays: ab(ldab,*) contains the upper or lower triangular matrix A, as specified by uplo, in band storage format. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of a must be at least max(1,n); the second dimension of b and x must be at least max(1,nrhs). The dimension of work must be at least max(1,3*n) for real flavors and max(1,2*n) for complex flavors. ldab INTEGER. The leading dimension of the array ab; ldab = kd +1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctbrfs DOUBLE PRECISION for ztbrfs. Workspace array, DIMENSION at least max(1, n). Output Parameters ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tbrfs interface are as follows: ab Holds the array A of size (kd+1,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. LAPACK Routines: Linear Equations 3 513 A call to this routine involves, for each right-hand side, solving a number of systems of linear equations A*x = b; the number of systems is usually 4 or 5 and never more than 11. Each solution requires approximately 2n*kd floating-point operations for real flavors or 8n*kd operations for complex flavors. Routines for Matrix Inversion It is seldom necessary to compute an explicit inverse of a matrix. In particular, do not attempt to solve a system of equations Ax = b by first computing A-1 and then forming the matrix-vector product x = A-1b. Call a solver routine instead (see Routines for Solving Systems of Linear Equations); this is more efficient and more accurate. However, matrix inversion routines are provided for the rare occasions when an explicit inverse matrix is needed. ?getri Computes the inverse of an LU-factored general matrix. Syntax Fortran 77: call sgetri( n, a, lda, ipiv, work, lwork, info ) call dgetri( n, a, lda, ipiv, work, lwork, info ) call cgetri( n, a, lda, ipiv, work, lwork, info ) call zgetri( n, a, lda, ipiv, work, lwork, info ) Fortran 95: call getri( a, ipiv [,info] ) C: lapack_int LAPACKE_getri( int matrix_order, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a general matrix A. Before calling this routine, call ?getrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. a, work REAL for sgetri DOUBLE PRECISION for dgetri COMPLEX for cgetri 3 Intel® Math Kernel Library Reference Manual 514 DOUBLE COMPLEX for zgetri. Arrays: a(lda,*), work(*). a(lda,*) contains the factorization of the matrix A, as returned by ? getrf: A = P*L*U. The second dimension of a must be at least max(1,n). work(*) is a workspace array of dimension at least max(1,lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?getrf. lwork INTEGER. The size of the work array; lwork = n. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for the suggested value of lwork. Output Parameters a Overwritten by the n-by-n matrix inv(A). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of the factor U is zero, U is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine getri interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. LAPACK Routines: Linear Equations 3 515 Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed inverse X satisfies the following error bound: |XA - I| = c(n)e|X|P|L||U|, where c(n) is a modest linear function of n; e is the machine precision; I denotes the identity matrix; P, L, and U are the factors of the matrix factorization A = P*L*U. The total number of floating-point operations is approximately (4/3)n3 for real flavors and (16/3)n3 for complex flavors. ?potri Computes the inverse of a symmetric (Hermitian) positive-definite matrix. Syntax Fortran 77: call spotri( uplo, n, a, lda, info ) call dpotri( uplo, n, a, lda, info ) call cpotri( uplo, n, a, lda, info ) call zpotri( uplo, n, a, lda, info ) Fortran 95: call potri( a [,uplo] [,info] ) C: lapack_int LAPACKE_potri( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric positive definite or, for complex flavors, Hermitian positive-definite matrix A. Before calling this routine, call ?potrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. n INTEGER. The order of the matrix A; n = 0. a REAL for spotri DOUBLE PRECISION for dpotri 3 Intel® Math Kernel Library Reference Manual 516 COMPLEX for cpotri DOUBLE COMPLEX for zpotri. Array a(lda,*). Contains the factorization of the matrix A, as returned by ?potrf. The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of a; lda = max(1, n). Output Parameters a Overwritten by the n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of the Cholesky factor (and therefore the factor itself) is zero, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine potri interface are as follows: a Holds the matrix A of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed inverse X satisfies the following error bounds: ||XA - I||2 = c(n)e?2(A), ||AX - I||2 = c(n)e?2(A), where c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. The 2-norm ||A||2 of a matrix A is defined by ||A||2 = maxx·x=1(Ax·Ax)1/2, and the condition number ?2(A) is defined by ?2(A) = ||A||2 ||A-1||2. The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. ?pftri Computes the inverse of a symmetric (Hermitian) positive-definite matrix in RFP format using the Cholesky factorization. Syntax Fortran 77: call spftri( transr, uplo, n, a, info ) call dpftri( transr, uplo, n, a, info ) call cpftri( transr, uplo, n, a, info ) call zpftri( transr, uplo, n, a, info ) LAPACK Routines: Linear Equations 3 517 C: lapack_int LAPACKE_pftri( int matrix_order, char transr, char uplo, lapack_int n, * a ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric positive definite or, for complex data, Hermitian positive-definite matrix A using the Cholesky factorization: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' Before calling this routine, call ?pftrf to factorize A. The matrix A is in the Rectangular Full Packed (RFP) format. For the description of the RFP format, see Matrix Storage Schemes. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. transr CHARACTER*1. Must be 'N', 'T' (for real data) or 'C' (for complex data). If transr = 'N', the Normal transr of RFP A is stored. If transr = 'T', the Transpose transr of RFP A is stored. If transr = 'C', the Conjugate-Transpose transr of RFP A is stored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of the RFP matrix A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of the matrix A; n = 0. a REAL for spftri DOUBLE PRECISION for dpftri COMPLEX for cpftri DOUBLE COMPLEX for zpftri. Array, DIMENSION (n*(n+1)/2). The array a contains the matrix A in the RFP format. Output Parameters a The symmetric/Hermitian inverse of the original matrix in the same storage format. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. 3 Intel® Math Kernel Library Reference Manual 518 If info = i, the (i,i) element of the factor U or L is zero, and the inverse could not be computed. ?pptri Computes the inverse of a packed symmetric (Hermitian) positive-definite matrix Syntax Fortran 77: call spptri( uplo, n, ap, info ) call dpptri( uplo, n, ap, info ) call cpptri( uplo, n, ap, info ) call zpptri( uplo, n, ap, info ) Fortran 95: call pptri( ap [,uplo] [,info] ) C: lapack_int LAPACKE_pptri( int matrix_order, char uplo, lapack_int n, * ap ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric positive definite or, for complex flavors, Hermitian positive-definite matrix A in packed form. Before calling this routine, call ?pptrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular factor is stored in ap: If uplo = 'U', then the upper triangular factor is stored. If uplo = 'L', then the lower triangular factor is stored. n INTEGER. The order of the matrix A; n = 0. ap REAL for spptri DOUBLE PRECISION for dpptri COMPLEX for cpptri DOUBLE COMPLEX for zpptri. Array, DIMENSION at least max(1, n(n+1)/2). Contains the factorization of the packed matrix A, as returned by ? pptrf. The dimension ap must be at least max(1,n(n+1)/2). LAPACK Routines: Linear Equations 3 519 Output Parameters ap Overwritten by the packed n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of the Cholesky factor (and therefore the factor itself) is zero, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pptri interface are as follows: ap Holds the array A of size (n*(n+1)/2). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed inverse X satisfies the following error bounds: ||XA - I||2 = c(n)e?2(A), ||AX - I||2 = c(n)e?2(A), where c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. The 2-norm ||A||2 of a matrix A is defined by ||A||2 =maxx·x=1(Ax·Ax)1/2, and the condition number ?2(A) is defined by ?2(A) = ||A||2 ||A-1||2 . The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. ?sytri Computes the inverse of a symmetric matrix. Syntax Fortran 77: call ssytri( uplo, n, a, lda, ipiv, work, info ) call dsytri( uplo, n, a, lda, ipiv, work, info ) call csytri( uplo, n, a, lda, ipiv, work, info ) call zsytri( uplo, n, a, lda, ipiv, work, info ) Fortran 95: call sytri( a, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_sytri( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 3 Intel® Math Kernel Library Reference Manual 520 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric matrix A. Before calling this routine, call ?sytrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the Bunch-Kaufman factorization A = P*U*D*UT*PT. If uplo = 'L', the array a stores the Bunch-Kaufman factorization A = P*L*D*LT*PT. n INTEGER. The order of the matrix A; n = 0. a, work REAL for ssytri DOUBLE PRECISION for dsytri COMPLEX for csytri DOUBLE COMPLEX for zsytri. Arrays: a(lda,*) contains the factorization of the matrix A, as returned by ? sytrf. The second dimension of a must be at least max(1,n). work(*) is a workspace array. The dimension of work must be at least max(1,2*n). lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sytrf. Output Parameters a Overwritten by the n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info =-i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of D is zero, D is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytri interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. LAPACK Routines: Linear Equations 3 521 Application Notes The computed inverse X satisfies the following error bounds: |D*UT*PT*X*P*U - I| = c(n)e(|D||UT|PT|X|P|U| + |D||D-1|) for uplo = 'U', and |D*LT*PT*X*P*L - I| = c(n)e(|D||LT|PT|X|P|L| + |D||D-1|) for uplo = 'L'. Here c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. ?hetri Computes the inverse of a complex Hermitian matrix. Syntax Fortran 77: call chetri( uplo, n, a, lda, ipiv, work, info ) call zhetri( uplo, n, a, lda, ipiv, work, info ) Fortran 95: call hetri( a, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_hetri( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a complex Hermitian matrix A. Before calling this routine, call ? hetrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the Bunch-Kaufman factorization A = P*U*D*UH*PT. If uplo = 'L', the array a stores the Bunch-Kaufman factorization A = P*L*D*LH*PT. n INTEGER. The order of the matrix A; n = 0. a, work COMPLEX for chetri 3 Intel® Math Kernel Library Reference Manual 522 DOUBLE COMPLEX for zhetri. Arrays: a(lda,*) contains the factorization of the matrix A, as returned by ? hetrf. The second dimension of a must be at least max(1,n). work(*) is a workspace array. The dimension of work must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hetrf. Output Parameters a Overwritten by the n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of D is zero, D is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetri interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed inverse X satisfies the following error bounds: |D*UH*PT*X*P*U - I| = c(n)e(|D||UH|PT|X|P|U| + |D||D-1|) for uplo = 'U', and |D*LH*PT*X*P*L - I| = c(n)e(|D||LH|PT|X|P|L| + |D||D-1|) for uplo = 'L'. Here c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. The real counterpart of this routine is ?sytri. ?sytri2 Computes the inverse of a symmetric indefinite matrix through setting the leading dimension of the workspace and calling ?sytri2x. LAPACK Routines: Linear Equations 3 523 Syntax Fortran 77: call ssytri2( uplo, n, a, lda, ipiv, work, lwork, info ) call dsytri2( uplo, n, a, lda, ipiv, work, lwork, info ) call csytri2( uplo, n, a, lda, ipiv, work, lwork, info ) call zsytri2( uplo, n, a, lda, ipiv, work, lwork, info ) Fortran 95: call sytri2( a,ipiv[,uplo][,info] ) C: lapack_int LAPACKE_sytri2( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric indefinite matrix A using the factorization A = U*D*UT or A = L*D*LT computed by ?sytrf. The ?sytri2 routine sets the leading dimension of the workspace before calling ?sytri2x that actually computes the inverse. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the factorization A = U*D*UT. If uplo = 'L', the array a stores the factorization A = L*D*LT. n INTEGER. The order of the matrix A; n = 0. a, work REAL for ssytri2 DOUBLE PRECISION for dsytri2 COMPLEX for csytri2 DOUBLE COMPLEX for zsytri2 Arrays: a(lda,*) contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as returned by ?sytrf. The second dimension of a must be at least max(1,n). work is a workspace array of (n+nb+1)*(nb+3) dimension. lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Details of the interchanges and the block structure of D as returned by ?sytrf. 3 Intel® Math Kernel Library Reference Manual 524 lwork INTEGER. The dimension of the work array. lwork = (n+nb+1)*(nb+3) where nb is the block size parameter as returned by sytrf. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. Output Parameters a If info = 0, the symmetric inverse of the original matrix. If info = 'U', the upper triangular part of the inverse is formed and the part of A below the diagonal is not referenced. If info = 'L', the lower triangular part of the inverse is formed and the part of A above the diagonal is not referenced. info INTEGER. If info = 0, the execution is successful. If info =-i, the i-th parameter had an illegal value. If info = i, D(i,i) = 0; D is singular and its inversion could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytri2 interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. uplo Indicates how the matrix A has been factored. Must be 'U' or 'L'. See Also ?sytrf ?sytri2x ?hetri2 Computes the inverse of a Hermitian indefinite matrix through setting the leading dimension of the workspace and calling ?hetri2x. Syntax Fortran 77: call chetri2( uplo, n, a, lda, ipiv, work, lwork, info ) call zhetri2( uplo, n, a, lda, ipiv, work, lwork, info ) Fortran 95: call hetri2( a,ipiv[,uplo][,info] ) C: lapack_int LAPACKE_hetri2( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv ); LAPACK Routines: Linear Equations 3 525 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a Hermitian indefinite matrix A using the factorization A = U*D*UH or A = L*D*LH computed by ?hetrf. The ?hetri2 routine sets the leading dimension of the workspace before calling ?hetri2x that actually computes the inverse. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the factorization A = U*D*UH. If uplo = 'L', the array a stores the factorization A = L*D*LH. n INTEGER. The order of the matrix A; n = 0. a, work COMPLEX for chetri2 DOUBLE COMPLEX for zhetri2 Arrays: a(lda,*) contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as returned by ?sytrf. The second dimension of a must be at least max(1,n). work is a workspace array of (n+nb+1)*(nb+3) dimension. lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Details of the interchanges and the block structure of D as returned by ?hetrf. lwork INTEGER. The dimension of the work array. lwork = (n+nb+1)*(nb+3) where nb is the block size parameter as returned by hetrf. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. Output Parameters a If info = 0, the inverse of the original matrix. If info = 'U', the upper triangular part of the inverse is formed and the part of A below the diagonal is not referenced. If info = 'L', the lower triangular part of the inverse is formed and the part of A above the diagonal is not referenced. info INTEGER. If info = 0, the execution is successful. 3 Intel® Math Kernel Library Reference Manual 526 If info =-i, the i-th parameter had an illegal value. If info = i, D(i,i) = 0; D is singular and its inversion could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetri2 interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. uplo Indicates how the input matrix A has been factored. Must be 'U' or 'L'. See Also ?hetrf ?hetri2x ?sytri2x Computes the inverse of a symmetric indefinite matrix after ?sytri2 sets the leading dimension of the workspace. Syntax Fortran 77: call ssytri2x( uplo, n, a, lda, ipiv, work, nb, info ) call dsytri2x( uplo, n, a, lda, ipiv, work, nb, info ) call csytri2x( uplo, n, a, lda, ipiv, work, nb, info ) call zsytri2x( uplo, n, a, lda, ipiv, work, nb, info ) Fortran 95: call sytri2x( a,ipiv,nb[,uplo][,info] ) C: lapack_int LAPACKE_sytri2x( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv, lapack_int nb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric indefinite matrix A using the factorization A = U*D*UT or A = L*D*LT computed by ?sytrf. The ?sytri2x actually computes the inverse after the ?sytri2 routine sets the leading dimension of the workspace before calling ?sytri2x. LAPACK Routines: Linear Equations 3 527 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the factorization A = U*D*UT. If uplo = 'L', the array a stores the factorization A = L*D*LT. n INTEGER. The order of the matrix A; n = 0. a, work REAL for ssytri2x DOUBLE PRECISION for dsytri2x COMPLEX for csytri2x DOUBLE COMPLEX for zsytri2x Arrays: a(lda,*) contains the nb (block size) diagonal matrix D and the multipliers used to obtain the factor U or L as returned by ?sytrf. The second dimension of a must be at least max(1,n). work is a workspace array of the dimension (n+nb+1)*(nb+3) where nb is the block size as set by ?sytrf. lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Details of the interchanges and the nb structure of D as returned by ? sytrf. nb INTEGER. Block size. Output Parameters a If info = 0, the symmetric inverse of the original matrix. If info = 'U', the upper triangular part of the inverse is formed and the part of A below the diagonal is not referenced. If info = 'L', the lower triangular part of the inverse is formed and the part of A above the diagonal is not referenced. info INTEGER. If info = 0, the execution is successful. If info =-i, the i-th parameter had an illegal value. If info = i, Dii= 0; D is singular and its inversion could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytri2x interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. nb Holds the block size. 3 Intel® Math Kernel Library Reference Manual 528 uplo Indicates how the input matrix A has been factored. Must be 'U' or 'L'. See Also ?sytrf ?sytri2 ?hetri2x Computes the inverse of a Hermitian indefinite matrix after ?hetri2 sets the leading dimension of the workspace. Syntax Fortran 77: call chetri2x( uplo, n, a, lda, ipiv, work, nb, info ) call zhetri2x( uplo, n, a, lda, ipiv, work, nb, info ) Fortran 95: call hetri2x( a,ipiv,nb[,uplo][,info] ) C: lapack_int LAPACKE_hetri2x( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv,lapack_int nb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a Hermitian indefinite matrix A using the factorization A = U*D*UH or A = L*D*LH computed by ?hetrf. The ?hetri2x actually computes the inverse after the ?hetri2 routine sets the leading dimension of the workspace before calling ?hetri2x. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the factorization A = U*D*UH. If uplo = 'L', the array a stores the factorization A = L*D*LH. n INTEGER. The order of the matrix A; n = 0. a, work COMPLEX for chetri2x DOUBLE COMPLEX for zhetri2x Arrays: a(lda,*) contains the nb (block size) diagonal matrix D and the multipliers used to obtain the factor U or L as returned by ?hetrf. LAPACK Routines: Linear Equations 3 529 The second dimension of a must be at least max(1,n). work is a workspace array of the dimension (n+nb+1)*(nb+3) where nb is the block size as set by ?hetrf. lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Details of the interchanges and the nb structure of D as returned by ? hetrf. nb INTEGER. Block size. Output Parameters a If info = 0, the symmetric inverse of the original matrix. If info = 'U', the upper triangular part of the inverse is formed and the part of A below the diagonal is not referenced. If info = 'L', the lower triangular part of the inverse is formed and the part of A above the diagonal is not referenced. info INTEGER. If info = 0, the execution is successful. If info =-i, the i-th parameter had an illegal value. If info = i, Dii= 0; D is singular and its inversion could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetri2x interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. nb Holds the block size. uplo Indicates how the input matrix A has been factored. Must be 'U' or 'L'. See Also ?hetrf ?hetri2 ?sptri Computes the inverse of a symmetric matrix using packed storage. Syntax Fortran 77: call ssptri( uplo, n, ap, ipiv, work, info ) call dsptri( uplo, n, ap, ipiv, work, info ) call csptri( uplo, n, ap, ipiv, work, info ) call zsptri( uplo, n, ap, ipiv, work, info ) 3 Intel® Math Kernel Library Reference Manual 530 Fortran 95: call sptri( ap, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_sptri( int matrix_order, char uplo, lapack_int n, * ap, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a packed symmetric matrix A. Before calling this routine, call ? sptrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the Bunch-Kaufman factorization A = P*U*D*UT*PT. If uplo = 'L', the array ap stores the Bunch-Kaufman factorization A = P*L*D*LT*PT. n INTEGER. The order of the matrix A; n = 0. ap, work REAL for ssptri DOUBLE PRECISION for dsptri COMPLEX for csptri DOUBLE COMPLEX for zsptri. Arrays: ap(*) contains the factorization of the matrix A, as returned by ? sptrf. The dimension of ap must be at least max(1,n(n+1)/2). work(*) is a workspace array. The dimension of work must be at least max(1,n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sptrf. Output Parameters ap Overwritten by the n-by-n matrix inv(A) in packed form. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of D is zero, D is singular, and the inversion could not be completed. LAPACK Routines: Linear Equations 3 531 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sptri interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed inverse X satisfies the following error bounds: |D*UT*PT*X*P*U - I| = c(n)e(|D||UT|PT|X|P|U| + |D||D-1|) for uplo = 'U', and |D*LT*PT*X*P*L - I| = c(n)e(|D||LT|PT|X|P|L| + |D||D-1|) for uplo = 'L'. Here c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. ?hptri Computes the inverse of a complex Hermitian matrix using packed storage. Syntax Fortran 77: call chptri( uplo, n, ap, ipiv, work, info ) call zhptri( uplo, n, ap, ipiv, work, info ) Fortran 95: call hptri( ap, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_hptri( int matrix_order, char uplo, lapack_int n, * ap, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a complex Hermitian matrix A using packed storage. Before calling this routine, call ?hptrf to factorize A. 3 Intel® Math Kernel Library Reference Manual 532 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the packed Bunch-Kaufman factorization A = P*U*D*UH*PT. If uplo = 'L', the array ap stores the packed Bunch-Kaufman factorization A = P*L*D*LH*PT. n INTEGER. The order of the matrix A; n = 0. ap, work COMPLEX for chptri DOUBLE COMPLEX for zhptri. Arrays: ap(*) contains the factorization of the matrix A, as returned by ? hptrf. The dimension of ap must be at least max(1,n(n+1)/2). work(*) is a workspace array. The dimension of work must be at least max(1,n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hptrf. Output Parameters ap Overwritten by the n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of D is zero, D is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hptri interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed inverse X satisfies the following error bounds: |D*UH*PT*X*P*U - I| = c(n)e(|D||UH|PT|X|P|U| + |D||D-1|) for uplo = 'U', and |D*LH*PT*X*PL - I| = c(n)e(|D||LH|PT|X|P|L| + |D||D-1|) for uplo = 'L'. Here c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. LAPACK Routines: Linear Equations 3 533 The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. The real counterpart of this routine is ?sptri. ?trtri Computes the inverse of a triangular matrix. Syntax Fortran 77: call strtri( uplo, diag, n, a, lda, info ) call dtrtri( uplo, diag, n, a, lda, info ) call ctrtri( uplo, diag, n, a, lda, info ) call ztrtri( uplo, diag, n, a, lda, info ) Fortran 95: call trtri( a [,uplo] [,diag] [,info] ) C: lapack_int LAPACKE_trtri( int matrix_order, char uplo, char diag, lapack_int n, * a, lapack_int lda ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a triangular matrix A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array a. n INTEGER. The order of the matrix A; n = 0. a REAL for strtri DOUBLE PRECISION for dtrtri COMPLEX for ctrtri DOUBLE COMPLEX for ztrtri. Array: DIMENSION (,*). 3 Intel® Math Kernel Library Reference Manual 534 Contains the matrix A. The second dimension of a must be at least max(1,n). lda INTEGER. The first dimension of a; lda = max(1, n). Output Parameters a Overwritten by the n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is zero, A is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trtri interface are as follows: a Holds the matrix A of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The computed inverse X satisfies the following error bounds: |XA - I| = c(n)e |X||A| |XA - I| = c(n)e |A-1||A||X|, where c(n) is a modest linear function of n; e is the machine precision; I denotes the identity matrix. The total number of floating-point operations is approximately (1/3)n3 for real flavors and (4/3)n3 for complex flavors. ?tftri Computes the inverse of a triangular matrix stored in the Rectangular Full Packed (RFP) format. Syntax Fortran 77: call stftri( transr, uplo, diag, n, a, info ) call dtftri( transr, uplo, diag, n, a, info ) call ctftri( transr, uplo, diag, n, a, info ) call ztftri( transr, uplo, diag, n, a, info ) C: lapack_int LAPACKE_tftri( int matrix_order, char transr, char uplo, char diag, lapack_int n, * a ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h LAPACK Routines: Linear Equations 3 535 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description Computes the inverse of a triangular matrix A stored in the Rectangular Full Packed (RFP) format. For the description of the RFP format, see Matrix Storage Schemes. This is the block version of the algorithm, calling Level 3 BLAS. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. transr CHARACTER*1. Must be 'N', 'T' (for real data) or 'C' (for complex data). If transr = 'N', the Normal transr of RFP A is stored. If transr = 'T', the Transpose transr of RFP A is stored. If transr = 'C', the Conjugate-Transpose transr of RFP A is stored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of RFP A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array a. n INTEGER. The order of the matrix A; n = 0. a REAL for stftri DOUBLE PRECISION for dtftri COMPLEX for ctftri DOUBLE COMPLEX for ztftri. Array, DIMENSION (n*(n+1)/2). The array a contains the matrix A in the RFP format. Output Parameters a The (triangular) inverse of the original matrix in the same storage format. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, A(i,i) is exactly zero. The triangular matrix is singular and its inverse cannot be computed. ?tptri Computes the inverse of a triangular matrix using packed storage. 3 Intel® Math Kernel Library Reference Manual 536 Syntax Fortran 77: call stptri( uplo, diag, n, ap, info ) call dtptri( uplo, diag, n, ap, info ) call ctptri( uplo, diag, n, ap, info ) call ztptri( uplo, diag, n, ap, info ) Fortran 95: call tptri( ap [,uplo] [,diag] [,info] ) C: lapack_int LAPACKE_tptri( int matrix_order, char uplo, char diag, lapack_int n, * ap ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a packed triangular matrix A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array ap. n INTEGER. The order of the matrix A; n = 0. ap REAL for stptri DOUBLE PRECISION for dtptri COMPLEX for ctptri DOUBLE COMPLEX for ztptri. Array, DIMENSION at least max(1,n(n+1)/2). Contains the packed triangular matrix A. Output Parameters ap Overwritten by the packed n-by-n matrix inv(A) . info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Linear Equations 3 537 If info = i, the i-th diagonal element of A is zero, A is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tptri interface are as follows: ap Holds the array A of size (n*(n+1)/2). uplo Must be 'U' or 'L'. The default value is 'U'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The computed inverse X satisfies the following error bounds: |XA - I| = c(n)e |X||A| |X - A-1| = c(n)e |A-1||A||X|, where c(n) is a modest linear function of n; e is the machine precision; I denotes the identity matrix. The total number of floating-point operations is approximately (1/3)n3 for real flavors and (4/3)n3 for complex flavors. Routines for Matrix Equilibration Routines described in this section are used to compute scaling factors needed to equilibrate a matrix. Note that these routines do not actually scale the matrices. ?geequ Computes row and column scaling factors intended to equilibrate a general matrix and reduce its condition number. Syntax Fortran 77: call sgeequ( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call dgeequ( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call cgeequ( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call zgeequ( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) Fortran 95: call geequ( a, r, c [,rowcnd] [,colcnd] [,amax] [,info] ) C: lapack_int LAPACKE_sgeequ( int matrix_order, lapack_int m, lapack_int n, const float* a, lapack_int lda, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_dgeequ( int matrix_order, lapack_int m, lapack_int n, const double* a, lapack_int lda, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); 3 Intel® Math Kernel Library Reference Manual 538 lapack_int LAPACKE_cgeequ( int matrix_order, lapack_int m, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_zgeequ( int matrix_order, lapack_int m, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate an m-by-n matrix A and reduce its condition number. The output array r returns the row scale factors and the array c the column scale factors. These factors are chosen to try to make the largest element in each row and column of the matrix B with elements bij=r(i)*aij*c(j) have absolute value 1. See ?laqge auxiliary function that uses scaling factors computed by ?geequ. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A; m = 0. n INTEGER. The number of columns of the matrix A; n = 0. a REAL for sgeequ DOUBLE PRECISION for dgeequ COMPLEX for cgeequ DOUBLE COMPLEX for zgeequ. Array: DIMENSION (lda,*). Contains the m-by-n matrix A whose equilibration factors are to be computed. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1, m). Output Parameters r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(m), c(n). If info = 0, or info > m, the array r contains the row scale factors of the matrix A. If info = 0, the array c contains the column scale factors of the matrix A. rowcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0 or info > m, rowcnd contains the ratio of the smallest r(i) to the largest r(i). colcnd REAL for single precision flavors LAPACK Routines: Linear Equations 3 539 DOUBLE PRECISION for double precision flavors. If info = 0, colcnd contains the ratio of the smallest c(i) to the largest c(i). amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i and i = m, the i-th row of A is exactly zero; i > m, the (i-m)th column of A is exactly zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine geequ interface are as follows: a Holds the matrix A of size (m, n). r Holds the vector of length (m). c Holds the vector of length n. Application Notes All the components of r and c are restricted to be between SMLNUM = smallest safe number and BIGNUM= largest safe number. Use of these scaling factors is not guaranteed to reduce the condition number of A but works well in practice. SMLNUM and BIGNUM are parameters representing machine precision. You can use the ?lamch routines to compute them. For example, compute single precision (real and complex) values of SMLNUM and BIGNUM as follows: SMLNUM = slamch ('s') BIGNUM = 1 / SMLNUM If rowcnd = 0.1 and amax is neither too large nor too small, it is not worth scaling by r. If colcnd = 0.1, it is not worth scaling by c. If amax is very close to overflow or very close to underflow, the matrix A should be scaled. ?geequb Computes row and column scaling factors restricted to a power of radix to equilibrate a general matrix and reduce its condition number. Syntax Fortran 77: call sgeequb( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call dgeequb( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call cgeequb( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call zgeequb( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) 3 Intel® Math Kernel Library Reference Manual 540 C: lapack_int LAPACKE_sgeequb( int matrix_order, lapack_int m, lapack_int n, const float* a, lapack_int lda, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_dgeequb( int matrix_order, lapack_int m, lapack_int n, const double* a, lapack_int lda, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); lapack_int LAPACKE_cgeequb( int matrix_order, lapack_int m, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_zgeequb( int matrix_order, lapack_int m, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate an m-by-n general matrix A and reduce its condition number. The output array r returns the row scale factors and the array c - the column scale factors. These factors are chosen to try to make the largest element in each row and column of the matrix B with elements b(ij)=r(i)*a(ij)*c(j) have an absolute value of at most the radix. r(i) and c(j) are restricted to be a power of the radix between SMLNUM = smallest safe number and BIGNUM = largest safe number. Use of these scaling factors is not guaranteed to reduce the condition number of a but works well in practice. SMLNUM and BIGNUM are parameters representing machine precision. You can use the ?lamch routines to compute them. For example, compute single precision (real and complex) values of SMLNUM and BIGNUM as follows: SMLNUM = slamch ('s') BIGNUM = 1 / SMLNUM This routine differs from ?geequ by restricting the scaling factors to a power of the radix. Except for overand underflow, scaling by these factors introduces no additional rounding errors. However, the scaled entries' magnitudes are no longer equal to approximately 1 but lie between sqrt(radix) and 1/sqrt(radix). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A; m = 0. n INTEGER. The number of columns of the matrix A; n = 0. a REAL for sgeequb DOUBLE PRECISION for dgeequb COMPLEX for cgeequb DOUBLE COMPLEX for zgeequb. Array: DIMENSION (lda,*). LAPACK Routines: Linear Equations 3 541 Contains the m-by-n matrix A whose equilibration factors are to be computed. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1, m). Output Parameters r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(m), c(n). If info = 0, or info > m, the array r contains the row scale factors for the matrix A. If info = 0, the array c contains the column scale factors for the matrix A. rowcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0 or info > m, rowcnd contains the ratio of the smallest r(i) to the largest r(i). If rowcnd = 0.1, and amax is neither too large nor too small, it is not worth scaling by r. colcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, colcnd contains the ratio of the smallest c(i) to the largest c(i). If colcnd = 0.1, it is not worth scaling by c. amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. If amax is very close to overflow or very close to underflow, the matrix should be scaled. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i and i = m, the i-th row of A is exactly zero; i > m, the (i-m)-th column of A is exactly zero. ?gbequ Computes row and column scaling factors intended to equilibrate a banded matrix and reduce its condition number. Syntax Fortran 77: call sgbequ( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call dgbequ( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call cgbequ( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call zgbequ( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) Fortran 95: call gbequ( ab, r, c [,kl] [,rowcnd] [,colcnd] [,amax] [,info] ) 3 Intel® Math Kernel Library Reference Manual 542 C: lapack_int LAPACKE_sgbequ( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const float* ab, lapack_int ldab, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_dgbequ( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const double* ab, lapack_int ldab, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); lapack_int LAPACKE_cgbequ( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_float* ab, lapack_int ldab, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_zgbequ( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_double* ab, lapack_int ldab, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate an m-by-n band matrix A and reduce its condition number. The output array r returns the row scale factors and the array c the column scale factors. These factors are chosen to try to make the largest element in each row and column of the matrix B with elements bij=r(i)*aij*c(j) have absolute value 1. See ?laqgb auxiliary function that uses scaling factors computed by ?gbequ. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A; m = 0. n INTEGER. The number of columns of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ab REAL for sgbequ DOUBLE PRECISION for dgbequ COMPLEX for cgbequ DOUBLE COMPLEX for zgbequ. Array, DIMENSION (ldab,*). Contains the original band matrix A stored in rows from 1 to kl + ku + 1. The second dimension of ab must be at least max(1,n). ldab INTEGER. The leading dimension of ab; ldab = kl+ku+1. Output Parameters r, c REAL for single precision flavors LAPACK Routines: Linear Equations 3 543 DOUBLE PRECISION for double precision flavors. Arrays: r(m), c(n). If info = 0, or info > m, the array r contains the row scale factors of the matrix A. If info = 0, the array c contains the column scale factors of the matrix A. rowcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0 or info > m, rowcnd contains the ratio of the smallest r(i) to the largest r(i). colcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, colcnd contains the ratio of the smallest c(i) to the largest c(i). amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i and i = m, the i-th row of A is exactly zero; i > m, the (i-m)th column of A is exactly zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbequ interface are as follows: ab Holds the array A of size (kl+ku+1,n). r Holds the vector of length (m). c Holds the vector of length n. kl If omitted, assumed kl = ku. ku Restored as ku = lda-kl-1. Application Notes All the components of r and c are restricted to be between SMLNUM = smallest safe number and BIGNUM= largest safe number. Use of these scaling factors is not guaranteed to reduce the condition number of A but works well in practice. SMLNUM and BIGNUM are parameters representing machine precision. You can use the ?lamch routines to compute them. For example, compute single precision (real and complex) values of SMLNUM and BIGNUM as follows: SMLNUM = slamch ('s') BIGNUM = 1 / SMLNUM If rowcnd = 0.1 and amax is neither too large nor too small, it is not worth scaling by r. If colcnd = 0.1, it is not worth scaling by c. If amax is very close to overflow or very close to underflow, the matrix A should be scaled. 3 Intel® Math Kernel Library Reference Manual 544 ?gbequb Computes row and column scaling factors restricted to a power of radix to equilibrate a banded matrix and reduce its condition number. Syntax Fortran 77: call sgbequb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call dgbequb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call cgbequb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call zgbequb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) C: lapack_int LAPACKE_sgbequb( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const float* ab, lapack_int ldab, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_dgbequb( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const double* ab, lapack_int ldab, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); lapack_int LAPACKE_cgbequb( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_float* ab, lapack_int ldab, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_zgbequb( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_double* ab, lapack_int ldab, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate an m-by-n banded matrix A and reduce its condition number. The output array r returns the row scale factors and the array c - the column scale factors. These factors are chosen to try to make the largest element in each row and column of the matrix B with elements b(ij)=r(i)*a(ij)*c(j) have an absolute value of at most the radix. r(i) and c(j) are restricted to be a power of the radix between SMLNUM = smallest safe number and BIGNUM = largest safe number. Use of these scaling factors is not guaranteed to reduce the condition number of a but works well in practice. SMLNUM and BIGNUM are parameters representing machine precision. You can use the ?lamch routines to compute them. For example, compute single precision (real and complex) values of SMLNUM and BIGNUM as follows: SMLNUM = slamch ('s') BIGNUM = 1 / SMLNUM This routine differs from ?gbequ by restricting the scaling factors to a power of the radix. Except for overand underflow, scaling by these factors introduces no additional rounding errors. However, the scaled entries' magnitudes are no longer equal to approximately 1 but lie between sqrt(radix) and 1/sqrt(radix). LAPACK Routines: Linear Equations 3 545 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A; m = 0. n INTEGER. The number of columns of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ab REAL for sgbequb DOUBLE PRECISION for dgbequb COMPLEX for cgbequb DOUBLE COMPLEX for zgbequb. Array: DIMENSION (ldab,*). Contains the original banded matrix A stored in rows from 1 to kl + ku + 1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = a(i,j) for max(1,j-ku) = i = min(n,j+kl). The second dimension of ab must be at least max(1,n). ldab INTEGER. The leading dimension of a; ldab = max(1, m). Output Parameters r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(m), c(n). If info = 0, or info > m, the array r contains the row scale factors for the matrix A. If info = 0, the array c contains the column scale factors for the matrix A. rowcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0 or info > m, rowcnd contains the ratio of the smallest r(i) to the largest r(i). If rowcnd = 0.1, and amax is neither too large nor too small, it is not worth scaling by r. colcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, colcnd contains the ratio of the smallest c(i) to the largest c(i). If colcnd = 0.1, it is not worth scaling by c. amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. If amax is very close to overflow or underflow, the matrix should be scaled. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. i = m, the i-th row of A is exactly zero; i > m, the (i-m)-th column of A is exactly zero. 3 Intel® Math Kernel Library Reference Manual 546 ?poequ Computes row and column scaling factors intended to equilibrate a symmetric (Hermitian) positive definite matrix and reduce its condition number. Syntax Fortran 77: call spoequ( n, a, lda, s, scond, amax, info ) call dpoequ( n, a, lda, s, scond, amax, info ) call cpoequ( n, a, lda, s, scond, amax, info ) call zpoequ( n, a, lda, s, scond, amax, info ) Fortran 95: call poequ( a, s [,scond] [,amax] [,info] ) C: lapack_int LAPACKE_spoequ( int matrix_order, lapack_int n, const float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_dpoequ( int matrix_order, lapack_int n, const double* a, lapack_int lda, double* s, double* scond, double* amax ); lapack_int LAPACKE_cpoequ( int matrix_order, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_zpoequ( int matrix_order, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a symmetric (Hermitian) positivedefinite matrix A and reduce its condition number (with respect to the two-norm). The output array s returns scale factors computed as These factors are chosen so that the scaled matrix B with elements bij=s(i)*aij*s(j) has diagonal elements equal to 1. This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. See ?laqsy auxiliary function that uses scaling factors computed by ?poequ. LAPACK Routines: Linear Equations 3 547 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. a REAL for spoequ DOUBLE PRECISION for dpoequ COMPLEX for cpoequ DOUBLE COMPLEX for zpoequ. Array: DIMENSION (lda,*). Contains the n-by-n symmetric or Hermitian positive definite matrix A whose scaling factors are to be computed. Only the diagonal elements of A are referenced. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1,n). Output Parameters s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. scond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine poequ interface are as follows: a Holds the matrix A of size (n,n). s Holds the vector of length n. Application Notes If scond = 0.1 and amax is neither too large nor too small, it is not worth scaling by s. If amax is very close to overflow or very close to underflow, the matrix A should be scaled. 3 Intel® Math Kernel Library Reference Manual 548 ?poequb Computes row and column scaling factors intended to equilibrate a symmetric (Hermitian) positive definite matrix and reduce its condition number. Syntax Fortran 77: call spoequb( n, a, lda, s, scond, amax, info ) call dpoequb( n, a, lda, s, scond, amax, info ) call cpoequb( n, a, lda, s, scond, amax, info ) call zpoequb( n, a, lda, s, scond, amax, info ) C: lapack_int LAPACKE_spoequb( int matrix_order, lapack_int n, const float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_dpoequb( int matrix_order, lapack_int n, const double* a, lapack_int lda, double* s, double* scond, double* amax ); lapack_int LAPACKE_cpoequb( int matrix_order, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_zpoequb( int matrix_order, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a symmetric (Hermitian) positivedefinite matrix A and reduce its condition number (with respect to the two-norm). These factors are chosen so that the scaled matrix B with elements b(i,j)=s(i)*a(i,j)*s(j) has diagonal elements equal to 1. s(i) is a power of two nearest to, but not exceeding 1/sqrt(A(i,i)). This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. a REAL for spoequb DOUBLE PRECISION for dpoequb COMPLEX for cpoequb DOUBLE COMPLEX for zpoequb. Array: DIMENSION (lda,*). LAPACK Routines: Linear Equations 3 549 Contains the n-by-n symmetric or Hermitian positive definite matrix A whose scaling factors are to be computed. Only the diagonal elements of A are referenced. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1, m). Output Parameters s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. scond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). If scond = 0.1, and amax is neither too large nor too small, it is not worth scaling by s. amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. If amax is very close to overflow or underflow, the matrix should be scaled. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. ?ppequ Computes row and column scaling factors intended to equilibrate a symmetric (Hermitian) positive definite matrix in packed storage and reduce its condition number. Syntax Fortran 77: call sppequ( uplo, n, ap, s, scond, amax, info ) call dppequ( uplo, n, ap, s, scond, amax, info ) call cppequ( uplo, n, ap, s, scond, amax, info ) call zppequ( uplo, n, ap, s, scond, amax, info ) Fortran 95: call ppequ( ap, s [,scond] [,amax] [,uplo] [,info] ) C: lapack_int LAPACKE_sppequ( int matrix_order, char uplo, lapack_int n, const float* ap, float* s, float* scond, float* amax ); lapack_int LAPACKE_dppequ( int matrix_order, char uplo, lapack_int n, const double* ap, double* s, double* scond, double* amax ); lapack_int LAPACKE_cppequ( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* ap, float* s, float* scond, float* amax ); 3 Intel® Math Kernel Library Reference Manual 550 lapack_int LAPACKE_zppequ( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* ap, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a symmetric (Hermitian) positive definite matrix A in packed storage and reduce its condition number (with respect to the two-norm). The output array s returns scale factors computed as These factors are chosen so that the scaled matrix B with elements bij=s(i)*aij*s(j) has diagonal elements equal to 1. This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. See ?laqsp auxiliary function that uses scaling factors computed by ?ppequ. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is packed in the array ap: If uplo = 'U', the array ap stores the upper triangular part of the matrix A. If uplo = 'L', the array ap stores the lower triangular part of the matrix A. n INTEGER. The order of matrix A; n = 0. ap REAL for sppequ DOUBLE PRECISION for dppequ COMPLEX for cppequ DOUBLE COMPLEX for zppequ. Array, DIMENSION at least max(1,n(n+1)/2). The array ap contains the upper or the lower triangular part of the matrix A (as specified by uplo) in packed storage (see Matrix Storage Schemes). Output Parameters s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. LAPACK Routines: Linear Equations 3 551 scond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ppequ interface are as follows: ap Holds the array A of size (n*(n+1)/2). s Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If scond = 0.1 and amax is neither too large nor too small, it is not worth scaling by s. If amax is very close to overflow or very close to underflow, the matrix A should be scaled. ?pbequ Computes row and column scaling factors intended to equilibrate a symmetric (Hermitian) positive-definite band matrix and reduce its condition number. Syntax Fortran 77: call spbequ( uplo, n, kd, ab, ldab, s, scond, amax, info ) call dpbequ( uplo, n, kd, ab, ldab, s, scond, amax, info ) call cpbequ( uplo, n, kd, ab, ldab, s, scond, amax, info ) call zpbequ( uplo, n, kd, ab, ldab, s, scond, amax, info ) Fortran 95: call pbequ( ab, s [,scond] [,amax] [,uplo] [,info] ) C: lapack_int LAPACKE_spbequ( int matrix_order, char uplo, lapack_int n, lapack_int kd, const float* ab, lapack_int ldab, float* s, float* scond, float* amax ); lapack_int LAPACKE_dpbequ( int matrix_order, char uplo, lapack_int n, lapack_int kd, const double* ab, lapack_int ldab, double* s, double* scond, double* amax ); lapack_int LAPACKE_cpbequ( int matrix_order, char uplo, lapack_int n, lapack_int kd, const lapack_complex_float* ab, lapack_int ldab, float* s, float* scond, float* amax ); 3 Intel® Math Kernel Library Reference Manual 552 lapack_int LAPACKE_zpbequ( int matrix_order, char uplo, lapack_int n, lapack_int kd, const lapack_complex_double* ab, lapack_int ldab, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a symmetric (Hermitian) positive definite matrix A in packed storage and reduce its condition number (with respect to the two-norm). The output array s returns scale factors computed as These factors are chosen so that the scaled matrix B with elements bij=s(i)*aij*s(j) has diagonal elements equal to 1. This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. See ?laqsb auxiliary function that uses scaling factors computed by ?pbequ. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is packed in the array ab: If uplo = 'U', the array ab stores the upper triangular part of the matrix A. If uplo = 'L', the array ab stores the lower triangular part of the matrix A. n INTEGER. The order of matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. ab REAL for spbequ DOUBLE PRECISION for dpbequ COMPLEX for cpbequ DOUBLE COMPLEX for zpbequ. Array, DIMENSION (ldab,*). The array ap contains either the upper or the lower triangular part of the matrix A (as specified by uplo) in band storage (see Matrix Storage Schemes). The second dimension of ab must be at least max(1, n). ldab INTEGER. The leading dimension of the array ab; ldab = kd +1. LAPACK Routines: Linear Equations 3 553 Output Parameters s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. scond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbequ interface are as follows: ab Holds the array A of size (kd+1,n). s Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If scond = 0.1 and amax is neither too large nor too small, it is not worth scaling by s. If amax is very close to overflow or very close to underflow, the matrix A should be scaled. ?syequb Computes row and column scaling factors intended to equilibrate a symmetric indefinite matrix and reduce its condition number. Syntax Fortran 77: call ssyequb( uplo, n, a, lda, s, scond, amax, work, info ) call dsyequb( uplo, n, a, lda, s, scond, amax, work, info ) call csyequb( uplo, n, a, lda, s, scond, amax, work, info ) call zsyequb( uplo, n, a, lda, s, scond, amax, work, info ) C: lapack_int LAPACKE_ssyequb( int matrix_order, char uplo, lapack_int n, const float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_dsyequb( int matrix_order, char uplo, lapack_int n, const double* a, lapack_int lda, double* s, double* scond, double* amax ); 3 Intel® Math Kernel Library Reference Manual 554 lapack_int LAPACKE_csyequb( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_zsyequb( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a symmetric indefinite matrix A and reduce its condition number (with respect to the two-norm). The array s contains the scale factors, s(i) = 1/sqrt(A(i,i)). These factors are chosen so that the scaled matrix B with elements b(i,j)=s(i)*a(i,j)*s(j) has ones on the diagonal. This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of the matrix A; n = 0. a, work REAL for ssyequb DOUBLE PRECISION for dsyequb COMPLEX for csyequb DOUBLE COMPLEX for zsyequb. Array a: DIMENSION (lda,*). Contains the n-by-n symmetric indefinite matrix A whose scaling factors are to be computed. Only the diagonal elements of A are referenced. The second dimension of a must be at least max(1,n). work(*) is a workspace array. The dimension of work is at least max(1,3*n). lda INTEGER. The leading dimension of a; lda = max(1, m). Output Parameters s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. scond REAL for single precision flavors LAPACK Routines: Linear Equations 3 555 DOUBLE PRECISION for double precision flavors. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). If scond = 0.1, and amax is neither too large nor too small, it is not worth scaling by s. amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. If amax is very close to overflow or underflow, the matrix should be scaled. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. ?heequb Computes row and column scaling factors intended to equilibrate a Hermitian indefinite matrix and reduce its condition number. Syntax Fortran 77: call cheequb( uplo, n, a, lda, s, scond, amax, work, info ) call zheequb( uplo, n, a, lda, s, scond, amax, work, info ) C: lapack_int LAPACKE_cheequb( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_zheequb( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a Hermitian indefinite matrix A and reduce its condition number (with respect to the two-norm). The array s contains the scale factors, s(i) = 1/sqrt(A(i,i)). These factors are chosen so that the scaled matrix B with elements b(i,j)=s(i)*a(i,j)*s(j) has ones on the diagonal. This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: 3 Intel® Math Kernel Library Reference Manual 556 If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of the matrix A; n = 0. a, work COMPLEX for cheequb DOUBLE COMPLEX for zheequb. Array a: DIMENSION (lda,*). Contains the n-by-n symmetric indefinite matrix A whose scaling factors are to be computed. Only the diagonal elements of A are referenced. The second dimension of a must be at least max(1,n). work(*) is a workspace array. The dimension of work is at least max(1,3*n). lda INTEGER. The leading dimension of a; lda = max(1, m). Output Parameters s REAL for cheequb DOUBLE PRECISION for zheequb. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. scond REAL for cheequb DOUBLE PRECISION for zheequb. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). If scond = 0.1, and amax is neither too large nor too small, it is not worth scaling by s. amax REAL for cheequb DOUBLE PRECISION for zheequb. Absolute value of the largest element of the matrix A. If amax is very close to overflow or underflow, the matrix should be scaled. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. Driver Routines Table "Driver Routines for Solving Systems of Linear Equations" lists the LAPACK driver routines for solving systems of linear equations with real or complex matrices. Driver Routines for Solving Systems of Linear Equations Matrix type, storage scheme Simple Driver Expert Driver Expert Driver using Extra-Precise Interative Refinement general ?gesv ?gesvx ?gesvxx general band ?gbsv ?gbsvx ?gbsvxx general tridiagonal ?gtsv ?gtsvx LAPACK Routines: Linear Equations 3 557 Matrix type, storage scheme Simple Driver Expert Driver Expert Driver using Extra-Precise Interative Refinement diagonally dominant tridiagonal ?dtsvb symmetric/Hermitian positive-definite ?posv ?posvx ?posvxx symmetric/Hermitian positive-definite, storage ?ppsv ?ppsvx symmetric/Hermitian positive-definite, band ?pbsv ?pbsvx symmetric/Hermitian positive-definite, tridiagonal ?ptsv ?ptsvx symmetric/Hermitian indefinite ?sysv/?hesv ?sysvx/?hesvx ?sysvxx/?hesvxx symmetric/Hermitian indefinite, packed storage ?spsv/?hpsv ?spsvx/?hpsvx complex symmetric ?sysv ?sysvx complex symmetric, packed storage ?spsv ?spsvx In this table ? stands for s (single precision real), d (double precision real), c (single precision complex), or z (double precision complex). In the description of ?gesv and ?posv routines, the ? sign stands for combined character codes ds and zc for the mixed precision subroutines. ?gesv Computes the solution to the system of linear equations with a square matrix A and multiple righthand sides. Syntax Fortran 77: call sgesv( n, nrhs, a, lda, ipiv, b, ldb, info ) call dgesv( n, nrhs, a, lda, ipiv, b, ldb, info ) call cgesv( n, nrhs, a, lda, ipiv, b, ldb, info ) call zgesv( n, nrhs, a, lda, ipiv, b, ldb, info ) call dsgesv( n, nrhs, a, lda, ipiv, b, ldb, x, ldx, work, swork, iter, info ) call zcgesv( n, nrhs, a, lda, ipiv, b, ldb, x, ldx, work, swork, rwork, iter, info ) Fortran 95: call gesv( a, b [,ipiv] [,info] ) 3 Intel® Math Kernel Library Reference Manual 558 C: lapack_int LAPACKE_gesv( int matrix_order, lapack_int n, lapack_int nrhs, * a, lapack_int lda, lapack_int* ipiv, * b, lapack_int ldb ); lapack_int LAPACKE_dsgesv( int matrix_order, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, lapack_int* ipiv, double* b, lapack_int ldb, double* x, lapack_int ldx, lapack_int* iter ); lapack_int LAPACKE_zcgesv( int matrix_order, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_int* ipiv, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, lapack_int* iter ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B, where A is an n-by-n matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The LU decomposition with partial pivoting and row interchanges is used to factor A as A = P*L*U, where P is a permutation matrix, L is unit lower triangular, and U is upper triangular. The factored form of A is then used to solve the system of equations A*X = B. The dsgesv and zcgesv are mixed precision iterative refinement subroutines for exploiting fast single precision hardware. They first attempt to factorize the matrix in single precision (dsgesv) or single complex precision (zcgesv) and use this factorization within an iterative refinement procedure to produce a solution with double precision (dsgesv) / double complex precision (zcgesv) normwise backward error quality (see below). If the approach fails, the method switches to a double precision or double complex precision factorization respectively and computes the solution. The iterative refinement is not going to be a winning strategy if the ratio single precision performance over double precision performance is too small. A reasonable strategy should take the number of right-hand sides and the size of the matrix into account. This might be done with a call to ilaenv in the future. At present, iterative refinement is implemented. The iterative refinement process is stopped if iter > itermax or for all the right-hand sides: rnmr < sqrt(n)*xnrm*anrm*eps*bwdmax where • iter is the number of the current iteration in the iterativerefinement process • rnmr is the infinity-norm of the residual • xnrm is the infinity-norm of the solution • anrm is the infinity-operator-norm of the matrix A • eps is the machine epsilon returned by dlamch (‘Epsilon’). The values itermax and bwdmax are fixed to 30 and 1.0d+00 respectively. LAPACK Routines: Linear Equations 3 559 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns of the matrix B; nrhs = 0. a, b REAL for sgesv DOUBLE PRECISION for dgesv and dsgesv COMPLEX for cgesv DOUBLE COMPLEX for zgesv and zcgesv. Arrays: a(lda,*), b(ldb,*). The array a contains the n-by-n coefficient matrix A. The array b contains the n-by-nrhs matrix of right hand side matrix B. The second dimension of a must be at least max(1, n), the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of the array a; lda = max(1, n). ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldx INTEGER. The leading dimension of the array x; ldx = max(1, n). work DOUBLE PRECISION for dsgesv DOUBLE COMPLEX for zcgesv. Workspace array, DIMENSION at least max(1,n*nrhs). This array is used to hold the residual vectors. swork REAL for dsgesv COMPLEX for zcgesv. Workspace array, DIMENSION at least max(1,n*(n+nrhs)). This array is used to use the single precision matrix and the right-hand sides or solutions in single precision. rwork DOUBLE PRECISION. Workspace array, DIMENSION at least max(1,n). Output Parameters a Overwritten by the factors L and U from the factorization of A = P*L*U; the unit diagonal elements of L are not stored. If iterative refinement has been successfully used (info= 0 and iter= 0), then A is unchanged. If double precision factorization has been used (info= 0 and iter < 0), then the array A contains the factors L and U from the factorization A = P*L*U; the unit diagonal elements of L are not stored. b Overwritten by the solution matrix X for dgesv, sgesv,zgesv,zgesv. Unchanged for dsgesv and zcgesv. ipiv INTEGER. Array, DIMENSION at least max(1, n). The pivot indices that define the permutation matrix P; row i of the matrix was interchanged with row ipiv(i). Corresponds to the single precision factorization (if info= 0 and iter = 0) or the double precision factorization (if info= 0 and iter < 0). x DOUBLE PRECISION for dsgesv 3 Intel® Math Kernel Library Reference Manual 560 DOUBLE COMPLEX for zcgesv. Array, DIMENSION (ldx, nrhs). If info = 0, contains the n-by-nrhs solution matrix X. iter INTEGER. If iter < 0: iterative refinement has failed, double precision factorization has been performed • If iter = -1: the routine fell back to full precision for implementation- or machine-specific reason • If iter = -2: narrowing the precision induced an overflow, the routine fell back to full precision • If iter = -3: failure of sgetrf for dsgesv, or cgetrf for zcgesv • If iter = -31: stop the iterative refinement after the 30th iteration. If iter > 0: iterative refinement has been successfully used. Returns the number of iterations. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, U(i, i) (computed in double precision for mixed precision subroutines) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gesv interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). ipiv Holds the vector of length n. NOTE Fortran 95 Interface is so far not available for the mixed precision subroutines dsgesv/zcgesv. See Also ilaenv ?lamch ?getrf ?gesvx Computes the solution to the system of linear equations with a square matrix A and multiple righthand sides, and provides error bounds on the solution. Syntax Fortran 77: call sgesvx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) LAPACK Routines: Linear Equations 3 561 call dgesvx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cgesvx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zgesvx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call gesvx( a, b, x [,af] [,ipiv] [,fact] [,trans] [,equed] [,r] [,c] [,ferr] [,berr] [,rcond] [,rpvgrw] [,info] ) C: lapack_int LAPACKE_sgesvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* r, float* c, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr, float* rpivot ); lapack_int LAPACKE_dgesvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* r, double* c, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr, double* rpivot ); lapack_int LAPACKE_cgesvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* r, float* c, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr, float* rpivot ); lapack_int LAPACKE_zgesvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* r, double* c, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr, double* rpivot ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the LU factorization to compute the solution to a real or complex system of linear equations A*X = B, where A is an n-by-n matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?gesvx performs the following steps: 1. If fact = 'E', real scaling factors r and c are computed to equilibrate the system: trans = 'N': diag(r)*A*diag(c)*inv(diag(c))*X = diag(r)*B trans = 'T': (diag(r)*A*diag(c))T*inv(diag(r))*X = diag(c)*B trans = 'C': (diag(r)*A*diag(c))H*inv(diag(r))*X = diag(c)*B 3 Intel® Math Kernel Library Reference Manual 562 Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(r)*A*diag(c) and B by diag(r)*B (if trans='N') or diag(c)*B (if trans = 'T' or 'C'). 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = P*L*U, where P is a permutation matrix, L is a unit lower triangular matrix, and U is upper triangular. 3. If some Ui,i= 0, so that U is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n + 1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(c) (if trans = 'N') or diag(r) (if trans = 'T' or 'C') so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface, except for rpivot. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F': on entry, af and ipiv contain the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by r and c. a, af, and ipiv are not modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated if necessary, then copied to af and factored. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form AT*X = B (Transpose). If trans = 'C', the system has the form AH*X = B (Transpose for real flavors, conjugate transpose for complex flavors). n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for sgesvx DOUBLE PRECISION for dgesvx COMPLEX for cgesvx DOUBLE COMPLEX for zgesvx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the matrix A. If fact = 'F' and equed is not 'N', then A must have been equilibrated by the scaling factors in r and/or c. The second dimension of a must be at least max(1,n). LAPACK Routines: Linear Equations 3 563 The array af is an input argument if fact = 'F'. It contains the factored form of the matrix A, that is, the factors L and U from the factorization A = P*L*U as computed by ?getrf. If equed is not 'N', then af is the factored form of the equilibrated matrix A. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains the pivot indices from the factorization A = P*L*U as computed by ?getrf; row i of the matrix was interchanged with row ipiv(i). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). If equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. These arrays are input arguments if fact = 'F' only; otherwise they are output arguments. If equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If fact = 'F' and equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If fact = 'F' and equed = 'C' or 'B', each element of c must be positive. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. 3 Intel® Math Kernel Library Reference Manual 564 Workspace array, DIMENSION at least max(1, 2*n); used in complex flavors only. Output Parameters x REAL for sgesvx DOUBLE PRECISION for dgesvx COMPLEX for cgesvx DOUBLE COMPLEX for zgesvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: diag(C)-1*X, if trans = 'N' and equed = 'C' or 'B'; diag(R)-1*X, if trans = 'T' or 'C' and equed = 'R' or 'B'. The second dimension of x must be at least max(1,nrhs). a Array a is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If equed ? 'N', A is scaled on exit as follows: equed = 'R': A = diag(R)*A equed = 'C': A = A*diag(c) equed = 'B': A = diag(R)*A*diag(c). af If fact = 'N' or 'E', then af is an output argument and on exit returns the factors L and U from the factorization A = PLU of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). See the description of a for the form of the equilibrated matrix. b Overwritten by diag(r)*B if trans = 'N' and equed = 'R'or 'B'; overwritten by diag(c)*B if trans = 'T' or 'C' and equed = 'C' or 'B'; not changed if equed = 'N'. r, c These arrays are output arguments if fact ? 'F'. See the description of r, c in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x (j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. LAPACK Routines: Linear Equations 3 565 Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. ipiv If fact = 'N' or 'E', then ipiv is an output argument and on exit contains the pivot indices from the factorization A = P*L*U of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). work, rwork, rpivot On exit, work(1) for real flavors, or rwork(1) for complex flavors (the Fortran interface) and rpivot (the C interface), contains the reciprocal pivot growth factor norm(A)/norm(U). The "max absolute element" norm is used. If work(1) for real flavors, or rwork(1) for complex flavors is much less than 1, then the stability of the LU factorization of the (equilibrated) matrix A could be poor. This also means that the solution x, condition estimator rcond, and forward error bound ferr could be unreliable. If factorization fails with 0 < info = n, then work(1) for real flavors, or rwork(1) for complex flavors contains the reciprocal pivot growth factor for the leading info columns of A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then U(i, i) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n+1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gesvx interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). af Holds the matrix AF of size (n,n). ipiv Holds the vector of length n. r Holds the vector of length n. Default value for each element is r(i) = 1.0_WP. c Holds the vector of length n. Default value for each element is c(i) = 1.0_WP. 3 Intel® Math Kernel Library Reference Manual 566 ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). fact Must be 'N', 'E', or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. trans Must be 'N', 'C', or 'T'. The default value is 'N'. equed Must be 'N', 'B', 'C', or 'R'. The default value is 'N'. rpvgrw Real value that contains the reciprocal pivot growth factor norm(A)/ norm(U). ?gesvxx Uses extra precise iterative refinement to compute the solution to the system of linear equations with a square matrix A and multiple right-hand sides Syntax Fortran 77: call sgesvxx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dgesvxx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cgesvxx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zgesvxx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sgesvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* r, float* c, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_dgesvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* r, double* c, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); lapack_int LAPACKE_cgesvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* r, float* c, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); LAPACK Routines: Linear Equations 3 567 lapack_int LAPACKE_zgesvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* r, double* c, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the LU factorization to compute the solution to a real or complex system of linear equations A*X = B, where A is an n-by-n matrix, the columns of the matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Both normwise and maximum componentwise error bounds are also provided on request. The routine returns a solution with a small guaranteed error (O(eps), where eps is the working machine precision) unless the matrix is very ill-conditioned, in which case a warning is returned. Relevant condition numbers are also calculated and returned. The routine accepts user-provided factorizations and equilibration factors; see definitions of the fact and equed options. Solving with refinement and using a factorization from a previous call of the routine also produces a solution with O(eps) errors or warnings but that may not be true for general user-provided factorizations and equilibration factors if they differ from what the routine would itself produce. The routine ?gesvxx performs the following steps: 1. If fact = 'E', scaling factors r and c are computed to equilibrate the system: trans = 'N': diag(r)*A*diag(c)*inv(diag(c))*X = diag(r)*B trans = 'T': (diag(r)*A*diag(c))T*inv(diag(r))*X = diag(c)*B trans = 'C': (diag(r)*A*diag(c))H*inv(diag(r))*X = diag(c)*B Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(r)*A*diag(c) and B by diag(r)*B (if trans='N') or diag(c)*B (if trans = 'T' or 'C'). 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = P*L*U, where P is a permutation matrix, L is a unit lower triangular matrix, and U is upper triangular. 3. If some Ui,i= 0, so that U is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A (see the rcond parameter). If the reciprocal of the condition number is less than machine precision, the routine still goes on to solve for X and compute error bounds. 4. The system of equations is solved for X using the factored form of A. 5. By default, unless params(la_linrx_itref_i) is set to zero, the routine applies iterative refinement to improve the computed solution matrix and calculate error bounds. Refinement calculates the residual to at least twice the working precision. 6. If equilibration was used, the matrix X is premultiplied by diag(c) (if trans = 'N') or diag(r) (if trans = 'T' or 'C') so that it solves the original system before equilibration. 3 Intel® Math Kernel Library Reference Manual 568 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F', on entry, af and ipiv contain the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by r and c. Parameters a, af, and ipiv are not modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated, if necessary, copied to af and factored. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form A**T*X = B (Transpose). If trans = 'C', the system has the form A**H*X = B (Conjugate Transpose = Transpose for real flavors, Conjugate Transpose for complex flavors). n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for sgesvxx DOUBLE PRECISION for dgesvxx COMPLEX for cgesvxx DOUBLE COMPLEX for zgesvxx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the matrix A. If fact = 'F' and equed is not 'N', then A must have been equilibrated by the scaling factors in r and/or c. The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains the factored form of the matrix A, that is, the factors L and U from the factorization A = P*L*U as computed by ?getrf. If equed is not 'N', then af is the factored form of the equilibrated matrix A. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains the pivot indices from the factorization A = P*L*U as computed by ?getrf; row i of the matrix was interchanged with row ipiv(i). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. LAPACK Routines: Linear Equations 3 569 equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). If equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. These arrays are input arguments if fact = 'F' only; otherwise they are output arguments. If equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If fact = 'F' and equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If fact = 'F' and equed = 'C' or 'B', each element of c must be positive. Each element of r or c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) 3 Intel® Math Kernel Library Reference Manual 570 params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sgesvxx DOUBLE PRECISION for dgesvxx COMPLEX for cgesvxx DOUBLE COMPLEX for zgesvxx. Array, DIMENSION (ldx,*). If info = 0, the array x contains the solution n-by-nrhs matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(c))*X, if trans = 'N' and equed = 'C' or 'B'; or inv(diag(r))*X, if trans = 'T' or 'C' and equed = 'R' or 'B'. The second dimension of x must be at least max(1,nrhs). a Array a is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If equed ? 'N', A is scaled on exit as follows: equed = 'R': A = diag(r)*A equed = 'C': A = A*diag(c) equed = 'B': A = diag(r)*A*diag(c). af If fact = 'N' or 'E', then af is an output argument and on exit returns the factors L and U from the factorization A = PLU of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). See the description of a for the form of the equilibrated matrix. b Overwritten by diag(r)*B if trans = 'N' and equed = 'R' or 'B'; overwritten by trans = 'T' or 'C' and equed = 'C' or 'B'; not changed if equed = 'N'. r, c These arrays are output arguments if fact ? 'F'. Each element of these arrays is a power of the radix. See the description of r, c in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. LAPACK Routines: Linear Equations 3 571 Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. rpvgrw REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Contains the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the (equlibrated) matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. If factorization fails with 0 < info = n, this parameter contains the reciprocal pivot growth factor for the leading info columns of A. In ?gesvx, this quantity is returned in work(1). berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors 3 Intel® Math Kernel Library Reference Manual 572 and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. LAPACK Routines: Linear Equations 3 573 Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. ipiv If fact = 'N' or 'E', then ipiv is an output argument and on exit contains the pivot indices from the factorization A = P*L*U of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). params If an entry is less than 0.0, that entry is filled with the default value used for that parameter, otherwise the entry is not modified info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?gbsv Computes the solution to the system of linear equations with a band matrix A and multiple righthand sides. Syntax Fortran 77: call sgbsv( n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call dgbsv( n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call cgbsv( n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call zgbsv( n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) Fortran 95: call gbsv( ab, b [,kl] [,ipiv] [,info] ) C: lapack_int LAPACKE_gbsv( int matrix_order, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, * ab, lapack_int ldab, lapack_int* ipiv, * b, lapack_int ldb ); 3 Intel® Math Kernel Library Reference Manual 574 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n band matrix with kl subdiagonals and ku superdiagonals, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The LU decomposition with partial pivoting and row interchanges is used to factor A as A = L*U, where L is a product of permutation and unit lower triangular matrices with kl subdiagonals, and U is upper triangular with kl+ku superdiagonals. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of A. The number of rows in B; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. nrhs INTEGER. The number of right-hand sides. The number of columns in B; nrhs = 0. ab, b REAL for sgbsv DOUBLE PRECISION for dgbsv COMPLEX for cgbsv DOUBLE COMPLEX for zgbsv. Arrays: ab(ldab,*), b(ldb,*). The array ab contains the matrix A in band storage (see Matrix Storage Schemes). The second dimension of ab must be at least max(1, n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldab INTEGER. The leading dimension of the array ab. (ldab = 2kl + ku +1) ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters ab Overwritten by L and U. The diagonal and kl + ku superdiagonals of U are stored in the first 1 + kl + ku rows of ab. The multipliers used to form L are stored in the next kl rows. b Overwritten by the solution matrix X. ipiv INTEGER. Array, DIMENSION at least max(1, n). The pivot indices: row i was interchanged with row ipiv(i). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Linear Equations 3 575 If info = i, U(i, i) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbsv interface are as follows: ab Holds the array A of size (2*kl+ku+1,n). b Holds the matrix B of size (n,nrhs). ipiv Holds the vector of length n. kl If omitted, assumed kl = ku. ku Restored as ku = lda-2*kl-1. ?gbsvx Computes the solution to the real or complex system of linear equations with a band matrix A and multiple right-hand sides, and provides error bounds on the solution. Syntax Fortran 77: call sgbsvx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dgbsvx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cgbsvx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zgbsvx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call gbsvx( ab, b, x [,kl] [,afb] [,ipiv] [,fact] [,trans] [,equed] [,r] [,c] [,ferr] [,berr] [,rcond] [,rpvgrw] [,info] ) C: lapack_int LAPACKE_sgbsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, float* ab, lapack_int ldab, float* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, float* r, float* c, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr, float* rpivot ); lapack_int LAPACKE_dgbsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, double* ab, lapack_int ldab, double* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, double* r, double* c, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr, double* rpivot ); 3 Intel® Math Kernel Library Reference Manual 576 lapack_int LAPACKE_cgbsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, lapack_complex_float* ab, lapack_int ldab, lapack_complex_float* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, float* r, float* c, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr, float* rpivot ); lapack_int LAPACKE_zgbsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, lapack_complex_double* ab, lapack_int ldab, lapack_complex_double* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, double* r, double* c, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr, double* rpivot ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the LU factorization to compute the solution to a real or complex system of linear equations A*X = B, AT*X = B, or AH*X = B, where A is a band matrix of order n with kl subdiagonals and ku superdiagonals, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?gbsvx performs the following steps: 1. If fact = 'E', real scaling factors r and c are computed to equilibrate the system: trans = 'N': diag(r)*A*diag(c) *inv(diag(c))*X = diag(r)*B trans = 'T': (diag(r)*A*diag(c))T *inv(diag(r))*X = diag(c)*B trans = 'C': (diag(r)*A*diag(c))H *inv(diag(r))*X = diag(c)*B Whether the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(r)*A*diag(c) and B by diag(r)*B (if trans='N') or diag(c)*B (if trans = 'T' or 'C'). 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = L*U, where L is a product of permutation and unit lower triangular matrices with kl subdiagonals, and U is upper triangular with kl+ku superdiagonals. 3. If some Ui,i = 0, so that U is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n + 1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(c) (if trans = 'N') or diag(r) (if trans = 'T' or 'C') so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface, except for rpivot. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Linear Equations 3 577 fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F': on entry, afb and ipiv contain the factored form of A. If equed is not 'N', the matrix A is equilibrated with scaling factors given by r and c. ab, afb, and ipiv are not modified. If fact = 'N', the matrix A will be copied to afb and factored. If fact = 'E', the matrix A will be equilibrated if necessary, then copied to afb and factored. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form AT*X = B (Transpose). If trans = 'C', the system has the form AH*X = B (Transpose for real flavors, conjugate transpose for complex flavors). n INTEGER. The number of linear equations, the order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. nrhs INTEGER. The number of right hand sides, the number of columns of the matrices B and X; nrhs = 0. ab, afb, b, work REAL for sgesvx DOUBLE PRECISION for dgesvx COMPLEX for cgesvx DOUBLE COMPLEX for zgesvx. Arrays: ab(ldab,*), afb(ldafb,*), b(ldb,*), work(*). The array ab contains the matrix A in band storage (see Matrix Storage Schemes). The second dimension of ab must be at least max(1, n). If fact = 'F' and equed is not 'N', then A must have been equilibrated by the scaling factors in r and/or c. The array afb is an input argument if fact = 'F'. The second dimension of afb must be at least max(1,n). It contains the factored form of the matrix A, that is, the factors L and U from the factorization A = L*U as computed by ?gbtrf. U is stored as an upper triangular band matrix with kl + ku superdiagonals in the first 1 + kl + ku rows of afb. The multipliers used during the factorization are stored in the next kl rows. If equed is not 'N', then afb is the factored form of the equilibrated matrix A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,3*n) for real flavors, and at least max(1,2*n) for complex flavors. ldab INTEGER. The leading dimension of ab; ldab = kl+ku+1. ldafb INTEGER. The leading dimension of afb; ldafb = 2*kl+ku+1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. 3 Intel® Math Kernel Library Reference Manual 578 Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains the pivot indices from the factorization A = L*U as computed by ?gbtrf; row i of the matrix was interchanged with row ipiv(i). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). if equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. These arrays are input arguments if fact = 'F' only; otherwise they are output arguments. If equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If fact = 'F' and equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If fact = 'F' and equed = 'C' or 'B', each element of c must be positive. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for sgbsvx DOUBLE PRECISION for dgbsvx COMPLEX for cgbsvx DOUBLE COMPLEX for zgbsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(c))*X, if trans = 'N' and equed = 'C' or 'B'; inv(diag(r))*X, if trans = 'T' or 'C' and equed = 'R' or 'B'. The second dimension of x must be at least max(1,nrhs). LAPACK Routines: Linear Equations 3 579 ab Array ab is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If equed ? 'N', A is scaled on exit as follows: equed = 'R': A = diag(r)*A equed = 'C': A = A*diag(c) equed = 'B': A = diag(r)*A*diag(c). afb If fact = 'N' or 'E', then afb is an output argument and on exit returns details of the LU factorization of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). See the description of ab for the form of the equilibrated matrix. b Overwritten by diag(r)*b if trans = 'N' and equed = 'R' or 'B'; overwritten by diag(c)*b if trans = 'T' or 'C' and equed = 'C' or 'B'; not changed if equed = 'N'. r, c These arrays are output arguments if fact ? 'F'. See the description of r, c in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision (in particular, if rcond =0), the matrix is singular to working precision. This condition is indicated by a return code of info>0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. ipiv If fact = 'N' or 'E', then ipiv is an output argument and on exit contains the pivot indices from the factorization A = L*U of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). work, rwork, rpivot On exit, work(1) for real flavors, or rwork(1) for complex flavors (the Fortran interface) and rpivot (the C interface), contains the reciprocal pivot growth factor norm(A)/norm(U). The "max absolute element" norm is used. If work(1) for real flavors, or rwork(1) for 3 Intel® Math Kernel Library Reference Manual 580 complex flavors is much less than 1, then the stability of the LU factorization of the (equilibrated) matrix A could be poor. This also means that the solution x, condition estimator rcond, and forward error bound ferr could be unreliable. If factorization fails with 0 < info = n, then work(1) for real flavors, or rwork(1) for complex flavors contains the reciprocal pivot growth factor for the leading info columns of A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then U(i, i) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n+1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbsvx interface are as follows: ab Holds the array A of size (kl+ku+1,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). afb Holds the array AF of size (2*kl+ku+1,n). ipiv Holds the vector of length n. r Holds the vector of length n. Default value for each element is r(i) = 1.0_WP. c Holds the vector of length n. Default value for each element is c(i) = 1.0_WP. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). trans Must be 'N', 'C', or 'T'. The default value is 'N'. equed Must be 'N', 'B', 'C', or 'R'. The default value is 'N'. fact Must be 'N', 'E', or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. rpvgrw Real value that contains the reciprocal pivot growth factor norm(A)/ norm(U). kl If omitted, assumed kl = ku. ku Restored as ku = lda-kl-1. LAPACK Routines: Linear Equations 3 581 ?gbsvxx Uses extra precise iterative refinement to compute the solution to the system of linear equations with a banded matrix A and multiple right-hand sides Syntax Fortran 77: call sgbsvxx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dgbsvxx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cgbsvxx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zgbsvxx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sgbsvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, float* ab, lapack_int ldab, float* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, float* r, float* c, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_dgbsvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, double* ab, lapack_int ldab, double* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, double* r, double* c, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); lapack_int LAPACKE_cgbsvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, lapack_complex_float* ab, lapack_int ldab, lapack_complex_float* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, float* r, float* c, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_zgbsvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, lapack_complex_double* ab, lapack_int ldab, lapack_complex_double* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, double* r, double* c, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); 3 Intel® Math Kernel Library Reference Manual 582 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the LU factorization to compute the solution to a real or complex system of linear equations A*X = B, where A is an n-by-n banded matrix, the columns of the matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Both normwise and maximum componentwise error bounds are also provided on request. The routine returns a solution with a small guaranteed error (O(eps), where eps is the working machine precision) unless the matrix is very ill-conditioned, in which case a warning is returned. Relevant condition numbers are also calculated and returned. The routine accepts user-provided factorizations and equilibration factors; see definitions of the fact and equed options. Solving with refinement and using a factorization from a previous call of the routine also produces a solution with O(eps) errors or warnings but that may not be true for general user-provided factorizations and equilibration factors if they differ from what the routine would itself produce. The routine ?gbsvxx performs the following steps: 1. If fact = 'E', scaling factors r and c are computed to equilibrate the system: trans = 'N': diag(r)*A*diag(c)*inv(diag(c))*X = diag(r)*B trans = 'T': (diag(r)*A*diag(c))T*inv(diag(r))*X = diag(c)*B trans = 'C': (diag(r)*A*diag(c))H*inv(diag(r))*X = diag(c)*B Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(r)*A*diag(c) and B by diag(r)*B (if trans='N') or diag(c)*B (if trans = 'T' or 'C'). 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = P*L*U, where P is a permutation matrix, L is a unit lower triangular matrix, and U is upper triangular. 3. If some Ui,i= 0, so that U is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A (see the rcond parameter). If the reciprocal of the condition number is less than machine precision, the routine still goes on to solve for X and compute error bounds. 4. The system of equations is solved for X using the factored form of A. 5. By default, unless params(la_linrx_itref_i) is set to zero, the routine applies iterative refinement to improve the computed solution matrix and calculate error bounds. Refinement calculates the residual to at least twice the working precision. 6. If equilibration was used, the matrix X is premultiplied by diag(c) (if trans = 'N') or diag(r) (if trans = 'T' or 'C') so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. LAPACK Routines: Linear Equations 3 583 If fact = 'F', on entry, afb and ipiv contain the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by r and c. Parameters ab, afb, and ipiv are not modified. If fact = 'N', the matrix A will be copied to afb and factored. If fact = 'E', the matrix A will be equilibrated, if necessary, copied to afb and factored. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form A**T*X = B (Transpose). If trans = 'C', the system has the form A**H*X = B (Conjugate Transpose = Transpose for real flavors, Conjugate Transpose for complex flavors). n INTEGER. The number of linear equations; the order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. ab, afb, b, work REAL for sgbsvxx DOUBLE PRECISION for dgbsvxx COMPLEX for cgbsvxx DOUBLE COMPLEX for zgbsvxx. Arrays: ab(ldab,*), afb(ldafb,*), b(ldb,*), work(*). The array ab contains the matrix A in band storage, in rows 1 to kl+ku+1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = A(i,j) for max(1,j-ku) = i = min(n,j+kl). If fact = 'F' and equed is not 'N', then AB must have been equilibrated by the scaling factors in r and/or c. The second dimension of a must be at least max(1,n). The array afb is an input argument if fact = 'F'. It contains the factored form of the banded matrix A, that is, the factors L and U from the factorization A = P*L*U as computed by ?gbtrf. U is stored as an upper triangular banded matrix with kl + ku superdiagonals in rows 1 to kl + ku + 1. The multipliers used during the factorization are stored in rows kl + ku + 2 to 2*kl + ku + 1. If equed is not 'N', then afb is the factored form of the equilibrated matrix A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. ldab INTEGER. The leading dimension of the array ab; ldab = kl+ku+1.. ldafb INTEGER. The leading dimension of the array afb; ldafb = 2*kl+ku+1.. ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains the pivot indices from the factorization A = P*L*U as computed by ?gbtrf; row i of the matrix was interchanged with row ipiv(i). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. 3 Intel® Math Kernel Library Reference Manual 584 equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). If equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. These arrays are input arguments if fact = 'F' only; otherwise they are output arguments. If equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If fact = 'F' and equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If fact = 'F' and equed = 'C' or 'B', each element of c must be positive. Each element of r or c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1,n). ldx INTEGER. The leading dimension of the output array x; ldx = max(1,n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the extra-precise refinement algorithm. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for fefinement. LAPACK Routines: Linear Equations 3 585 Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sgbsvxx DOUBLE PRECISION for dgbsvxx COMPLEX for cgbsvxx DOUBLE COMPLEX for zgbsvxx. Array, DIMENSION (ldx,*). If info = 0, the array x contains the solution n-by-nrhs matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(c))*X, if trans = 'N' and equed = 'C' or 'B'; or inv(diag(r))*X, if trans = 'T' or 'C' and equed = 'R' or 'B'. The second dimension of x must be at least max(1,nrhs). ab Array ab is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If equed ? 'N', A is scaled on exit as follows: equed = 'R': A = diag(r)*A equed = 'C': A = A*diag(c) equed = 'B': A = diag(r)*A*diag(c). afb If fact = 'N' or 'E', then afb is an output argument and on exit returns the factors L and U from the factorization A = PLU of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). b Overwritten by diag(r)*B if trans = 'N' and equed = 'R' or 'B'; overwritten by trans = 'T' or 'C' and equed = 'C' or 'B'; not changed if equed = 'N'. r, c These arrays are output arguments if fact ? 'F'. Each element of these arrays is a power of the radix. See the description of r, c in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. 3 Intel® Math Kernel Library Reference Manual 586 Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. rpvgrw REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Contains the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the (equlibrated) matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. If factorization fails with 0 < info = n, this parameter contains the reciprocal pivot growth factor for the leading info columns of A. In ?gbsvx, this quantity is returned in work(1). berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors LAPACK Routines: Linear Equations 3 587 and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. 3 Intel® Math Kernel Library Reference Manual 588 Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. ipiv If fact = 'N' or 'E', then ipiv is an output argument and on exit contains the pivot indices from the factorization A = P*L*U of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). params If an entry is less than 0.0, that entry is filled with the default value used for that parameter, otherwise the entry is not modified. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?gtsv Computes the solution to the system of linear equations with a tridiagonal matrix A and multiple right-hand sides. Syntax Fortran 77: call sgtsv( n, nrhs, dl, d, du, b, ldb, info ) call dgtsv( n, nrhs, dl, d, du, b, ldb, info ) call cgtsv( n, nrhs, dl, d, du, b, ldb, info ) call zgtsv( n, nrhs, dl, d, du, b, ldb, info ) Fortran 95: call gtsv( dl, d, du, b [,info] ) C: lapack_int LAPACKE_gtsv( int matrix_order, lapack_int n, lapack_int nrhs, * dl, * d, * du, * b, lapack_int ldb ); LAPACK Routines: Linear Equations 3 589 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B, where A is an n-by-n tridiagonal matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The routine uses Gaussian elimination with partial pivoting. Note that the equation AT*X = B may be solved by interchanging the order of the arguments du and dl. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of A, the number of rows in B; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. dl, d, du, b REAL for sgtsv DOUBLE PRECISION for dgtsv COMPLEX for cgtsv DOUBLE COMPLEX for zgtsv. Arrays: dl(n - 1), d(n), du(n - 1), b(ldb,*). The array dl contains the (n - 1) subdiagonal elements of A. The array d contains the diagonal elements of A. The array du contains the (n - 1) superdiagonal elements of A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters dl Overwritten by the (n-2) elements of the second superdiagonal of the upper triangular matrix U from the LU factorization of A. These elements are stored in dl(1), ..., dl(n-2). d Overwritten by the n diagonal elements of U. du Overwritten by the (n-1) elements of the first superdiagonal of U. b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, U(i, i) is exactly zero, and the solution has not been computed. The factorization has not been completed unless i = n. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gtsv interface are as follows: 3 Intel® Math Kernel Library Reference Manual 590 dl Holds the vector of length (n-1). d Holds the vector of length n. dl Holds the vector of length (n-1). b Holds the matrix B of size (n,nrhs). ?gtsvx Computes the solution to the real or complex system of linear equations with a tridiagonal matrix A and multiple right-hand sides, and provides error bounds on the solution. Syntax Fortran 77: call sgtsvx( fact, trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dgtsvx( fact, trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cgtsvx( fact, trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zgtsvx( fact, trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call gtsvx( dl, d, du, b, x [,dlf] [,df] [,duf] [,du2] [,ipiv] [,fact] [,trans] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_sgtsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, const float* dl, const float* d, const float* du, float* dlf, float* df, float* duf, float* du2, lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dgtsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, const double* dl, const double* d, const double* du, double* dlf, double* df, double* duf, double* du2, lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cgtsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_float* dl, const lapack_complex_float* d, const lapack_complex_float* du, lapack_complex_float* dlf, lapack_complex_float* df, lapack_complex_float* duf, lapack_complex_float* du2, lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zgtsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_double* dl, const lapack_complex_double* d, const lapack_complex_double* du, lapack_complex_double* dlf, lapack_complex_double* df, lapack_complex_double* duf, lapack_complex_double* du2, lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); LAPACK Routines: Linear Equations 3 591 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the LU factorization to compute the solution to a real or complex system of linear equations A*X = B, AT*X = B, or AH*X = B, where A is a tridiagonal matrix of order n, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?gtsvx performs the following steps: 1. If fact = 'N', the LU decomposition is used to factor the matrix A as A = L*U, where L is a product of permutation and unit lower bidiagonal matrices and U is an upper triangular matrix with nonzeroes in only the main diagonal and first two superdiagonals. 2. If some Ui,i = 0, so that U is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n + 1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. Specifies whether or not the factored form of the matrix A has been supplied on entry. If fact = 'F': on entry, dlf, df, duf, du2, and ipiv contain the factored form of A; arrays dl, d, du, dlf, df, duf, du2, and ipiv will not be modified. If fact = 'N', the matrix A will be copied to dlf, df, and duf and factored. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form AT*X = B (Transpose). If trans = 'C', the system has the form AH*X = B (Conjugate transpose). n INTEGER. The number of linear equations, the order of the matrix A; n = 0. nrhs INTEGER. The number of right hand sides, the number of columns of the matrices B and X; nrhs = 0. dl,d,du,dlf,df, duf,du2,b, x,work REAL for sgtsvx DOUBLE PRECISION for dgtsvx COMPLEX for cgtsvx DOUBLE COMPLEX for zgtsvx. 3 Intel® Math Kernel Library Reference Manual 592 Arrays: dl, DIMENSION (n -1), contains the subdiagonal elements of A. d, DIMENSION (n), contains the diagonal elements of A. du, DIMENSION (n -1), contains the superdiagonal elements of A. dlf, DIMENSION (n -1). If fact = 'F', then dlf is an input argument and on entry contains the (n -1) multipliers that define the matrix L from the LU factorization of A as computed by ?gttrf. df, DIMENSION (n). If fact = 'F', then df is an input argument and on entry contains the n diagonal elements of the upper triangular matrix U from the LU factorization of A. duf, DIMENSION (n -1). If fact = 'F', then duf is an input argument and on entry contains the (n -1) elements of the first superdiagonal of U. du2, DIMENSION (n -2). If fact = 'F', then du2 is an input argument and on entry contains the (n-2) elements of the second superdiagonal of U. b(ldb*) contains the right-hand side matrix B. The second dimension of b must be at least max(1, nrhs). x(ldx*) contains the solution matrix X. The second dimension of x must be at least max(1, nrhs). work(*) is a workspace array. DIMENSION of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). If fact = 'F', then ipiv is an input argument and on entry contains the pivot indices, as returned by ?gttrf. iwork INTEGER. Workspace array, DIMENSION (n). Used for real flavors only. rwork REAL for cgtsvx DOUBLE PRECISION for zgtsvx. Workspace array, DIMENSION (n). Used for complex flavors only. Output Parameters x REAL for sgtsvx DOUBLE PRECISION for dgtsvx COMPLEX for cgtsvx DOUBLE COMPLEX for zgtsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X. The second dimension of x must be at least max(1, nrhs). dlf If fact = 'N', then dlf is an output argument and on exit contains the (n-1) multipliers that define the matrix L from the LU factorization of A. df If fact = 'N', then df is an output argument and on exit contains the n diagonal elements of the upper triangular matrix U from the LU factorization of A. duf If fact = 'N', then duf is an output argument and on exit contains the (n-1) elements of the first superdiagonal of U. LAPACK Routines: Linear Equations 3 593 du2 If fact = 'N', then du2 is an output argument and on exit contains the (n-2) elements of the second superdiagonal of U. ipiv The array ipiv is an output argument if fact = 'N' and, on exit, contains the pivot indices from the factorization A = L*U ; row i of the matrix was interchanged with row ipiv(i). The value of ipiv(i) will always be i or i+1; ipiv(i)=i indicates a row interchange was not required. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A. If rcond is less than the machine precision (in particular, if rcond =0), the matrix is singular to working precision. This condition is indicated by a return code of info>0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then U(i, i) is exactly zero. The factorization has not been completed unless i = n, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gtsvx interface are as follows: dl Holds the vector of length (n-1). d Holds the vector of length n. du Holds the vector of length (n-1). b Holds the matrix B of size (n,nrhs). 3 Intel® Math Kernel Library Reference Manual 594 x Holds the matrix X of size (n,nrhs). dlf Holds the vector of length (n-1). df Holds the vector of length n. duf Holds the vector of length (n-1). du2 Holds the vector of length (n-2). ipiv Holds the vector of length n. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then the arguments dlf, df, duf, du2, and ipiv must be present; otherwise, an error is returned. trans Must be 'N', 'C', or 'T'. The default value is 'N'. ?dtsvb Computes the solution to the system of linear equations with a diagonally dominant tridiagonal matrix A and multiple right-hand sides. Syntax Fortran 77: call sdtsvb( n, nrhs, dl, d, du, b, ldb, info ) call ddtsvb( n, nrhs, dl, d, du, b, ldb, info ) call cdtsvb( n, nrhs, dl, d, du, b, ldb, info ) call zdtsvb( n, nrhs, dl, d, du, b, ldb, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?dtsvb routine solves a system of linear equations A*X = B for X, where A is an n-by-n diagonally dominant tridiagonal matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The routine uses the BABE (Burning At Both Ends) algorithm. Note that the equation AT*X = B may be solved by interchanging the order of the arguments du and dl. Input Parameters n INTEGER. The order of A, the number of rows in B; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. dl, d, du, b REAL for sdtsvb DOUBLE PRECISION for ddtsvb COMPLEX for cdtsvb DOUBLE COMPLEX for zdtsvb. Arrays: dl(n - 1), d(n), du(n - 1), b(ldb,*). The array dl contains the (n - 1) subdiagonal elements of A. The array d contains the diagonal elements of A. LAPACK Routines: Linear Equations 3 595 The array du contains the (n - 1) superdiagonal elements of A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters dl Overwritten by the (n-1) elements of the subdiagonal of the lower triangular matrices L1, L2 from the factorization of A. d Overwritten by the n diagonal element reciprocals of U. b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, uii is exactly zero, and the solution has not been computed. The factorization has not been completed unless i = n. Application Notes A diagonally dominant tridiagonal system is defined such that |di| > |dli-1| + |dui| for any i: 1 < i < n, and |d1| > |du1|, |dn| > |dln-1| The underlying BABE algorithm is designed for diagonally dominant systems. Such systems have no numerical stability issue unlike the canonical systems that use elimination with partial pivoting (see ?gtsv). The diagonally dominant systems are much faster than the canonical systems. NOTE • The current implementation of BABE has a potential accuracy issue on very small or large data close to the underflow or overflow threshold respectively. Scale the matrix before applying the solver in the case of such input data. • Applying the ?dtsvb factorization to non-diagonally dominant systems may lead to an accuracy loss, or false singularity detected due to no pivoting. ?posv Computes the solution to the system of linear equations with a symmetric or Hermitian positivedefinite matrix A and multiple right-hand sides. Syntax Fortran 77: call sposv( uplo, n, nrhs, a, lda, b, ldb, info ) call dposv( uplo, n, nrhs, a, lda, b, ldb, info ) call cposv( uplo, n, nrhs, a, lda, b, ldb, info ) call zposv( uplo, n, nrhs, a, lda, b, ldb, info ) call dsposv( uplo, n, nrhs, a, lda, b, ldb, x, ldx, work, swork, iter, info ) call zcposv( uplo, n, nrhs, a, lda, b, ldb, x, ldx, work, swork, rwork, iter, info ) 3 Intel® Math Kernel Library Reference Manual 596 Fortran 95: call posv( a, b [,uplo] [,info] ) C: lapack_int LAPACKE_posv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * a, lapack_int lda, * b, lapack_int ldb ); lapack_int LAPACKE_dsposv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* b, lapack_int ldb, double* x, lapack_int ldx, lapack_int* iter ); lapack_int LAPACKE_zcposv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, lapack_int* iter ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n symmetric/Hermitian positive-definite matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The Cholesky decomposition is used to factor A as A = UT*U (real flavors) and A = UH*U (complex flavors), if uplo = 'U' or A = L*LT (real flavors) and A = L*LH (complex flavors), if uplo = 'L', where U is an upper triangular matrix and L is a lower triangular matrix. The factored form of A is then used to solve the system of equations A*X = B. The dsposv and zcposv are mixed precision iterative refinement subroutines for exploiting fast single precision hardware. They first attempt to factorize the matrix in single precision (dsposv) or single complex precision (zcposv) and use this factorization within an iterative refinement procedure to produce a solution with double precision (dsposv) / double complex precision (zcposv) normwise backward error quality (see below). If the approach fails, the method switches to a double precision or double complex precision factorization respectively and computes the solution. The iterative refinement is not going to be a winning strategy if the ratio single precision/COMPLEX performance over double precision/DOUBLE COMPLEX performance is too small. A reasonable strategy should take the number of right-hand sides and the size of the matrix into account. This might be done with a call to ilaenv in the future. At present, iterative refinement is implemented. The iterative refinement process is stopped if iter > itermax or for all the right-hand sides: rnmr < sqrt(n)*xnrm*anrm*eps*bwdmax, where • iter is the number of the current iteration in the iterative refinement process • rnmr is the infinity-norm of the residual • xnrm is the infinity-norm of the solution • anrm is the infinity-operator-norm of the matrix A • eps is the machine epsilon returned by dlamch (‘Epsilon’). LAPACK Routines: Linear Equations 3 597 The values itermax and bwdmax are fixed to 30 and 1.0d+00 respectively. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. a, b REAL for sposv DOUBLE PRECISION for dposv and dsposv. COMPLEX for cposv DOUBLE COMPLEX for zposv and zcposv. Arrays: a(lda,*), b(ldb,*). The array a contains the upper or the lower triangular part of the matrix A (see uplo). The second dimension of a must be at least max(1, n). Note that in the case of zcposv the imaginary parts of the diagonal elements need not be set and are assumed to be zero. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of the array x; ldx = max(1, n). work DOUBLE PRECISION for dsposv DOUBLE COMPLEX for zcposv. Workspace array, DIMENSION (n*nrhs). This array is used to hold the residual vectors. swork REAL for dsgesv COMPLEX for zcgesv. Workspace array, DIMENSION (n*(n+nrhs)). This array is used to use the single precision matrix and the right-hand sides or solutions in single precision. rwork DOUBLE PRECISION. Workspace array, DIMENSION (n). Output Parameters a If info = 0, the upper or lower triangular part of a is overwritten by the Cholesky factor U or L, as specified by uplo. If iterative refinement has been successfully used (info= 0 and iter= 0), then A is unchanged. If double precision factorization has been used (info= 0 and iter < 0), then the array A contains the factors L and U from the Cholesky factorization; the unit diagonal elements of L are not stored. b Overwritten by the solution matrix X. ipiv INTEGER. 3 Intel® Math Kernel Library Reference Manual 598 Array, DIMENSION at least max(1, n). The pivot indices that define the permutation matrix P; row i of the matrix was interchanged with row ipiv(i). Corresponds to the single precision factorization (if info= 0 and iter = 0) or the double precision factorization (if info= 0 and iter < 0). x DOUBLE PRECISION for dsposv DOUBLE COMPLEX for zcposv. Array, DIMENSION (ldx, nrhs). If info = 0, contains the n-by-nrhs solution matrix X. iter INTEGER. If iter < 0: iterative refinement has failed, double precision factorization has been performed • If iter = -1: the routine fell back to full precision for implementation- or machine-specific reason • If iter = -2: narrowing the precision induced an overflow, the routine fell back to full precision • If iter = -3: failure of spotrf for dsposv, or cpotrf for zcposv • If iter = -31: stop the iterative refinement after the 30th iteration. If iter > 0: iterative refinement has been successfully used. Returns the number of iterations. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive definite, so the factorization could not be completed, and the solution has not been computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine posv interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. ?posvx Uses the Cholesky factorization to compute the solution to the system of linear equations with a symmetric or Hermitian positive-definite matrix A, and provides error bounds on the solution. Syntax Fortran 77: call sposvx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dposvx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) LAPACK Routines: Linear Equations 3 599 call cposvx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zposvx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call posvx( a, b, x [,uplo] [,af] [,fact] [,equed] [,s] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_sposvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* af, lapack_int ldaf, char* equed, float* s, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dposvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* af, lapack_int ldaf, char* equed, double* s, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cposvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zposvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the Cholesky factorization A=UT*U (real flavors) / A=UH*U (complex flavors) or A=L*LT (real flavors) / A=L*LH (complex flavors) to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n real symmetric/Hermitian positive definite matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?posvx performs the following steps: 1. If fact = 'E', real scaling factors s are computed to equilibrate the system: diag(s)*A*diag(s)*inv(diag(s))*X = diag(s)*B. Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. 2. If fact = 'N' or 'E', the Cholesky decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = UT*U (real), A = UH*U (complex), if uplo = 'U', or A = L*LT (real), A = L*LH (complex), if uplo = 'L', 3 Intel® Math Kernel Library Reference Manual 600 where U is an upper triangular matrix and L is a lower triangular matrix. 3. If the leading i-by-i principal minor is not positive-definite, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n + 1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(s) so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F': on entry, af contains the factored form of A. If equed = 'Y', the matrix A has been equilibrated with scaling factors given by s. a and af will not be modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated if necessary, then copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. a, af, b, work REAL for sposvx DOUBLE PRECISION for dposvx COMPLEX for cposvx DOUBLE COMPLEX for zposvx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the matrix A as specified by uplo. If fact = 'F' and equed = 'Y', then A must have been equilibrated by the scaling factors in s, and a must contain the equilibrated matrix diag(s)*A*diag(s). The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains the triangular factor U or L from the Cholesky factorization of A in the same storage format as A. If equed is not 'N', then af is the factored form of the equilibrated matrix diag(s)*A*diag(s). The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). LAPACK Routines: Linear Equations 3 601 work(*) is a workspace array. The dimension of work must be at least max(1,3*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: if equed = 'N', no equilibration was done (always true if fact = 'N'); if equed = 'Y', equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. This array is an input argument if fact = 'F' only; otherwise it is an output argument. If equed = 'N', s is not accessed. If fact = 'F' and equed = 'Y', each element of s must be positive. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for cposvx DOUBLE PRECISION for zposvx. Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for sposvx DOUBLE PRECISION for dposvx COMPLEX for cposvx DOUBLE COMPLEX for zposvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the original system of equations. Note that if equed = 'Y', A and B are modified on exit, and the solution to the equilibrated system is inv(diag(s))*X. The second dimension of x must be at least max(1,nrhs). a Array a is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If fact = 'E' and equed = 'Y', A is overwritten by diag(s)*A*diag(s). af If fact = 'N' or 'E', then af is an output argument and on exit returns the triangular factor U or L from the Cholesky factorization A=U**T*U or A=L*L**T (real routines), A=U**H*U or A=L*L**H (complex routines) of the original matrix A (if fact = 'N'), or of the equilibrated matrix A (if fact = 'E'). See the description of a for the form of the equilibrated matrix. 3 Intel® Math Kernel Library Reference Manual 602 b Overwritten by diag(s)*B, if equed = 'Y'; not changed if equed = 'N'. s This array is an output argument if fact ? 'F'. See the description of s in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision (in particular, if rcond =0), the matrix is singular to working precision. This condition is indicated by a return code of info>0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, so the factorization could not be completed, and the solution and error bounds could not be computed; rcond =0 is returned. If info = i, and i = n + 1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine posvx interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). LAPACK Routines: Linear Equations 3 603 af Holds the matrix AF of size (n,n). s Holds the vector of length n. Default value for each element is s(i) = 1.0_WP. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N', 'E', or 'F'. The default value is 'N'. If fact = 'F', then af must be present; otherwise, an error is returned. equed Must be 'N' or 'Y'. The default value is 'N'. ?posvxx Uses extra precise iterative refinement to compute the solution to the system of linear equations with a symmetric or Hermitian positive-definite matrix A applying the Cholesky factorization. Syntax Fortran 77: call sposvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dposvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cposvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zposvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sposvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* af, lapack_int ldaf, char* equed, float* s, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_dposvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* af, lapack_int ldaf, char* equed, double* s, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); lapack_int LAPACKE_cposvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); 3 Intel® Math Kernel Library Reference Manual 604 lapack_int LAPACKE_zposvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the Cholesky factorization A=UT*U (real flavors) / A=UH*U (complex flavors) or A=L*LT (real flavors) / A=L*LH (complex flavors) to compute the solution to a real or complex system of linear equations A*X = B, where A is an n-by-n real symmetric/Hermitian positive definite matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Both normwise and maximum componentwise error bounds are also provided on request. The routine returns a solution with a small guaranteed error (O(eps), where eps is the working machine precision) unless the matrix is very ill-conditioned, in which case a warning is returned. Relevant condition numbers are also calculated and returned. The routine accepts user-provided factorizations and equilibration factors; see definitions of the fact and equed options. Solving with refinement and using a factorization from a previous call of the routine also produces a solution with O(eps) errors or warnings but that may not be true for general user-provided factorizations and equilibration factors if they differ from what the routine would itself produce. The routine ?posvxx performs the following steps: 1. If fact = 'E', scaling factors are computed to equilibrate the system: diag(s)*A*diag(s) *inv(diag(s))*X = diag(s)*B Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. 2. If fact = 'N' or 'E', the Cholesky decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = UT*U (real), A = UH*U (complex), if uplo = 'U', or A = L*LT (real), A = L*LH (complex), if uplo = 'L', where U is an upper triangular matrix and L is a lower triangular matrix. 3. If the leading i-by-i principal minor is not positive-definite, the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A (see the rcond parameter). If the reciprocal of the condition number is less than machine precision, the routine still goes on to solve for X and compute error bounds. 4. The system of equations is solved for X using the factored form of A. 5. By default, unless params(la_linrx_itref_i) is set to zero, the routine applies iterative refinement to get a small error and error bounds. Refinement calculates the residual to at least twice the working precision. 6. If equilibration was used, the matrix X is premultiplied by diag(s) so that it solves the original system before equilibration. LAPACK Routines: Linear Equations 3 605 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F', on entry, af contains the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by s. Parameters a and af are not modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated, if necessary, copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for sposvxx DOUBLE PRECISION for dposvxx COMPLEX for cposvxx DOUBLE COMPLEX for zposvxx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the matrix A as specified by uplo . If fact = 'F' and equed = 'Y', then A must have been equilibrated by the scaling factors in s, and a must contain the equilibrated matrix diag(s)*A*diag(s). The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains the triangular factor U or L from the Cholesky factorization of A in the same storage format as A. If equed is not 'N', then af is the factored form of the equilibrated matrix diag(s)*A*diag(s). The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). If equed = 'N', s is not accessed. If fact = 'F' and equed = 'Y', each element of s must be positive. work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of the array a; lda = max(1,n). ldaf INTEGER. The leading dimension of the array af; ldaf = max(1,n). equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). if equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). 3 Intel® Math Kernel Library Reference Manual 606 s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. This array is an input argument if fact = 'F' only; otherwise it is an output argument. If equed = 'N', s is not accessed. If fact = 'F' and equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in the Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the extra-precise refinement algorithm. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for fefinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors LAPACK Routines: Linear Equations 3 607 DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sposvxx DOUBLE PRECISION for dposvxx COMPLEX for cposvxx DOUBLE COMPLEX for zposvxx. Array, DIMENSION (ldx,*). If info = 0, the array x contains the solution n-by-nrhs matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(s))*X. a Array a is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If fact = 'E' and equed = 'Y', A is overwritten by diag(s)*A*diag(s). af If fact = 'N' or 'E', then af is an output argument and on exit returns the triangular factor U or L from the Cholesky factorization A=U**T*U or A=L*L**T (real routines), A=U**H*U or A=L*L**H (complex routines) of the original matrix A (if fact = 'N'), or of the equilibrated matrix A (if fact = 'E'). See the description of a for the form of the equilibrated matrix. b If equed = 'N', B is not modified. If equed = 'Y', B is overwritten by diag(s)*B. s This array is an output argument if fact ? 'F'. Each element of this array is a power of the radix. See the description of s in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. rpvgrw REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Contains the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the (equlibrated) matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. If factorization fails with 0 < info = n, this parameter contains the reciprocal pivot growth factor for the leading info columns of A. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. 3 Intel® Math Kernel Library Reference Manual 608 Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: LAPACK Routines: Linear Equations 3 609 The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). params If an entry is less than 0.0, that entry is filled with the default value used for that parameter, otherwise the entry is not modified. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested 3 Intel® Math Kernel Library Reference Manual 610 params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?ppsv Computes the solution to the system of linear equations with a symmetric (Hermitian) positive definite packed matrix A and multiple right-hand sides. Syntax Fortran 77: call sppsv( uplo, n, nrhs, ap, b, ldb, info ) call dppsv( uplo, n, nrhs, ap, b, ldb, info ) call cppsv( uplo, n, nrhs, ap, b, ldb, info ) call zppsv( uplo, n, nrhs, ap, b, ldb, info ) Fortran 95: call ppsv( ap, b [,uplo] [,info] ) C: lapack_int LAPACKE_ppsv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * ap, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n real symmetric/Hermitian positive-definite matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The Cholesky decomposition is used to factor A as A = UT*U (real flavors) and A = UH*U (complex flavors), if uplo = 'U' or A = L*LT (real flavors) and A = L*LH (complex flavors), if uplo = 'L', where U is an upper triangular matrix and L is a lower triangular matrix. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Linear Equations 3 611 uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ap, b REAL for sppsv DOUBLE PRECISION for dppsv COMPLEX for cppsv DOUBLE COMPLEX for zppsv. Arrays: ap(*), b(ldb,*). The array ap contains the upper or the lower triangular part of the matrix A (as specified by uplo) in packed storage (see Matrix Storage Schemes). The dimension of ap must be at least max(1,n(n+1)/2). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters ap If info = 0, the upper or lower triangular part of A in packed storage is overwritten by the Cholesky factor U or L, as specified by uplo. b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, so the factorization could not be completed, and the solution has not been computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ppsv interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. ?ppsvx Uses the Cholesky factorization to compute the solution to the system of linear equations with a symmetric (Hermitian) positive definite packed matrix A, and provides error bounds on the solution. 3 Intel® Math Kernel Library Reference Manual 612 Syntax Fortran 77: call sppsvx( fact, uplo, n, nrhs, ap, afp, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dppsvx( fact, uplo, n, nrhs, ap, afp, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cppsvx( fact, uplo, n, nrhs, ap, afp, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zppsvx( fact, uplo, n, nrhs, ap, afp, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call ppsvx( ap, b, x [,uplo] [,af] [,fact] [,equed] [,s] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_sppsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, float* ap, float* afp, char* equed, float* s, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dppsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, double* ap, double* afp, char* equed, double* s, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cppsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_float* ap, lapack_complex_float* afp, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zppsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* ap, lapack_complex_double* afp, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the Cholesky factorization A=UT*U (real flavors) / A=UH*U (complex flavors) or A=L*LT (real flavors) / A=L*LH (complex flavors) to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n symmetric or Hermitian positive-definite matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?ppsvx performs the following steps: 1. If fact = 'E', real scaling factors s are computed to equilibrate the system: diag(s)*A*diag(s)*inv(diag(s))*X = diag(s)*B. Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. LAPACK Routines: Linear Equations 3 613 2. If fact = 'N' or 'E', the Cholesky decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = UT*U (real), A = UH*U (complex), if uplo = 'U', or A = L*LT (real), A = L*LH (complex), if uplo = 'L', where U is an upper triangular matrix and L is a lower triangular matrix. 3. If the leading i-by-i principal minor is not positive-definite, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(s) so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F': on entry, afp contains the factored form of A. If equed = 'Y', the matrix A has been equilibrated with scaling factors given by s. ap and afp will not be modified. If fact = 'N', the matrix A will be copied to afp and factored. If fact = 'E', the matrix A will be equilibrated if necessary, then copied to afp and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns in B; nrhs = 0. ap, afp, b, work REAL for sppsvx DOUBLE PRECISION for dppsvx COMPLEX for cppsvx DOUBLE COMPLEX for zppsvx. Arrays: ap(*), afp(*), b(ldb,*), work(*). The array ap contains the upper or lower triangle of the original symmetric/Hermitian matrix A in packed storage (see Matrix Storage Schemes). In case when fact = 'F' and equed = 'Y', ap must contain the equilibrated matrix diag(s)*A*diag(s). 3 Intel® Math Kernel Library Reference Manual 614 The array afp is an input argument if fact = 'F' and contains the triangular factor U or L from the Cholesky factorization of A in the same storage format as A. If equed is not 'N', then afp is the factored form of the equilibrated matrix A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1, n(n +1)/2); the second dimension of b must be at least max(1,nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: if equed = 'N', no equilibration was done (always true if fact = 'N'); if equed = 'Y', equilibration was done, that is, A has been replaced by diag(s)A*diag(s). s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. This array is an input argument if fact = 'F' only; otherwise it is an output argument. If equed = 'N', s is not accessed. If fact = 'F' and equed = 'Y', each element of s must be positive. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for cppsvx; DOUBLE PRECISION for zppsvx. Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for sppsvx DOUBLE PRECISION for dppsvx COMPLEX for cppsvx DOUBLE COMPLEX for zppsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the original system of equations. Note that if equed = 'Y', A and B are modified on exit, and the solution to the equilibrated system is inv(diag(s))*X. The second dimension of x must be at least max(1,nrhs). ap Array ap is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If fact = 'E' and equed = 'Y', A is overwritten by diag(s)*A*diag(s). LAPACK Routines: Linear Equations 3 615 afp If fact = 'N' or 'E', then afp is an output argument and on exit returns the triangular factor U or L from the Cholesky factorization A=UT*U or A=L*LT (real routines), A=UH*U or A=L*LH (complex routines) of the original matrix A (if fact = 'N'), or of the equilibrated matrix A (if fact = 'E'). See the description of ap for the form of the equilibrated matrix. b Overwritten by diag(s)*B, if equed = 'Y'; not changed if equed = 'N'. s This array is an output argument if fact ? 'F'. See the description of s in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, so the factorization could not be completed, and the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. 3 Intel® Math Kernel Library Reference Manual 616 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ppsvx interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). afp Holds the matrix AF of size (n*(n+1)/2). s Holds the vector of length n. Default value for each element is s(i) = 1.0_WP. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N', 'E', or 'F'. The default value is 'N'. If fact = 'F', then af must be present; otherwise, an error is returned. equed Must be 'N' or 'Y'. The default value is 'N'. ?pbsv Computes the solution to the system of linear equations with a symmetric or Hermitian positivedefinite band matrix A and multiple right-hand sides. Syntax Fortran 77: call spbsv( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call dpbsv( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call cpbsv( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call zpbsv( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) Fortran 95: call pbsv( ab, b [,uplo] [,info] ) C: lapack_int LAPACKE_pbsv( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, * ab, lapack_int ldab, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n symmetric/Hermitian positive definite band matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. LAPACK Routines: Linear Equations 3 617 The Cholesky decomposition is used to factor A as A = UT*U (real flavors) and A = UH*U (complex flavors), if uplo = 'U' or A = L*LT (real flavors) and A = L*LH (complex flavors), if uplo = 'L', where U is an upper triangular band matrix and L is a lower triangular band matrix, with the same number of superdiagonals or subdiagonals as A. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. kd INTEGER. The number of superdiagonals of the matrix A if uplo = 'U', or the number of subdiagonals if uplo = 'L'; kd = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ab, b REAL for spbsv DOUBLE PRECISION for dpbsv COMPLEX for cpbsv DOUBLE COMPLEX for zpbsv. Arrays: ab(ldab, *), b(ldb,*). The array ab contains the upper or the lower triangular part of the matrix A (as specified by uplo) in band storage (see Matrix Storage Schemes). The second dimension of ab must be at least max(1, n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldab INTEGER. The leading dimension of the array ab; ldab = kd +1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters ab The upper or lower triangular part of A (in band storage) is overwritten by the Cholesky factor U or L, as specified by uplo, in the same storage format as A. b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, so the factorization could not be completed, and the solution has not been computed. 3 Intel® Math Kernel Library Reference Manual 618 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbsv interface are as follows: ab Holds the array A of size (kd+1,n). b Holds the matrix B of size (n,nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. ?pbsvx Uses the Cholesky factorization to compute the solution to the system of linear equations with a symmetric (Hermitian) positive-definite band matrix A, and provides error bounds on the solution. Syntax Fortran 77: call spbsvx( fact, uplo, n, kd, nrhs, ab, ldab, afb, ldafb, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dpbsvx( fact, uplo, n, kd, nrhs, ab, ldab, afb, ldafb, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cpbsvx( fact, uplo, n, kd, nrhs, ab, ldab, afb, ldafb, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zpbsvx( fact, uplo, n, kd, nrhs, ab, ldab, afb, ldafb, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call pbsvx( ab, b, x [,uplo] [,afb] [,fact] [,equed] [,s] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_spbsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, float* ab, lapack_int ldab, float* afb, lapack_int ldafb, char* equed, float* s, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dpbsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, double* ab, lapack_int ldab, double* afb, lapack_int ldafb, char* equed, double* s, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cpbsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, lapack_complex_float* ab, lapack_int ldab, lapack_complex_float* afb, lapack_int ldafb, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); LAPACK Routines: Linear Equations 3 619 lapack_int LAPACKE_zpbsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, lapack_complex_double* ab, lapack_int ldab, lapack_complex_double* afb, lapack_int ldafb, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the Cholesky factorization A=UT*U (real flavors) / A=UH*U (complex flavors) or A=L*LT (real flavors) / A=L*LH (complex flavors) to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n symmetric or Hermitian positive definite band matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?pbsvx performs the following steps: 1. If fact = 'E', real scaling factors s are computed to equilibrate the system: diag(s)*A*diag(s)*inv(diag(s))*X = diag(s)*B. Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. 2. If fact = 'N' or 'E', the Cholesky decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = UT*U (real), A = UH*U (complex), if uplo = 'U', or A = L*LT (real), A = L*LH (complex), if uplo = 'L', where U is an upper triangular band matrix and L is a lower triangular band matrix. 3. If the leading i-by-i principal minor is not positive definite, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(s) so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. 3 Intel® Math Kernel Library Reference Manual 620 If fact = 'F': on entry, afb contains the factored form of A. If equed = 'Y', the matrix A has been equilibrated with scaling factors given by s. ab and afb will not be modified. If fact = 'N', the matrix A will be copied to afb and factored. If fact = 'E', the matrix A will be equilibrated if necessary, then copied to afb and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ab, afb, b, work REAL for spbsvx DOUBLE PRECISION for dpbsvx COMPLEX for cpbsvx DOUBLE COMPLEX for zpbsvx. Arrays: ab(ldab,*), afb(ldab,*), b(ldb,*), work(*). The array ab contains the upper or lower triangle of the matrix A in band storage (see Matrix Storage Schemes). If fact = 'F' and equed = 'Y', then ab must contain the equilibrated matrix diag(s)*A*diag(s). The second dimension of ab must be at least max(1, n). The array afb is an input argument if fact = 'F'. It contains the triangular factor U or L from the Cholesky factorization of the band matrix A in the same storage format as A. If equed = 'Y', then afb is the factored form of the equilibrated matrix A. The second dimension of afb must be at least max(1, n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,3*n) for real flavors, and at least max(1,2*n) for complex flavors. ldab INTEGER. The leading dimension of ab; ldab = kd+1. ldafb INTEGER. The leading dimension of afb; ldafb = kd+1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: if equed = 'N', no equilibration was done (always true if fact = 'N') if equed = 'Y', equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. LAPACK Routines: Linear Equations 3 621 Array, DIMENSION (n). The array s contains the scale factors for A. This array is an input argument if fact = 'F' only; otherwise it is an output argument. If equed = 'N', s is not accessed. If fact = 'F' and equed = 'Y', each element of s must be positive. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for cpbsvx DOUBLE PRECISION for zpbsvx. Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for spbsvx DOUBLE PRECISION for dpbsvx COMPLEX for cpbsvx DOUBLE COMPLEX for zpbsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the original system of equations. Note that if equed = 'Y', A and B are modified on exit, and the solution to the equilibrated system is inv(diag(s))*X. The second dimension of x must be at least max(1,nrhs). ab On exit, if fact = 'E' and equed = 'Y', A is overwritten by diag(s)*A*diag(s). afb If fact = 'N' or 'E', then afb is an output argument and on exit returns the triangular factor U or L from the Cholesky factorization A=UT*U or A=L*LT (real routines), A=UH*U or A=L*LH (complex routines) of the original matrix A (if fact = 'N'), or of the equilibrated matrix A (if fact = 'E'). See the description of ab for the form of the equilibrated matrix. b Overwritten by diag(s)*B, if equed = 'Y'; not changed if equed = 'N'. s This array is an output argument if fact ? 'F'. See the description of s in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the 3 Intel® Math Kernel Library Reference Manual 622 largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. equed If fact ?'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, the leading minor of order i (and therefore the matrix A itself) is not positive definite, so the factorization could not be completed, and the solution and error bounds could not be computed; rcond =0 is returned. If info = i, and i = n + 1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbsvx interface are as follows: ab Holds the array A of size (kd+1,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). afb Holds the array AF of size (kd+1,n). s Holds the vector with the number of elements n. Default value for each element is s(i) = 1.0_WP. ferr Holds the vector with the number of elements nrhs. berr Holds the vector with the number of elements nrhs. uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N', 'E', or 'F'. The default value is 'N'. If fact = 'F', then af must be present; otherwise, an error is returned. equed Must be 'N' or 'Y'. The default value is 'N'. ?ptsv Computes the solution to the system of linear equations with a symmetric or Hermitian positive definite tridiagonal matrix A and multiple right-hand sides. LAPACK Routines: Linear Equations 3 623 Syntax Fortran 77: call sptsv( n, nrhs, d, e, b, ldb, info ) call dptsv( n, nrhs, d, e, b, ldb, info ) call cptsv( n, nrhs, d, e, b, ldb, info ) call zptsv( n, nrhs, d, e, b, ldb, info ) Fortran 95: call ptsv( d, e, b [,info] ) C: lapack_int LAPACKE_sptsv( int matrix_order, lapack_int n, lapack_int nrhs, float* d, float* e, float* b, lapack_int ldb ); lapack_int LAPACKE_dptsv( int matrix_order, lapack_int n, lapack_int nrhs, double* d, double* e, double* b, lapack_int ldb ); lapack_int LAPACKE_cptsv( int matrix_order, lapack_int n, lapack_int nrhs, float* d, lapack_complex_float* e, lapack_complex_float* b, lapack_int ldb ); lapack_int LAPACKE_zptsv( int matrix_order, lapack_int n, lapack_int nrhs, double* d, lapack_complex_double* e, lapack_complex_double* b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n symmetric/Hermitian positive-definite tridiagonal matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. A is factored as A = L*D*LT (real flavors) or A = L*D*LH (complex flavors), and the factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. d REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, dimension at least max(1, n). Contains the diagonal elements of the tridiagonal matrix A. e, b REAL for sptsv DOUBLE PRECISION for dptsv COMPLEX for cptsv 3 Intel® Math Kernel Library Reference Manual 624 DOUBLE COMPLEX for zptsv. Arrays: e(n - 1), b(ldb,*). The array e contains the (n - 1) subdiagonal elements of A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters d Overwritten by the n diagonal elements of the diagonal matrix D from the L*D*LT (real)/ L*D*LH (complex) factorization of A. e Overwritten by the (n - 1) subdiagonal elements of the unit bidiagonal factor L from the factorization of A. b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, and the solution has not been computed. The factorization has not been completed unless i = n. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ptsv interface are as follows: d Holds the vector of length n. e Holds the vector of length (n-1). b Holds the matrix B of size (n,nrhs). ?ptsvx Uses factorization to compute the solution to the system of linear equations with a symmetric (Hermitian) positive definite tridiagonal matrix A, and provides error bounds on the solution. Syntax Fortran 77: call sptsvx( fact, n, nrhs, d, e, df, ef, b, ldb, x, ldx, rcond, ferr, berr, work, info ) call dptsvx( fact, n, nrhs, d, e, df, ef, b, ldb, x, ldx, rcond, ferr, berr, work, info ) call cptsvx( fact, n, nrhs, d, e, df, ef, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zptsvx( fact, n, nrhs, d, e, df, ef, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) LAPACK Routines: Linear Equations 3 625 Fortran 95: call ptsvx( d, e, b, x [,df] [,ef] [,fact] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_sptsvx( int matrix_order, char fact, lapack_int n, lapack_int nrhs, const float* d, const float* e, float* df, float* ef, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dptsvx( int matrix_order, char fact, lapack_int n, lapack_int nrhs, const double* d, const double* e, double* df, double* ef, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cptsvx( int matrix_order, char fact, lapack_int n, lapack_int nrhs, const float* d, const lapack_complex_float* e, float* df, lapack_complex_float* ef, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zptsvx( int matrix_order, char fact, lapack_int n, lapack_int nrhs, const double* d, const lapack_complex_double* e, double* df, lapack_complex_double* ef, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the Cholesky factorization A = L*D*LT (real)/A = L*D*LH (complex) to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n symmetric or Hermitian positive definite tridiagonal matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?ptsvx performs the following steps: 1. If fact = 'N', the matrix A is factored as A = L*D*LT (real flavors)/A = L*D*LH (complex flavors), where L is a unit lower bidiagonal matrix and D is diagonal. The factorization can also be regarded as having the form A = UT*D*U (real flavors)/A = UH*D*U (complex flavors). 2. If the leading i-by-i principal minor is not positive-definite, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. 3 Intel® Math Kernel Library Reference Manual 626 Specifies whether or not the factored form of the matrix A is supplied on entry. If fact = 'F': on entry, df and ef contain the factored form of A. Arrays d, e, df, and ef will not be modified. If fact = 'N', the matrix A will be copied to df and ef, and factored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. d, df, rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: d(n), df(n), rwork(n). The array d contains the n diagonal elements of the tridiagonal matrix A. The array df is an input argument if fact = 'F' and on entry contains the n diagonal elements of the diagonal matrix D from the L*D*LT (real)/ L*D*LH (complex) factorization of A. The array rwork is a workspace array used for complex flavors only. e,ef,b,work REAL for sptsvx DOUBLE PRECISION for dptsvx COMPLEX for cptsvx DOUBLE COMPLEX for zptsvx. Arrays: e(n -1), ef(n -1), b(ldb*), work(*). The array e contains the (n - 1) subdiagonal elements of the tridiagonal matrix A. The array ef is an input argument if fact = 'F' and on entry contains the (n - 1) subdiagonal elements of the unit bidiagonal factor L from the L*D*LT (real)/ L*D*LH (complex) factorization of A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The array work is a workspace array. The dimension of work must be at least 2*n for real flavors, and at least n for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). Output Parameters x REAL for sptsvx DOUBLE PRECISION for dptsvx COMPLEX for cptsvx DOUBLE COMPLEX for zptsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the system of equations. The second dimension of x must be at least max(1,nrhs). df, ef These arrays are output arguments if fact = 'N'. See the description of df, ef in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. LAPACK Routines: Linear Equations 3 627 ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, so the factorization could not be completed, and the solution and error bounds could not be computed; rcond =0 is returned. If info = i, and i = n + 1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ptsvx interface are as follows: d Holds the vector of length n. e Holds the vector of length (n-1). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). df Holds the vector of length n. ef Holds the vector of length (n-1). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. 3 Intel® Math Kernel Library Reference Manual 628 ?sysv Computes the solution to the system of linear equations with a real or complex symmetric matrix A and multiple right-hand sides. Syntax Fortran 77: call ssysv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) call dsysv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) call csysv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) call zsysv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) Fortran 95: call sysv( a, b [,uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_sysv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * a, lapack_int lda, lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n symmetric matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The diagonal pivoting method is used to factor A as A = U*D*UT or A = L*D*LT, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns in B; nrhs = 0. a, b, work REAL for ssysv DOUBLE PRECISION for dsysv LAPACK Routines: Linear Equations 3 629 COMPLEX for csysv DOUBLE COMPLEX for zsysv. Arrays: a(lda,*), b(ldb,*), work(*). The array a contains the upper or the lower triangular part of the symmetric matrix A (see uplo). The second dimension of a must be at least max(1, n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work is a workspace array, dimension at least max(1,lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). lwork INTEGER. The size of the work array; lwork = 1. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for details and for the suggested value of lwork. Output Parameters a If info = 0, a is overwritten by the block-diagonal matrix D and the multipliers used to obtain the factor U (or L) from the factorization of A as computed by ?sytrf. b If info = 0, b is overwritten by the solution matrix X. ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D, as determined by ?sytrf. If ipiv(i) = k >0, then dii is a 1-by-1 diagonal block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) = ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) = ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sysv interface are as follows: a Holds the matrix A of size (n,n). 3 Intel® Math Kernel Library Reference Manual 630 b Holds the matrix B of size (n,nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?sysvx Uses the diagonal pivoting factorization to compute the solution to the system of linear equations with a real or complex symmetric matrix A, and provides error bounds on the solution. Syntax Fortran 77: call ssysvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, iwork, info ) call dsysvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, iwork, info ) call csysvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, rwork, info ) call zsysvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, rwork, info ) Fortran 95: call sysvx( a, b, x [,uplo] [,af] [,ipiv] [,fact] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_ssysvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, float* af, lapack_int ldaf, lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dsysvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, double* af, lapack_int ldaf, lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); LAPACK Routines: Linear Equations 3 631 lapack_int LAPACKE_csysvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zsysvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n symmetric matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?sysvx performs the following steps: 1. If fact = 'N', the diagonal pivoting method is used to factor the matrix A. The form of the factorization is A = U*D*UT or A = L*D*LT, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 2. If some di,i= 0, so that D is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. Specifies whether or not the factored form of the matrix A has been supplied on entry. If fact = 'F': on entry, af and ipiv contain the factored form of A. Arrays a, af, and ipiv will not be modified. If fact = 'N', the matrix A will be copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. 3 Intel® Math Kernel Library Reference Manual 632 a, af, b, work REAL for ssysvx DOUBLE PRECISION for dsysvx COMPLEX for csysvx DOUBLE COMPLEX for zsysvx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the upper or the lower triangular part of the symmetric matrix A (see uplo). The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains he block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*U**T orA = L*D*L**T as computed by ?sytrf. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). work(*) is a workspace array, dimension at least max(1,lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D, as determined by ?sytrf. If ipiv(i) = k > 0, then dii is a 1-by-1 diagonal block, and the ith row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) = ipiv(i-1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) = ipiv(i+1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). lwork INTEGER. The size of the work array. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for details and for the suggested value of lwork. iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for csysvx; DOUBLE PRECISION for zsysvx. Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for ssysvx DOUBLE PRECISION for dsysvx LAPACK Routines: Linear Equations 3 633 COMPLEX for csysvx DOUBLE COMPLEX for zsysvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the system of equations. The second dimension of x must be at least max(1,nrhs). af, ipiv These arrays are output arguments if fact = 'N'. See the description of af, ipiv in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A. If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. work(1) If info=0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then dii is exactly zero. The factorization has been completed, but the block diagonal matrix D is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then D is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sysvx interface are as follows: 3 Intel® Math Kernel Library Reference Manual 634 a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). af Holds the matrix AF of size (n,n). ipiv Holds the vector of length n. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. Application Notes The value of lwork must be at least max(1,m*n), where for real flavors m = 3 and for complex flavors m = 2. For better performance, try using lwork = max(1, m*n, n*blocksize), where blocksize is the optimal block size for ?sytrf. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?sysvxx Uses extra precise iterative refinement to compute the solution to the system of linear equations with a symmetric indefinite matrix A applying the diagonal pivoting factorization. Syntax Fortran 77: call ssysvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dsysvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call csysvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) LAPACK Routines: Linear Equations 3 635 call zsysvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_ssysvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* s, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_dsysvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* s, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); lapack_int LAPACKE_csysvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_zsysvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization A=UT*U (real flavors) / A=UH*U (complex flavors) or A=L*LT (real flavors) / A=L*LH (complex flavors) to compute the solution to a real or complex system of linear equations A*X = B, where A is an n-by-n real symmetric/Hermitian matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Both normwise and maximum componentwise error bounds are also provided on request. The routine returns a solution with a small guaranteed error (O(eps), where eps is the working machine precision) unless the matrix is very ill-conditioned, in which case a warning is returned. Relevant condition numbers are also calculated and returned. The routine accepts user-provided factorizations and equilibration factors; see definitions of the fact and equed options. Solving with refinement and using a factorization from a previous call of the routine also produces a solution with O(eps) errors or warnings but that may not be true for general user-provided factorizations and equilibration factors if they differ from what the routine would itself produce. The routine ?sysvxx performs the following steps: 1. If fact = 'E', scaling factors are computed to equilibrate the system: diag(s)*A*diag(s) *inv(diag(s))*X = diag(s)*B 3 Intel® Math Kernel Library Reference Manual 636 Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = U*D*UT, if uplo = 'U', or A = L*D*LT, if uplo = 'L', where U or L is a product of permutation and unit upper (lower) triangular matrices, and D is a symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 3. If some D(i,i)=0, so that D is exactly singular, the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A (see the rcond parameter). If the reciprocal of the condition number is less than machine precision, the routine still goes on to solve for X and compute error bounds. 4. The system of equations is solved for X using the factored form of A. 5. By default, unless params(la_linrx_itref_i) is set to zero, the routine applies iterative refinement to get a small error and error bounds. Refinement calculates the residual to at least twice the working precision. 6. If equilibration was used, the matrix X is premultiplied by diag(r) so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F', on entry, af and ipiv contain the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by s. Parameters a, af, and ipiv are not modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated, if necessary, copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for ssysvxx DOUBLE PRECISION for dsysvxx COMPLEX for csysvxx DOUBLE COMPLEX for zsysvxx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the symmetric matrix A as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a LAPACK Routines: Linear Equations 3 637 contains the lower triangular part of the matrix A and the strictly upper triangular part of a is not referenced. The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains the block diagonal matrix D and the multipliers used to obtain the factor U and L from the factorization A = U*D*U**T or A = L*D*L**T as computed by ? sytrf. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of the array a; lda = max(1,n). ldaf INTEGER. The leading dimension of the array af; ldaf = max(1,n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D as determined by ?sytrf. If ipiv(k) > 0, rows and columns k and ipiv(k) were intercanaged and D(k,k) is a 1-by-1 diagonal block. If ipiv = 'U' and ipiv(k) = ipiv(k-1) < 0, rows and columns k-1 and -ipiv(k) were interchanged and D(k-1:k, k-1:k) is a 2-by-2 diagonal block. If ipiv = 'L' and ipiv(k) = ipiv(k+1) < 0, rows and columns k+1 and -ipiv(k) were interchanged and D(k:k+1, k:k+1) is a 2-by-2 diagonal block. equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). if equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. If equed = 'Y', A is multiplied on the left and right by diag(s). This array is an input argument if fact = 'F' only; otherwise it is an output argument. If fact = 'F' and equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in the Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. 3 Intel® Math Kernel Library Reference Manual 638 params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the extra-precise refinement algorithm. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for fefinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for ssysvxx DOUBLE PRECISION for dsysvxx COMPLEX for csysvxx DOUBLE COMPLEX for zsysvxx. Array, DIMENSION (ldx,nrhs). If info = 0, the array x contains the solution n-by-nrhs matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(s))*X. a If fact = 'E' and equed = 'Y', overwritten by diag(s)*A*diag(s). af If fact = 'N', af is an output argument and on exit returns the block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*U**T or A = L*D*L**T. b If equed = 'N', B is not modified. LAPACK Routines: Linear Equations 3 639 If equed = 'Y', B is overwritten by diag(s)*B. s This array is an output argument if fact ? 'F'. Each element of this array is a power of the radix. See the description of s in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. rpvgrw REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Contains the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the (equlibrated) matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. If factorization fails with 0 < info = n, this parameter contains the reciprocal pivot growth factor for the leading info columns of A. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and 3 Intel® Math Kernel Library Reference Manual 640 sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold LAPACK Routines: Linear Equations 3 641 sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. ipiv If fact = 'N', ipiv is an output argument and on exit contains details of the interchanges and the block structure D, as determined by ssytrf for single precision flavors and dsytrf for double precision flavors. equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). params If an entry is less than 0.0, that entry is filled with the default value used for that parameter, otherwise the entry is not modified. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?hesv Computes the solution to the system of linear equations with a Hermitian matrix A and multiple right-hand sides. Syntax Fortran 77: call chesv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) call zhesv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) Fortran 95: call hesv( a, b [,uplo] [,ipiv] [,info] ) 3 Intel® Math Kernel Library Reference Manual 642 C: lapack_int LAPACKE_hesv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * a, lapack_int lda, lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the complex system of linear equations A*X = B, where A is an n-by-n symmetric matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The diagonal pivoting method is used to factor A as A = U*D*UH or A = L*D*LH, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is Hermitian and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array a stores the upper triangular part of the matrix A, and A is factored as U*D*UH. If uplo = 'L', the array a stores the lower triangular part of the matrix A, and A is factored as L*D*LH. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. a, b, work COMPLEX for chesv DOUBLE COMPLEX for zhesv. Arrays: a(lda,*), b(ldb,*), work(*). The array a contains the upper or the lower triangular part of the Hermitian matrix A (see uplo). The second dimension of a must be at least max(1, n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work is a workspace array, dimension at least max(1,lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). lwork INTEGER. The size of the work array (lwork = 1). LAPACK Routines: Linear Equations 3 643 If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for details and for the suggested value of lwork. Output Parameters a If info = 0, a is overwritten by the block-diagonal matrix D and the multipliers used to obtain the factor U (or L) from the factorization of A as computed by ?hetrf. b If info = 0, b is overwritten by the solution matrix X. ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D, as determined by ?hetrf. If ipiv(i) = k > 0, then dii is a 1-by-1 diagonal block, and the ith row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hesv interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. 3 Intel® Math Kernel Library Reference Manual 644 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?hesvx Uses the diagonal pivoting factorization to compute the solution to the complex system of linear equations with a Hermitian matrix A, and provides error bounds on the solution. Syntax Fortran 77: call chesvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, rwork, info ) call zhesvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, rwork, info ) Fortran 95: call hesvx( a, b, x [,uplo] [,af] [,ipiv] [,fact] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_chesvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zhesvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization to compute the solution to a complex system of linear equations A*X = B, where A is an n-by-n Hermitian matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?hesvx performs the following steps: LAPACK Routines: Linear Equations 3 645 1. If fact = 'N', the diagonal pivoting method is used to factor the matrix A. The form of the factorization is A = U*D*UH or A = L*D*LH, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is Hermitian and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 2. If some di,i = 0, so that D is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. Specifies whether or not the factored form of the matrix A has been supplied on entry. If fact = 'F': on entry, af and ipiv contain the factored form of A. Arrays a, af, and ipiv are not modified. If fact = 'N', the matrix A is copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array a stores the upper triangular part of the Hermitian matrix A, and A is factored as U*D*UH. If uplo = 'L', the array a stores the lower triangular part of the Hermitian matrix A; A is factored as L*D*LH. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. a, af, b, work COMPLEX for chesvx DOUBLE COMPLEX for zhesvx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the upper or the lower triangular part of the Hermitian matrix A (see uplo). The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains he block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*UH or A = L*D*LH as computed by ? hetrf. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). work(*) is a workspace array of dimension at least max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. 3 Intel® Math Kernel Library Reference Manual 646 Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D, as determined by ?hetrf. If ipiv(i) = k > 0, then dii is a 1-by-1 diagonal block, and the ith row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). lwork INTEGER. The size of the work array. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for details and for the suggested value of lwork. rwork REAL for chesvx DOUBLE PRECISION for zhesvx. Workspace array, DIMENSION at least max(1, n). Output Parameters x COMPLEX for chesvx DOUBLE COMPLEX for zhesvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the system of equations. The second dimension of x must be at least max(1,nrhs). af, ipiv These arrays are output arguments if fact = 'N'. See the description of af, ipiv in Input Arguments section. rcond REAL for chesvx DOUBLE PRECISION for zhesvx. An estimate of the reciprocal condition number of the matrix A. If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for chesvx DOUBLE PRECISION for zhesvx. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcon, and is almost always a slight overestimate of the true error. berr REAL for chesvx DOUBLE PRECISION for zhesvx. LAPACK Routines: Linear Equations 3 647 Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then dii is exactly zero. The factorization has been completed, but the block diagonal matrix D is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then D is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hesvx interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). af Holds the matrix AF of size (n,n). ipiv Holds the vector of length n. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. Application Notes The value of lwork must be at least 2*n. For better performance, try using lwork = n*blocksize, where blocksize is the optimal block size for ?hetrf. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. 3 Intel® Math Kernel Library Reference Manual 648 Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?hesvxx Uses extra precise iterative refinement to compute the solution to the system of linear equations with a Hermitian indefinite matrix A applying the diagonal pivoting factorization. Syntax Fortran 77: call chesvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zhesvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_chesvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_zhesvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization to compute the solution to a complex/double complex system of linear equations A*X = B, where A is an n-by-n Hermitian matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Both normwise and maximum componentwise error bounds are also provided on request. The routine returns a solution with a small guaranteed error (O(eps), where eps is the working machine precision) unless the matrix is very ill-conditioned, in which case a warning is returned. Relevant condition numbers are also calculated and returned. The routine accepts user-provided factorizations and equilibration factors; see definitions of the fact and equed options. Solving with refinement and using a factorization from a previous call of the routine also produces a solution with O(eps) errors or warnings but that may not be true for general user-provided factorizations and equilibration factors if they differ from what the routine would itself produce. LAPACK Routines: Linear Equations 3 649 The routine ?hesvxx performs the following steps: 1. If fact = 'E', scaling factors are computed to equilibrate the system: diag(s)*A*diag(s) *inv(diag(s))*X = diag(s)*B Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = U*D*UT, if uplo = 'U', or A = L*D*LT, if uplo = 'L', where U or L is a product of permutation and unit upper (lower) triangular matrices, and D is a symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 3. If some D(i,i)=0, so that D is exactly singular, the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A (see the rcond parameter). If the reciprocal of the condition number is less than machine precision, the routine still goes on to solve for X and compute error bounds. 4. The system of equations is solved for X using the factored form of A. 5. By default, unless params(la_linrx_itref_i) is set to zero, the routine applies iterative refinement to get a small error and error bounds. Refinement calculates the residual to at least twice the working precision. 6. If equilibration was used, the matrix X is premultiplied by diag(r) so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F', on entry, af and ipiv contain the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by s. Parameters a, af, and ipiv are not modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated, if necessary, copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work COMPLEX for chesvxx DOUBLE COMPLEX for zhesvxx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). 3 Intel® Math Kernel Library Reference Manual 650 The array a contains the Hermitian matrix A as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A and the strictly upper triangular part of a is not referenced. The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains the block diagonal matrix D and the multipliers used to obtain the factor U and L from the factorization A = U*D*U**T or A = L*D*L**T as computed by ? hetrf. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,2*n). lda INTEGER. The leading dimension of the array a; lda = max(1,n). ldaf INTEGER. The leading dimension of the array af; ldaf = max(1,n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D as determined by ?sytrf. If ipiv(k) > 0, rows and columns k and ipiv(k) were intercanaged and D(k,k) is a 1-by-1 diagonal block. If ipiv = 'U' and ipiv(k) = ipiv(k-1) < 0, rows and columns k-1 and -ipiv(k) were interchanged and D(k-1:k, k-1:k) is a 2-by-2 diagonal block. If ipiv = 'L' and ipiv(k) = ipiv(k+1) < 0, rows and columns k+1 and -ipiv(k) were interchanged and D(k:k+1, k:k+1) is a 2-by-2 diagonal block. equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). if equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). s REAL for chesvxx DOUBLE PRECISION for zhesvxx. Array, DIMENSION (n). The array s contains the scale factors for A. If equed = 'Y', A is multiplied on the left and right by diag(s). This array is an input argument if fact = 'F' only; otherwise it is an output argument. If fact = 'F' and equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). LAPACK Routines: Linear Equations 3 651 n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in the Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the extra-precise refinement algorithm. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for fefinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). rwork REAL for chesvxx DOUBLE PRECISION for zhesvxx. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x COMPLEX for chesvxx DOUBLE COMPLEX for zhesvxx. Array, DIMENSION (ldx,nrhs). If info = 0, the array x contains the solution n-by-nrhs matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(s))*X. a If fact = 'E' and equed = 'Y', overwritten by diag(s)*A*diag(s). af If fact = 'N', af is an output argument and on exit returns the block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*U**T or A = L*D*L**T. 3 Intel® Math Kernel Library Reference Manual 652 b If equed = 'N', B is not modified. If equed = 'Y', B is overwritten by diag(s)*B. s This array is an output argument if fact ? 'F'. Each element of this array is a power of the radix. See the description of s in Input Arguments section. rcond REAL for chesvxx DOUBLE PRECISION for zhesvxx. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. rpvgrw REAL for chesvxx DOUBLE PRECISION for zhesvxx. Contains the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the (equlibrated) matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. If factorization fails with 0 < info = n, this parameter contains the reciprocal pivot growth factor for the leading info columns of A. berr REAL for chesvxx DOUBLE PRECISION for zhesvxx. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for chesvxx DOUBLE PRECISION for zhesvxx. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e) for zhesvxx. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e) for zhesvxx. This error bound should only be trusted if the previous boolean is true. LAPACK Routines: Linear Equations 3 653 err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e)for zhesvxx to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for chesvxx DOUBLE PRECISION for zhesvxx. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e) for zhesvxx. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e) for zhesvxx. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e) for zhesvxx to determine if the error estimate is "guaranteed". These 3 Intel® Math Kernel Library Reference Manual 654 reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. ipiv If fact = 'N', ipiv is an output argument and on exit contains details of the interchanges and the block structure D, as determined by ssytrf for single precision flavors and dsytrf for double precision flavors. equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). params If an entry is less than 0.0, that entry is filled with the default value used for that parameter, otherwise the entry is not modified. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?spsv Computes the solution to the system of linear equations with a real or complex symmetric matrix A stored in packed format, and multiple right-hand sides. Syntax Fortran 77: call sspsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call dspsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call cspsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call zspsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) Fortran 95: call spsv( ap, b [,uplo] [,ipiv] [,info] ) LAPACK Routines: Linear Equations 3 655 C: lapack_int LAPACKE_spsv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * ap, lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n symmetric matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The diagonal pivoting method is used to factor A as A = U*D*UT or A = L*D*LT, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ap, b REAL for sspsv DOUBLE PRECISION for dspsv COMPLEX for cspsv DOUBLE COMPLEX for zspsv. Arrays: ap(*), b(ldb,*). The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the factor U or L, as specified by uplo, in packed storage (see Matrix Storage Schemes). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters ap The block-diagonal matrix D and the multipliers used to obtain the factor U (or L) from the factorization of A as computed by ?sptrf, stored as a packed triangular matrix in the same storage format as A. b If info = 0, b is overwritten by the solution matrix X. ipiv INTEGER. 3 Intel® Math Kernel Library Reference Manual 656 Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D, as determined by ?sptrf. If ipiv(i) = k > 0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spsv interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). ipiv Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. ?spsvx Uses the diagonal pivoting factorization to compute the solution to the system of linear equations with a real or complex symmetric matrix A stored in packed format, and provides error bounds on the solution. Syntax Fortran 77: call sspsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dspsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cspsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zspsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call spsvx( ap, b, x [,uplo] [,afp] [,ipiv] [,fact] [,ferr] [,berr] [,rcond] [,info] ) LAPACK Routines: Linear Equations 3 657 C: lapack_int LAPACKE_sspsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const float* ap, float* afp, lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dspsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const double* ap, double* afp, lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cspsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, lapack_complex_float* afp, lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zspsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, lapack_complex_double* afp, lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n symmetric matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?spsvx performs the following steps: 1. If fact = 'N', the diagonal pivoting method is used to factor the matrix A. The form of the factorization is A = U*D*UT orA = L*D*LT, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 2. If some di,i = 0, so that D is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. Specifies whether or not the factored form of the matrix A has been supplied on entry. If fact = 'F': on entry, afp and ipiv contain the factored form of A. Arrays ap, afp, and ipiv are not modified. 3 Intel® Math Kernel Library Reference Manual 658 If fact = 'N', the matrix A is copied to afp and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array ap stores the upper triangular part of the symmetric matrix A, and A is factored as U*D*UT. If uplo = 'L', the array ap stores the lower triangular part of the symmetric matrix A; A is factored as L*D*LT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ap, afp, b, work REAL for sspsvx DOUBLE PRECISION for dspsvx COMPLEX for cspsvx DOUBLE COMPLEX for zspsvx. Arrays: ap(*), afp(*), b(ldb,*), work(*). The array ap contains the upper or lower triangle of the symmetric matrix A in packed storage (see Matrix Storage Schemes). The array afp is an input argument if fact = 'F'. It contains the block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*UT or A = L*D*LT as computed by ?sptrf, in the same storage format as A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1, n(n +1)/2); the second dimension of b must be at least max(1,nrhs); the dimension of work must be at least max(1,3*n) for real flavors and max(1,2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D, as determined by ?sptrf. If ipiv(i) = k > 0, then dii is a 1-by-1 diagonal block, and the ith row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for cspsvx DOUBLE PRECISION for zspsvx. LAPACK Routines: Linear Equations 3 659 Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for sspsvx DOUBLE PRECISION for dspsvx COMPLEX for cspsvx DOUBLE COMPLEX for zspsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the system of equations. The second dimension of x must be at least max(1,nrhs). afp, ipiv These arrays are output arguments if fact = 'N'. See the description of afp, ipiv in Input Arguments section. rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A. If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and relative backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then dii is exactly zero. The factorization has been completed, but the block diagonal matrix D is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then D is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spsvx interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). afp Holds the array AF of size (n*(n+1)/2). ipiv Holds the vector with the number of elements n. ferr Holds the vector with the number of elements nrhs. 3 Intel® Math Kernel Library Reference Manual 660 berr Holds the vector with the number of elements nrhs. uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. ?hpsv Computes the solution to the system of linear equations with a Hermitian matrix A stored in packed format, and multiple right-hand sides. Syntax Fortran 77: call chpsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call zhpsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) Fortran 95: call hpsv( ap, b [,uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_hpsv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * ap, lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B, where A is an n-by-n Hermitian matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The diagonal pivoting method is used to factor A as A = U*D*UH or A = L*D*LH, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is Hermitian and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. LAPACK Routines: Linear Equations 3 661 nrhs INTEGER. The number of right-hand sides; the number of columns in B; nrhs = 0. ap, b COMPLEX for chpsv DOUBLE COMPLEX for zhpsv. Arrays: ap(*), b(ldb,*). The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the factor U or L, as specified by uplo, in packed storage (see Matrix Storage Schemes). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters ap The block-diagonal matrix D and the multipliers used to obtain the factor U (or L) from the factorization of A as computed by ?hptrf, stored as a packed triangular matrix in the same storage format as A. b If info = 0, b is overwritten by the solution matrix X. ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D, as determined by ?hptrf. If ipiv(i) = k > 0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpsv interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). ipiv Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. 3 Intel® Math Kernel Library Reference Manual 662 ?hpsvx Uses the diagonal pivoting factorization to compute the solution to the system of linear equations with a Hermitian matrix A stored in packed format, and provides error bounds on the solution. Syntax Fortran 77: call chpsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zhpsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call hpsvx( ap, b, x [,uplo] [,afp] [,ipiv] [,fact] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_chpsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, lapack_complex_float* afp, lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zhpsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, lapack_complex_double* afp, lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization to compute the solution to a complex system of linear equations A*X = B, where A is a n-by-n Hermitian matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?hpsvx performs the following steps: 1. If fact = 'N', the diagonal pivoting method is used to factor the matrix A. The form of the factorization is A = U*D*UH or A = L*D*LH, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is a Hermitian and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 2. If some di,i = 0, so that D is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. LAPACK Routines: Linear Equations 3 663 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. Specifies whether or not the factored form of the matrix A has been supplied on entry. If fact = 'F': on entry, afp and ipiv contain the factored form of A. Arrays ap, afp, and ipiv are not modified. If fact = 'N', the matrix A is copied to afp and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array ap stores the upper triangular part of the Hermitian matrix A, and A is factored as U*D*UH. If uplo = 'L', the array ap stores the lower triangular part of the Hermitian matrix A, and A is factored as L*D*LH. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ap, afp, b, work COMPLEX for chpsvx DOUBLE COMPLEX for zhpsvx. Arrays: ap(*), afp(*), b(ldb,*), work(*). The array ap contains the upper or lower triangle of the Hermitian matrix A in packed storage (see Matrix Storage Schemes). The array afp is an input argument if fact = 'F'. It contains the block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*UH or A = L*D*LH as computed by ?hptrf, in the same storage format as A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1,n(n+1)/ 2); the second dimension of b must be at least max(1,nrhs); the dimension of work must be at least max(1,2*n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D, as determined by ?hptrf. If ipiv(i) = k > 0, then dii is a 1-by-1 diagonal block, and the ith row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. 3 Intel® Math Kernel Library Reference Manual 664 ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). rwork REAL for chpsvx DOUBLE PRECISION for zhpsvx. Workspace array, DIMENSION at least max(1, n). Output Parameters x COMPLEX for chpsvx DOUBLE COMPLEX for zhpsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the system of equations. The second dimension of x must be at least max(1,nrhs). afp, ipiv These arrays are output arguments if fact = 'N'. See the description of afp, ipiv in Input Arguments section. rcond REAL for chpsvx DOUBLE PRECISION for zhpsvx. An estimate of the reciprocal condition number of the matrix A. If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for chpsvx DOUBLE PRECISION for zhpsvx. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for chpsvx DOUBLE PRECISION for zhpsvx. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then dii is exactly zero. The factorization has been completed, but the block diagonal matrix D is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then D is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. LAPACK Routines: Linear Equations 3 665 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpsvx interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). afp Holds the array AF of size (n*(n+1)/2). ipiv Holds the vector with the number of elements n. ferr Holds the vector with the number of elements nrhs. berr Holds the vector with the number of elements nrhs. uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. 3 Intel® Math Kernel Library Reference Manual 666 LAPACK Routines: Least Squares and Eigenvalue Problems 4 This chapter describes the Intel® Math Kernel Library implementation of routines from the LAPACK package that are used for solving linear least squares problems, eigenvalue and singular value problems, as well as performing a number of related computational tasks. Sections in this chapter include descriptions of LAPACK computational routines and driver routines. For full reference on LAPACK routines and related information see [LUG]. Least Squares Problems. A typical least squares problem is as follows: given a matrix A and a vector b, find the vector x that minimizes the sum of squares Si((Ax)i - bi)2 or, equivalently, find the vector x that minimizes the 2-norm ||Ax - b||2. In the most usual case, A is an m-by-n matrix with m = n and rank(A) = n. This problem is also referred to as finding the least squares solution to an overdetermined system of linear equations (here we have more equations than unknowns). To solve this problem, you can use the QR factorization of the matrix A (see QR Factorization). If m < n and rank(A) = m, there exist an infinite number of solutions x which exactly satisfy Ax = b, and thus minimize the norm ||Ax - b||2. In this case it is often useful to find the unique solution that minimizes ||x||2. This problem is referred to as finding the minimum-norm solution to an underdetermined system of linear equations (here we have more unknowns than equations). To solve this problem, you can use the LQ factorization of the matrix A (see LQ Factorization). In the general case you may have a rank-deficient least squares problem, with rank(A)< min(m, n): find the minimum-norm least squares solution that minimizes both ||x||2 and ||Ax - b||2. In this case (or when the rank of A is in doubt) you can use the QR factorization with pivoting or singular value decomposition (see Singular Value Decomposition). Eigenvalue Problems. The eigenvalue problems (from German eigen "own") are stated as follows: given a matrix A, find the eigenvalues ? and the corresponding eigenvectors z that satisfy the equation Az = ?z (right eigenvectors z) or the equation zHA = ?zH (left eigenvectors z). If A is a real symmetric or complex Hermitian matrix, the above two equations are equivalent, and the problem is called a symmetric eigenvalue problem. Routines for solving this type of problems are described in the sectionSymmetric Eigenvalue Problems . Routines for solving eigenvalue problems with nonsymmetric or non-Hermitian matrices are described in the sectionNonsymmetric Eigenvalue Problems. The library also includes routines that handle generalized symmetric-definite eigenvalue problems: find the eigenvalues ? and the corresponding eigenvectors x that satisfy one of the following equations: Az = ?Bz, ABz = ?z, or BAz = ?z, where A is symmetric or Hermitian, and B is symmetric positive-definite or Hermitian positive-definite. Routines for reducing these problems to standard symmetric eigenvalue problems are described in the sectionGeneralized Symmetric-Definite Eigenvalue Problems. To solve a particular problem, you usually call several computational routines. Sometimes you need to combine the routines of this chapter with other LAPACK routines described in Chapter 3 as well as with BLAS routines described in Chapter 2. 667 For example, to solve a set of least squares problems minimizing ||Ax - b||2 for all columns b of a given matrix B (where A and B are real matrices), you can call ?geqrf to form the factorization A = QR, then call ? ormqr to compute C = QHB and finally call the BLAS routine ?trsm to solve for X the system of equations RX = C. Another way is to call an appropriate driver routine that performs several tasks in one call. For example, to solve the least squares problem the driver routine ?gels can be used. WARNING LAPACK routines assume that input matrices do not contain IEEE 754 special values such as INF or NaN values. Using these special values may cause LAPACK to return unexpected results or become unstable. Starting from release 8.0, Intel MKL along with the FORTRAN 77 interface to LAPACK computational and driver routines supports also the Fortran 95 interface, which uses simplified routine calls with shorter argument lists. The syntax section of the routine description gives the calling sequence for the Fortran 95 interface, where available, immediately after the FORTRAN 77 calls. Routine Naming Conventions For each routine in this chapter, when calling it from the FORTRAN 77 program you can use the LAPACK name. LAPACK names have the structure xyyzzz, which is explained below. The initial letter x indicates the data type: s real, single precision c complex, single precision d real, double precision z complex, double precision The second and third letters yy indicate the matrix type and storage scheme: bb bidiagonal-block matrix bd bidiagonal matrix ge general matrix gb general band matrix hs upper Hessenberg matrix or (real) orthogonal matrix op (real) orthogonal matrix (packed storage) un (complex) unitary matrix up (complex) unitary matrix (packed storage) pt symmetric or Hermitian positive-definite tridiagonal matrix sy symmetric matrix sp symmetric matrix (packed storage) sb (real) symmetric band matrix st (real) symmetric tridiagonal matrix he Hermitian matrix hp Hermitian matrix (packed storage) hb (complex) Hermitian band matrix tr triangular or quasi-triangular matrix. The last three letters zzz indicate the computation performed, for example: qrf form the QR factorization lqf form the LQ factorization. 4 Intel® Math Kernel Library Reference Manual 668 Thus, the routine sgeqrf forms the QR factorization of general real matrices in single precision; the corresponding routine for complex matrices is cgeqrf. Names of the LAPACK computational and driver routines for the Fortran 95 interface in Intel MKL are the same as the FORTRAN 77 names but without the first letter that indicates the data type. For example, the name of the routine that forms the QR factorization of general real matrices in the Fortran 95 interface is geqrf. Handling of different data types is done through defining a specific internal parameter referring to a module block with named constants for single and double precision. For details on the design of the Fortran 95 interface for LAPACK computational and driver routines in Intel MKL and for the general information on how the optional arguments are reconstructed, see the Fortran 95 Interface Conventions in chapter 3 . Matrix Storage Schemes LAPACK routines use the following matrix storage schemes: • Full storage: a matrix A is stored in a two-dimensional array a, with the matrix element aij stored in the array element a(i,j). • Packed storage scheme allows you to store symmetric, Hermitian, or triangular matrices more compactly: the upper or lower triangle of the matrix is packed by columns in a one-dimensional array. • Band storage: an m-by-n band matrix with kl sub-diagonals and ku super-diagonals is stored compactly in a two-dimensional array ab with kl+ku+1 rows and n columns. Columns of the matrix are stored in the corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. In Chapters 3 and 4 , arrays that hold matrices in the packed storage have names ending in p; arrays with matrices in the band storage have names ending in b. For more information on matrix storage schemes, see "Matrix Arguments" in Appendix B . Mathematical Notation In addition to the mathematical notation used in description of BLAS and LAPACK Linear Equations routines, descriptions of the routines to solve Least Squares and Eigenvalue plroblems use the following notation: ?i Eigenvalues of the matrix A (for the definition of eigenvalues, see Eigenvalue Problems). si Singular values of the matrix A. They are equal to square roots of the eigenvalues of AHA. (For more information, see Singular Value Decomposition). ||x||2 The 2-norm of the vector x: ||x||2 = (Si|xi|2)1/2 = ||x||E . ||A||2 The 2-norm (or spectral norm) of the matrix A. ||A||2 = maxisi, ||A||22= max|x|=1(Ax·Ax). ||A||E The Euclidean norm of the matrix A: ||A||E2 = SiSj|aij|2 (for vectors, the Euclidean norm and the 2-norm are equal: ||x||E = ||x||2). q(x, y) The acute angle between vectors x and y: cos q(x, y) = |x·y| / (||x||2||y||2). Computational Routines In the sections that follow, the descriptions of LAPACK computational routines are given. These routines perform distinct computational tasks that can be used for: Orthogonal Factorizations Singular Value Decomposition Symmetric Eigenvalue Problems Generalized Symmetric-Definite Eigenvalue Problems LAPACK Routines: Least Squares and Eigenvalue Problems 4 669 Nonsymmetric Eigenvalue Problems Generalized Nonsymmetric Eigenvalue Problems Generalized Singular Value Decomposition See also the respective driver routines. Orthogonal Factorizations This section describes the LAPACK routines for the QR (RQ) and LQ (QL) factorization of matrices. Routines for the RZ factorization as well as for generalized QR and RQ factorizations are also included. QR Factorization. Assume that A is an m-by-n matrix to be factored. If m = n, the QR factorization is given by where R is an n-by-n upper triangular matrix with real diagonal elements, and Q is an m-by-m orthogonal (or unitary) matrix. You can use the QR factorization for solving the following least squares problem: minimize ||Ax - b||2 where A is a full-rank m-by-n matrix (m=n). After factoring the matrix, compute the solution x by solving Rx = (Q1)Tb. If m < n, the QR factorization is given by A = QR = Q(R1R2) where R is trapezoidal, R1 is upper triangular and R2 is rectangular. The LAPACK routines do not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. LQ Factorization LQ factorization of an m-by-n matrix A is as follows. If m = n, where L is an m-by-m lower triangular matrix with real diagonal elements, and Q is an n-by-n orthogonal (or unitary) matrix. If m > n, the LQ factorization is where L1 is an n-by-n lower triangular matrix, L2 is rectangular, and Q is an n-by-n orthogonal (or unitary) matrix. You can use the LQ factorization to find the minimum-norm solution of an underdetermined system of linear equations Ax = b where A is an m-by-n matrix of rank m (m < n). After factoring the matrix, compute the solution vector x as follows: solve Ly = b for y, and then compute x = (Q1)Hy. 4 Intel® Math Kernel Library Reference Manual 670 Table "Computational Routines for Orthogonal Factorization" lists LAPACK routines (FORTRAN 77 interface) that perform orthogonal factorization of matrices. Respective routine names in Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Computational Routines for Orthogonal Factorization Matrix type, factorization Factorize without pivoting Factorize with pivoting Generate matrix Q Apply matrix Q general matrices, QR factorization geqrf geqrfp geqpf geqp3 orgqr ungqr ormqr unmqr general matrices, RQ factorization gerqf orgrq ungrq ormrq unmrq general matrices, LQ factorization gelqf orglq unglq ormlq unmlq general matrices, QL factorization geqlf orgql ungql ormql unmql trapezoidal matrices, RZ factorization tzrzf ormrz unmrz pair of matrices, generalized QR factorization ggqrf pair of matrices, generalized RQ factorization ggrqf ?geqrf Computes the QR factorization of a general m-by-n matrix. Syntax Fortran 77: call sgeqrf(m, n, a, lda, tau, work, lwork, info) call dgeqrf(m, n, a, lda, tau, work, lwork, info) call cgeqrf(m, n, a, lda, tau, work, lwork, info) call zgeqrf(m, n, a, lda, tau, work, lwork, info) Fortran 95: call geqrf(a [, tau] [,info]) C: lapack_int LAPACKE_geqrf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description LAPACK Routines: Least Squares and Eigenvalue Problems 4 671 The routine forms the QR factorization of a general m-by-n matrix A (see Orthogonal Factorizations). No pivoting is performed. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqrf DOUBLE PRECISION for dgeqrf COMPLEX for cgeqrf DOUBLE COMPLEX for zgeqrf. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the factorization data as follows: If m = n, the elements below the diagonal are overwritten by the details of the unitary matrix Q, and the upper triangle is overwritten by the corresponding elements of the upper triangular matrix R. If m < n, the strictly lower triangular part is overwritten by the details of the unitary matrix Q, and the remaining elements are overwritten by the corresponding elements of the m-by-n upper trapezoidal matrix R. tau REAL for sgeqrf DOUBLE PRECISION for dgeqrf COMPLEX for cgeqrf DOUBLE COMPLEX for zgeqrf. Array, DIMENSION at least max (1, min(m, n)). Contains additional information on the matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. 4 Intel® Math Kernel Library Reference Manual 672 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geqrf interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,n) Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed factorization is the exact factorization of a matrix A + E, where ||E||2 = O(e)||A||2. The approximate number of floating-point operations for real flavors is (4/3)n3 if m = n, (2/3)n2(3m-n) if m > n, (2/3)m2(3n-m) if m < n. The number of operations for complex flavors is 4 times greater. To solve a set of least squares problems minimizing ||A*x - b||2 for all columns b of a given matrix B, you can call the following: ?geqrf (this routine) to factorize A = QR; ormqr to compute C = QT*B (for real matrices); unmqr to compute C = QH*B (for complex matrices); trsm (a BLAS routine) to solve R*X = C. (The columns of the computed X are the least squares solution vectors x.) To compute the elements of Q explicitly, call orgqr (for real matrices) ungqr (for complex matrices). See Also mkl_progress LAPACK Routines: Least Squares and Eigenvalue Problems 4 673 ?geqrfp Computes the QR factorization of a general m-by-n matrix with non-negative diagonal elements. Syntax Fortran 77: call sgeqrfp(m, n, a, lda, tau, work, lwork, info) call dgeqrfp(m, n, a, lda, tau, work, lwork, info) call cgeqrfp(m, n, a, lda, tau, work, lwork, info) call zgeqrfp(m, n, a, lda, tau, work, lwork, info) C: lapack_int LAPACKE_geqrfp( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The routine forms the QR factorization of a general m-by-n matrix A (see Orthogonal Factorizations). No pivoting is performed. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqrfp DOUBLE PRECISION for dgeqrfp COMPLEX for cgeqrfp DOUBLE COMPLEX for zgeqrfp. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. 4 Intel® Math Kernel Library Reference Manual 674 See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the factorization data as follows: If m = n, the elements below the diagonal are overwritten by the details of the unitary matrix Q, and the upper triangle is overwritten by the corresponding elements of the upper triangular matrix R. If m < n, the strictly lower triangular part is overwritten by the details of the unitary matrix Q, and the remaining elements are overwritten by the corresponding elements of the m-by-n upper trapezoidal matrix R. The diagonal elements of the matrix R are non-negative. tau REAL for sgeqrfp DOUBLE PRECISION for dgeqrfp COMPLEX for cgeqrfp DOUBLE COMPLEX for zgeqrfp. Array, DIMENSION at least max (1, min(m, n)). Contains additional information on the matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geqrfp interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,n) Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed factorization is the exact factorization of a matrix A + E, where ||E||2 = O(e)||A||2. The approximate number of floating-point operations for real flavors is LAPACK Routines: Least Squares and Eigenvalue Problems 4 675 (4/3)n3 if m = n, (2/3)n2(3m-n) if m > n, (2/3)m2(3n-m) if m < n. The number of operations for complex flavors is 4 times greater. To solve a set of least squares problems minimizing ||A*x - b||2 for all columns b of a given matrix B, you can call the following: ?geqrfp (this routine) to factorize A = QR; ormqr to compute C = QT*B (for real matrices); unmqr to compute C = QH*B (for complex matrices); trsm (a BLAS routine) to solve R*X = C. (The columns of the computed X are the least squares solution vectors x.) To compute the elements of Q explicitly, call orgqr (for real matrices) ungqr (for complex matrices). See Also mkl_progress ?geqpf Computes the QR factorization of a general m-by-n matrix with pivoting. Syntax Fortran 77: call sgeqpf(m, n, a, lda, jpvt, tau, work, info) call dgeqpf(m, n, a, lda, jpvt, tau, work, info) call cgeqpf(m, n, a, lda, jpvt, tau, work, rwork, info) call zgeqpf(m, n, a, lda, jpvt, tau, work, rwork, info) Fortran 95: call geqpf(a, jpvt [,tau] [,info]) C: lapack_int LAPACKE_geqpf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, lapack_int* jpvt, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine is deprecated and has been replaced by routine geqp3. The routine ?geqpf forms the QR factorization of a general m-by-n matrix A with column pivoting: A*P = Q*R (see Orthogonal Factorizations). Here P denotes an n-by-n permutation matrix. 4 Intel® Math Kernel Library Reference Manual 676 The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqpf DOUBLE PRECISION for dgeqpf COMPLEX for cgeqpf DOUBLE COMPLEX for zgeqpf. Arrays: a (lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work (lwork) is a workspace array. The size of the work array must be at least max(1, 3*n) for real flavors and at least max(1, n) for complex flavors. lda INTEGER. The leading dimension of a; at least max(1, m). jpvt INTEGER. Array, DIMENSION at least max(1, n). On entry, if jpvt(i) > 0, the i-th column of A is moved to the beginning of A*P before the computation, and fixed in place during the computation. If jpvt(i) = 0, the ith column of A is a free column (that is, it may be interchanged during the computation with any other free column). rwork REAL for cgeqpf DOUBLE PRECISION for zgeqpf. A workspace array, DIMENSION at least max(1, 2*n). Output Parameters a Overwritten by the factorization data as follows: If m = n, the elements below the diagonal are overwritten by the details of the unitary (orthogonal) matrix Q, and the upper triangle is overwritten by the corresponding elements of the upper triangular matrix R. If m < n, the strictly lower triangular part is overwritten by the details of the matrix Q, and the remaining elements are overwritten by the corresponding elements of the m-by-n upper trapezoidal matrix R. tau REAL for sgeqpf DOUBLE PRECISION for dgeqpf COMPLEX for cgeqpf DOUBLE COMPLEX for zgeqpf. Array, DIMENSION at least max (1, min(m, n)). Contains additional information on the matrix Q. jpvt Overwritten by details of the permutation matrix P in the factorization A*P = Q*R. More precisely, the columns of A*P are the columns of A in the following order: jpvt(1), jpvt(2), ..., jpvt(n). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Least Squares and Eigenvalue Problems 4 677 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geqpf interface are the following: a Holds the matrix A of size (m,n). jpvt Holds the vector of length n. tau Holds the vector of length min(m,n) Application Notes The computed factorization is the exact factorization of a matrix A + E, where ||E||2 = O(e)||A||2. The approximate number of floating-point operations for real flavors is (4/3)n3 if m = n, (2/3)n2(3m-n) if m > n, (2/3)m2(3n-m) if m < n. The number of operations for complex flavors is 4 times greater. To solve a set of least squares problems minimizing ||A*x - b||2 for all columns b of a given matrix B, you can call the following: ?geqpf (this routine) to factorize A*P = Q*R; ormqr to compute C = QT*B (for real matrices); unmqr to compute C = QH*B (for complex matrices); trsm (a BLAS routine) to solve R*X = C. (The columns of the computed X are the permuted least squares solution vectors x; the output array jpvt specifies the permutation order.) To compute the elements of Q explicitly, call orgqr (for real matrices) ungqr (for complex matrices). ?geqp3 Computes the QR factorization of a general m-by-n matrix with column pivoting using level 3 BLAS. Syntax Fortran 77: call sgeqp3(m, n, a, lda, jpvt, tau, work, lwork, info) call dgeqp3(m, n, a, lda, jpvt, tau, work, lwork, info) call cgeqp3(m, n, a, lda, jpvt, tau, work, lwork, rwork, info) call zgeqp3(m, n, a, lda, jpvt, tau, work, lwork, rwork, info) Fortran 95: call geqp3(a, jpvt [,tau] [,info]) 4 Intel® Math Kernel Library Reference Manual 678 C: lapack_int LAPACKE_geqp3( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, lapack_int* jpvt, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the QR factorization of a general m-by-n matrix A with column pivoting: A*P = Q*R (see Orthogonal Factorizations) using Level 3 BLAS. Here P denotes an n-by-n permutation matrix. Use this routine instead of geqpf for better performance. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqp3 DOUBLE PRECISION for dgeqp3 COMPLEX for cgeqp3 DOUBLE COMPLEX for zgeqp3. Arrays: a (lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; must be at least max(1, 3*n+1) for real flavors, and at least max(1, n+1) for complex flavors. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for details. jpvt INTEGER. Array, DIMENSION at least max(1, n). On entry, if jpvt(i) ? 0, the i-th column of A is moved to the beginning of AP before the computation, and fixed in place during the computation. If jpvt(i) = 0, the i-th column of A is a free column (that is, it may be interchanged during the computation with any other free column). rwork REAL for cgeqp3 DOUBLE PRECISION for zgeqp3. A workspace array, DIMENSION at least max(1, 2*n). Used in complex flavors only. LAPACK Routines: Least Squares and Eigenvalue Problems 4 679 Output Parameters a Overwritten by the factorization data as follows: If m = n, the elements below the diagonal are overwritten by the details of the unitary (orthogonal) matrix Q, and the upper triangle is overwritten by the corresponding elements of the upper triangular matrix R. If m < n, the strictly lower triangular part is overwritten by the details of the matrix Q, and the remaining elements are overwritten by the corresponding elements of the m-by-n upper trapezoidal matrix R. tau REAL for sgeqp3 DOUBLE PRECISION for dgeqp3 COMPLEX for cgeqp3 DOUBLE COMPLEX for zgeqp3. Array, DIMENSION at least max (1, min(m, n)). Contains scalar factors of the elementary reflectors for the matrix Q. jpvt Overwritten by details of the permutation matrix P in the factorization A*P = Q*R. More precisely, the columns of AP are the columns of A in the following order: jpvt(1), jpvt(2), ..., jpvt(n). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geqp3 interface are the following: a Holds the matrix A of size (m,n). jpvt Holds the vector of length n. tau Holds the vector of length min(m,n) Application Notes To solve a set of least squares problems minimizing ||A*x - b||2 for all columns b of a given matrix B, you can call the following: ?geqp3 (this routine) to factorize A*P = Q*R; ormqr to compute C = QT*B (for real matrices); unmqr to compute C = QH*B (for complex matrices); trsm (a BLAS routine) to solve R*X = C. (The columns of the computed X are the permuted least squares solution vectors x; the output array jpvt specifies the permutation order.) To compute the elements of Q explicitly, call orgqr (for real matrices) ungqr (for complex matrices). If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. 4 Intel® Math Kernel Library Reference Manual 680 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?orgqr Generates the real orthogonal matrix Q of the QR factorization formed by ?geqrf. Syntax Fortran 77: call sorgqr(m, n, k, a, lda, tau, work, lwork, info) call dorgqr(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call orgqr(a, tau [,info]) C: lapack_int LAPACKE_orgqr( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of m-by-m orthogonal matrix Q of the QR factorization formed by the routines geqrf/geqrf or geqpf/geqpf. Use this routine after a call to sgeqrf/dgeqrf or sgeqpf/dgeqpf. Usually Q is determined from the QR factorization of an m by p matrix A with m = p. To compute the whole matrix Q, use: call ?orgqr(m, m, p, a, lda, tau, work, lwork, info) To compute the leading p columns of Q (which form an orthonormal basis in the space spanned by the columns of A): call ?orgqr(m, p, p, a, lda, tau, work, lwork, info) To compute the matrix Qk of the QR factorization of leading k columns of the matrix A: call ?orgqr(m, m, k, a, lda, tau, work, lwork, info) To compute the leading k columns of Qk (which form an orthonormal basis in the space spanned by leading k columns of the matrix A): call ?orgqr(m, k, k, a, lda, tau, work, lwork, info) LAPACK Routines: Least Squares and Eigenvalue Problems 4 681 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The order of the orthogonal matrix Q (m = 0). n INTEGER. The number of columns of Q to be computed (0 = n = m). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (0 = k = n). a, tau, work REAL for sorgqr DOUBLE PRECISION for dorgqr Arrays: a(lda,*) and tau(*) are the arrays returned by sgeqrf / dgeqrf or sgeqpf / dgeqpf. The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by n leading columns of the m-by-m orthogonal matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine orgqr interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k) Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. 4 Intel® Math Kernel Library Reference Manual 682 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e)|*|A||2 where e is the machine precision. The total number of floating-point operations is approximately 4*m*n*k - 2*(m + n)*k2 + (4/3)*k3. If n = k, the number is approximately (2/3)*n2*(3m - n). The complex counterpart of this routine is ungqr. ?ormqr Multiplies a real matrix by the orthogonal matrix Q of the QR factorization formed by ?geqrf or ?geqpf. Syntax Fortran 77: call sormqr(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call dormqr(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormqr(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_ormqr( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real matrix C by Q or Q T, where Q is the orthogonal matrix Q of the QR factorization formed by the routines geqrf/geqrf or geqpf/geqpf. Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 683 side CHARACTER*1. Must be either 'L' or 'R'. If side ='L', Q or QT is applied to C from the left. If side ='R', Q or QT is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'T'. If trans ='N', the routine multiplies C by Q. If trans ='T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m if side ='L'; 0 = k = n if side ='R'. a, tau, c, work REAL for sgeqrf DOUBLE PRECISION for dgeqrf. Arrays: a(lda,*) and tau(*) are the arrays returned by sgeqrf / dgeqrf or sgeqpf / dgeqpf. The second dimension of a must be at least max(1, k). The dimension of tau must be at least max(1, k). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a. Constraints: lda = max(1, m) if side = 'L'; lda = max(1, n) if side = 'R'. ldc INTEGER. The leading dimension of c. Constraint: ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormqr interface are the following: 4 Intel® Math Kernel Library Reference Manual 684 a Holds the matrix A of size (r,k). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is unmqr. ?ungqr Generates the complex unitary matrix Q of the QR factorization formed by ?geqrf. Syntax Fortran 77: call cungqr(m, n, k, a, lda, tau, work, lwork, info) call zungqr(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call ungqr(a, tau [,info]) C: lapack_int LAPACKE_ungqr( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of m-by-m unitary matrix Q of the QR factorization formed by the routines geqrf/geqrf or geqpf/geqpf. Use this routine after a call to cgeqrf/zgeqrf or cgeqpf/zgeqpf. LAPACK Routines: Least Squares and Eigenvalue Problems 4 685 Usually Q is determined from the QR factorization of an m by p matrix A with m = p. To compute the whole matrix Q, use: call ?ungqr(m, m, p, a, lda, tau, work, lwork, info) To compute the leading p columns of Q (which form an orthonormal basis in the space spanned by the columns of A): call ?ungqr(m, p, p, a, lda, tau, work, lwork, info) To compute the matrix Qk of the QR factorization of the leading k columns of the matrix A: call ?ungqr(m, m, k, a, lda, tau, work, lwork, info) To compute the leading k columns of Qk (which form an orthonormal basis in the space spanned by the leading k columns of the matrix A): call ?ungqr(m, k, k, a, lda, tau, work, lwork, info) Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The order of the unitary matrix Q (m = 0). n INTEGER. The number of columns of Q to be computed (0 = n = m). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (0 = k = n). a, tau, work COMPLEX for cungqr DOUBLE COMPLEX for zungqr Arrays: a(lda,*) and tau(*) are the arrays returned by cgeqrf/zgeqrf or cgeqpf/zgeqpf. The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by n leading columns of the m-by-m unitary matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. 4 Intel® Math Kernel Library Reference Manual 686 Specific details for the routine ungqr interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed Q differs from an exactly unitary matrix by a matrix E such that ||E||2 = O(e)*||A||2, where e is the machine precision. The total number of floating-point operations is approximately 16*m*n*k - 8*(m + n)*k2 + (16/3)*k3. If n = k, the number is approximately (8/3)*n2*(3m - n). The real counterpart of this routine is orgqr. ?unmqr Multiplies a complex matrix by the unitary matrix Q of the QR factorization formed by ?geqrf. Syntax Fortran 77: call cunmqr(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call zunmqr(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmqr(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmqr( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a rectangular complex matrix C by Q or QH, where Q is the unitary matrix Q of the QR factorization formed by the routines geqrf/geqrf or geqpf/geqpf. LAPACK Routines: Least Squares and Eigenvalue Problems 4 687 Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'C'. If trans = 'N', the routine multiplies C by Q. If trans = 'C', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m if side = 'L'; 0 = k = n if side = 'R'. a, c, tau, work COMPLEX for cgeqrf DOUBLE COMPLEX for zgeqrf. Arrays: a(lda,*) and tau(*) are the arrays returned by cgeqrf / zgeqrf or cgeqpf / zgeqpf. The second dimension of a must be at least max(1, k). The dimension of tau must be at least max(1, k). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a. Constraints: lda = max(1, m) if side = 'L'; lda = max(1, n) if side = 'R'. ldc INTEGER. The leading dimension of c. Constraint: ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. 4 Intel® Math Kernel Library Reference Manual 688 If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmqr interface are the following: a Holds the matrix A of size (r,k). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is ormqr. ?gelqf Computes the LQ factorization of a general m-by-n matrix. Syntax Fortran 77: call sgelqf(m, n, a, lda, tau, work, lwork, info) call dgelqf(m, n, a, lda, tau, work, lwork, info) call cgelqf(m, n, a, lda, tau, work, lwork, info) call zgelqf(m, n, a, lda, tau, work, lwork, info) Fortran 95: call gelqf(a [, tau] [,info]) LAPACK Routines: Least Squares and Eigenvalue Problems 4 689 C: lapack_int LAPACKE_gelqf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the LQ factorization of a general m-by-n matrix A (see Orthogonal Factorizations). No pivoting is performed. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgelqf DOUBLE PRECISION for dgelqf COMPLEX for cgelqf DOUBLE COMPLEX for zgelqf. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the factorization data as follows: If m = n, the elements above the diagonal are overwritten by the details of the unitary (orthogonal) matrix Q, and the lower triangle is overwritten by the corresponding elements of the lower triangular matrix L. If m > n, the strictly upper triangular part is overwritten by the details of the matrix Q, and the remaining elements are overwritten by the corresponding elements of the m-by-n lower trapezoidal matrix L. 4 Intel® Math Kernel Library Reference Manual 690 tau REAL for sgelqf DOUBLE PRECISION for dgelqf COMPLEX for cgelqf DOUBLE COMPLEX for zgelqf. Array, DIMENSION at least max(1, min(m, n)). Contains additional information on the matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gelqf interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,n). Application Notes For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed factorization is the exact factorization of a matrix A + E, where ||E||2 = O(e) ||A||2. The approximate number of floating-point operations for real flavors is (4/3)n3 if m = n, (2/3)n2(3m-n) if m > n, (2/3)m2(3n-m) if m < n. The number of operations for complex flavors is 4 times greater. To find the minimum-norm solution of an underdetermined least squares problem minimizing ||A*x - b||2 for all columns b of a given matrix B, you can call the following: ?gelqf (this routine) to factorize A = L*Q; trsm (a BLAS routine) to solve L*Y = B for Y; ormlq to compute X = (Q1)T*Y (for real matrices); LAPACK Routines: Least Squares and Eigenvalue Problems 4 691 unmlq to compute X = (Q1)H*Y (for complex matrices). (The columns of the computed X are the minimum-norm solution vectors x. Here A is an m-by-n matrix with m < n; Q1 denotes the first m columns of Q). To compute the elements of Q explicitly, call orglq (for real matrices) unglq (for complex matrices). See Also mkl_progress ?orglq Generates the real orthogonal matrix Q of the LQ factorization formed by ?gelqf. Syntax Fortran 77: call sorglq(m, n, k, a, lda, tau, work, lwork, info) call dorglq(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call orglq(a, tau [,info]) C: lapack_int LAPACKE_orglq( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of n-by-n orthogonal matrix Q of the LQ factorization formed by the routines gelqf/gelqf. Use this routine after a call to sgelqf/dgelqf. Usually Q is determined from the LQ factorization of an p-by-n matrix A with n = p. To compute the whole matrix Q, use: call ?orglq(n, n, p, a, lda, tau, work, lwork, info) To compute the leading p rows of Q, which form an orthonormal basis in the space spanned by the rows of A, use: call ?orglq(p, n, p, a, lda, tau, work, lwork, info) To compute the matrix Qk of the LQ factorization of the leading k rows of A, use: call ?orglq(n, n, k, a, lda, tau, work, lwork, info) To compute the leading k rows of Qk, which form an orthonormal basis in the space spanned by the leading k rows of A, use: call ?orgqr(k, n, k, a, lda, tau, work, lwork, info) 4 Intel® Math Kernel Library Reference Manual 692 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of Q to be computed (0 = m = n). n INTEGER. The order of the orthogonal matrix Q (n = m). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (0 = k = m). a, tau, work REAL for sorglq DOUBLE PRECISION for dorglq Arrays: a(lda,*) and tau(*) are the arrays returned by sgelqf/dgelqf. The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by m leading rows of the n-by-n orthogonal matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine orglq interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. LAPACK Routines: Least Squares and Eigenvalue Problems 4 693 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e)*||A||2, where e is the machine precision. The total number of floating-point operations is approximately 4*m*n*k - 2*(m + n)*k2 + (4/3)*k3. If m = k, the number is approximately (2/3)*m2*(3n - m). The complex counterpart of this routine is unglq. ?ormlq Multiplies a real matrix by the orthogonal matrix Q of the LQ factorization formed by ?gelqf. Syntax Fortran 77: call sormlq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call dormlq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormlq(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_ormlq( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real m-by-n matrix C by Q or Q T, where Q is the orthogonal matrix Q of the LQ factorization formed by the routine gelqf/gelqf. Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. 4 Intel® Math Kernel Library Reference Manual 694 If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m if side = 'L'; 0 = k = n if side = 'R'. a, c, tau, work REAL for sormlq DOUBLE PRECISION for dormlq. Arrays: a(lda,*) and tau(*) are arrays returned by ?gelqf. The second dimension of a must be: at least max(1, m) if side = 'L'; at least max(1, n) if side = 'R'. The dimension of tau must be at least max(1, k). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k). ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormlq interface are the following: a Holds the matrix A of size (k,m). LAPACK Routines: Least Squares and Eigenvalue Problems 4 695 tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork= -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork= -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is unmlq. ?unglq Generates the complex unitary matrix Q of the LQ factorization formed by ?gelqf. Syntax Fortran 77: call cunglq(m, n, k, a, lda, tau, work, lwork, info) call zunglq(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call unglq(a, tau [,info]) C: lapack_int LAPACKE_unglq( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of n-by-n unitary matrix Q of the LQ factorization formed by the routines gelqf/gelqf. Use this routine after a call to cgelqf/zgelqf. 4 Intel® Math Kernel Library Reference Manual 696 Usually Q is determined from the LQ factorization of an p-by-n matrix A with n < p. To compute the whole matrix Q, use: call ?unglq(n, n, p, a, lda, tau, work, lwork, info) To compute the leading p rows of Q, which form an orthonormal basis in the space spanned by the rows of A, use: call ?unglq(p, n, p, a, lda, tau, work, lwork, info) To compute the matrix Qk of the LQ factorization of the leading k rows of the matrix A, use: call ?unglq(n, n, k, a, lda, tau, work, lwork, info) To compute the leading k rows of Qk, which form an orthonormal basis in the space spanned by the leading k rows of the matrix A, use: call ?ungqr(k, n, k, a, lda, tau, work, lwork, info) Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of Q to be computed (0 = m = n). n INTEGER. The order of the unitary matrix Q (n = m). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (0 = k = m). a, tau, work COMPLEX for cunglq DOUBLE COMPLEX for zunglq Arrays: a(lda,*) and tau(*) are the arrays returned by sgelqf/dgelqf. The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by m leading rows of the n-by-n unitary matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unglq interface are the following: LAPACK Routines: Least Squares and Eigenvalue Problems 4 697 a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork = m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed Q differs from an exactly unitary matrix by a matrix E such that ||E||2 = O(e)*||A||2, where e is the machine precision. The total number of floating-point operations is approximately 16*m*n*k - 8*(m + n)*k2 + (16/3)*k3. If m = k, the number is approximately (8/3)*m2*(3n - m) . The real counterpart of this routine is orglq. ?unmlq Multiplies a complex matrix by the unitary matrix Q of the LQ factorization formed by ?gelqf. Syntax Fortran 77: call cunmlq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call zunmlq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmlq(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmlq( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real m-by-n matrix C by Q or QH, where Q is the unitary matrix Q of the LQ factorization formed by the routine gelqf/gelqf. 4 Intel® Math Kernel Library Reference Manual 698 Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'C'. If trans = 'N', the routine multiplies C by Q. If trans = 'C', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m if side = 'L'; 0 = k = n if side = 'R'. a, c, tau, work COMPLEX for cunmlq DOUBLE COMPLEX for zunmlq. Arrays: a(lda,*) and tau(*) are arrays returned by ?gelqf. The second dimension of a must be: at least max(1, m) if side = 'L'; at least max(1, n) if side = 'R'. The dimension of tau must be at least max(1, k). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k). ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Least Squares and Eigenvalue Problems 4 699 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmlq interface are the following: a Holds the matrix A of size (k,m). tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is ormlq. ?geqlf Computes the QL factorization of a general m-by-n matrix. Syntax Fortran 77: call sgeqlf(m, n, a, lda, tau, work, lwork, info) call dgeqlf(m, n, a, lda, tau, work, lwork, info) call cgeqlf(m, n, a, lda, tau, work, lwork, info) call zgeqlf(m, n, a, lda, tau, work, lwork, info) Fortran 95: call geqlf(a [, tau] [,info]) C: lapack_int LAPACKE_geqlf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h 4 Intel® Math Kernel Library Reference Manual 700 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the QL factorization of a general m-by-n matrix A. No pivoting is performed. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqlf DOUBLE PRECISION for dgeqlf COMPLEX for cgeqlf DOUBLE COMPLEX for zgeqlf. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten on exit by the factorization data as follows: if m = n, the lower triangle of the subarray a(m-n+1:m, 1:n) contains the nby- n lower triangular matrix L; if m = n, the elements on and below the (nm)- th superdiagonal contain the m-by-n lower trapezoidal matrix L; in both cases, the remaining elements, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors. tau REAL for sgeqlf DOUBLE PRECISION for dgeqlf COMPLEX for cgeqlf DOUBLE COMPLEX for zgeqlf. Array, DIMENSION at least max(1, min(m, n)). Contains scalar factors of the elementary reflectors for the matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. info INTEGER. LAPACK Routines: Least Squares and Eigenvalue Problems 4 701 If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geqlf interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,n). Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Related routines include: orgql to generate matrix Q (for real matrices); ungql to generate matrix Q (for complex matrices); ormql to apply matrix Q (for real matrices); unmql to apply matrix Q (for complex matrices). See Also mkl_progress ?orgql Generates the real matrix Q of the QL factorization formed by ?geqlf. Syntax Fortran 77: call sorgql(m, n, k, a, lda, tau, work, lwork, info) call dorgql(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call orgql(a, tau [,info]) C: lapack_int LAPACKE_orgql( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); 4 Intel® Math Kernel Library Reference Manual 702 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates an m-by-n real matrix Q with orthonormal columns, which is defined as the last n columns of a product of k elementary reflectors H(i) of order m: Q = H(k) *...* H(2)*H(1) as returned by the routines geqlf/geqlf. Use this routine after a call to sgeqlf/dgeqlf. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix Q (m= 0). n INTEGER. The number of columns of the matrix Q (m= n= 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (n= k= 0). a, tau, work REAL for sorgql DOUBLE PRECISION for dorgql Arrays: a(lda,*), tau(*). On entry, the (n - k + i)th column of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by sgeqlf/dgeqlf in the last k columns of its array argument a; tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by sgeqlf/dgeqlf; The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the m-by-n matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 703 Specific details for the routine orgql interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is ungql. ?ungql Generates the complex matrix Q of the QL factorization formed by ?geqlf. Syntax Fortran 77: call cungql(m, n, k, a, lda, tau, work, lwork, info) call zungql(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call ungql(a, tau [,info]) C: lapack_int LAPACKE_ungql( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates an m-by-n complex matrix Q with orthonormal columns, which is defined as the last n columns of a product of k elementary reflectors H(i) of order m: Q = H(k) *...* H(2)*H(1) as returned by the routines geqlf/geqlf . Use this routine after a call to cgeqlf/zgeqlf. 4 Intel® Math Kernel Library Reference Manual 704 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix Q (m=0). n INTEGER. The number of columns of the matrix Q (m=n=0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (n=k=0). a, tau, work COMPLEX for cungql DOUBLE COMPLEX for zungql Arrays: a(lda,*), tau(*), work(lwork). On entry, the (n - k + i)th column of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by cgeqlf/zgeqlf in the last k columns of its array argument a; tau(i) must contain the scalar factor of the elementaryreflector H(i), as returned by cgeqlf/zgeqlf; The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the m-by-n matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ungql interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. LAPACK Routines: Least Squares and Eigenvalue Problems 4 705 In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is orgql. ?ormql Multiplies a real matrix by the orthogonal matrix Q of the QL factorization formed by ?geqlf. Syntax Fortran 77: call sormql(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call dormql(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormql(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_ormql( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real m-by-n matrix C by Q or QT, where Q is the orthogonal matrix Q of the QL factorization formed by the routine geqlf/geqlf . Depending on the parameters side and trans, the routine ormql can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result over C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m= 0). 4 Intel® Math Kernel Library Reference Manual 706 n INTEGER. The number of columns in C (n= 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 =k=m if side = 'L'; 0 =k=n if side = 'R'. a, tau, c, work REAL for sormql DOUBLE PRECISION for dormql. Arrays: a(lda,*), tau(*), c(ldc,*). On entry, the ith column of a must contain the vector which defines the elementary reflector Hi, for i = 1,2,...,k, as returned by sgeqlf/dgeqlf in the last k columns of its array argument a. The second dimension of a must be at least max(1, k). tau(i) must contain the scalar factor of the elementary reflector Hi, as returned by sgeqlf/dgeqlf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; if side = 'L', lda= max(1, m); if side = 'R', lda= max(1, n). ldc INTEGER. The leading dimension of c; ldc= max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork= max(1, n) if side = 'L'; lwork= max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormql interface are the following: a Holds the matrix A of size (r,k). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (k). LAPACK Routines: Least Squares and Eigenvalue Problems 4 707 c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is unmql. ?unmql Multiplies a complex matrix by the unitary matrix Q of the QL factorization formed by ?geqlf. Syntax Fortran 77: call cunmql(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call zunmql(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmql(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmql( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a complex m-by-n matrix C by Q or QH, where Q is the unitary matrix Q of the QL factorization formed by the routine geqlf/geqlf . Depending on the parameters side and trans, the routine unmql can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result over C). 4 Intel® Math Kernel Library Reference Manual 708 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'C'. If trans = 'N', the routine multiplies C by Q. If trans = 'C', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m if side = 'L'; 0 = k = n if side = 'R'. a, tau, c, work COMPLEX for cunmql DOUBLE COMPLEX for zunmql. Arrays: a(lda,*), tau(*), c(ldc,*), work(lwork). On entry, the i-th column of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by cgeqlf/zgeqlf in the last k columns of its array argument a. The second dimension of a must be at least max(1, k). tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by cgeqlf/zgeqlf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Least Squares and Eigenvalue Problems 4 709 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmql interface are the following: a Holds the matrix A of size (r,k). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'L'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is ormql. ?gerqf Computes the RQ factorization of a general m-by-n matrix. Syntax Fortran 77: call sgerqf(m, n, a, lda, tau, work, lwork, info) call dgerqf(m, n, a, lda, tau, work, lwork, info) call cgerqf(m, n, a, lda, tau, work, lwork, info) call zgerqf(m, n, a, lda, tau, work, lwork, info) Fortran 95: call gerqf(a [, tau] [,info]) C: lapack_int LAPACKE_gerqf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); 4 Intel® Math Kernel Library Reference Manual 710 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the RQ factorization of a general m-by-n matrix A. No pivoting is performed. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgerqf DOUBLE PRECISION for dgerqf COMPLEX for cgerqf DOUBLE COMPLEX for zgerqf. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; lwork = max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten on exit by the factorization data as follows: if m = n, the upper triangle of the subarray a(1:m, n-m+1:n ) contains the m-by-m upper triangular matrix R; if m = n, the elements on and above the (m-n)th subdiagonal contain the mby- n upper trapezoidal matrix R; in both cases, the remaining elements, with the array tau, represent the orthogonal/unitary matrix Q as a product of min(m,n) elementary reflectors. tau REAL for sgerqf DOUBLE PRECISION for dgerqf COMPLEX for cgerqf DOUBLE COMPLEX for zgerqf. LAPACK Routines: Least Squares and Eigenvalue Problems 4 711 Array, DIMENSION at least max (1, min(m, n)). Contains scalar factors of the elementary reflectors for the matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gerqf interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,n). Application Notes For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Related routines include: orgrq to generate matrix Q (for real matrices); ungrq to generate matrix Q (for complex matrices); ormrq to apply matrix Q (for real matrices); unmrq to apply matrix Q (for complex matrices). See Also mkl_progress ?orgrq Generates the real matrix Q of the RQ factorization formed by ?gerqf. Syntax Fortran 77: call sorgrq(m, n, k, a, lda, tau, work, lwork, info) call dorgrq(m, n, k, a, lda, tau, work, lwork, info) 4 Intel® Math Kernel Library Reference Manual 712 Fortran 95: call orgrq(a, tau [,info]) C: lapack_int LAPACKE_orgrq( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates an m-by-n real matrix Q with orthonormal rows, which is defined as the last m rows of a product of k elementary reflectors H(i) of order n: Q = H(1)* H(2)*...*H(k)as returned by the routines gerqf/gerqf. Use this routine after a call to sgerqf/dgerqf. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix Q (m= 0). n INTEGER. The number of columns of the matrix Q (n= m). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (m= k= 0). a, tau, work REAL for sorgrq DOUBLE PRECISION for dorgrq Arrays: a(lda,*), tau(*). On entry, the (m - k + i)-th row of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by sgerqf/ dgerqf in the last k rows of its array argument a; tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by sgerqf/dgerqf; The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the m-by-n matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. LAPACK Routines: Least Squares and Eigenvalue Problems 4 713 info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine orgrq interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is ungrq. ?ungrq Generates the complex matrix Q of the RQ factorization formed by ?gerqf. Syntax Fortran 77: call cungrq(m, n, k, a, lda, tau, work, lwork, info) call zungrq(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call ungrq(a, tau [,info]) C: lapack_int LAPACKE_ungrq( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 4 Intel® Math Kernel Library Reference Manual 714 Description The routine generates an m-by-n complex matrix Q with orthonormal rows, which is defined as the last m rows of a product of k elementary reflectors H(i) of order n: Q = H(1)H* H(2)H*...*H(k)H as returned by the routines gerqf/gerqf. Use this routine after a call to sgerqf/dgerqf. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix Q (m=0). n INTEGER. The number of columns of the matrix Q (n=m ). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (m=k=0). a, tau, work REAL for cungrq DOUBLE PRECISION for zungrq Arrays: a(lda,*), tau(*), work(lwork). On entry, the (m - k + i)th row of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by sgerqf/ dgerqf in the last k rows of its array argument a; tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by sgerqf/dgerqf; The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the m-by-n matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ungrq interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). LAPACK Routines: Least Squares and Eigenvalue Problems 4 715 Application Notes For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is orgrq. ?ormrq Multiplies a real matrix by the orthogonal matrix Q of the RQ factorization formed by ?gerqf. Syntax Fortran 77: call sormrq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call dormrq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormrq(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_ormrq( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real m-by-n matrix C by Q or QT, where Q is the real orthogonal matrix defined as a product of k elementary reflectors Hi : Q = H1 H2 ... Hk as returned by the RQ factorization routine gerqf/ gerqf . Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result over C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. 4 Intel® Math Kernel Library Reference Manual 716 side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m, if side = 'L'; 0 = k = n, if side = 'R'. a, tau, c, work REAL for sormrq DOUBLE PRECISION for dormrq. Arrays: a(lda,*), tau(*), c(ldc,*). On entry, the ith row of a must contain the vector which defines the elementary reflector Hi, for i = 1,2,...,k, as returned by sgerqf/dgerqf in the last k rows of its array argument a. The second dimension of a must be at least max(1, m) if side = 'L', and at least max(1, n) if side = 'R'. tau(i) must contain the scalar factor of the elementary reflector Hi, as returned by sgerqf/dgerqf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k). ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Least Squares and Eigenvalue Problems 4 717 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormrq interface are the following: a Holds the matrix A of size (k,m). tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is unmrq. ?unmrq Multiplies a complex matrix by the unitary matrix Q of the RQ factorization formed by ?gerqf. Syntax Fortran 77: call cunmrq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call zunmrq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmrq(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmrq( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 4 Intel® Math Kernel Library Reference Manual 718 • C: mkl_lapacke.h Description The routine multiplies a complex m-by-n matrix C by Q or QH, where Q is the complex unitary matrix defined as a product of k elementary reflectors H(i) of order n: Q = H(1)H* H(2)H*...*H(k)Has returned by the RQ factorization routine gerqf/gerqf . Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result over C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'C'. If trans = 'N', the routine multiplies C by Q. If trans = 'C', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m, if side = 'L'; 0 = k = n, if side = 'R'. a, tau, c, work COMPLEX for cunmrq DOUBLE COMPLEX for zunmrq. Arrays: a(lda,*), tau(*), c(ldc,*), work(lwork). On entry, the ith row of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by cgerqf/zgerqf in the last k rows of its array argument a. The second dimension of a must be at least max(1, m) if side = 'L', and at least max(1, n) if side = 'R'. tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by cgerqf/zgerqf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k) . ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. LAPACK Routines: Least Squares and Eigenvalue Problems 4 719 Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmrq interface are the following: a Holds the matrix A of size (k,m). tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is ormrq. ?tzrzf Reduces the upper trapezoidal matrix A to upper triangular form. Syntax Fortran 77: call stzrzf(m, n, a, lda, tau, work, lwork, info) call dtzrzf(m, n, a, lda, tau, work, lwork, info) call ctzrzf(m, n, a, lda, tau, work, lwork, info) call ztzrzf(m, n, a, lda, tau, work, lwork, info) 4 Intel® Math Kernel Library Reference Manual 720 Fortran 95: call tzrzf(a [, tau] [,info]) C: lapack_int LAPACKE_tzrzf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces the m-by-n (m = n) real/complex upper trapezoidal matrix A to upper triangular form by means of orthogonal/unitary transformations. The upper trapezoidal matrix A is factored as A = (R 0)*Z, where Z is an n-by-n orthogonal/unitary matrix and R is an m-by-m upper triangular matrix. See larz that applies an elementary reflector returned by ?tzrzf to a general matrix. The ?tzrzf routine replaces the deprecated ?tzrqf routine. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = m). a, work REAL for stzrzf DOUBLE PRECISION for dtzrzf COMPLEX for ctzrzf DOUBLE COMPLEX for ztzrzf. Arrays: a(lda,*), work(lwork).The leading m-by-n upper trapezoidal part of the array a contains the matrix A to be factorized. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; lwork = max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten on exit by the factorization data as follows: LAPACK Routines: Least Squares and Eigenvalue Problems 4 721 the leading m-by-m upper triangular part of a contains the upper triangular matrix R, and elements m +1 to n of the first m rows of a, with the array tau, represent the orthogonal matrix Z as a product of m elementary reflectors. tau REAL for stzrzf DOUBLE PRECISION for dtzrzf COMPLEX for ctzrzf DOUBLE COMPLEX for ztzrzf. Array, DIMENSION at least max (1, m). Contains scalar factors of the elementary reflectors for the matrix Z. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine tzrzf interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (m). Application Notes The factorization is obtained by Householder's method. The k-th transformation matrix, z(k), which is used to introduce zeros into the (m - k + 1)-th row of A, is given in the form where for real flavors and for complex flavors 4 Intel® Math Kernel Library Reference Manual 722 tau is a scalar and z(k) is an l-element vector. tau and z(k) are chosen to annihilate the elements of the kth row of X. The scalar tau is returned in the k-th element of tau and the vector u(k) in the k-th row of A, such that the elements of z(k) are in a(k, m+1), ..., a(k, n). The elements of r are returned in the upper triangular part of A. Z is given by Z = Z(1)*Z(2)*...*Z(m). For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Related routines include: ormrz to apply matrix Q (for real matrices) unmrz to apply matrix Q (for complex matrices). ?ormrz Multiplies a real matrix by the orthogonal matrix defined from the factorization formed by ?tzrzf. Syntax Fortran 77: call sormrz(side, trans, m, n, k, l, a, lda, tau, c, ldc, work, lwork, info) call dormrz(side, trans, m, n, k, l, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormrz(a, tau, c, l [, side] [,trans] [,info]) C: lapack_int LAPACKE_ormrz( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, lapack_int l, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description LAPACK Routines: Least Squares and Eigenvalue Problems 4 723 The ?ormrz routine multiplies a real m-by-n matrix C by Q or QT, where Q is the real orthogonal matrix defined as a product of k elementary reflectors H(i) of order n: Q = H(1)* H(2)*...*H(k) as returned by the factorization routine tzrzf/tzrzf . Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result over C). The matrix Q is of order m if side = 'L' and of order n if side = 'R'. The ?ormrz routine replaces the deprecated ?latzm routine. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m, if side = 'L'; 0 = k = n, if side = 'R'. l INTEGER. The number of columns of the matrix A containing the meaningful part of the Householder reflectors. Constraints: 0 = l = m, if side = 'L'; 0 = l = n, if side = 'R'. a, tau, c, work REAL for sormrz DOUBLE PRECISION for dormrz. Arrays: a(lda,*), tau(*), c(ldc,*). On entry, the ith row of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by stzrzf/dtzrzf in the last k rows of its array argument a. The second dimension of a must be at least max(1, m) if side = 'L', and at least max(1, n) if side = 'R'. tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by stzrzf/dtzrzf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k) . ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. 4 Intel® Math Kernel Library Reference Manual 724 If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormrz interface are the following: a Holds the matrix A of size (k,m). tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is unmrz. ?unmrz Multiplies a complex matrix by the unitary matrix defined from the factorization formed by ?tzrzf. LAPACK Routines: Least Squares and Eigenvalue Problems 4 725 Syntax Fortran 77: call cunmrz(side, trans, m, n, k, l, a, lda, tau, c, ldc, work, lwork, info) call zunmrz(side, trans, m, n, k, l, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmrz(a, tau, c, l [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmrz( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, lapack_int l, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a complex m-by-n matrix C by Q or QH, where Q is the unitary matrix defined as a product of k elementary reflectors H(i): Q = H(1)H* H(2)H*...*H(k)H as returned by the factorization routine tzrzf/tzrzf . Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result over C). The matrix Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'C'. If trans = 'N', the routine multiplies C by Q. If trans = 'C', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m, if side = 'L'; 0 = k = n, if side = 'R'. l INTEGER. The number of columns of the matrix A containing the meaningful part of the Householder reflectors. Constraints: 0 = l = m, if side = 'L'; 0 = l = n, if side = 'R'. 4 Intel® Math Kernel Library Reference Manual 726 a, tau, c, work COMPLEX for cunmrz DOUBLE COMPLEX for zunmrz. Arrays: a(lda,*), tau(*), c(ldc,*), work(lwork). On entry, the ith row of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by ctzrzf/ztzrzf in the last k rows of its array argument a. The second dimension of a must be at least max(1, m) if side = 'L', and at least max(1, n) if side = 'R'. tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by ctzrzf/ztzrzf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k). ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmrz interface are the following: a Holds the matrix A of size (k,m). tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. LAPACK Routines: Least Squares and Eigenvalue Problems 4 727 If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is ormrz. ?ggqrf Computes the generalized QR factorization of two matrices. Syntax Fortran 77: call sggqrf(n, m, p, a, lda, taua, b, ldb, taub, work, lwork, info) call dggqrf(n, m, p, a, lda, taua, b, ldb, taub, work, lwork, info) call cggqrf(n, m, p, a, lda, taua, b, ldb, taub, work, lwork, info) call zggqrf(n, m, p, a, lda, taua, b, ldb, taub, work, lwork, info) Fortran 95: call ggqrf(a, b [,taua] [,taub] [,info]) C: lapack_int LAPACKE_ggqrf( int matrix_order, lapack_int n, lapack_int m, lapack_int p, * a, lapack_int lda, * taua, * b, lapack_int ldb, * taub ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the generalized QR factorization of an n-by-m matrix A and an n-by-p matrix B as A = Q*R, B = Q*T*Z, where Q is an n-by-n orthogonal/unitary matrix, Z is a p-by-p orthogonal/unitary matrix, and R and T assume one of the forms: or 4 Intel® Math Kernel Library Reference Manual 728 where R11 is upper triangular, and where T12 or T21 is a p-by-p upper triangular matrix. In particular, if B is square and nonsingular, the GQR factorization of A and B implicitly gives the QR factorization of B-1A as: B-1*A = ZT*(T-1*R) (for real flavors) or B-1*A = ZH*(T-1*R) (for complex flavors). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The number of rows of the matrices A and B (n = 0). m INTEGER. The number of columns in A (m = 0). p INTEGER. The number of columns in B (p = 0). a, b, work REAL for sggqrf DOUBLE PRECISION for dggqrf COMPLEX for cggqrf DOUBLE COMPLEX for zggqrf. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, m). b(ldb,*) contains the matrix B. The second dimension of b must be at least max(1, p). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). lwork INTEGER. The size of the work array; must be at least max(1, n, m, p). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. LAPACK Routines: Least Squares and Eigenvalue Problems 4 729 Output Parameters a, b Overwritten by the factorization data as follows: on exit, the elements on and above the diagonal of the array a contain the min(n,m)-by-m upper trapezoidal matrix R (R is upper triangular if n = m);the elements below the diagonal, with the array taua, represent the orthogonal/unitary matrix Q as a product of min(n,m) elementary reflectors ; if n = p, the upper triangle of the subarray b(1:n, p-n+1:p ) contains the nby- n upper triangular matrix T; if n > p, the elements on and above the (n-p)th subdiagonal contain the nby- p upper trapezoidal matrix T; the remaining elements, with the array taub, represent the orthogonal/unitary matrix Z as a product of elementary reflectors. taua, taub REAL for sggqrf DOUBLE PRECISION for dggqrf COMPLEX for cggqrf DOUBLE COMPLEX for zggqrf. Arrays, DIMENSION at least max (1, min(n, m)) for taua and at least max (1, min(n, p)) for taub. The array taua contains the scalar factors of the elementary reflectors which represent the orthogonal/unitary matrix Q. The array taub contains the scalar factors of the elementary reflectors which represent the orthogonal/unitary matrix Z. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggqrf interface are the following: a Holds the matrix A of size (n,m). b Holds the matrix B of size (n,p). taua Holds the vector of length min(n,m). taub Holds the vector of length min(n,p). Application Notes The matrix Q is represented as a product of elementary reflectors Q = H(1)H(2)...H(k), where k = min(n,m). Each H(i) has the form H(i) = I - taua*v*vT for real flavors, or H(i) = I - taua*v*vH for complex flavors, where taua is a real/complex scalar, and v is a real/complex vector with v(1:i-1) = 0, v(i) = 1. On exit, v(i+1:n) is stored in a(i+1:n, i) and taua is stored in taua(i). The matrix Z is represented as a product of elementary reflectors 4 Intel® Math Kernel Library Reference Manual 730 Z = H(1)H(2)...H(k), where k = min(n,p). Each H(i) has the form H(i) = I - taub*v*vT for real flavors, or H(i) = I - taub*v*vH for complex flavors, where taub is a real/complex scalar, and v is a real/complex vector with v(p-k+i+1:p) = 0, v(p-k+i) = 1. On exit, v(1:p-k+i-1) is stored in b(n-k+i, 1:p-k+i-1) and taub is stored in taub(i). For better performance, try using lwork = max(n,m, p)*max(nb1,nb2,nb3), where nb1 is the optimal blocksize for the QR factorization of an n-by-m matrix, nb2 is the optimal blocksize for the RQ factorization of an n-by-p matrix, and nb3 is the optimal blocksize for a call of ormqr/unmqr. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?ggrqf Computes the generalized RQ factorization of two matrices. Syntax Fortran 77: call sggrqf (m, p, n, a, lda, taua, b, ldb, taub, work, lwork, info) call dggrqf (m, p, n, a, lda, taua, b, ldb, taub, work, lwork, info) call cggrqf (m, p, n, a, lda, taua, b, ldb, taub, work, lwork, info) call zggrqf (m, p, n, a, lda, taua, b, ldb, taub, work, lwork, info) Fortran 95: call ggrqf(a, b [,taua] [,taub] [,info]) C: lapack_int LAPACKE_ggrqf( int matrix_order, lapack_int m, lapack_int p, lapack_int n, * a, lapack_int lda, * taua, * b, lapack_int ldb, * taub ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description LAPACK Routines: Least Squares and Eigenvalue Problems 4 731 The routine forms the generalized RQ factorization of an m-by-n matrix A and an p-by-n matrix B as A = R*Q, B = Z*T*Q, where Q is an n-by-n orthogonal/unitary matrix, Z is a p-by-p orthogonal/unitary matrix, and R and T assume one of the forms: or where R11 or R21 is upper triangular, and or where T11 is upper triangular. In particular, if B is square and nonsingular, the GRQ factorization of A and B implicitly gives the RQ factorization of A*B-1 as: A*B-1 = (R*T-1)*ZT (for real flavors) or A*B-1 = (R*T-1)*ZH (for complex flavors). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A (m = 0). p INTEGER. The number of rows in B (p = 0). n INTEGER. The number of columns of the matrices A and B (n = 0). a, b, work REAL for sggrqf DOUBLE PRECISION for dggrqf COMPLEX for cggrqf DOUBLE COMPLEX for zggrqf. 4 Intel® Math Kernel Library Reference Manual 732 Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the p-by-n matrix B. The second dimension of b must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; at least max(1, p). lwork INTEGER. The size of the work array; must be at least max(1, n, m, p). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a, b Overwritten by the factorization data as follows: on exit, if m = n, the upper triangle of the subarray a(1:m, n-m+1:n ) contains the m-by-m upper triangular matrix R; if m > n, the elements on and above the (m-n)th subdiagonal contain the mby- n upper trapezoidal matrix R; the remaining elements, with the array taua, represent the orthogonal/ unitary matrix Q as a product of elementary reflectors; the elements on and above the diagonal of the array b contain the min(p,n)-by-n upper trapezoidal matrix T (T is upper triangular if p = n); the elements below the diagonal, with the array taub, represent the orthogonal/unitary matrix Z as a product of elementary reflectors. taua, taub REAL for sggrqf DOUBLE PRECISION for dggrqf COMPLEX for cggrqf DOUBLE COMPLEX for zggrqf. Arrays, DIMENSION at least max (1, min(m, n)) for taua and at least max (1, min(p, n)) for taub. The array taua contains the scalar factors of the elementary reflectors which represent the orthogonal/unitary matrix Q. The array taub contains the scalar factors of the elementary reflectors which represent the orthogonal/unitary matrix Z. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggrqf interface are the following: a Holds the matrix A of size (m,n). b Holds the matrix A of size (p,n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 733 taua Holds the vector of length min(m,n). taub Holds the vector of length min(p,n). Application Notes The matrix Q is represented as a product of elementary reflectors Q = H(1)H(2)...H(k), where k = min(m,n). Each H(i) has the form H(i) = I - taua*v*vT for real flavors, or H(i) = I - taua*v*vH for complex flavors, where taua is a real/complex scalar, and v is a real/complex vector with v(n-k+i+1:n) = 0, v(n-k+i) = 1. On exit, v(1:n-k+i-1) is stored in a(m-k+i,1:n-k+i-1) and taua is stored in taua(i). The matrix Z is represented as a product of elementary reflectors Z = H(1)H(2)...H(k), where k = min(p,n). Each H(i) has the form H(i) = I - taub*v*vT for real flavors, or H(i) = I - taub*v*vH for complex flavors, where taub is a real/complex scalar, and v is a real/complex vector with v(1:i-1) = 0, v(i) = 1. On exit, v(i+1:p) is stored in b(i+1:p, i) and taub is stored in taub(i). For better performance, try using lwork = max(n,m, p)*max(nb1,nb2,nb3), where nb1 is the optimal blocksize for the RQ factorization of an m-by-n matrix, nb2 is the optimal blocksize for the QR factorization of an p-by-n matrix, and nb3 is the optimal blocksize for a call of ?ormrq/?unmrq. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork= -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork= -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Singular Value Decomposition This section describes LAPACK routines for computing the singular value decomposition (SVD) of a general m-by-n matrix A: A = USVH. In this decomposition, U and V are unitary (for complex A) or orthogonal (for real A); S is an m-by-n diagonal matrix with real diagonal elements si: s1 < s2 < ... < smin(m, n) < 0. 4 Intel® Math Kernel Library Reference Manual 734 The diagonal elements si are singular values of A. The first min(m, n) columns of the matrices U and V are, respectively, left and right singular vectors of A. The singular values and singular vectors satisfy Avi = siui and AHui = sivi where ui and vi are the i-th columns of U and V, respectively. To find the SVD of a general matrix A, call the LAPACK routine ?gebrd or ?gbbrd for reducing A to a bidiagonal matrix B by a unitary (orthogonal) transformation: A = QBPH. Then call ?bdsqr, which forms the SVD of a bidiagonal matrix: B = U1SV1 H. Thus, the sought-for SVD of A is given by A = USVH =(QU1)S(V1 HPH). Table "Computational Routines for Singular Value Decomposition (SVD)" lists LAPACK routines (FORTRAN 77 interface) that perform singular value decomposition of matrices. Respective routine names in Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Computational Routines for Singular Value Decomposition (SVD) Operation Real matrices Complex matrices Reduce A to a bidiagonal matrix B: A = QBPH (full storage) gebrd gebrd Reduce A to a bidiagonal matrix B: A = QBPH (band storage) gbbrd gbbrd Generate the orthogonal (unitary) matrix Q or P orgbr ungbr Apply the orthogonal (unitary) matrix Q or P ormbr unmbr Form singular value decomposition of the bidiagonal matrix B: B = USVH bdsqr bdsdc bdsqr Decision Tree: Singular Value Decomposition LAPACK Routines: Least Squares and Eigenvalue Problems 4 735 Figure "Decision Tree: Singular Value Decomposition" presents a decision tree that helps you choose the right sequence of routines for SVD, depending on whether you need singular values only or singular vectors as well, whether A is real or complex, and so on. You can use the SVD to find a minimum-norm solution to a (possibly) rank-deficient least squares problem of minimizing ||Ax - b||2. The effective rank k of the matrix A can be determined as the number of singular values which exceed a suitable threshold. The minimum-norm solution is x = Vk(Sk)-1c where Sk is the leading k-by-k submatrix of S, the matrix Vk consists of the first k columns of V = PV1, and the vector c consists of the first k elements of UHb = U1 HQHb. ?gebrd Reduces a general matrix to bidiagonal form. Syntax Fortran 77: call sgebrd(m, n, a, lda, d, e, tauq, taup, work, lwork, info) call dgebrd(m, n, a, lda, d, e, tauq, taup, work, lwork, info) call cgebrd(m, n, a, lda, d, e, tauq, taup, work, lwork, info) call zgebrd(m, n, a, lda, d, e, tauq, taup, work, lwork, info) Fortran 95: call gebrd(a [, d] [,e] [,tauq] [,taup] [,info]) C: lapack_int LAPACKE_sgebrd( int matrix_order, lapack_int m, lapack_int n, float* a, lapack_int lda, float* d, float* e, float* tauq, float* taup ); lapack_int LAPACKE_dgebrd( int matrix_order, lapack_int m, lapack_int n, double* a, lapack_int lda, double* d, double* e, double* tauq, double* taup ); lapack_int LAPACKE_cgebrd( int matrix_order, lapack_int m, lapack_int n, lapack_complex_float* a, lapack_int lda, float* d, float* e, lapack_complex_float* tauq, lapack_complex_float* taup ); lapack_int LAPACKE_zgebrd( int matrix_order, lapack_int m, lapack_int n, lapack_complex_double* a, lapack_int lda, double* d, double* e, lapack_complex_double* tauq, lapack_complex_double* taup ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a general m-by-n matrix A to a bidiagonal matrix B by an orthogonal (unitary) transformation. If m = n, the reduction is given by 4 Intel® Math Kernel Library Reference Manual 736 where B1 is an n-by-n upper diagonal matrix, Q and P are orthogonal or, for a complex A, unitary matrices; Q1 consists of the first n columns of Q. If m < n, the reduction is given by A = Q*B*PH = Q*(B10)*PH = Q1*B1*P1 H, where B1 is an m-by-m lower diagonal matrix, Q and P are orthogonal or, for a complex A, unitary matrices; P1 consists of the first m rows of P. The routine does not form the matrices Q and P explicitly, but represents them as products of elementary reflectors. Routines are provided to work with the matrices Q and P in this representation: If the matrix A is real, • to compute Q and P explicitly, call orgbr. • to multiply a general matrix by Q or P, call ormbr. If the matrix A is complex, • to compute Q and P explicitly, call ungbr. • to multiply a general matrix by Q or P, call unmbr. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgebrd DOUBLE PRECISION for dgebrd COMPLEX for cgebrd DOUBLE COMPLEX for zgebrd. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The dimension of work; at least max(1, m, n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a If m = n, the diagonal and first super-diagonal of a are overwritten by the upper bidiagonal matrix B. Elements below the diagonal are overwritten by details of Q, and the remaining elements are overwritten by details of P. If m < n, the diagonal and first sub-diagonal of a are overwritten by the lower bidiagonal matrix B. Elements above the diagonal are overwritten by details of P, and the remaining elements are overwritten by details of Q. d REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. LAPACK Routines: Least Squares and Eigenvalue Problems 4 737 Array, DIMENSION at least max(1, min(m, n)). Contains the diagonal elements of B. e REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION at least max(1, min(m, n) - 1). Contains the offdiagonal elements of B. tauq, taup REAL for sgebrd DOUBLE PRECISION for dgebrd COMPLEX for cgebrd DOUBLE COMPLEX for zgebrd. Arrays, DIMENSION at least max (1, min(m, n)). Contain further details of the matrices Q and P. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gebrd interface are the following: a Holds the matrix A of size (m,n). d Holds the vector of length min(m,n). e Holds the vector of length min(m,n)-1. tauq Holds the vector of length min(m,n). taup Holds the vector of length min(m,n). Application Notes For better performance, try using lwork = (m + n)*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrices Q, B, and P satisfy QBPH = A + E, where ||E||2 = c(n)e ||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The approximate number of floating-point operations for real flavors is (4/3)*n2*(3*m - n) for m = n, 4 Intel® Math Kernel Library Reference Manual 738 (4/3)*m2*(3*n - m) for m < n. The number of operations for complex flavors is four times greater. If n is much less than m, it can be more efficient to first form the QR factorization of A by calling geqrf and then reduce the factor R to bidiagonal form. This requires approximately 2*n2*(m + n) floating-point operations. If m is much less than n, it can be more efficient to first form the LQ factorization of A by calling gelqf and then reduce the factor L to bidiagonal form. This requires approximately 2*m2*(m + n) floating-point operations. ?gbbrd Reduces a general band matrix to bidiagonal form. Syntax Fortran 77: call sgbbrd(vect, m, n, ncc, kl, ku, ab, ldab, d, e, q, ldq, pt, ldpt, c, ldc, work, info) call dgbbrd(vect, m, n, ncc, kl, ku, ab, ldab, d, e, q, ldq, pt, ldpt, c, ldc, work, info) call cgbbrd(vect, m, n, ncc, kl, ku, ab, ldab, d, e, q, ldq, pt, ldpt, c, ldc, work, rwork, info) call zgbbrd(vect, m, n, ncc, kl, ku, ab, ldab, d, e, q, ldq, pt, ldpt, c, ldc, work, rwork, info) Fortran 95: call gbbrd(ab [, c] [,d] [,e] [,q] [,pt] [,kl] [,m] [,info]) C: lapack_int LAPACKE_sgbbrd( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int ncc, lapack_int kl, lapack_int ku, float* ab, lapack_int ldab, float* d, float* e, float* q, lapack_int ldq, float* pt, lapack_int ldpt, float* c, lapack_int ldc ); lapack_int LAPACKE_dgbbrd( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int ncc, lapack_int kl, lapack_int ku, double* ab, lapack_int ldab, double* d, double* e, double* q, lapack_int ldq, double* pt, lapack_int ldpt, double* c, lapack_int ldc ); lapack_int LAPACKE_cgbbrd( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int ncc, lapack_int kl, lapack_int ku, lapack_complex_float* ab, lapack_int ldab, float* d, float* e, lapack_complex_float* q, lapack_int ldq, lapack_complex_float* pt, lapack_int ldpt, lapack_complex_float* c, lapack_int ldc ); lapack_int LAPACKE_zgbbrd( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int ncc, lapack_int kl, lapack_int ku, lapack_complex_double* ab, lapack_int ldab, double* d, double* e, lapack_complex_double* q, lapack_int ldq, lapack_complex_double* pt, lapack_int ldpt, lapack_complex_double* c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h LAPACK Routines: Least Squares and Eigenvalue Problems 4 739 Description The routine reduces an m-by-n band matrix A to upper bidiagonal matrix B: A = Q*B*PH. Here the matrices Q and P are orthogonal (for real A) or unitary (for complex A). They are determined as products of Givens rotation matrices, and may be formed explicitly by the routine if required. The routine can also update a matrix C as follows: C = QH*C. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. vect CHARACTER*1. Must be 'N' or 'Q' or 'P' or 'B'. If vect = 'N', neither Q nor PH is generated. If vect = 'Q', the routine generates the matrix Q. If vect = 'P', the routine generates the matrix PH. If vect = 'B', the routine generates both Q and PH. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). ncc INTEGER. The number of columns in C (ncc = 0). kl INTEGER. The number of sub-diagonals within the band of A (kl = 0). ku INTEGER. The number of super-diagonals within the band of A (ku = 0). ab, c, work REAL for sgbbrd DOUBLE PRECISION for dgbbrd COMPLEX for cgbbrd DOUBLE COMPLEX for zgbbrd. Arrays: ab(ldab,*) contains the matrix A in band storage (see Matrix Storage Schemes). The second dimension of a must be at least max(1, n). c(ldc,*) contains an m-by-ncc matrix C. If ncc = 0, the array c is not referenced. The second dimension of c must be at least max(1, ncc). work(*) is a workspace array. The dimension of work must be at least 2*max(m, n) for real flavors, or max(m, n) for complex flavors. ldab INTEGER. The leading dimension of the array ab (ldab = kl + ku + 1). ldq INTEGER. The leading dimension of the output array q. ldq = max(1, m) if vect = 'Q' or 'B', ldq = 1 otherwise. ldpt INTEGER. The leading dimension of the output array pt. ldpt = max(1, n) if vect = 'P' or 'B', ldpt = 1 otherwise. ldc INTEGER. The leading dimension of the array c. ldc = max(1, m) if ncc > 0; ldc = 1 if ncc = 0. rwork REAL for cgbbrd DOUBLE PRECISION for zgbbrd. A workspace array, DIMENSION at least max(m, n). Output Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. 4 Intel® Math Kernel Library Reference Manual 740 ab Overwritten by values generated during the reduction. d REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION at least max(1, min(m, n)). Contains the diagonal elements of the matrix B. e REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION at least max(1, min(m, n) - 1). Contains the off-diagonal elements of B. q, pt REAL for sgebrd DOUBLE PRECISION for dgebrd COMPLEX for cgebrd DOUBLE COMPLEX for zgebrd. Arrays: q(ldq,*) contains the output m-by-m matrix Q. The second dimension of q must be at least max(1, m). p(ldpt,*) contains the output n-by-n matrix PT. The second dimension of pt must be at least max(1, n). c Overwritten by the product QH*C. c is not referenced if ncc = 0. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbbrd interface are the following: ab Holds the array A of size (kl+ku+1,n). c Holds the matrix C of size (m,ncc). d Holds the vector with the number of elements min(m,n). e Holds the vector with the number fo elements min(m,n)-1. q Holds the matrix Q of size (m,m). pt Holds the matrix PT of size (n,n). m If omitted, assumed m = n. kl If omitted, assumed kl = ku. ku Restored as ku = lda-kl-1. vect Restored based on the presence of arguments q and pt as follows: vect = 'B', if both q and pt are present, vect = 'Q', if q is present and pt omitted, vect = 'P', if q is omitted and pt present, vect = 'N', if both q and pt are omitted. Application Notes The computed matrices Q, B, and P satisfy Q*B*PH = A + E, where ||E||2 = c(n)e ||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. If m = n, the total number of floating-point operations for real flavors is approximately the sum of: LAPACK Routines: Least Squares and Eigenvalue Problems 4 741 6*n2*(kl + ku) if vect = 'N' and ncc = 0, 3*n2*ncc*(kl + ku - 1)/(kl + ku) if C is updated, and 3*n3*(kl + ku - 1)/(kl + ku) if either Q or PH is generated (double this if both). To estimate the number of operations for complex flavors, use the same formulas with the coefficients 20 and 10 (instead of 6 and 3). ?orgbr Generates the real orthogonal matrix Q or PT determined by ?gebrd. Syntax Fortran 77: call sorgbr(vect, m, n, k, a, lda, tau, work, lwork, info) call dorgbr(vect, m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call orgbr(a, tau [,vect] [,info]) C: lapack_int LAPACKE_orgbr( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of the orthogonal matrices Q and PT formed by the routines gebrd/ gebrd. Use this routine after a call to sgebrd/dgebrd. All valid combinations of arguments are described in Input parameters. In most cases you need the following: To compute the whole m-by-m matrix Q: call ?orgbr('Q', m, m, n, a ... ) (note that the array a must have at least m columns). To form the n leading columns of Q if m > n: call ?orgbr('Q', m, n, n, a ... ) To compute the whole n-by-n matrix PT: call ?orgbr('P', n, n, m, a ... ) (note that the array a must have at least n rows). To form the m leading rows of PT if m < n: call ?orgbr('P', m, n, m, a ... ) Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. 4 Intel® Math Kernel Library Reference Manual 742 vect CHARACTER*1. Must be 'Q' or 'P'. If vect = 'Q', the routine generates the matrix Q. If vect = 'P', the routine generates the matrix PT. m, n INTEGER. The number of rows (m) and columns (n) in the matrix Q or PT to be returned (m = 0, n = 0). If vect = 'Q', m = n = min(m, k). If vect = 'P', n = m = min(n, k). k If vect = 'Q', the number of columns in the original m-by-k matrix reduced by gebrd. If vect = 'P', the number of rows in the original k-by-n matrix reduced by gebrd. a REAL for sorgbr DOUBLE PRECISION for dorgbr The vectors which define the elementary reflectors, as returned by gebrd. lda INTEGER. The leading dimension of the array a. lda = max(1, m). tau REAL for sorgbr DOUBLE PRECISION for dorgbr Array, DIMENSION min (m,k) if vect = 'Q', min (n,k) if vect = 'P'. Scalar factor of the elementary reflector H(i) or G(i), which determines Q and PT as returned by gebrd in the array tauq or taup. work REAL for sorgbr DOUBLE PRECISION for dorgbr Workspace array, DIMENSION max(1, lwork). lwork INTEGER. Dimension of the array work. See Application Notes for the suggested value of lwork. If lwork = -1 then the routine performs a workspace query and calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. Output Parameters a Overwritten by the orthogonal matrix Q or PT (or the leading rows or columns thereof) as specified by vect, m, and n. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine orgbr interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,k) where k = m, if vect = 'P', k = n, if vect = 'Q'. vect Must be 'Q' or 'P'. The default value is 'Q'. LAPACK Routines: Least Squares and Eigenvalue Problems 4 743 Application Notes For better performance, try using lwork = min(m,n)*blocksize, where blocksize is a machinedependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e). The approximate numbers of floating-point operations for the cases listed in Description are as follows: To form the whole of Q: (4/3)*n*(3m2 - 3m*n + n2) if m > n; (4/3)*m3 if m = n. To form the n leading columns of Q when m > n: (2/3)*n2*(3m - n2) if m > n. To form the whole of PT: (4/3)*n3 if m = n; (4/3)*m*(3n2 - 3m*n + m2) if m < n. To form the m leading columns of PT when m < n: (2/3)*n2*(3m - n2) if m > n. The complex counterpart of this routine is ungbr. ?ormbr Multiplies an arbitrary real matrix by the real orthogonal matrix Q or PT determined by ?gebrd. Syntax Fortran 77: call sormbr(vect, side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call dormbr(vect, side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormbr(a, tau, c [,vect] [,side] [,trans] [,info]) C: lapack_int LAPACKE_ormbr( int matrix_order, char vect, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); 4 Intel® Math Kernel Library Reference Manual 744 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description Given an arbitrary real matrix C, this routine forms one of the matrix products Q*C, QT*C, C*Q, C*Q,T, P*C, PT*C, C*P, C*PT, where Q and P are orthogonal matrices computed by a call to gebrd/gebrd. The routine overwrites the product on C. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q or PT: If side = 'L', r = m; if side = 'R', r = n. vect CHARACTER*1. Must be 'Q' or 'P'. If vect = 'Q', then Q or QT is applied to C. If vect = 'P', then P or PT is applied to C. side CHARACTER*1. Must be 'L' or 'R'. If side = 'L', multipliers are applied to C from the left. If side = 'R', they are applied to C from the right. trans CHARACTER*1. Must be 'N' or 'T'. If trans = 'N', then Q or P is applied to C. If trans = 'T', then QT or PT is applied to C. m INTEGER. The number of rows in C. n INTEGER. The number of columns in C. k INTEGER. One of the dimensions of A in ?gebrd: If vect = 'Q', the number of columns in A; If vect = 'P', the number of rows in A. Constraints: m = 0, n = 0, k = 0. a, c, work REAL for sormbr DOUBLE PRECISION for dormbr. Arrays: a(lda,*) is the array a as returned by ?gebrd. Its second dimension must be at least max(1, min(r,k)) for vect = 'Q', or max(1, r)) for vect = 'P'. c(ldc,*) holds the matrix C. Its second dimension must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a. Constraints: lda = max(1, r) if vect = 'Q'; lda = max(1, min(r,k)) if vect = 'P'. ldc INTEGER. The leading dimension of c; ldc = max(1, m). tau REAL for sormbr DOUBLE PRECISION for dormbr. Array, DIMENSION at least max (1, min(r, k)). LAPACK Routines: Least Squares and Eigenvalue Problems 4 745 For vect = 'Q', the array tauq as returned by ?gebrd. For vect = 'P', the array taup as returned by ?gebrd. lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, C*Q,T, P*C, PT*C, C*P, or C*PT, as specified by vect, side, and trans. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormbr interface are the following: a Holds the matrix A of size (r,min(nq,k)) where r = nq, if vect = 'Q', r = min(nq,k), if vect = 'P', nq = m, if side = 'L', nq = n, if side = 'R', k = m, if vect = 'P', k = n, if vect = 'Q'. tau Holds the vector of length min(nq,k). c Holds the matrix C of size (m,n). vect Must be 'Q' or 'P'. The default value is 'Q'. side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize for side = 'L', or lwork = m*blocksize for side = 'R', where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. 4 Intel® Math Kernel Library Reference Manual 746 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed product differs from the exact product by a matrix E such that ||E||2 = O(e)*||C||2. The total number of floating-point operations is approximately 2*n*k(2*m - k) if side = 'L' and m = k; 2*m*k(2*n - k) if side = 'R' and n = k; 2*m2*n if side = 'L' and m < k; 2*n2*m if side = 'R' and n < k. The complex counterpart of this routine is unmbr. ?ungbr Generates the complex unitary matrix Q or PH determined by ?gebrd. Syntax Fortran 77: call cungbr(vect, m, n, k, a, lda, tau, work, lwork, info) call zungbr(vect, m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call ungbr(a, tau [,vect] [,info]) C: lapack_int LAPACKE_ungbr( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of the unitary matrices Q and PH formed by the routines gebrd/gebrd. Use this routine after a call to cgebrd/zgebrd. All valid combinations of arguments are described in Input Parameters; in most cases you need the following: To compute the whole m-by-m matrix Q, use: call ?ungbr('Q', m, m, n, a ... ) (note that the array a must have at least m columns). To form the n leading columns of Q if m > n, use: call ?ungbr('Q', m, n, n, a ... ) LAPACK Routines: Least Squares and Eigenvalue Problems 4 747 To compute the whole n-by-n matrix PH, use: call ?ungbr('P', n, n, m, a ... ) (note that the array a must have at least n rows). To form the m leading rows of PH if m < n, use: call ?ungbr('P', m, n, m, a ... ) Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. vect CHARACTER*1. Must be 'Q' or 'P'. If vect = 'Q', the routine generates the matrix Q. If vect = 'P', the routine generates the matrix PH. m INTEGER. The number of required rows of Q or PH. n INTEGER. The number of required columns of Q or PH. k INTEGER. One of the dimensions of A in ?gebrd: If vect = 'Q', the number of columns in A; If vect = 'P', the number of rows in A. Constraints: m = 0, n = 0, k = 0. For vect = 'Q': k = n = m if m > k, or m = n if m = k. For vect = 'P': k = m = n if n > k, or m = n if n = k. a, work COMPLEX for cungbr DOUBLE COMPLEX for zungbr. Arrays: a(lda,*) is the array a as returned by ?gebrd. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). tau COMPLEX for cungbr DOUBLE COMPLEX for zungbr. For vect = 'Q', the array tauq as returned by ?gebrd. For vect = 'P', the array taup as returned by ?gebrd. The dimension of tau must be at least max(1, min(m, k)) for vect = 'Q', or max(1, min(m, k)) for vect = 'P'. lwork INTEGER. The size of the work array. Constraint: lwork < max(1, min(m, n)). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the orthogonal matrix Q or PT (or the leading rows or columns thereof) as specified by vect, m, and n. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. 4 Intel® Math Kernel Library Reference Manual 748 If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ungbr interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,k) where k = m, if vect = 'P', k = n, if vect = 'Q'. vect Must be 'Q' or 'P'. The default value is 'Q'. Application Notes For better performance, try using lwork = min(m,n)*blocksize, where blocksize is a machinedependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e). The approximate numbers of possible floating-point operations are listed below: To compute the whole matrix Q: (16/3)n(3m2 - 3m*n + n2) if m > n; (16/3)m3 if m = n. To form the n leading columns of Q when m > n: (8/3)n2(3m - n2). To compute the whole matrix PH: (16/3)n3 if m = n; (16/3)m(3n2 - 3m*n + m2) if m < n. To form the m leading columns of PH when m < n: (8/3)n2(3m - n2) if m > n. The real counterpart of this routine is orgbr. ?unmbr Multiplies an arbitrary complex matrix by the unitary matrix Q or P determined by ?gebrd. LAPACK Routines: Least Squares and Eigenvalue Problems 4 749 Syntax Fortran 77: call cunmbr(vect, side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call zunmbr(vect, side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmbr(a, tau, c [,vect] [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmbr( int matrix_order, char vect, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description Given an arbitrary complex matrix C, this routine forms one of the matrix products Q*C, QH*C, C*Q, C*QH, P*C, PH*C, C*P, or C*PH, where Q and P are unitary matrices computed by a call to gebrd/gebrd. The routine overwrites the product on C. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q or PH: If side = 'L', r = m; if side = 'R', r = n. vect CHARACTER*1. Must be 'Q' or 'P'. If vect = 'Q', then Q or QH is applied to C. If vect = 'P', then P or PH is applied to C. side CHARACTER*1. Must be 'L' or 'R'. If side = 'L', multipliers are applied to C from the left. If side = 'R', they are applied to C from the right. trans CHARACTER*1. Must be 'N' or 'C'. If trans = 'N', then Q or P is applied to C. If trans = 'C', then QH or PH is applied to C. m INTEGER. The number of rows in C. n INTEGER. The number of columns in C. k INTEGER. One of the dimensions of A in ?gebrd: If vect = 'Q', the number of columns in A; If vect = 'P', the number of rows in A. Constraints: m = 0, n = 0, k = 0. a, c, work COMPLEX for cunmbr DOUBLE COMPLEX for zunmbr. Arrays: 4 Intel® Math Kernel Library Reference Manual 750 a(lda,*) is the array a as returned by ?gebrd. Its second dimension must be at least max(1, min(r,k)) for vect = 'Q', or max(1, r)) for vect = 'P'. c(ldc,*) holds the matrix C. Its second dimension must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a. Constraints: lda = max(1, r) if vect = 'Q'; lda = max(1, min(r,k)) if vect = 'P'. ldc INTEGER. The leading dimension of c; ldc = max(1, m). tau COMPLEX for cunmbr DOUBLE COMPLEX for zunmbr. Array, DIMENSION at least max (1, min(r, k)). For vect = 'Q', the array tauq as returned by ?gebrd. For vect = 'P', the array taup as returned by ?gebrd. lwork INTEGER. The size of the work array. lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. lwork = 1 if n=0 or m=0. For optimum performance lwork = max(1,n*nb) if side = 'L', and lwork = max(1,m*nb) if side = 'R', where nb is the optimal blocksize. (nb = 0 if m = 0 or n = 0.) If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, C*QH, P*C, PH*C, C*P, or C*PH, as specified by vect, side, and trans. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmbr interface are the following: a Holds the matrix A of size (r,min(nq,k)) where r = nq, if vect = 'Q', r = min(nq,k), if vect = 'P', nq = m, if side = 'L', nq = n, if side = 'R', k = m, if vect = 'P', k = n, if vect = 'Q'. LAPACK Routines: Least Squares and Eigenvalue Problems 4 751 tau Holds the vector of length min(nq,k). c Holds the matrix C of size (m,n). vect Must be 'Q' or 'P'. The default value is 'Q'. side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, use lwork = n*blocksize for side = 'L', or lwork = m*blocksize for side = 'R', where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed product differs from the exact product by a matrix E such that ||E||2 = O(e)*||C||2. The total number of floating-point operations is approximately 8*n*k(2*m - k) if side = 'L' and m = k; 8*m*k(2*n - k) if side = 'R' and n = k; 8*m2*n if side = 'L' and m < k; 8*n2*m if side = 'R' and n < k. The real counterpart of this routine is ormbr. ?bdsqr Computes the singular value decomposition of a general matrix that has been reduced to bidiagonal form. Syntax Fortran 77: call sbdsqr(uplo, n, ncvt, nru, ncc, d, e, vt, ldvt, u, ldu, c, ldc, work, info) call dbdsqr(uplo, n, ncvt, nru, ncc, d, e, vt, ldvt, u, ldu, c, ldc, work, info) call cbdsqr(uplo, n, ncvt, nru, ncc, d, e, vt, ldvt, u, ldu, c, ldc, work, info) call zbdsqr(uplo, n, ncvt, nru, ncc, d, e, vt, ldvt, u, ldu, c, ldc, work, info) Fortran 95: call rbdsqr(d, e [,vt] [,u] [,c] [,uplo] [,info]) call bdsqr(d, e [,vt] [,u] [,c] [,uplo] [,info]) 4 Intel® Math Kernel Library Reference Manual 752 C: lapack_int LAPACKE_sbdsqr( int matrix_order, char uplo, lapack_int n, lapack_int ncvt, lapack_int nru, lapack_int ncc, float* d, float* e, float* vt, lapack_int ldvt, float* u, lapack_int ldu, float* c, lapack_int ldc ); lapack_int LAPACKE_dbdsqr( int matrix_order, char uplo, lapack_int n, lapack_int ncvt, lapack_int nru, lapack_int ncc, double* d, double* e, double* vt, lapack_int ldvt, double* u, lapack_int ldu, double* c, lapack_int ldc ); lapack_int LAPACKE_cbdsqr( int matrix_order, char uplo, lapack_int n, lapack_int ncvt, lapack_int nru, lapack_int ncc, float* d, float* e, lapack_complex_float* vt, lapack_int ldvt, lapack_complex_float* u, lapack_int ldu, lapack_complex_float* c, lapack_int ldc ); lapack_int LAPACKE_zbdsqr( int matrix_order, char uplo, lapack_int n, lapack_int ncvt, lapack_int nru, lapack_int ncc, double* d, double* e, lapack_complex_double* vt, lapack_int ldvt, lapack_complex_double* u, lapack_int ldu, lapack_complex_double* c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the singular values and, optionally, the right and/or left singular vectors from the Singular Value Decomposition (SVD) of a real n-by-n (upper or lower) bidiagonal matrix B using the implicit zero-shift QR algorithm. The SVD of B has the form B = Q*S*PH where S is the diagonal matrix of singular values, Q is an orthogonal matrix of left singular vectors, and P is an orthogonal matrix of right singular vectors. If left singular vectors are requested, this subroutine actually returns U *Q instead of Q, and, if right singular vectors are requested, this subroutine returns PH *VT instead of PH, for given real/complex input matrices U and VT. When U and VT are the orthogonal/unitary matrices that reduce a general matrix A to bidiagonal form: A = U*B*VT, as computed by ?gebrd, then A = (U*Q)*S*(PH*VT) is the SVD of A. Optionally, the subroutine may also compute QH *C for a given real/complex input matrix C. See also lasq1, lasq2, lasq3, lasq4, lasq5, lasq6 used by this routine. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', B is an upper bidiagonal matrix. If uplo = 'L', B is a lower bidiagonal matrix. n INTEGER. The order of the matrix B (n = 0). ncvt INTEGER. The number of columns of the matrix VT, that is, the number of right singular vectors (ncvt = 0). Set ncvt = 0 if no right singular vectors are required. nru INTEGER. The number of rows in U, that is, the number of left singular vectors (nru = 0). LAPACK Routines: Least Squares and Eigenvalue Problems 4 753 Set nru = 0 if no left singular vectors are required. ncc INTEGER. The number of columns in the matrix C used for computing the product QH*C (ncc = 0). Set ncc = 0 if no matrix C is supplied. d, e, work REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays: d(*) contains the diagonal elements of B. The dimension of d must be at least max(1, n). e(*) contains the (n-1) off-diagonal elements of B. The dimension of e must be at least max(1, n). work(*) is a workspace array. The dimension of work must be at least max(1, 4*n). vt, u, c REAL for sbdsqr DOUBLE PRECISION for dbdsqr COMPLEX for cbdsqr DOUBLE COMPLEX for zbdsqr. Arrays: vt(ldvt,*) contains an n-by-ncvt matrix VT. The second dimension of vt must be at least max(1, ncvt). vt is not referenced if ncvt = 0. u(ldu,*) contains an nru by n unit matrix U. The second dimension of u must be at least max(1, n). u is not referenced if nru = 0. c(ldc,*) contains the matrix C for computing the product QH*C. The second dimension of c must be at least max(1, ncc). The array is not referenced if ncc = 0. ldvt INTEGER. The leading dimension of vt. Constraints: ldvt = max(1, n) if ncvt > 0; ldvt = 1 if ncvt = 0. ldu INTEGER. The leading dimension of u. Constraint: ldu = max(1, nru). ldc INTEGER. The leading dimension of c. Constraints: ldc = max(1, n) if ncc > 0;ldc = 1 otherwise. Output Parameters d On exit, if info = 0, overwritten by the singular values in decreasing order (see info). e On exit, if info = 0, e is destroyed. See also info below. c Overwritten by the product QH*C. vt On exit, this array is overwritten by PH *VT. u On exit, this array is overwritten by U *Q . info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info > i, If ncvt = nru = ncc = 0, • info = 1, a split was marked by a positive value in e • info = 2, the current block of z not diagonalized after 30*n iterations (in the inner while loop) 4 Intel® Math Kernel Library Reference Manual 754 • info = 3, termination criterion of the outer while loop is not met (the program created more than n unreduced blocks). In all other cases when ncvt = nru = ncc = 0, the algorithm did not converge; d and e contain the elements of a bidiagonal matrix that is orthogonally similar to the input matrix B; if info = i, i elements of e have not converged to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine bdsqr interface are the following: d Holds the vector of length (n). e Holds the vector of length (n). vt Holds the matrix VT of size (n, ncvt). u Holds the matrix U of size (nru,n). c Holds the matrix C of size (n,ncc). uplo Must be 'U' or 'L'. The default value is 'U'. ncvt If argument vt is present, then ncvt is equal to the number of columns in matrix VT; otherwise, ncvt is set to zero. nru If argument u is present, then nru is equal to the number of rows in matrix U; otherwise, nru is set to zero. ncc If argument c is present, then ncc is equal to the number of columns in matrix C; otherwise, ncc is set to zero. Note that two variants of Fortran 95 interface for bdsqr routine are needed because of an ambiguous choice between real and complex cases appear when vt, u, and c are omitted. Thus, the name rbdsqr is used in real cases (single or double precision), and the name bdsqr is used in complex cases (single or double precision). Application Notes Each singular value and singular vector is computed to high relative accuracy. However, the reduction to bidiagonal form (prior to calling the routine) may decrease the relative accuracy in the small singular values of the original matrix if its singular values vary widely in magnitude. If si is an exact singular value of B, and si is the corresponding computed value, then |si - si| = p*(m,n)*e*si where p(m, n) is a modestly increasing function of m and n, and e is the machine precision. If only singular values are computed, they are computed more accurately than when some singular vectors are also computed (that is, the function p(m, n) is smaller). If ui is the corresponding exact left singular vector of B, and wi is the corresponding computed left singular vector, then the angle ?(ui, wi) between them is bounded as follows: ?(ui, wi) = p(m,n)*e / min i?j(|si - sj|/|si + sj|). Here mini?j(|si - sj|/|si + sj|) is the relative gap between si and the other singular values. A similar error bound holds for the right singular vectors. LAPACK Routines: Least Squares and Eigenvalue Problems 4 755 The total number of real floating-point operations is roughly proportional to n2 if only the singular values are computed. About 6n2*nru additional operations (12n2*nru for complex flavors) are required to compute the left singular vectors and about 6n2*ncvt operations (12n2*ncvt for complex flavors) to compute the right singular vectors. ?bdsdc Computes the singular value decomposition of a real bidiagonal matrix using a divide and conquer method. Syntax Fortran 77: call sbdsdc(uplo, compq, n, d, e, u, ldu, vt, ldvt, q, iq, work, iwork, info) call dbdsdc(uplo, compq, n, d, e, u, ldu, vt, ldvt, q, iq, work, iwork, info) Fortran 95: call bdsdc(d, e [,u] [,vt] [,q] [,iq] [,uplo] [,info]) C: lapack_int LAPACKE_bdsdc( int matrix_order, char uplo, char compq, lapack_int n, * d, * e, * u, lapack_int ldu, * vt, lapack_int ldvt, * q, lapack_int* iq ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the Singular Value Decomposition (SVD) of a real n-by-n (upper or lower) bidiagonal matrix B: B = U*S*VT, using a divide and conquer method, where S is a diagonal matrix with non-negative diagonal elements (the singular values of B), and U and V are orthogonal matrices of left and right singular vectors, respectively. ?bdsdc can be used to compute all singular values, and optionally, singular vectors or singular vectors in compact form. This rotuine uses ?lasd0, ?lasd1, ?lasd2, ?lasd3, ?lasd4, ?lasd5, ?lasd6, ?lasd7, ?lasd8, ?lasd9, ? lasda, ?lasdq, ?lasdt. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', B is an upper bidiagonal matrix. If uplo = 'L', B is a lower bidiagonal matrix. compq CHARACTER*1. Must be 'N', 'P', or 'I'. If compq = 'N', compute singular values only. If compq = 'P', compute singular values and compute singular vectors in compact form. If compq = 'I', compute singular values and singular vectors. n INTEGER. The order of the matrix B (n = 0). 4 Intel® Math Kernel Library Reference Manual 756 d, e, work REAL for sbdsdc DOUBLE PRECISION for dbdsdc. Arrays: d(*) contains the n diagonal elements of the bidiagonal matrix B. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of the bidiagonal matrix B. The dimension of e must be at least max(1, n). work(*) is a workspace array. The dimension of work must be at least: max(1, 4*n), if compq = 'N'; max(1, 6*n), if compq = 'P'; max(1, 3*n2+4*n), if compq = 'I'. ldu INTEGER. The leading dimension of the output array u; ldu = 1. If singular vectors are desired, then ldu = max(1, n). ldvt INTEGER. The leading dimension of the output array vt; ldvt = 1. If singular vectors are desired, then ldvt = max(1, n). iwork INTEGER. Workspace array, dimension at least max(1, 8*n). Output Parameters d If info = 0, overwritten by the singular values of B. e On exit, e is overwritten. u, vt, q REAL for sbdsdc DOUBLE PRECISION for dbdsdc. Arrays: u(ldu,*), vt(ldvt,*), q(*). If compq = 'I', then on exit u contains the left singular vectors of the bidiagonal matrix B, unless info ? 0 (seeinfo). For other values of compq, u is not referenced. The second dimension of u must be at least max(1,n). if compq = 'I', then on exit vtT contains the right singular vectors of the bidiagonal matrix B, unless info ? 0 (seeinfo). For other values of compq, vt is not referenced. The second dimension of vt must be at least max(1,n). If compq = 'P', then on exit, if info = 0, q and iq contain the left and right singular vectors in a compact form. Specifically, q contains all the REAL (for sbdsdc) or DOUBLE PRECISION (for dbdsdc) data for singular vectors. For other values of compq, q is not referenced. See Application notes for details. iq INTEGER. Array: iq(*). If compq = 'P', then on exit, if info = 0, q and iq contain the left and right singular vectors in a compact form. Specifically, iq contains all the INTEGER data for singular vectors. For other values of compq, iq is not referenced. See Application notes for details. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the algorithm failed to compute a singular value. The update process of divide and conquer failed. LAPACK Routines: Least Squares and Eigenvalue Problems 4 757 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine bdsdc interface are the following: d Holds the vector of length n. e Holds the vector of length n. u Holds the matrix U of size (n,n). vt Holds the matrix VT of size (n,n). q Holds the vector of length (ldq), where ldq = n*(11 + 2*smlsiz + 8*int(log_2(n/(smlsiz + 1)))) and smlsiz is returned by ilaenv and is equal to the maximum size of the subproblems at the bottom of the computation tree (usually about 25). compq Restored based on the presence of arguments u, vt, q, and iq as follows: compq = 'N', if none of u, vt, q, and iq are present, compq = 'I', if both u and vt are present. Arguments u and vt must either be both present or both omitted, compq = 'P', if both q and iq are present. Arguments q and iq must either be both present or both omitted. Note that there will be an error condition if all of u, vt, q, and iq arguments are present simultaneously. See Also ?lasd0 ?lasd1 ?lasd2 ?lasd3 ?lasd4 ?lasd5 ?lasd6 ?lasd7 ?lasd8 ?lasd9 ?lasda ?lasdq ?lasdt Symmetric Eigenvalue Problems Symmetric eigenvalue problems are posed as follows: given an n-by-n real symmetric or complex Hermitian matrix A, find the eigenvalues ? and the corresponding eigenvectors z that satisfy the equation Az = ?z (or, equivalently, zHA = ?zH). In such eigenvalue problems, all n eigenvalues are real not only for real symmetric but also for complex Hermitian matrices A, and there exists an orthonormal system of n eigenvectors. If A is a symmetric or Hermitian positive-definite matrix, all eigenvalues are positive. To solve a symmetric eigenvalue problem with LAPACK, you usually need to reduce the matrix to tridiagonal form and then solve the eigenvalue problem with the tridiagonal matrix obtained. LAPACK includes routines for reducing the matrix to a tridiagonal form by an orthogonal (or unitary) similarity transformation A = QTQH as well as for solving tridiagonal symmetric eigenvalue problems. These routines (for FORTRAN 77 4 Intel® Math Kernel Library Reference Manual 758 interface) are listed in Table "Computational Routines for Solving Symmetric Eigenvalue Problems". Respective routine names in Fortran 95 interface are without the first symbol (see Routine Naming Conventions). There are different routines for symmetric eigenvalue problems, depending on whether you need all eigenvectors or only some of them or eigenvalues only, whether the matrix A is positive-definite or not, and so on. These routines are based on three primary algorithms for computing eigenvalues and eigenvectors of symmetric problems: the divide and conquer algorithm, the QR algorithm, and bisection followed by inverse iteration. The divide and conquer algorithm is generally more efficient and is recommended for computing all eigenvalues and eigenvectors. Furthermore, to solve an eigenvalue problem using the divide and conquer algorithm, you need to call only one routine. In general, more than one routine has to be called if the QR algorithm or bisection followed by inverse iteration is used. The decision tree in Figure "Decision Tree: Real Symmetric Eigenvalue Problems" will help you choose the right routine or sequence of routines for eigenvalue problems with real symmetric matrices. Figure "Decision Tree: Complex Hermitian Eigenvalue Problems" presents a similar decision tree for complex Hermitian matrices. LAPACK Routines: Least Squares and Eigenvalue Problems 4 759 Decision Tree: Real Symmetric Eigenvalue Problems 4 Intel® Math Kernel Library Reference Manual 760 Decision Tree: Complex Hermitian Eigenvalue Problems Computational Routines for Solving Symmetric Eigenvalue Problems Operation Real symmetric matrices Complex Hermitian matrices Reduce to tridiagonal form A = QTQH (full storage) sytrd syrdb hetrd herdb Reduce to tridiagonal form A = QTQH (packed storage) sptrd hptrd Reduce to tridiagonal form A = QTQH (band storage). sbtrd hbtrd Generate matrix Q (full storage) orgtr ungtr Generate matrix Q (packed storage) opgtr upgtr Apply matrix Q (full storage) ormtr unmtr Apply matrix Q (packed storage) opmtr upmtr LAPACK Routines: Least Squares and Eigenvalue Problems 4 761 Operation Real symmetric matrices Complex Hermitian matrices Find all eigenvalues of a tridiagonal matrix T sterf Find all eigenvalues and eigenvectors of a tridiagonal matrix T steqr stedc steqr stedc Find all eigenvalues and eigenvectors of a tridiagonal positive-definite matrix T. pteqr pteqr Find selected eigenvalues of a tridiagonal matrix T stebz stegr stegr Find selected eigenvectors of a tridiagonal matrix T stein stegr stein stegr Find selected eigenvalues and eigenvectors of f a real symmetric tridiagonal matrix T stemr stemr Compute the reciprocal condition numbers for the eigenvectors disna disna ?sytrd Reduces a real symmetric matrix to tridiagonal form. Syntax Fortran 77: call ssytrd(uplo, n, a, lda, d, e, tau, work, lwork, info) call dsytrd(uplo, n, a, lda, d, e, tau, work, lwork, info) Fortran 95: call sytrd(a, tau [,uplo] [,info]) C: lapack_int LAPACKE_sytrd( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, * d, * e, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a real symmetric matrix A to symmetric tridiagonal form T by an orthogonal similarity transformation: A = Q*T*QT. The orthogonal matrix Q is not formed explicitly but is represented as a product of n-1 elementary reflectors. Routines are provided for working with Q in this representation (see Application Notes below). This routine calls latrd to reduce a real symmetric matrix to tridiagonal form by an orthogonal similarity transformation. 4 Intel® Math Kernel Library Reference Manual 762 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a, work REAL for ssytrd DOUBLE PRECISION for dsytrd. a(lda,*) is an array containing either upper or lower triangular part of the matrix A, as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A, and the strictly lower triangular part of A is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A, and the strictly upper triangular part of A is not referenced. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a On exit, if uplo = 'U', the diagonal and first superdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements above the first superdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors; if uplo = 'L', the diagonal and first subdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements below the first subdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors. d, e, tau REAL for ssytrd DOUBLE PRECISION for dsytrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau(*) stores further details of the orthogonal matrix Q in the first n-1 elements. tau(n) is used as workspace. The dimension of tau must be at least max(1, n). work(1) If info=0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. LAPACK Routines: Least Squares and Eigenvalue Problems 4 763 If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytrd interface are the following: a Holds the matrix A of size (n,n). tau Holds the vector of length (n-1). uplo Must be 'U' or 'L'. The default value is 'U'. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix T is exactly similar to a matrix A+E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The approximate number of floating-point operations is (4/3)n3. After calling this routine, you can call the following: orgtr to form the computed matrix Q explicitly ormtr to multiply a real matrix by Q. The complex counterpart of this routine is hetrd. ?syrdb Reduces a real symmetric matrix to tridiagonal form with Successive Bandwidth Reduction approach. Syntax Fortran 77: call ssyrdb(jobz, uplo, n, kd, a, lda, d, e, tau, z, ldz, work, lwork, info) call dsyrdb(jobz, uplo, n, kd, a, lda, d, e, tau, z, ldz, work, lwork, info) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 4 Intel® Math Kernel Library Reference Manual 764 Description The routine reduces a real symmetric matrix A to symmetric tridiagonal form T by an orthogonal similarity transformation: A = Q*T*QT and optionally multiplies matrix Z by Q, or simply forms Q. This routine reduces a full symmetric matrix to the banded symmetric form, and then to the tridiagonal symmetric form with a Successive Bandwidth Reduction approach after Prof. C.Bischof's works (see for instance, [Bischof92]). ?syrdb is functionally close to ?sytrd routine but the tridiagonal form may differ from those obtained by ?sytrd. Unlike ?sytrd, the orthogonal matrix Q cannot be restored from the details of matrix A on exit. Input Parameters jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only A is reduced to T. If jobz = 'V', then A is reduced to T and A contains Q on exit. If jobz = 'U', then A is reduced to T and Z contains Z*Q on exit. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). kd INTEGER. The bandwidth of the banded matrix B (kd = 1). a,z, work REAL for ssyrdb. DOUBLE PRECISION for dsyrdb. a(lda,*) is an array containing either upper or lower triangular part of the matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). z(ldz,*), the second dimension of z must be at least max(1, n). If jobz = 'U', then the matrix z is multiplied by Q. If jobz = 'N' or 'V', then z is not referenced. work(lwork) is a workspace array. lda INTEGER. The leading dimension of a; at least max(1, n). ldz INTEGER. The leading dimension of z; at least max(1, n). Not referenced if jobz = 'N' lwork INTEGER. The size of the work array (lwork = (2kd+1)n+kd). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a If jobz = 'V', then overwritten by Q matrix. If jobz = 'N' or 'U', then overwritten by the banded matrix B and details of the orthogonal matrix QB to reduce A to B as specified by uplo. z On exit, if jobz = 'U', then the matrix z is overwritten by Z*Q. If jobz = 'N' or 'V', then z is not referenced. d, e, tau DOUBLE PRECISION. Arrays: d(*) contains the diagonal elements of the matrix T. LAPACK Routines: Least Squares and Eigenvalue Problems 4 765 The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau(*) stores further details of the orthogonal matrix Q. The dimension of tau must be at least max(1, n-kd-1). work(1) If info=0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Application Notes For better performance, try using lwork = n*(3*kd+3). If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. For better performance, try using kd equal to 40 if n = 2000 and 64 otherwise. Try using ?syrdb instead of ?sytrd on large matrices obtaining only eigenvalues - when no eigenvectors are needed, especially in multi-threaded environment. ?syrdb becomes faster beginning approximately with n = 1000, and much faster at larger matrices with a better scalability than ?sytrd. Avoid applying ?syrdb for computing eigenvectors due to the two-step reduction, that is, the number of operations needed to apply orthogonal transformations to Z is doubled compared to the traditional one-step reduction. In that case it is better to apply ?sytrd and ?ormtr/?orgtr to obtain tridiagonal form along with the orthogonal transformation matrix Q. ?herdb Reduces a complex Hermitian matrix to tridiagonal form with Successive Bandwidth Reduction approach. Syntax Fortran 77: call cherdb(jobz, uplo, n, kd, a, lda, d, e, tau, z, ldz, work, lwork, info) call zherdb(jobz, uplo, n, kd, a, lda, d, e, tau, z, ldz, work, lwork, info) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine reduces a complex Hermitian matrix A to symmetric tridiagonal form T by a unitary similarity transformation: A = Q*T*QT and optionally multiplies matrix Z by Q, or simply forms Q. 4 Intel® Math Kernel Library Reference Manual 766 This routine reduces a full Hermitian matrix to the banded Hermitian form, and then to the tridiagonal symmetric form with a Successive Bandwidth Reduction approach after Prof. C.Bischof's works (see for instance, [Bischof92]). ?herdb is functionally close to ?hetrd routine but the tridiagonal form may differ from those obtained by ?hetrd. Unlike ?hetrd, the orthogonal matrix Q cannot be restored from the details of matrix A on exit. Input Parameters jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only A is reduced to T. If jobz = 'V', then A is reduced to T and A contains Q on exit. If jobz = 'U', then A is reduced to T and Z contains Z*Q on exit. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). kd INTEGER. The bandwidth of the banded matrix B (kd = 1). a,z, work COMPLEX for cherdb. DOUBLE COMPLEX for zherdb. a(lda,*) is an array containing either upper or lower triangular part of the matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). z(ldz,*), the second dimension of z must be at least max(1, n). If jobz = 'U', then the matrix z is multiplied by Q. If jobz = 'N' or 'V', then z is not referenced. work(lwork) is a workspace array. lda INTEGER. The leading dimension of a; at least max(1, n). ldz INTEGER. The leading dimension of z; at least max(1, n). Not referenced if jobz = 'N' lwork INTEGER. The size of the work array (lwork = (2kd+1)n+kd). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a If jobz = 'V', then overwritten by Q matrix. If jobz = 'N' or 'U', then overwritten by the banded matrix B and details of the unitary matrix QB to reduce A to B as specified by uplo. z On exit, if jobz = 'U', then the matrix z is overwritten by Z*Q . If jobz = 'N' or 'V', then z is not referenced. d, e COMPLEX for cherdb. DOUBLE COMPLEX for zherdb. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau(*) stores further details of the orthogonal matrix Q. LAPACK Routines: Least Squares and Eigenvalue Problems 4 767 The dimension of tau must be at least max(1, n-kd-1). tau COMPLEX for cherdb. DOUBLE COMPLEX for zherdb. Array, DIMENSION at least max(1, n-1) Stores further details of the unitary matrix QB. The dimension of tau must be at least max(1, n-kd-1). work(1) If info=0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Application Notes For better performance, try using lwork = n*(3*kd+3). If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. For better performance, try using kd equal to 40 if n = 2000 and 64 otherwise. Try using ?herdb instead of ?hetrd on large matrices obtaining only eigenvalues - when no eigenvectors are needed, especially in multi-threaded environment. ?herdb becomes faster beginning approximately with n = 1000, and much faster at larger matrices with a better scalability than ?hetrd. Avoid applying ?herdb for computing eigenvectors due to the two-step reduction, that is, the number of operations needed to apply orthogonal transformations to Z is doubled compared to the traditional one-step reduction. In that case it is better to apply ?hetrd and ?unmtr/?ungtr to obtain tridiagonal form along with the unitary transformation matrix Q. ?orgtr Generates the real orthogonal matrix Q determined by ?sytrd. Syntax Fortran 77: call sorgtr(uplo, n, a, lda, tau, work, lwork, info) call dorgtr(uplo, n, a, lda, tau, work, lwork, info) Fortran 95: call orgtr(a, tau [,uplo] [,info]) C: lapack_int LAPACKE_orgtr( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const * tau ); 4 Intel® Math Kernel Library Reference Manual 768 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine explicitly generates the n-by-n orthogonal matrix Q formed by sytrd when reducing a real symmetric matrix A to tridiagonal form: A = Q*T*QT. Use this routine after a call to ?sytrd. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?sytrd. n INTEGER. The order of the matrix Q (n = 0). a, tau, work REAL for sorgtr DOUBLE PRECISION for dorgtr. Arrays: a(lda,*) is the array a as returned by ?sytrd. The second dimension of a must be at least max(1, n). tau(*) is the array tau as returned by ?sytrd. The dimension of tau must be at least max(1, n-1). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the orthogonal matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine orgtr interface are the following: a Holds the matrix A of size (n,n). tau Holds the vector of length (n-1). LAPACK Routines: Least Squares and Eigenvalue Problems 4 769 uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For better performance, try using lwork = (n-1)*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e), where e is the machine precision. The approximate number of floating-point operations is (4/3)n3. The complex counterpart of this routine is ungtr. ?ormtr Multiplies a real matrix by the real orthogonal matrix Q determined by ?sytrd. Syntax Fortran 77: call sormtr(side, uplo, trans, m, n, a, lda, tau, c, ldc, work, lwork, info) call dormtr(side, uplo, trans, m, n, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormtr(a, tau, c [,side] [,uplo] [,trans] [,info]) C: lapack_int LAPACKE_ormtr( int matrix_order, char side, char uplo, char trans, lapack_int m, lapack_int n, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real matrix C by Q or QT, where Q is the orthogonal matrix Q formed by sytrd when reducing a real symmetric matrix A to tridiagonal form: A = Q*T*QT. Use this routine after a call to ?sytrd. Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result on C). 4 Intel® Math Kernel Library Reference Manual 770 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q: If side = 'L', r = m; if side = 'R', r = n. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?sytrd. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). a, c, tau, work REAL for sormtr DOUBLE PRECISION for dormtr a(lda,*) and tau are the arrays returned by ?sytrd. The second dimension of a must be at least max(1, r). The dimension of tau must be at least max(1, r-1). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, r). ldc INTEGER. The leading dimension of c; ldc = max(1, n). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 771 Specific details for the routine ormtr interface are the following: a Holds the matrix A of size (r,r). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (r-1). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize for side = 'L', or lwork = m*blocksize for side = 'R', where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed product differs from the exact product by a matrix E such that ||E||2 = O(e)*||C||2. The total number of floating-point operations is approximately 2*m2*n, if side = 'L', or 2*n2*m, if side = 'R'. The complex counterpart of this routine is unmtr. ?hetrd Reduces a complex Hermitian matrix to tridiagonal form. Syntax Fortran 77: call chetrd(uplo, n, a, lda, d, e, tau, work, lwork, info) call zhetrd(uplo, n, a, lda, d, e, tau, work, lwork, info) Fortran 95: call hetrd(a, tau [,uplo] [,info]) C: lapack_int LAPACKE_chetrd( int matrix_order, char uplo, lapack_int n, lapack_complex_float* a, lapack_int lda, float* d, float* e, lapack_complex_float* tau ); 4 Intel® Math Kernel Library Reference Manual 772 lapack_int LAPACKE_zhetrd( int matrix_order, char uplo, lapack_int n, lapack_complex_double* a, lapack_int lda, double* d, double* e, lapack_complex_double* tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a complex Hermitian matrix A to symmetric tridiagonal form T by a unitary similarity transformation: A = Q*T*QH. The unitary matrix Q is not formed explicitly but is represented as a product of n-1 elementary reflectors. Routines are provided to work with Q in this representation. (They are described later in this section .) This routine calls latrd to reduce a complex Hermitian matrix A to Hermitian tridiagonal form by a unitary similarity transformation. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a, work COMPLEX for chetrd DOUBLE COMPLEX for zhetrd. a(lda,*) is an array containing either upper or lower triangular part of the matrix A, as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A, and the strictly lower triangular part of A is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A, and the strictly upper triangular part of A is not referenced. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a On exit, if uplo = 'U', the diagonal and first superdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements above the first superdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors; LAPACK Routines: Least Squares and Eigenvalue Problems 4 773 if uplo = 'L', the diagonal and first subdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements below the first subdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors. d, e REAL for chetrd DOUBLE PRECISION for zhetrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau COMPLEX for chetrd DOUBLE COMPLEX for zhetrd. Array, DIMENSION at least max(1, n-1). Stores further details of the unitary matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetrd interface are the following: a Holds the matrix A of size (n,n). tau Holds the vector of length (n-1). uplo Must be 'U' or 'L'. The default value is 'U'. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix T is exactly similar to a matrix A + E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The approximate number of floating-point operations is (16/3)n3. 4 Intel® Math Kernel Library Reference Manual 774 After calling this routine, you can call the following: ungtr to form the computed matrix Q explicitly unmtr to multiply a complex matrix by Q. The real counterpart of this routine is sytrd. ?ungtr Generates the complex unitary matrix Q determined by ?hetrd. Syntax Fortran 77: call cungtr(uplo, n, a, lda, tau, work, lwork, info) call zungtr(uplo, n, a, lda, tau, work, lwork, info) Fortran 95: call ungtr(a, tau [,uplo] [,info]) C: lapack_int LAPACKE_ungtr( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine explicitly generates the n-by-n unitary matrix Q formed by hetrd when reducing a complex Hermitian matrix A to tridiagonal form: A = Q*T*QH. Use this routine after a call to ?hetrd. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?hetrd. n INTEGER. The order of the matrix Q (n = 0). a, tau, work COMPLEX for cungtr DOUBLE COMPLEX for zungtr. Arrays: a(lda,*) is the array a as returned by ?hetrd. The second dimension of a must be at least max(1, n). tau(*) is the array tau as returned by ?hetrd. The dimension of tau must be at least max(1, n-1). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array (lwork = n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 775 If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the unitary matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ungtr interface are the following: a Holds the matrix A of size (n,n). tau Holds the vector of length (n-1). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For better performance, try using lwork = (n-1)*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from an exactly unitary matrix by a matrix E such that ||E||2 = O(e), where e is the machine precision. The approximate number of floating-point operations is (16/3)n3. The real counterpart of this routine is orgtr. ?unmtr Multiplies a complex matrix by the complex unitary matrix Q determined by ?hetrd. Syntax Fortran 77: call cunmtr(side, uplo, trans, m, n, a, lda, tau, c, ldc, work, lwork, info) 4 Intel® Math Kernel Library Reference Manual 776 call zunmtr(side, uplo, trans, m, n, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmtr(a, tau, c [,side] [,uplo] [,trans] [,info]) C: lapack_int LAPACKE_unmtr( int matrix_order, char side, char uplo, char trans, lapack_int m, lapack_int n, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a complex matrix C by Q or QH, where Q is the unitary matrix Q formed by hetrd when reducing a complex Hermitian matrix A to tridiagonal form: A = Q*T*QH. Use this routine after a call to ? hetrd. Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q: If side = 'L', r = m; if side = 'R', r = n. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?hetrd. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). a, c, tau, work COMPLEX for cunmtr DOUBLE COMPLEX for zunmtr. a(lda,*) and tau are the arrays returned by ?hetrd. The second dimension of a must be at least max(1, r). The dimension of tau must be at least max(1, r-1). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, r). ldc INTEGER. The leading dimension of c; ldc = max(1, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 777 lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmtr interface are the following: a Holds the matrix A of size (r,r). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (r-1). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (for side = 'L') or lwork = m*blocksize (for side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed product differs from the exact product by a matrix E such that ||E||2 = O(e)*||C||2, where e is the machine precision. 4 Intel® Math Kernel Library Reference Manual 778 The total number of floating-point operations is approximately 8*m2*n if side = 'L' or 8*n2*m if side = 'R'. The real counterpart of this routine is ormtr. ?sptrd Reduces a real symmetric matrix to tridiagonal form using packed storage. Syntax Fortran 77: call ssptrd(uplo, n, ap, d, e, tau, info) call dsptrd(uplo, n, ap, d, e, tau, info) Fortran 95: call sptrd(ap, tau [,uplo] [,info]) C: lapack_int LAPACKE_sptrd( int matrix_order, char uplo, lapack_int n, * ap, * d, * e, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a packed real symmetric matrix A to symmetric tridiagonal form T by an orthogonal similarity transformation: A = Q*T*QT. The orthogonal matrix Q is not formed explicitly but is represented as a product of n-1 elementary reflectors. Routines are provided for working with Q in this representation. See Application Notes below for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ap stores the packed upper triangle of A. If uplo = 'L', ap stores the packed lower triangle of A. n INTEGER. The order of the matrix A (n = 0). ap REAL for ssptrd DOUBLE PRECISION for dsptrd. Array, DIMENSION at least max(1, n(n+1)/2). Contains either upper or lower triangle of A (as specified by uplo) in the packed form described in "Matrix Arguments" in Appendix B . Output Parameters ap Overwritten by the tridiagonal matrix T and details of the orthogonal matrix Q, as specified by uplo. LAPACK Routines: Least Squares and Eigenvalue Problems 4 779 d, e, tau REAL for ssptrd DOUBLE PRECISION for dsptrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau(*) stores further details of the matrix Q. The dimension of tau must be at least max(1, n-1). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sptrd interface are the following: ap Holds the array A of size (n*(n+1)/2). tau Holds the vector with the number of elements n-1. uplo Must be 'U' or 'L'. The default value is 'U'. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes The matrix Q is represented as a product of n-1 elementary reflectors, as follows : • If uplo = 'U', Q = H(n-1) ... H(2)H(1) Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors, where tau is a real/complex scalar, and v is a real/complex vector with v(i+1:n) = 0 and v(i) = 1. On exit, tau is stored in tau(i), and v(1:i-1) is stored in AP, overwriting A(1:i-1, i+1). • If uplo = 'L', Q = H(1)H(2) ... H(n-1) Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors, where tau is a real/complex scalar, and v is a real/complex vector with v(1:i) = 0 and v(i+1) = 1. On exit, tau is stored in tau(i), and v(i+2:n) is stored in AP, overwriting A(i+2:n, i). The computed matrix T is exactly similar to a matrix A+E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The approximate number of floating-point operations is (4/3)n3. After calling this routine, you can call the following: opgtr to form the computed matrix Q explicitly opmtr to multiply a real matrix by Q. 4 Intel® Math Kernel Library Reference Manual 780 The complex counterpart of this routine is hptrd. ?opgtr Generates the real orthogonal matrix Q determined by ?sptrd. Syntax Fortran 77: call sopgtr(uplo, n, ap, tau, q, ldq, work, info) call dopgtr(uplo, n, ap, tau, q, ldq, work, info) Fortran 95: call opgtr(ap, tau, q [,uplo] [,info]) C: lapack_int LAPACKE_opgtr( int matrix_order, char uplo, lapack_int n, const * ap, const * tau, * q, lapack_int ldq ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine explicitly generates the n-by-n orthogonal matrix Q formed by sptrd when reducing a packed real symmetric matrix A to tridiagonal form: A = Q*T*QT. Use this routine after a call to ?sptrd. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ? sptrd. n INTEGER. The order of the matrix Q (n = 0). ap, tau REAL for sopgtr DOUBLE PRECISION for dopgtr. Arrays ap and tau, as returned by ?sptrd. The dimension of ap must be at least max(1, n(n+1)/2). The dimension of tau must be at least max(1, n-1). ldq INTEGER. The leading dimension of the output array q; at least max(1, n). work REAL for sopgtr DOUBLE PRECISION for dopgtr. Workspace array, DIMENSION at least max(1, n-1). Output Parameters q REAL for sopgtr DOUBLE PRECISION for dopgtr. Array, DIMENSION (ldq,*). LAPACK Routines: Least Squares and Eigenvalue Problems 4 781 Contains the computed matrix Q. The second dimension of q must be at least max(1, n). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine opgtr interface are the following: ap Holds the array A of size (n*(n+1)/2). tau Holds the vector with the number of elements n - 1. q Holds the matrix Q of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e), where e is the machine precision. The approximate number of floating-point operations is (4/3)n3. The complex counterpart of this routine is upgtr. ?opmtr Multiplies a real matrix by the real orthogonal matrix Q determined by ?sptrd. Syntax Fortran 77: call sopmtr(side, uplo, trans, m, n, ap, tau, c, ldc, work, info) call dopmtr(side, uplo, trans, m, n, ap, tau, c, ldc, work, info) Fortran 95: call opmtr(ap, tau, c [,side] [,uplo] [,trans] [,info]) C: lapack_int LAPACKE_opmtr( int matrix_order, char side, char uplo, char trans, lapack_int m, lapack_int n, const * ap, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real matrix C by Q or QT, where Q is the orthogonal matrix Q formed by sptrd when reducing a packed real symmetric matrix A to tridiagonal form: A = Q*T*QT. Use this routine after a call to ? sptrd. 4 Intel® Math Kernel Library Reference Manual 782 Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q: If side = 'L', r = m; if side = 'R', r = n. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?sptrd. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). ap, tau, c, work REAL for sopmtr DOUBLE PRECISION for dopmtr. ap and tau are the arrays returned by ?sptrd. The dimension of ap must be at least max(1, r(r+1)/2). The dimension of tau must be at least max(1, r-1). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work(*) is a workspace array. The dimension of work must be at least max(1, n) if side = 'L'; max(1, m) if side = 'R'. ldc INTEGER. The leading dimension of c; ldc = max(1, n). Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine opmtr interface are the following: ap Holds the array A of size (r*(r+1)/2), where r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector with the number of elements r - 1. LAPACK Routines: Least Squares and Eigenvalue Problems 4 783 c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. Application Notes The computed product differs from the exact product by a matrix E such that ||E||2 = O(e) ||C||2, where e is the machine precision. The total number of floating-point operations is approximately 2*m2*n if side = 'L', or 2*n2*m if side = 'R'. The complex counterpart of this routine is upmtr. ?hptrd Reduces a complex Hermitian matrix to tridiagonal form using packed storage. Syntax Fortran 77: call chptrd(uplo, n, ap, d, e, tau, info) call zhptrd(uplo, n, ap, d, e, tau, info) Fortran 95: call hptrd(ap, tau [,uplo] [,info]) C: lapack_int LAPACKE_chptrd( int matrix_order, char uplo, lapack_int n, lapack_complex_float* ap, float* d, float* e, lapack_complex_float* tau ); lapack_int LAPACKE_zhptrd( int matrix_order, char uplo, lapack_int n, lapack_complex_double* ap, double* d, double* e, lapack_complex_double* tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a packed complex Hermitian matrix A to symmetric tridiagonal form T by a unitary similarity transformation: A = Q*T*QH. The unitary matrix Q is not formed explicitly but is represented as a product of n-1 elementary reflectors. Routines are provided for working with Q in this representation (see Application Notes below). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ap stores the packed upper triangle of A. 4 Intel® Math Kernel Library Reference Manual 784 If uplo = 'L', ap stores the packed lower triangle of A. n INTEGER. The order of the matrix A (n = 0). ap COMPLEX for chptrd DOUBLE COMPLEX for zhptrd. Array, DIMENSION at least max(1, n(n+1)/2). Contains either upper or lower triangle of A (as specified by uplo) in the packed form described in "Matrix Arguments" in Appendix B . Output Parameters ap Overwritten by the tridiagonal matrix T and details of the orthogonal matrix Q, as specified by uplo. d, e REAL for chptrd DOUBLE PRECISION for zhptrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau COMPLEX for chptrd DOUBLE COMPLEX for zhptrd. Arrays, DIMENSION at least max(1, n-1). Contains further details of the orthogonal matrix Q. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hptrd interface are the following: ap Holds the array A of size (n*(n+1)/2). tau Holds the vector with the number of elements n - 1. uplo Must be 'U' or 'L'. The default value is 'U'. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes The computed matrix T is exactly similar to a matrix A + E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The approximate number of floating-point operations is (16/3)n3. After calling this routine, you can call the following: upgtr to form the computed matrix Q explicitly upmtr to multiply a complex matrix by Q. The real counterpart of this routine is sptrd. LAPACK Routines: Least Squares and Eigenvalue Problems 4 785 ?upgtr Generates the complex unitary matrix Q determined by ?hptrd. Syntax Fortran 77: call cupgtr(uplo, n, ap, tau, q, ldq, work, info) call zupgtr(uplo, n, ap, tau, q, ldq, work, info) Fortran 95: call upgtr(ap, tau, q [,uplo] [,info]) C: lapack_int LAPACKE_upgtr( int matrix_order, char uplo, lapack_int n, const * ap, const * tau, * q, lapack_int ldq ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine explicitly generates the n-by-n unitary matrix Q formed by hptrd when reducing a packed complex Hermitian matrix A to tridiagonal form: A = Q*T*QH. Use this routine after a call to ?hptrd. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ? hptrd. n INTEGER. The order of the matrix Q (n = 0). ap, tau COMPLEX for cupgtr DOUBLE COMPLEX for zupgtr. Arrays ap and tau, as returned by ?hptrd. The dimension of ap must be at least max(1, n(n+1)/2). The dimension of tau must be at least max(1, n-1). ldq INTEGER. The leading dimension of the output array q; at least max(1, n). work COMPLEX for cupgtr DOUBLE COMPLEX for zupgtr. Workspace array, DIMENSION at least max(1, n-1). Output Parameters q COMPLEX for cupgtr DOUBLE COMPLEX for zupgtr. Array, DIMENSION (ldq,*). Contains the computed matrix Q. 4 Intel® Math Kernel Library Reference Manual 786 The second dimension of q must be at least max(1, n). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine upgtr interface are the following: ap Holds the array A of size (n*(n+1)/2). tau Holds the vector with the number of elements n - 1. q Holds the matrix Q of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e), where e is the machine precision. The approximate number of floating-point operations is (16/3)n3. The real counterpart of this routine is opgtr. ?upmtr Multiplies a complex matrix by the unitary matrix Q determined by ?hptrd. Syntax Fortran 77: call cupmtr(side, uplo, trans, m, n, ap, tau, c, ldc, work, info) call zupmtr(side, uplo, trans, m, n, ap, tau, c, ldc, work, info) Fortran 95: call upmtr(ap, tau, c [,side] [,uplo] [,trans] [,info]) C: lapack_int LAPACKE_upmtr( int matrix_order, char side, char uplo, char trans, lapack_int m, lapack_int n, const * ap, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a complex matrix C by Q or QH, where Q is the unitary matrix formed by hptrd when reducing a packed complex Hermitian matrix A to tridiagonal form: A = Q*T*QH. Use this routine after a call to ?hptrd. LAPACK Routines: Least Squares and Eigenvalue Problems 4 787 Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q: If side = 'L', r = m; if side = 'R', r = n. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?hptrd. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). ap, tau, c, COMPLEX for cupmtr DOUBLE COMPLEX for zupmtr. ap and tau are the arrays returned by ?hptrd. The dimension of ap must be at least max(1, r(r+1)/2). The dimension of tau must be at least max(1, r-1). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work(*) is a workspace array. The dimension of work must be at least max(1, n) if side = 'L'; max(1, m) if side = 'R'. ldc INTEGER. The leading dimension of c; ldc = max(1, n). Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine upmtr interface are the following: ap Holds the array A of size (r*(r+1)/2), where r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector with the number of elements n - 1. 4 Intel® Math Kernel Library Reference Manual 788 c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'.The default value is 'U'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes The computed product differs from the exact product by a matrix E such that ||E||2 = O(e)*||C||2, where e is the machine precision. The total number of floating-point operations is approximately 8*m2*n if side = 'L' or 8*n2*m if side = 'R'. The real counterpart of this routine is opmtr. ?sbtrd Reduces a real symmetric band matrix to tridiagonal form. Syntax Fortran 77: call ssbtrd(vect, uplo, n, kd, ab, ldab, d, e, q, ldq, work, info) call dsbtrd(vect, uplo, n, kd, ab, ldab, d, e, q, ldq, work, info) Fortran 95: call sbtrd(ab[, q] [,vect] [,uplo] [,info]) C: lapack_int LAPACKE_sbtrd( int matrix_order, char vect, char uplo, lapack_int n, lapack_int kd, * ab, lapack_int ldab, * d, * e, * q, lapack_int ldq ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a real symmetric band matrix A to symmetric tridiagonal form T by an orthogonal similarity transformation: A = Q*T*QT. The orthogonal matrix Q is determined as a product of Givens rotations. If required, the routine can also form the matrix Q explicitly. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. vect CHARACTER*1. Must be 'V' or 'N'. If vect = 'V', the routine returns the explicit matrix Q. If vect = 'N', the routine does not return Q. LAPACK Routines: Least Squares and Eigenvalue Problems 4 789 uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ab stores the upper triangular part of A. If uplo = 'L', ab stores the lower triangular part of A. n INTEGER. The order of the matrix A (n=0). kd INTEGER. The number of super- or sub-diagonals in A (kd=0). ab, q, work REAL for ssbtrd DOUBLE PRECISION for dsbtrd. ab (ldab,*) is an array containing either upper or lower triangular part of the matrix A (as specified by uplo) in band storage format. The second dimension of ab must be at least max(1, n). q (ldq,*) is an array. If vect = 'U', the q array must contain an n-by-n matrix X. If vect = 'N' or 'V', the q parameter need not be set. The second dimension of q must be at least max(1, n). work(*) is a workspace array. The dimension of work must be at least max(1, n). ldab INTEGER. The leading dimension of ab; at least kd+1. ldq INTEGER. The leading dimension of q. Constraints: ldq = max(1, n) if vect = 'V'; ldq = 1 if vect = 'N'. Output Parameters ab On exit, the diagonal elements of the array ab are overwritten by the diagonal elements of the tridiagonal matrix T. If kd > 0, the elements on the first superdiagonal (if uplo = 'U') or the first subdiagonal (if uplo = 'L') are ovewritten by the off-diagonal elements of T. The rest of ab is overwritten by values generated during the reduction. d, e, q REAL for ssbtrd DOUBLE PRECISION for dsbtrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). q(ldq,*) is not referenced if vect = 'N'. If vect = 'V', q contains the n-by-n matrix Q. The second dimension of q must be: at least max(1, n) if vect = 'V'; at least 1 if vect = 'N'. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sbtrd interface are the following: 4 Intel® Math Kernel Library Reference Manual 790 ab Holds the array A of size (kd+1,n). q Holds the matrix Q of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. vect If omitted, this argument is restored based on the presence of argument q as follows: vect = 'V', if q is present, vect = 'N', if q is omitted. If present, vect must be equal to 'V' or 'U' and the argument q must also be present. Note that there will be an error condition if vect is present and q omitted. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes The computed matrix T is exactly similar to a matrix A+E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e). The total number of floating-point operations is approximately 6n2*kd if vect = 'N', with 3n3*(kd-1)/kd additional operations if vect = 'V'. The complex counterpart of this routine is hbtrd. ?hbtrd Reduces a complex Hermitian band matrix to tridiagonal form. Syntax Fortran 77: call chbtrd(vect, uplo, n, kd, ab, ldab, d, e, q, ldq, work, info) call zhbtrd(vect, uplo, n, kd, ab, ldab, d, e, q, ldq, work, info) Fortran 95: call hbtrd(ab [, q] [,vect] [,uplo] [,info]) C: lapack_int LAPACKE_chbtrd( int matrix_order, char vect, char uplo, lapack_int n, lapack_int kd, lapack_complex_float* ab, lapack_int ldab, float* d, float* e, lapack_complex_float* q, lapack_int ldq ); lapack_int LAPACKE_zhbtrd( int matrix_order, char vect, char uplo, lapack_int n, lapack_int kd, lapack_complex_double* ab, lapack_int ldab, double* d, double* e, lapack_complex_double* q, lapack_int ldq ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a complex Hermitian band matrix A to symmetric tridiagonal form T by a unitary similarity transformation: A = Q*T*QH. The unitary matrix Q is determined as a product of Givens rotations. If required, the routine can also form the matrix Q explicitly. LAPACK Routines: Least Squares and Eigenvalue Problems 4 791 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. vect CHARACTER*1. Must be 'V' or 'N'. If vect = 'V', the routine returns the explicit matrix Q. If vect = 'N', the routine does not return Q. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ab stores the upper triangular part of A. If uplo = 'L', ab stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). kd INTEGER. The number of super- or sub-diagonals in A (kd = 0). ab, work COMPLEX for chbtrd DOUBLE COMPLEX for zhbtrd. ab (ldab,*) is an array containing either upper or lower triangular part of the matrix A (as specified by uplo) in band storage format. The second dimension of ab must be at least max(1, n). work(*) is a workspace array. The dimension of work must be at least max(1, n). ldab INTEGER. The leading dimension of ab; at least kd+1. ldq INTEGER. The leading dimension of q. Constraints: ldq = max(1, n) if vect = 'V'; ldq = 1 if vect = 'N'. Output Parameters ab On exit, the diagonal elements of the array ab are overwritten by the diagonal elements of the tridiagonal matrix T. If kd > 0, the elements on the first superdiagonal (if uplo = 'U') or the first subdiagonal (if uplo = 'L') are ovewritten by the off-diagonal elements of T. The rest of ab is overwritten by values generated during the reduction. d, e REAL for chbtrd DOUBLE PRECISION for zhbtrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). q COMPLEX for chbtrd DOUBLE COMPLEX for zhbtrd. Array, DIMENSION (ldq,*). If vect = 'N', q is not referenced. If vect = 'V', q contains the n-by-n matrix Q. The second dimension of q must be: at least max(1, n) if vect = 'V'; at least 1 if vect = 'N'. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. 4 Intel® Math Kernel Library Reference Manual 792 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbtrd interface are the following: ab Holds the array A of size (kd+1,n). q Holds the matrix Q of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. vect If omitted, this argument is restored based on the presence of argument q as follows: vect = 'V', if q is present, vect = 'N', if q is omitted. If present, vect must be equal to 'V' or 'U' and the argument q must also be present. Note that there will be an error condition if vect is present and q omitted. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes The computed matrix T is exactly similar to a matrix A + E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The computed matrix Q differs from an exactly unitary matrix by a matrix E such that ||E||2 = O(e). The total number of floating-point operations is approximately 20n2*kd if vect = 'N', with 10n3*(kd-1)/ kd additional operations if vect = 'V'. The real counterpart of this routine is sbtrd. ?sterf Computes all eigenvalues of a real symmetric tridiagonal matrix using QR algorithm. Syntax Fortran 77: call ssterf(n, d, e, info) call dsterf(n, d, e, info) Fortran 95: call sterf(d, e [,info]) C: lapack_int LAPACKE_sterf( lapack_int n, * d, * e ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues of a real symmetric tridiagonal matrix T (which can be obtained by reducing a symmetric or Hermitian matrix to tridiagonal form). The routine uses a square-root-free variant of the QR algorithm. LAPACK Routines: Least Squares and Eigenvalue Problems 4 793 If you need not only the eigenvalues but also the eigenvectors, call steqr. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix T (n = 0). d, e REAL for ssterf DOUBLE PRECISION for dsterf. Arrays: d(*) contains the diagonal elements of T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). Output Parameters d The n eigenvalues in ascending order, unless info > 0. See also info. e On exit, the array is overwritten; see info. info INTEGER. If info = 0, the execution is successful. If info = i, the algorithm failed to find all the eigenvalues after 30n iterations: i off-diagonal elements have not converged to zero. On exit, d and e contain, respectively, the diagonal and off-diagonal elements of a tridiagonal matrix orthogonally similar to T. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sterf interface are the following: d Holds the vector of length n. e Holds the vector of length (n-1). Application Notes The computed eigenvalues and eigenvectors are exact for a matrix T+E such that ||E||2 = O(e)*||T||2, where e is the machine precision. If ?i is an exact eigenvalue, and mi is the corresponding computed value, then |µi - ?i| = c(n)*e*||T||2 where c(n) is a modestly increasing function of n. The total number of floating-point operations depends on how rapidly the algorithm converges. Typically, it is about 14n2. 4 Intel® Math Kernel Library Reference Manual 794 ?steqr Computes all eigenvalues and eigenvectors of a symmetric or Hermitian matrix reduced to tridiagonal form (QR algorithm). Syntax Fortran 77: call ssteqr(compz, n, d, e, z, ldz, work, info) call dsteqr(compz, n, d, e, z, ldz, work, info) call csteqr(compz, n, d, e, z, ldz, work, info) call zsteqr(compz, n, d, e, z, ldz, work, info) Fortran 95: call rsteqr(d, e [,z] [,compz] [,info]) call steqr(d, e [,z] [,compz] [,info]) C: lapack_int LAPACKE_ssteqr( int matrix_order, char compz, lapack_int n, float* d, float* e, float* z, lapack_int ldz ); lapack_int LAPACKE_dsteqr( int matrix_order, char compz, lapack_int n, double* d, double* e, double* z, lapack_int ldz ); lapack_int LAPACKE_csteqr( int matrix_order, char compz, lapack_int n, float* d, float* e, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zsteqr( int matrix_order, char compz, lapack_int n, double* d, double* e, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues and (optionally) all the eigenvectors of a real symmetric tridiagonal matrix T. In other words, the routine can compute the spectral factorization: T = Z*?*ZT. Here ? is a diagonal matrix whose diagonal elements are the eigenvalues ?i; Z is an orthogonal matrix whose columns are eigenvectors. Thus, T*zi = ?i*zi for i = 1, 2, ..., n. The routine normalizes the eigenvectors so that ||zi||2 = 1. You can also use the routine for computing the eigenvalues and eigenvectors of an arbitrary real symmetric (or complex Hermitian) matrix A reduced to tridiagonal form T: A = Q*T*QH. In this case, the spectral factorization is as follows: A = Q*T*QH = (Q*Z)*?*(Q*Z)H. Before calling ?steqr, you must reduce A to tridiagonal form and generate the explicit matrix Q by calling the following routines: for real matrices: for complex matrices: full storage ?sytrd, ?orgtr ?hetrd, ?ungtr LAPACK Routines: Least Squares and Eigenvalue Problems 4 795 for real matrices: for complex matrices: packed storage ?sptrd, ?opgtr ?hptrd, ?upgtr band storage ?sbtrd (vect='V') ?hbtrd (vect='V') If you need eigenvalues only, it's more efficient to call sterf. If T is positive-definite, pteqr can compute small eigenvalues more accurately than ?steqr. To solve the problem by a single call, use one of the divide and conquer routines stevd, syevd, spevd, or sbevd for real symmetric matrices or heevd, hpevd, or hbevd for complex Hermitian matrices. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. compz CHARACTER*1. Must be 'N' or 'I' or 'V'. If compz = 'N', the routine computes eigenvalues only. If compz = 'I', the routine computes the eigenvalues and eigenvectors of the tridiagonal matrix T. If compz = 'V', the routine computes the eigenvalues and eigenvectors of A (and the array z must contain the matrix Q on entry). n INTEGER. The order of the matrix T (n = 0). d, e, work REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays: d(*) contains the diagonal elements of T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). work(*) is a workspace array. The dimension of work must be: at least 1 if compz = 'N'; at least max(1, 2*n-2) if compz = 'V' or 'I'. z REAL for ssteqr DOUBLE PRECISION for dsteqr COMPLEX for csteqr DOUBLE COMPLEX for zsteqr. Array, DIMENSION (ldz, *) If compz = 'N' or 'I', z need not be set. If vect = 'V', z must contain the n-by-n matrix Q. The second dimension of z must be: at least 1 if compz = 'N'; at least max(1, n) if compz = 'V' or 'I'. work (lwork) is a workspace array. ldz INTEGER. The leading dimension of z. Constraints: ldz = 1 if compz = 'N'; ldz = max(1, n) if compz = 'V' or 'I'. Output Parameters d The n eigenvalues in ascending order, unless info > 0. See also info. 4 Intel® Math Kernel Library Reference Manual 796 e On exit, the array is overwritten; see info. z If info = 0, contains the n orthonormal eigenvectors, stored by columns. (The i-th column corresponds to the ith eigenvalue.) info INTEGER. If info = 0, the execution is successful. If info = i, the algorithm failed to find all the eigenvalues after 30n iterations: i off-diagonal elements have not converged to zero. On exit, d and e contain, respectively, the diagonal and off-diagonal elements of a tridiagonal matrix orthogonally similar to T. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine steqr interface are the following: d Holds the vector of length n. e Holds the vector of length (n-1). z Holds the matrix Z of size (n,n). compz If omitted, this argument is restored based on the presence of argument z as follows: compz = 'I', if z is present, compz = 'N', if z is omitted. If present, compz must be equal to 'I' or 'V' and the argument z must also be present. Note that there will be an error condition if compz is present and z omitted. Note that two variants of Fortran 95 interface for steqr routine are needed because of an ambiguous choice between real and complex cases appear when z is omitted. Thus, the name rsteqr is used in real cases (single or double precision), and the name steqr is used in complex cases (single or double precision). Application Notes The computed eigenvalues and eigenvectors are exact for a matrix T+E such that ||E||2 = O(e)*||T||2, where e is the machine precision. If ?i is an exact eigenvalue, and µi is the corresponding computed value, then |µi - ?i| = c(n)*e*||T||2 where c(n) is a modestly increasing function of n. If zi is the corresponding exact eigenvector, and wi is the corresponding computed vector, then the angle ?(zi, wi) between them is bounded as follows: ?(zi, wi) = c(n)*e*||T||2 / mini?j|?i - ?j|. The total number of floating-point operations depends on how rapidly the algorithm converges. Typically, it is about 24n2 if compz = 'N'; 7n3 (for complex flavors, 14n3) if compz = 'V' or 'I'. LAPACK Routines: Least Squares and Eigenvalue Problems 4 797 ?stemr Computes selected eigenvalues and eigenvectors of a real symmetric tridiagonal matrix. Syntax Fortran 77: call sstemr(jobz, range, n, d, e, vl, vu, il, iu, m, w, z, ldz, nzc, isuppz, tryrac, work, lwork, iwork, liwork, info) call dstemr(jobz, range, n, d, e, vl, vu, il, iu, m, w, z, ldz, nzc, isuppz, tryrac, work, lwork, iwork, liwork, info) call cstemr(jobz, range, n, d, e, vl, vu, il, iu, m, w, z, ldz, nzc, isuppz, tryrac, work, lwork, iwork, liwork, info) call zstemr(jobz, range, n, d, e, vl, vu, il, iu, m, w, z, ldz, nzc, isuppz, tryrac, work, lwork, iwork, liwork, info) C: lapack_int LAPACKE_sstemr( int matrix_order, char jobz, char range, lapack_int n, const float* d, float* e, float vl, float vu, lapack_int il, lapack_int iu, lapack_int* m, float* w, float* z, lapack_int ldz, lapack_int nzc, lapack_int* isuppz, lapack_logical* tryrac ); lapack_int LAPACKE_dstemr( int matrix_order, char jobz, char range, lapack_int n, const double* d, double* e, double vl, double vu, lapack_int il, lapack_int iu, lapack_int* m, double* w, double* z, lapack_int ldz, lapack_int nzc, lapack_int* isuppz, lapack_logical* tryrac ); lapack_int LAPACKE_cstemr( int matrix_order, char jobz, char range, lapack_int n, const float* d, float* e, float vl, float vu, lapack_int il, lapack_int iu, lapack_int* m, float* w, lapack_complex_float* z, lapack_int ldz, lapack_int nzc, lapack_int* isuppz, lapack_logical* tryrac ); lapack_int LAPACKE_zstemr( int matrix_order, char jobz, char range, lapack_int n, const double* d, double* e, double vl, double vu, lapack_int il, lapack_int iu, lapack_int* m, double* w, lapack_complex_double* z, lapack_int ldz, lapack_int nzc, lapack_int* isuppz, lapack_logical* tryrac ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a real symmetric tridiagonal matrix T. Any such unreduced matrix has a well defined set of pairwise different real eigenvalues, the corresponding real eigenvectors are pairwise orthogonal. The spectrum may be computed either completely or partially by specifying either an interval (vl,vu] or a range of indices il:iu for the desired eigenvalues. Depending on the number of desired eigenvalues, these are computed either by bisection or the dqds algorithm. Numerically orthogonal eigenvectors are computed by the use of various suitable L*D*LT factorizations near clusters of close eigenvalues (referred to as RRRs, Relatively Robust Representations). An informal sketch of the algorithm follows. 4 Intel® Math Kernel Library Reference Manual 798 For each unreduced block (submatrix) of T, a. Compute T - sigma*I = L*D*LT, so that L and D define all the wanted eigenvalues to high relative accuracy. This means that small relative changes in the entries of L and D cause only small relative changes in the eigenvalues and eigenvectors. The standard (unfactored) representation of the tridiagonal matrix T does not have this property in general. b. Compute the eigenvalues to suitable accuracy. If the eigenvectors are desired, the algorithm attains full accuracy of the computed eigenvalues only right before the corresponding vectors have to be computed, see steps c and d. c. For each cluster of close eigenvalues, select a new shift close to the cluster, find a new factorization, and refine the shifted eigenvalues to suitable accuracy. d. For each eigenvalue with a large enough relative separation compute the corresponding eigenvector by forming a rank revealing twisted factorization. Go back to step c for any clusters that remain. For more details, see: [Dhillon04], [Dhillon04-02], [Dhillon97] The routine works only on machines which follow IEEE-754 floating-point standard in their handling of infinities and NaNs (NaN stands for "not a number"). This permits the use of efficient inner loops avoiding a check for zero divisors. LAPACK routines can be used to reduce a complex Hermitean matrix to real symmetric tridiagonal form. (Any complex Hermitean tridiagonal matrix has real values on its diagonal and potentially complex numbers on its off-diagonals. By applying a similarity transform with an appropriate diagonal matrix diag(1,e{i \phy_1}, ..., e{i \phy_{n-1}}), the complex Hermitean matrix can be transformed into a real symmetric matrix and complex arithmetic can be entirely avoided.) While the eigenvectors of the real symmetric tridiagonal matrix are real, the eigenvectors of original complex Hermitean matrix have complex entries in general. Since LAPACK drivers overwrite the matrix data with the eigenvectors, zstemr accepts complex workspace to facilitate interoperability with zunmtr or zupmtr. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes all eigenvalues in the half-open interval: (vl, vu]. If range = 'I', the routine computes eigenvalues with indices il to iu. n INTEGER. The order of the matrix T (n=0). d REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). Contains n diagonal elements of the tridiagonal matrix T. e REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n-1). Contains (n-1) off-diagonal elements of the tridiagonal matrix T in elements 1 to n-1 of e. e(n) need not be set on input, but is used internally as workspace. vl, vu REAL for single precision flavors LAPACK Routines: Least Squares and Eigenvalue Problems 4 799 DOUBLE PRECISION for double precision flavors. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues. Constraint: vl0. If range = 'A' or 'V', il and iu are not referenced. ldz INTEGER. The leading dimension of the output array z. if jobz = 'V', then ldz = max(1, n); ldz = 1 otherwise. nzc INTEGER. The number of eigenvectors to be held in the array z. If range = 'A', then nzc=max(1, n); If range = 'V', then nzc is greater than or equal to the number of eigenvalues in the half-open interval: (vl, vu]. If range = 'I', then nzc=il+iu+1. This value is returned as the first entry of the array z, and no error message related to nzc is issued by the routine xerbla. tryrac LOGICAL. If tryrac = .TRUE., it indicates that the code should check whether the tridiagonal matrix defines its eigenvalues to high relative accuracy. If so, the code uses relative-accuracy preserving algorithms that might be (a bit) slower depending on the matrix. If the matrix does not define its eigenvalues to high relative accuracy, the code can uses possibly faster algorithms. If tryrac = .FALSE., the code is not required to guarantee relatively accurate eigenvalues and can use the fastest possible techniques. work REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION (lwork). lwork INTEGER. The dimension of the array work, lwork = max(1, 18*n). If lwork=-1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. iwork INTEGER. Workspace array, DIMENSION (liwork). liwork INTEGER. The dimension of the array iwork. lwork=max(1, 10*n) if the eigenvectors are desired, and lwork=max(1, 8*n) if only the eigenvalues are to be computed. If liwork=-1, then a workspace query is assumed; the routine only calculates the optimal size of the iwork array, returns this value as the first entry of the iwork array, and no error message related to liwork is issued by xerbla. 4 Intel® Math Kernel Library Reference Manual 800 Output Parameters e On exit, the array e is overwritten. m INTEGER. The total number of eigenvalues found, 0=m=n. If range = 'A', then m=n, and if If range = 'I', then m=iu-il+1. w REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The first m elements contain the selected eigenvalues in ascending order. z REAL for sstemr DOUBLE PRECISION for dstemr COMPLEX for cstemr DOUBLE COMPLEX for zstemr. Array z(ldz, *), the second dimension of z must be at least max(1, m). If jobz = 'V', and info = 0, then the first m columns of z contain the orthonormal eigenvectors of the matrix T corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z ; if range = 'V', the exact value of m is not known in advance and an can be computed with a workspace query by setting nzc=-1, see description of the parameter nzc. isuppz INTEGER. Array, DIMENSION (2*max(1, m)). The support of the eigenvectors in z, that is the indices indicating the nonzero elements in z. The i-th computed eigenvector is nonzero only in elements isuppz(2*i-1) through isuppz(2*i). This is relevant in the case when the matrix is split. isuppz is only accessed when jobz = 'V' and n>0. tryrac On exit, TRUE. tryrac is set to .FALSE. if the matrix does not define its eigenvalues to high relative accuracy. work(1) On exit, if info = 0, then work(1) returns the optimal (and minimal) size of lwork. iwork(1) On exit, if info = 0, then iwork(1) returns the optimal size of liwork. info INTEGER. If = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, internal error in ?larre occurred, if info = 2, internal error in ?larrv occurred. ?stedc Computes all eigenvalues and eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method. LAPACK Routines: Least Squares and Eigenvalue Problems 4 801 Syntax Fortran 77: call sstedc(compz, n, d, e, z, ldz, work, lwork, iwork, liwork, info) call dstedc(compz, n, d, e, z, ldz, work, lwork, iwork, liwork, info) call cstedc(compz, n, d, e, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) call zstedc(compz, n, d, e, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) Fortran 95: call rstedc(d, e [,z] [,compz] [,info]) call stedc(d, e [,z] [,compz] [,info]) C: lapack_int LAPACKE_sstedc( int matrix_order, char compz, lapack_int n, float* d, float* e, float* z, lapack_int ldz ); lapack_int LAPACKE_dstedc( int matrix_order, char compz, lapack_int n, double* d, double* e, double* z, lapack_int ldz ); lapack_int LAPACKE_cstedc( int matrix_order, char compz, lapack_int n, float* d, float* e, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zstedc( int matrix_order, char compz, lapack_int n, double* d, double* e, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues and (optionally) all the eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method. The eigenvectors of a full or band real symmetric or complex Hermitian matrix can also be found if sytrd/hetrd or sptrd/hptrd or sbtrd/hbtrd has been used to reduce this matrix to tridiagonal form. See also laed0, laed1, laed2, laed3, laed4, laed5, laed6, laed7, laed8, laed9, and laeda used by this function. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. compz CHARACTER*1. Must be 'N' or 'I' or 'V'. If compz = 'N', the routine computes eigenvalues only. If compz = 'I', the routine computes the eigenvalues and eigenvectors of the tridiagonal matrix. If compz = 'V', the routine computes the eigenvalues and eigenvectors of original symmetric/Hermitian matrix. On entry, the array z must contain the orthogonal/unitary matrix used to reduce the original matrix to tridiagonal form. n INTEGER. The order of the symmetric tridiagonal matrix (n = 0). 4 Intel® Math Kernel Library Reference Manual 802 d, e, rwork REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays: d(*) contains the diagonal elements of the tridiagonal matrix. The dimension of d must be at least max(1, n). e(*) contains the subdiagonal elements of the tridiagonal matrix. The dimension of e must be at least max(1, n-1). rwork is a workspace array, its dimension max(1, lrwork). z, work REAL for sstedc DOUBLE PRECISION for dstedc COMPLEX for cstedc DOUBLE COMPLEX for zstedc. Arrays: z(ldz, *), work(*). If compz = 'V', then, on entry, z must contain the orthogonal/unitary matrix used to reduce the original matrix to tridiagonal form. The second dimension of z must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). ldz INTEGER. The leading dimension of z. Constraints: ldz = 1 if compz = 'N'; ldz = max(1, n) if compz = 'V' or 'I'. lwork INTEGER. The dimension of the array work. For real functions sstedc and dstedc: • If compz = 'N'or n = 1, lwork must be at least 1. • If compz = 'V' and n > 1, lwork must be at least 1 + 3*n + 2*n*log2(n) + 4*n2, where log2(n) is the smallest integer k such that 2k=n. • If compz = 'I' and n > 1 then lwork must be at least 1 + 4*n + n2 Note that for compz = 'I' or 'V' and if n is less than or equal to the minimum divide size, usually 25, then lwork need only be max(1, 2*(n-1)). For complex functions cstedc and zstedc: • If compz = 'N'or 'I', or n = 1, lwork must be at least 1. • If compz = 'V' and n > 1, lwork must be at least n2. Note that for compz = 'V', and if n is less than or equal to the minimum divide size, usually 25, then lwork need only be 1. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for the required value of lwork. lrwork INTEGER. The dimension of the array rwork (used for complex flavors only). If compz = 'N', or n = 1, lrwork must be at least 1. If compz = 'V' and n > 1, lrwork must be at least (1+3*n+2*n*lg(n) +4*n*n), where lg(n)is the smallest integer k such that 2**k=n. If compz = 'I' and n > 1, lrwork must be at least (1+4*n+2*n*n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 803 Note that for compz = 'V'or 'I', and if n is less than or equal to the minimum divide size, usually 25, then lrwork need only be max(1, 2*(n-1)). If lrwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for the required value of lrwork. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork. If compz = 'N', or n = 1, liwork must be at least 1. If compz = 'V' and n > 1, liwork must be at least (6+6*n+5*n*lg(n), where lg(n)is the smallest integer k such that 2**k=n. If compz = 'I' and n > 1, liwork must be at least (3+5*n). Note that for compz = 'V'or 'I', and if n is less than or equal to the minimum divide size, usually 25, then liwork need only be 1. If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for the required value of liwork. Output Parameters d The n eigenvalues in ascending order, unless info ? 0. See also info. e On exit, the array is overwritten; see info. z If info = 0, then if compz = 'V', z contains the orthonormal eigenvectors of the original symmetric/Hermitian matrix, and if compz = 'I', z contains the orthonormal eigenvectors of the symmetric tridiagonal matrix. If compz = 'N', z is not referenced. work(1) On exit, if info = 0, then work(1) returns the optimal lwork. rwork(1) On exit, if info = 0, then rwork(1) returns the optimal lrwork (for complex flavors only). iwork(1) On exit, if info = 0, then iwork(1) returns the optimal liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the algorithm failed to compute an eigenvalue while working on the submatrix lying in rows and columns i/(n+1) through mod(i, n+1). Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine stedc interface are the following: d Holds the vector of length n. e Holds the vector of length (n-1). z Holds the matrix Z of size (n,n). 4 Intel® Math Kernel Library Reference Manual 804 compz If omitted, this argument is restored based on the presence of argument z as follows: compz = 'I', if z is present, compz = 'N', if z is omitted. If present, compz must be equal to 'I' or 'V' and the argument z must also be present. Note that there will be an error condition if compz is present and z omitted. Note that two variants of Fortran 95 interface for stedc routine are needed because of an ambiguous choice between real and complex cases appear when z and work are omitted. Thus, the name rstedc is used in real cases (single or double precision), and the name stedc is used in complex cases (single or double precision). Application Notes The required size of workspace arrays must be as follows. For sstedc/dstedc: If compz = 'N' or n = 1 then lwork must be at least 1. If compz = 'V' and n > 1 then lwork must be at least (1 + 3n + 2n·lgn + 3n2), where lg(n) = smallest integer k such that 2k= n. If compz = 'I' and n > 1 then lwork must be at least (1 + 4n + n2). If compz = 'N' or n = 1 then liwork must be at least 1. If compz = 'V' and n > 1 then liwork must be at least (6 + 6n + 5n·lgn). If compz = 'I' and n > 1 then liwork must be at least (3 + 5n). For cstedc/zstedc: If compz = 'N' or'I', or n = 1, lwork must be at least 1. If compz = 'V' and n > 1, lwork must be at least n2. If compz = 'N' or n = 1, lrwork must be at least 1. If compz = 'V' and n > 1, lrwork must be at least (1 + 3n + 2n·lgn + 3n2), where lg(n ) = smallest integer k such that 2k= n. If compz = 'I' and n > 1, lrwork must be at least(1 + 4n + 2n2). The required value of liwork for complex flavors is the same as for real flavors. If lwork (or liwork or lrwork, if supplied) is equal to -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork). This operation is called a workspace query. Note that if lwork (liwork, lrwork) is less than the minimal required value and is not equal to -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?stegr Computes selected eigenvalues and eigenvectors of a real symmetric tridiagonal matrix. Syntax Fortran 77: call sstegr(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) LAPACK Routines: Least Squares and Eigenvalue Problems 4 805 call dstegr(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) call cstegr(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) call zstegr(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) Fortran 95: call rstegr(d, e, w [,z] [,vl] [,vu] [,il] [,iu] [,m] [,isuppz] [,abstol] [,info]) call stegr(d, e, w [,z] [,vl] [,vu] [,il] [,iu] [,m] [,isuppz] [,abstol] [,info]) C: lapack_int LAPACKE_sstegr( int matrix_order, char jobz, char range, lapack_int n, float* d, float* e, float vl, float vu, lapack_int il, lapack_int iu, float abstol, lapack_int* m, float* w, float* z, lapack_int ldz, lapack_int* isuppz ); lapack_int LAPACKE_dstegr( int matrix_order, char jobz, char range, lapack_int n, double* d, double* e, double vl, double vu, lapack_int il, lapack_int iu, double abstol, lapack_int* m, double* w, double* z, lapack_int ldz, lapack_int* isuppz ); lapack_int LAPACKE_cstegr( int matrix_order, char jobz, char range, lapack_int n, float* d, float* e, float vl, float vu, lapack_int il, lapack_int iu, float abstol, lapack_int* m, float* w, lapack_complex_float* z, lapack_int ldz, lapack_int* isuppz ); lapack_int LAPACKE_zstegr( int matrix_order, char jobz, char range, lapack_int n, double* d, double* e, double vl, double vu, lapack_int il, lapack_int iu, double abstol, lapack_int* m, double* w, lapack_complex_double* z, lapack_int ldz, lapack_int* isuppz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a real symmetric tridiagonal matrix T. Any such unreduced matrix has a well defined set of pairwise different real eigenvalues, the corresponding real eigenvectors are pairwise orthogonal. The spectrum may be computed either completely or partially by specifying either an interval (vl,vu] or a range of indices il:iu for the desired eigenvalues. ?sregr is a compatibility wrapper around the improved stemr routine. See its description for further details. Note that the abstol parameter no longer provides any benefit and hence is no longer used. See also auxiliary lasq2 lasq5, lasq6, used by this routine. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If job = 'N', then only eigenvalues are computed. 4 Intel® Math Kernel Library Reference Manual 806 If job = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl< lambda(i) = vu. If range = 'I', the routine computes eigenvalues with indices il to iu. n INTEGER. The order of the matrix T (n = 0). d, e, work REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: d(*) contains the diagonal elements of T. The dimension of d must be at least max(1, n). e(*) contains the subdiagonal elements of T in elements 1 to n-1; e(n) need not be set on input, but it is used as a workspace. The dimension of e must be at least max(1, n). work(lwork) is a workspace array. vl, vu REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues. Constraint: vl< vu. If range = 'A' or 'I', vl and vu are not referenced. il, iu INTEGER. If range = 'I', the indices in ascending order of the smallest and largest eigenvalues to be returned. Constraint: 1 = il = iu = n, if n > 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Unused. Was the absolute error tolerance for the eigenvalues/ eigenvectors in previous versions. ldz INTEGER. The leading dimension of the output array z. Constraints: ldz < 1 if jobz = 'N'; ldz < max(1, n) jobz = 'V', an. lwork INTEGER. The dimension of the array work, lwork=max(1, 18*n) if jobz = 'V', and lwork=max(1, 12*n) if jobz = 'N'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for details. iwork INTEGER. Workspace array, DIMENSION (liwork). liwork INTEGER. The dimension of the array iwork, lwork = max(1, 10*n) if the eigenvectors are desired, and lwork = max(1, 8*n) if only the eigenvalues are to be computed.. LAPACK Routines: Least Squares and Eigenvalue Problems 4 807 If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the iwork array, returns this value as the first entry of the iwork array, and no error message related to liwork is issued by xerbla. See Application Notes below for details. Output Parameters d, e On exit, d and e are overwritten. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, n). The selected eigenvalues in ascending order, stored in w(1) to w(m). z REAL for sstegr DOUBLE PRECISION for dstegr COMPLEX for cstegr DOUBLE COMPLEX for zstegr. Array z(ldz, *), the second dimension of z must be at least max(1, m). If jobz = 'V', and if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix T corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z ; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. Supplying n columns is always safe. isuppz INTEGER. Array, DIMENSION at least (2*max(1, m)). The support of the eigenvectors in z, that is the indices indicating the nonzero elements in z. The i-th computed eigenvector is nonzero only in elements isuppz(2*i-1) through isuppz(2*i). This is relevant in the case when the matrix is split. isuppz is only accessed when jobz = 'V', and n > 0. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1x, internal error in ?larre occurred, If info = 2x, internal error in ?larrv occurred. Here the digit x = abs(iinfo) < 10, where iinfo is the non-zero error code returned by ? larre or ?larrv, respectively. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. 4 Intel® Math Kernel Library Reference Manual 808 Specific details for the routine stegr interface are the following: d Holds the vector of length n. e Holds the vector of length n. w Holds the vector of length n. z Holds the matrix Z of size (n,m). isuppz Holds the vector of length (2*m). vl Default value for this argument is vl = - HUGE (vl) where HUGE(a) means the largest machine number of the same precision as argument a. vu Default value for this argument is vu = HUGE (vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this argument is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Note that two variants of Fortran 95 interface for stegr routine are needed because of an ambiguous choice between real and complex cases appear when z is omitted. Thus, the name rstegr is used in real cases (single or double precision), and the name stegr is used in complex cases (single or double precision). Application Notes Currently ?stegr is only set up to find all the n eigenvalues and eigenvectors of T in O(n2) time, that is, only range = 'A' is supported. ?stegr works only on machines which follow IEEE-754 floating-point standard in their handling of infinities and NaNs. Normal execution of ?stegr may create NaNs and infinities and hence may abort due to a floating point exception in environments which do not conform to the IEEE-754 standard. If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run, or set lwork = -1 (liwork = -1). If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, then the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if lwork (liwork) is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. LAPACK Routines: Least Squares and Eigenvalue Problems 4 809 ?pteqr Computes all eigenvalues and (optionally) all eigenvectors of a real symmetric positive-definite tridiagonal matrix. Syntax Fortran 77: call spteqr(compz, n, d, e, z, ldz, work, info) call dpteqr(compz, n, d, e, z, ldz, work, info) call cpteqr(compz, n, d, e, z, ldz, work, info) call zpteqr(compz, n, d, e, z, ldz, work, info) Fortran 95: call rpteqr(d, e [,z] [,compz] [,info]) call pteqr(d, e [,z] [,compz] [,info]) C: lapack_int LAPACKE_spteqr( int matrix_order, char compz, lapack_int n, float* d, float* e, float* z, lapack_int ldz ); lapack_int LAPACKE_dpteqr( int matrix_order, char compz, lapack_int n, double* d, double* e, double* z, lapack_int ldz ); lapack_int LAPACKE_cpteqr( int matrix_order, char compz, lapack_int n, float* d, float* e, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zpteqr( int matrix_order, char compz, lapack_int n, double* d, double* e, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues and (optionally) all the eigenvectors of a real symmetric positivedefinite tridiagonal matrix T. In other words, the routine can compute the spectral factorization: T = Z*?*ZT. Here ? is a diagonal matrix whose diagonal elements are the eigenvalues ?i; Z is an orthogonal matrix whose columns are eigenvectors. Thus, T*zi = ?i*zi for i = 1, 2, ..., n. (The routine normalizes the eigenvectors so that ||zi||2 = 1.) You can also use the routine for computing the eigenvalues and eigenvectors of real symmetric (or complex Hermitian) positive-definite matrices A reduced to tridiagonal form T: A = Q*T*QH. In this case, the spectral factorization is as follows: A = Q*T*QH = (QZ)*?*(QZ)H. Before calling ?pteqr, you must reduce A to tridiagonal form and generate the explicit matrix Q by calling the following routines: 4 Intel® Math Kernel Library Reference Manual 810 for real matrices: for complex matrices: full storage ?sytrd, ?orgtr ?hetrd, ?ungtr packed storage ?sptrd, ?opgtr ?hptrd, ?upgtr band storage ?sbtrd (vect='V') ?hbtrd (vect='V') The routine first factorizes T as L*D*LH where L is a unit lower bidiagonal matrix, and D is a diagonal matrix. Then it forms the bidiagonal matrix B = L*D1/2 and calls ?bdsqr to compute the singular values of B, which are the same as the eigenvalues of T. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. compz CHARACTER*1. Must be 'N' or 'I' or 'V'. If compz = 'N', the routine computes eigenvalues only. If compz = 'I', the routine computes the eigenvalues and eigenvectors of the tridiagonal matrix T. If compz = 'V', the routine computes the eigenvalues and eigenvectors of A (and the array z must contain the matrix Q on entry). n INTEGER. The order of the matrix T (n = 0). d, e, work REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays: d(*) contains the diagonal elements of T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). work(*) is a workspace array. The dimension of work must be: at least 1 if compz = 'N'; at least max(1, 4*n-4) if compz = 'V' or 'I'. z REAL for spteqr DOUBLE PRECISION for dpteqr COMPLEX for cpteqr DOUBLE COMPLEX for zpteqr. Array, DIMENSION (ldz,*) If compz = 'N' or 'I', z need not be set. If compz = 'V', z must contain the n-by-n matrix Q. The second dimension of z must be: at least 1 if compz = 'N'; at least max(1, n) if compz = 'V' or 'I'. ldz INTEGER. The leading dimension of z. Constraints: ldz = 1 if compz = 'N'; ldz = max(1, n) if compz = 'V' or 'I'. Output Parameters d The n eigenvalues in descending order, unless info > 0. See also info. LAPACK Routines: Least Squares and Eigenvalue Problems 4 811 e On exit, the array is overwritten. z If info = 0, contains the n orthonormal eigenvectors, stored by columns. (The i-th column corresponds to the i-th eigenvalue.) info INTEGER. If info = 0, the execution is successful. If info = i, the leading minor of order i (and hence T itself) is not positive-definite. If info = n + i, the algorithm for computing singular values failed to converge; i off-diagonal elements have not converged to zero. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine pteqr interface are the following: d Holds the vector of length n. e Holds the vector of length (n-1). z Holds the matrix Z of size (n,n). compz If omitted, this argument is restored based on the presence of argument z as follows: compz = 'I', if z is present, compz = 'N', if z is omitted. If present, compz must be equal to 'I' or 'V' and the argument z must also be present. Note that there will be an error condition if compz is present and z omitted. Note that two variants of Fortran 95 interface for pteqr routine are needed because of an ambiguous choice between real and complex cases appear when z is omitted. Thus, the name rpteqr is used in real cases (single or double precision), and the name pteqr is used in complex cases (single or double precision). Application Notes If ?i is an exact eigenvalue, and µi is the corresponding computed value, then |µi - ?i| = c(n)*e*K*?i where c(n) is a modestly increasing function of n, e is the machine precision, and K = ||DTD||2 *|| (DTD)-1||2, D is diagonal with dii = tii-1/2. If zi is the corresponding exact eigenvector, and wi is the corresponding computed vector, then the angle ?(zi, wi) between them is bounded as follows: ?(ui, wi) = c(n)eK / mini?j(|?i - ?j|/|?i + ?j|). Here mini?j(|?i - ?j|/|?i + ?j|) is the relative gap between ?i and the other eigenvalues. The total number of floating-point operations depends on how rapidly the algorithm converges. Typically, it is about 30n2 if compz = 'N'; 6n3 (for complex flavors, 12n3) if compz = 'V' or 'I'. 4 Intel® Math Kernel Library Reference Manual 812 ?stebz Computes selected eigenvalues of a real symmetric tridiagonal matrix by bisection. Syntax Fortran 77: call sstebz (range, order, n, vl, vu, il, iu, abstol, d, e, m, nsplit, w, iblock, isplit, work, iwork, info) call dstebz (range, order, n, vl, vu, il, iu, abstol, d, e, m, nsplit, w, iblock, isplit, work, iwork, info) Fortran 95: call stebz(d, e, m, nsplit, w, iblock, isplit [, order] [,vl] [,vu] [,il] [,iu] [,abstol] [,info]) C: lapack_int LAPACKE_stebz( char range, char order, lapack_int n, vl, vu, lapack_int il, lapack_int iu, abstol, const * d, const * e, lapack_int* m, lapack_int* nsplit, * w, lapack_int* iblock, lapack_int* isplit ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes some (or all) of the eigenvalues of a real symmetric tridiagonal matrix T by bisection. The routine searches for zero or negligible off-diagonal elements to see if T splits into block-diagonal form T = diag(T1, T2, ...). Then it performs bisection on each of the blocks Ti and returns the block index of each computed eigenvalue, so that a subsequent call to stein can also take advantage of the block structure. See also laebz. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl < lambda(i) = vu. If range = 'I', the routine computes eigenvalues with indices il to iu. order CHARACTER*1. Must be 'B' or 'E'. If order = 'B', the eigenvalues are to be ordered from smallest to largest within each split-off block. If order = 'E', the eigenvalues for the entire matrix are to be ordered from smallest to largest. n INTEGER. The order of the matrix T (n = 0). LAPACK Routines: Least Squares and Eigenvalue Problems 4 813 vl, vu REAL for sstebz DOUBLE PRECISION for dstebz. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl < lambda(i) = vu. If range = 'A' or 'I', vl and vu are not referenced. il, iu INTEGER. Constraint: 1 = il = iu = n. If range = 'I', the routine computes eigenvalues lambda(i) such that il= i = iu (assuming that the eigenvalues lambda(i) are in ascending order). If range = 'A' or 'V', il and iu are not referenced. abstol REAL for sstebz DOUBLE PRECISION for dstebz. The absolute tolerance to which each eigenvalue is required. An eigenvalue (or cluster) is considered to have converged if it lies in an interval of width abstol. If abstol = 0.0, then the tolerance is taken as eps*|T|, where eps is the machine precision, and |T| is the 1-norm of the matrix T. d, e, work REAL for sstebz DOUBLE PRECISION for dstebz. Arrays: d(*) contains the diagonal elements of T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). work(*) is a workspace array. The dimension of work must be at least max(1, 4n). iwork INTEGER. Workspace. Array, DIMENSION at least max(1, 3n). Output Parameters m INTEGER. The actual number of eigenvalues found. nsplit INTEGER. The number of diagonal blocks detected in T. w REAL for sstebz DOUBLE PRECISION for dstebz. Array, DIMENSION at least max(1, n). The computed eigenvalues, stored in w(1) to w(m). iblock, isplit INTEGER. Arrays, DIMENSION at least max(1, n). A positive value iblock(i) is the block number of the eigenvalue stored in w(i) (see also info). The leading nsplit elements of isplit contain points at which T splits into blocks Ti as follows: the block T1 contains rows/columns 1 to isplit(1); the block T2 contains rows/columns isplit(1)+1 to isplit(2), and so on. info INTEGER. If info = 0, the execution is successful. If info = 1, for range = 'A' or 'V', the algorithm failed to compute some of the required eigenvalues to the desired accuracy; iblock(i)<0 indicates that the eigenvalue stored in w(i) failed to converge. 4 Intel® Math Kernel Library Reference Manual 814 If info = 2, for range = 'I', the algorithm failed to compute some of the required eigenvalues. Try calling the routine again with range = 'A'. If info = 3: for range = 'A' or 'V', same as info = 1; for range = 'I', same as info = 2. If info = 4, no eigenvalues have been computed. The floating-point arithmetic on the computer is not behaving as expected. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine stebz interface are the following: d Holds the vector of length n. e Holds the vector of length (n-1). w Holds the vector of length n. iblock Holds the vector of length n. isplit Holds the vector of length n. order Must be 'B' or 'E'. The default value is 'B'. vl Default value for this argument is vl = - HUGE (vl) where HUGE(a) means the largest machine number of the same precision as argument a. vu Default value for this argument is vu = HUGE (vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this argument is abstol = 0.0_WP. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes The eigenvalues of T are computed to high relative accuracy which means that if they vary widely in magnitude, then any small eigenvalues will be computed more accurately than, for example, with the standard QR method. However, the reduction to tridiagonal form (prior to calling the routine) may exclude the possibility of obtaining high relative accuracy in the small eigenvalues of the original matrix if its eigenvalues vary widely in magnitude. ?stein Computes the eigenvectors corresponding to specified eigenvalues of a real symmetric tridiagonal matrix. Syntax Fortran 77: call sstein(n, d, e, m, w, iblock, isplit, z, ldz, work, iwork, ifailv, info) call dstein(n, d, e, m, w, iblock, isplit, z, ldz, work, iwork, ifailv, info) LAPACK Routines: Least Squares and Eigenvalue Problems 4 815 call cstein(n, d, e, m, w, iblock, isplit, z, ldz, work, iwork, ifailv, info) call zstein(n, d, e, m, w, iblock, isplit, z, ldz, work, iwork, ifailv, info) Fortran 95: call stein(d, e, w, iblock, isplit, z [,ifailv] [,info]) C: lapack_int LAPACKE_sstein( int matrix_order, lapack_int n, const float* d, const float* e, lapack_int m, const float* w, const lapack_int* iblock, const lapack_int* isplit, float* z, lapack_int ldz, lapack_int* ifailv ); lapack_int LAPACKE_dstein( int matrix_order, lapack_int n, const double* d, const double* e, lapack_int m, const double* w, const lapack_int* iblock, const lapack_int* isplit, double* z, lapack_int ldz, lapack_int* ifailv ); lapack_int LAPACKE_cstein( int matrix_order, lapack_int n, const float* d, const float* e, lapack_int m, const float* w, const lapack_int* iblock, const lapack_int* isplit, lapack_complex_float* z, lapack_int ldz, lapack_int* ifailv ); lapack_int LAPACKE_zstein( int matrix_order, lapack_int n, const double* d, const double* e, lapack_int m, const double* w, const lapack_int* iblock, const lapack_int* isplit, lapack_complex_double* z, lapack_int ldz, lapack_int* ifailv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the eigenvectors of a real symmetric tridiagonal matrix T corresponding to specified eigenvalues, by inverse iteration. It is designed to be used in particular after the specified eigenvalues have been computed by ?stebz with order = 'B', but may also be used when the eigenvalues have been computed by other routines. If you use this routine after ?stebz, it can take advantage of the block structure by performing inverse iteration on each block Ti separately, which is more efficient than using the whole matrix T. If T has been formed by reduction of a full symmetric or Hermitian matrix A to tridiagonal form, you can transform eigenvectors of T to eigenvectors of A by calling ?ormtr or ?opmtr (for real flavors) or by calling ? unmtr or ?upmtr (for complex flavors). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix T (n = 0). m INTEGER. The number of eigenvectors to be returned. d, e, w REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays: d(*) contains the diagonal elements of T. The dimension of d must be at least max(1, n). e(*) contains the sub-diagonal elements of T stored in elements 1 to n-1 4 Intel® Math Kernel Library Reference Manual 816 The dimension of e must be at least max(1, n-1). w(*) contains the eigenvalues of T, stored in w(1) to w(m) (as returned by stebz). Eigenvalues of T1 must be supplied first, in non-decreasing order; then those of T2, again in non-decreasing order, and so on. Constraint: if iblock(i) = iblock(i+1), w(i) = w(i+1). The dimension of w must be at least max(1, n). iblock, isplit INTEGER. Arrays, DIMENSION at least max(1, n). The arrays iblock and isplit, as returned by ?stebz with order = 'B'. If you did not call ?stebz with order = 'B', set all elements of iblock to 1, and isplit(1) to n.) ldz INTEGER. The leading dimension of the output array z; ldz = max(1, n). work REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Workspace array, DIMENSION at least max(1, 5n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). Output Parameters z REAL for sstein DOUBLE PRECISION for dstein COMPLEX for cstein DOUBLE COMPLEX for zstein. Array, DIMENSION (ldz, *). If info = 0, z contains the m orthonormal eigenvectors, stored by columns. (The ith column corresponds to the i-th specified eigenvalue.) ifailv INTEGER. Array, DIMENSION at least max(1, m). If info = i > 0, the first i elements of ifailv contain the indices of any eigenvectors that failed to converge. info INTEGER. If info = 0, the execution is successful. If info = i, then i eigenvectors (as indicated by the parameter ifailv) each failed to converge in 5 iterations. The current iterates are stored in the corresponding columns of the array z. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine stein interface are the following: d Holds the vector of length n. e Holds the vector of length n. w Holds the vector of length n. iblock Holds the vector of length n. isplit Holds the vector of length n. z Holds the matrix Z of size (n,m). LAPACK Routines: Least Squares and Eigenvalue Problems 4 817 ifailv Holds the vector of length (m). Application Notes Each computed eigenvector zi is an exact eigenvector of a matrix T+Ei, where ||Ei||2 = O(e)*||T||2. However, a set of eigenvectors computed by this routine may not be orthogonal to so high a degree of accuracy as those computed by ?steqr. ?disna Computes the reciprocal condition numbers for the eigenvectors of a symmetric/ Hermitian matrix or for the left or right singular vectors of a general matrix. Syntax Fortran 77: call sdisna(job, m, n, d, sep, info) call ddisna(job, m, n, d, sep, info) Fortran 95: call disna(d, sep [,job] [,minmn] [,info]) C: lapack_int LAPACKE_disna( char job, lapack_int m, lapack_int n, const * d, * sep ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the reciprocal condition numbers for the eigenvectors of a real symmetric or complex Hermitian matrix or for the left or right singular vectors of a general m-by-n matrix. The reciprocal condition number is the 'gap' between the corresponding eigenvalue or singular value and the nearest other one. The bound on the error, measured by angle in radians, in the i-th computed vector is given by slamch('E')*(anorm/sep(i)) where anorm = ||A||2 = max( |d(j)| ). sep(i) is not allowed to be smaller than slamch('E')*anorm in order to limit the size of the error bound. ?disna may also be used to compute error bounds for eigenvectors of the generalized symmetric definite eigenproblem. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. job CHARACTER*1. Must be 'E','L', or 'R'. Specifies for which problem the reciprocal condition numbers should be computed: job = 'E': for the eigenvectors of a symmetric/Hermitian matrix; 4 Intel® Math Kernel Library Reference Manual 818 job = 'L': for the left singular vectors of a general matrix; job = 'R': for the right singular vectors of a general matrix. m INTEGER. The number of rows of the matrix (m = 0). n INTEGER. If job = 'L', or 'R', the number of columns of the matrix (n = 0). Ignored if job = 'E'. d REAL for sdisna DOUBLE PRECISION for ddisna. Array, dimension at least max(1,m) if job = 'E', and at least max(1, min(m,n)) if job = 'L' or 'R'. This array must contain the eigenvalues (if job = 'E') or singular values (if job = 'L' or 'R') of the matrix, in either increasing or decreasing order. If singular values, they must be non-negative. Output Parameters sep REAL for sdisna DOUBLE PRECISION for ddisna. Array, dimension at least max(1,m) if job = 'E', and at least max(1, min(m,n)) if job = 'L' or 'R'. The reciprocal condition numbers of the vectors. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine disna interface are the following: d Holds the vector of length min(m,n). sep Holds the vector of length min(m,n). job Must be 'E', 'L', or 'R'. The default value is 'E'. minmn Indicates which of the values m or n is smaller. Must be either 'M' or 'N', the default is 'M'. If job = 'E', this argument is superfluous, If job = 'L' or 'R', this argument is used by the routine. Generalized Symmetric-Definite Eigenvalue Problems Generalized symmetric-definite eigenvalue problems are as follows: find the eigenvalues ? and the corresponding eigenvectors z that satisfy one of these equations: Az = ?Bz, ABz = ?z, or BAz = ?z, where A is an n-by-n symmetric or Hermitian matrix, and B is an n-by-n symmetric positive-definite or Hermitian positive-definite matrix. In these problems, there exist n real eigenvectors corresponding to real eigenvalues (even for complex Hermitian matrices A and B). LAPACK Routines: Least Squares and Eigenvalue Problems 4 819 Routines described in this section allow you to reduce the above generalized problems to standard symmetric eigenvalue problem Cy = ?y, which you can solve by calling LAPACK routines described earlier in this chapter (see Symmetric Eigenvalue Problems). Different routines allow the matrices to be stored either conventionally or in packed storage. Prior to reduction, the positive-definite matrix B must first be factorized using either potrf or pptrf. The reduction routine for the banded matrices A and B uses a split Cholesky factorization for which a specific routine pbstf is provided. This refinement halves the amount of work required to form matrix C. Table "Computational Routines for Reducing Generalized Eigenproblems to Standard Problems" lists LAPACK routines (FORTRAN 77 interface) that can be used to solve generalized symmetric-definite eigenvalue problems. Respective routine names in Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Computational Routines for Reducing Generalized Eigenproblems to Standard Problems Matrix type Reduce to standard problems (full storage) Reduce to standard problems (packed storage) Reduce to standard problems (band matrices) Factorize band matrix real symmetric matrices sygst spgst sbgst pbstf complex Hermitian matrices hegst hpgst hbgst pbstf ?sygst Reduces a real symmetric-definite generalized eigenvalue problem to the standard form. Syntax Fortran 77: call ssygst(itype, uplo, n, a, lda, b, ldb, info) call dsygst(itype, uplo, n, a, lda, b, ldb, info) Fortran 95: call sygst(a, b [,itype] [,uplo] [,info]) C: lapack_int LAPACKE_sygst( int matrix_order, lapack_int itype, char uplo, lapack_int n, * a, lapack_int lda, const * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces real symmetric-definite generalized eigenproblems A*z = ?*B*z, A*B*z = ?*z, or B*A*z = ?*z to the standard form C*y = ?*y. Here A is a real symmetric matrix, and B is a real symmetric positivedefinite matrix. Before calling this routine, call ?potrf to compute the Cholesky factorization: B = UT*U or B = L*LT. 4 Intel® Math Kernel Library Reference Manual 820 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. If itype = 1, the generalized eigenproblem is A*z = lambda*B*z for uplo = 'U': C = inv(UT)*A*inv(U), z = inv(U)*y; for uplo = 'L': C = inv(L)*A*inv(LT), z = inv(LT)*y. If itype = 2, the generalized eigenproblem is A*B*z = lambda*z for uplo = 'U': C = U*A*UT, z = inv(U)*y; for uplo = 'L': C = LT*A*L, z = inv(LT)*y. If itype = 3, the generalized eigenproblem is B*A*z = lambda*z for uplo = 'U': C = U*A*UT, z = UT*y; for uplo = 'L': C = LT*A*L, z = L*y. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', the array a stores the upper triangle of A; you must supply B in the factored form B = UT*U. If uplo = 'L', the array a stores the lower triangle of A; you must supply B in the factored form B = L*LT. n INTEGER. The order of the matrices A and B (n = 0). a, b REAL for ssygst DOUBLE PRECISION for dsygst. Arrays: a(lda,*) contains the upper or lower triangle of A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the Cholesky-factored matrix B: B = UT*U or B = L*LT (as returned by ?potrf). The second dimension of b must be at least max(1, n). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). Output Parameters a The upper or lower triangle of A is overwritten by the upper or lower triangle of C, as specified by the arguments itype and uplo. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sygst interface are the following: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,n). itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. LAPACK Routines: Least Squares and Eigenvalue Problems 4 821 Application Notes Forming the reduced matrix C is a stable procedure. However, it involves implicit multiplication by inv(B) (if itype = 1) or B (if itype = 2 or 3). When the routine is used as a step in the computation of eigenvalues and eigenvectors of the original problem, there may be a significant loss of accuracy if B is ill-conditioned with respect to inversion. The approximate number of floating-point operations is n3. ?hegst Reduces a complex Hermitian-definite generalized eigenvalue problem to the standard form. Syntax Fortran 77: call chegst(itype, uplo, n, a, lda, b, ldb, info) call zhegst(itype, uplo, n, a, lda, b, ldb, info) Fortran 95: call hegst(a, b [,itype] [,uplo] [,info]) C: lapack_int LAPACKE_hegst( int matrix_order, lapack_int itype, char uplo, lapack_int n, * a, lapack_int lda, const * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces complex Hermitian-definite generalized eigenvalue problems A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. to the standard form Cy = ?y. Here the matrix A is complex Hermitian, and B is complex Hermitian positivedefinite. Before calling this routine, you must call ?potrf to compute the Cholesky factorization: B = UH*U or B = L*LH. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. If itype = 1, the generalized eigenproblem is A*z = lambda*B*z for uplo = 'U': C = (UH)-1*A*U-1, z = inv(U)*y; for uplo = 'L': C = L-1*A*(LH)-1, z = (LH)-1*y. If itype = 2, the generalized eigenproblem is A*B*z = lambda*z for uplo = 'U': C = U*A*UH, z = U-1*y; for uplo = 'L': C = LH*A*L, z = (LH)-1*y. If itype = 3, the generalized eigenproblem is B*A*z = lambda*z for uplo = 'U': C = U*A*UH, z = UH*y; 4 Intel® Math Kernel Library Reference Manual 822 for uplo = 'L': C = LH*A*L, z = L*y. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', the array a stores the upper triangle of A; you must supply B in the factored form B = UH*U. If uplo = 'L', the array a stores the lower triangle of A; you must supply B in the factored form B = L*LH. n INTEGER. The order of the matrices A and B (n = 0). a, b COMPLEX for chegstDOUBLE COMPLEX for zhegst. Arrays: a(lda,*) contains the upper or lower triangle of A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the Cholesky-factored matrix B: B = UH*U or B = L*LH (as returned by ?potrf). The second dimension of b must be at least max(1, n). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). Output Parameters a The upper or lower triangle of A is overwritten by the upper or lower triangle of C, as specified by the arguments itype and uplo. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hegst interface are the following: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,n). itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes Forming the reduced matrix C is a stable procedure. However, it involves implicit multiplication by B-1 (if itype = 1) or B (if itype = 2 or 3). When the routine is used as a step in the computation of eigenvalues and eigenvectors of the original problem, there may be a significant loss of accuracy if B is ill-conditioned with respect to inversion. The approximate number of floating-point operations is n3. ?spgst Reduces a real symmetric-definite generalized eigenvalue problem to the standard form using packed storage. LAPACK Routines: Least Squares and Eigenvalue Problems 4 823 Syntax Fortran 77: call sspgst(itype, uplo, n, ap, bp, info) call dspgst(itype, uplo, n, ap, bp, info) Fortran 95: call spgst(ap, bp [,itype] [,uplo] [,info]) C: lapack_int LAPACKE_spgst( int matrix_order, lapack_int itype, char uplo, lapack_int n, * ap, const * bp ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces real symmetric-definite generalized eigenproblems A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x to the standard form C*y = ?*y, using packed matrix storage. Here A is a real symmetric matrix, and B is a real symmetric positive-definite matrix. Before calling this routine, call ?pptrf to compute the Cholesky factorization: B = UT*U or B = L*LT. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. If itype = 1, the generalized eigenproblem is A*z = lambda*B*z for uplo = 'U': C = inv(UT)*A*inv(U), z = inv(U)*y; for uplo = 'L': C = inv(L)*A*inv(LT), z = inv(LT)*y. If itype = 2, the generalized eigenproblem is A*B*z = lambda*z for uplo = 'U': C = U*A*UT, z = inv(U)*y; for uplo = 'L': C = LT*A*L, z = inv(LT)*y. If itype = 3, the generalized eigenproblem is B*A*z = lambda*z for uplo = 'U': C = U*A*UT, z = UT*y; for uplo = 'L': C = LT*A*L, z = L*y. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ap stores the packed upper triangle of A; you must supply B in the factored form B = UT*U. If uplo = 'L', ap stores the packed lower triangle of A; you must supply B in the factored form B = L*LT. n INTEGER. The order of the matrices A and B (n = 0). ap, bp REAL for sspgst DOUBLE PRECISION for dspgst. Arrays: ap(*) contains the packed upper or lower triangle of A. 4 Intel® Math Kernel Library Reference Manual 824 The dimension of ap must be at least max(1, n*(n+1)/2). bp(*) contains the packed Cholesky factor of B (as returned by ?pptrf with the same uplo value). The dimension of bp must be at least max(1, n*(n+1)/2). Output Parameters ap The upper or lower triangle of A is overwritten by the upper or lower triangle of C, as specified by the arguments itype and uplo. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine spgst interface are the following: ap Holds the array A of size (n*(n+1)/2). bp Holds the array B of size (n*(n+1)/2). itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes Forming the reduced matrix C is a stable procedure. However, it involves implicit multiplication by inv(B) (if itype = 1) or B (if itype = 2 or 3). When the routine is used as a step in the computation of eigenvalues and eigenvectors of the original problem, there may be a significant loss of accuracy if B is ill-conditioned with respect to inversion. The approximate number of floating-point operations is n3. ?hpgst Reduces a complex Hermitian-definite generalized eigenvalue problem to the standard form using packed storage. Syntax Fortran 77: call chpgst(itype, uplo, n, ap, bp, info) call zhpgst(itype, uplo, n, ap, bp, info) Fortran 95: call hpgst(ap, bp [,itype] [,uplo] [,info]) C: lapack_int LAPACKE_hpgst( int matrix_order, lapack_int itype, char uplo, lapack_int n, * ap, const * bp ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 LAPACK Routines: Least Squares and Eigenvalue Problems 4 825 • C: mkl_lapacke.h Description The routine reduces real symmetric-definite generalized eigenproblems A*z = ?*B*z, A*B*z = ?*z, or B*A*z = ?*z. to the standard form C*y = ?*y, using packed matrix storage. Here A is a real symmetric matrix, and B is a real symmetric positive-definite matrix. Before calling this routine, you must call ?pptrf to compute the Cholesky factorization: B = UH*U or B = L*LH. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. If itype = 1, the generalized eigenproblem is A*z = lambda*B*z for uplo = 'U': C = inv(UH)*A*inv(U), z = inv(U)*y; for uplo = 'L': C = inv(L)*A*inv(LH), z = inv(LH)*y. If itype = 2, the generalized eigenproblem is A*B*z = lambda*z for uplo = 'U': C = U*A*UH, z = inv(U)*y; for uplo = 'L': C = LH*A*L, z = inv(LH)*y. If itype = 3, the generalized eigenproblem is B*A*z = lambda*z for uplo = 'U': C = U*A*UH, z = UH*y; for uplo = 'L': C = LH*A*L, z = L*y. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ap stores the packed upper triangle of A; you must supply B in the factored form B = UH*U. If uplo = 'L', ap stores the packed lower triangle of A; you must supply B in the factored form B = L*LH. n INTEGER. The order of the matrices A and B (n = 0). ap, bp COMPLEX for chpgstDOUBLE COMPLEX for zhpgst. Arrays: ap(*) contains the packed upper or lower triangle of A. The dimension of a must be at least max(1, n*(n+1)/2). bp(*) contains the packed Cholesky factor of B (as returned by ?pptrf with the same uplo value). The dimension of b must be at least max(1, n*(n+1)/2). Output Parameters ap The upper or lower triangle of A is overwritten by the upper or lower triangle of C, as specified by the arguments itype and uplo. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpgst interface are the following: 4 Intel® Math Kernel Library Reference Manual 826 ap Holds the array A of size (n*(n+1)/2). bp Holds the array B of size (n*(n+1)/2). itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes Forming the reduced matrix C is a stable procedure. However, it involves implicit multiplication by inv(B) (if itype = 1) or B (if itype = 2 or 3). When the routine is used as a step in the computation of eigenvalues and eigenvectors of the original problem, there may be a significant loss of accuracy if B is ill-conditioned with respect to inversion. The approximate number of floating-point operations is n3. ?sbgst Reduces a real symmetric-definite generalized eigenproblem for banded matrices to the standard form using the factorization performed by ?pbstf. Syntax Fortran 77: call ssbgst(vect, uplo, n, ka, kb, ab, ldab, bb, ldbb, x, ldx, work, info) call dsbgst(vect, uplo, n, ka, kb, ab, ldab, bb, ldbb, x, ldx, work, info) Fortran 95: call sbgst(ab, bb [,x] [,uplo] [,info]) C: lapack_int LAPACKE_sbgst( int matrix_order, char vect, char uplo, lapack_int n, lapack_int ka, lapack_int kb, * ab, lapack_int ldab, const * bb, lapack_int ldbb, * x, lapack_int ldx ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description To reduce the real symmetric-definite generalized eigenproblem A*z = ?*B*z to the standard form C*y=?*y, where A, B and C are banded, this routine must be preceded by a call to pbstf/pbstf, which computes the split Cholesky factorization of the positive-definite matrix B: B=ST*S. The split Cholesky factorization, compared with the ordinary Cholesky factorization, allows the work to be approximately halved. This routine overwrites A with C = XT*A*X, where X = inv(S)*Q and Q is an orthogonal matrix chosen (implicitly) to preserve the bandwidth of A. The routine also has an option to allow the accumulation of X, and then, if z is an eigenvector of C, X*z is an eigenvector of the original system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 827 vect CHARACTER*1. Must be 'N' or 'V'. If vect = 'N', then matrix X is not returned; If vect = 'V', then matrix X is returned. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ab stores the upper triangular part of A. If uplo = 'L', ab stores the lower triangular part of A. n INTEGER. The order of the matrices A and B (n = 0). ka INTEGER. The number of super- or sub-diagonals in A (ka = 0). kb INTEGER. The number of super- or sub-diagonals in B (ka = kb = 0). ab, bb, work REAL for ssbgst DOUBLE PRECISION for dsbgst ab (ldab,*) is an array containing either upper or lower triangular part of the symmetric matrix A (as specified by uplo) in band storage format. The second dimension of the array ab must be at least max(1, n). bb (ldbb,*) is an array containing the banded split Cholesky factor of B as specified by uplo, n and kb and returned by pbstf/pbstf. The second dimension of the array bb must be at least max(1, n). work(*) is a workspace array, dimension at least max(1, 2*n) ldab INTEGER. The leading dimension of the array ab; must be at least ka+1. ldbb INTEGER. The leading dimension of the array bb; must be at least kb+1. ldx The leading dimension of the output array x. Constraints: if vect = 'N', then ldx = 1; if vect = 'V', then ldx = max(1, n). Output Parameters ab On exit, this array is overwritten by the upper or lower triangle of C as specified by uplo. x REAL for ssbgst DOUBLE PRECISION for dsbgst Array. If vect = 'V', then x (ldx,*) contains the n-by-n matrix X = inv(S)*Q. If vect = 'N', then x is not referenced. The second dimension of x must be: at least max(1, n), if vect = 'V'; at least 1, if vect = 'N'. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sbgst interface are the following: ab Holds the array A of size (ka+1,n). bb Holds the array B of size (kb+1,n). 4 Intel® Math Kernel Library Reference Manual 828 x Holds the matrix X of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. vect Restored based on the presence of the argument x as follows: vect = 'V', if x is present, vect = 'N', if x is omitted. Application Notes Forming the reduced matrix C involves implicit multiplication by inv(B). When the routine is used as a step in the computation of eigenvalues and eigenvectors of the original problem, there may be a significant loss of accuracy if B is ill-conditioned with respect to inversion. If ka and kb are much less than n then the total number of floating-point operations is approximately 6n2*kb, when vect = 'N'. Additional (3/2)n3*(kb/ka) operations are required when vect = 'V'. ?hbgst Reduces a complex Hermitian-definite generalized eigenproblem for banded matrices to the standard form using the factorization performed by ?pbstf. Syntax Fortran 77: call chbgst(vect, uplo, n, ka, kb, ab, ldab, bb, ldbb, x, ldx, work, rwork, info) call zhbgst(vect, uplo, n, ka, kb, ab, ldab, bb, ldbb, x, ldx, work, rwork, info) Fortran 95: call hbgst(ab, bb [,x] [,uplo] [,info]) C: lapack_int LAPACKE_hbgst( int matrix_order, char vect, char uplo, lapack_int n, lapack_int ka, lapack_int kb, * ab, lapack_int ldab, const * bb, lapack_int ldbb, * x, lapack_int ldx ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description To reduce the complex Hermitian-definite generalized eigenproblem A*z = ?*B*z to the standard form C*x = ?*y, where A, B and C are banded, this routine must be preceded by a call to pbstf/pbstf, which computes the split Cholesky factorization of the positive-definite matrix B: B = SH*S. The split Cholesky factorization, compared with the ordinary Cholesky factorization, allows the work to be approximately halved. This routine overwrites A with C = XH*A*X, where X = inv(S)*Q, and Q is a unitary matrix chosen (implicitly) to preserve the bandwidth of A. The routine also has an option to allow the accumulation of X, and then, if z is an eigenvector of C, X*z is an eigenvector of the original system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 829 vect CHARACTER*1. Must be 'N' or 'V'. If vect = 'N', then matrix X is not returned; If vect = 'V', then matrix X is returned. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ab stores the upper triangular part of A. If uplo = 'L', ab stores the lower triangular part of A. n INTEGER. The order of the matrices A and B (n = 0). ka INTEGER. The number of super- or sub-diagonals in A (ka = 0). kb INTEGER. The number of super- or sub-diagonals in B (ka = kb = 0). ab, bb, work COMPLEX for chbgstDOUBLE COMPLEX for zhbgst ab (ldab,*) is an array containing either upper or lower triangular part of the Hermitian matrix A (as specified by uplo) in band storage format. The second dimension of the array ab must be at least max(1, n). bb (ldbb,*) is an array containing the banded split Cholesky factor of B as specified by uplo, n and kb and returned by pbstf/pbstf. The second dimension of the array bb must be at least max(1, n). work(*) is a workspace array, dimension at least max(1, n) ldab INTEGER. The leading dimension of the array ab; must be at least ka+1. ldbb INTEGER. The leading dimension of the array bb; must be at least kb+1. ldx The leading dimension of the output array x. Constraints: if vect = 'N', then ldx = 1; if vect = 'V', then ldx = max(1, n). rwork REAL for chbgst DOUBLE PRECISION for zhbgst Workspace array, dimension at least max(1, n) Output Parameters ab On exit, this array is overwritten by the upper or lower triangle of C as specified by uplo. x COMPLEX for chbgst DOUBLE COMPLEX for zhbgst Array. If vect = 'V', then x (ldx,*) contains the n-by-n matrix X = inv(S)*Q. If vect = 'N', then x is not referenced. The second dimension of x must be: at least max(1, n), if vect = 'V'; at least 1, if vect = 'N'. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbgst interface are the following: 4 Intel® Math Kernel Library Reference Manual 830 ab Holds the array A of size (ka+1,n). bb Holds the array B of size (kb+1,n). x Holds the matrix X of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. vect Restored based on the presence of the argument x as follows: vect = 'V', if x is present, vect = 'N', if x is omitted. Application Notes Forming the reduced matrix C involves implicit multiplication by inv(B). When the routine is used as a step in the computation of eigenvalues and eigenvectors of the original problem, there may be a significant loss of accuracy if B is ill-conditioned with respect to inversion. The total number of floating-point operations is approximately 20n2*kb, when vect = 'N'. Additional 5n3*(kb/ka) operations are required when vect = 'V'. All these estimates assume that both ka and kb are much less than n. ?pbstf Computes a split Cholesky factorization of a real symmetric or complex Hermitian positive-definite banded matrix used in ?sbgst/?hbgst . Syntax Fortran 77: call spbstf(uplo, n, kb, bb, ldbb, info) call dpbstf(uplo, n, kb, bb, ldbb, info) call cpbstf(uplo, n, kb, bb, ldbb, info) call zpbstf(uplo, n, kb, bb, ldbb, info) Fortran 95: call pbstf(bb [, uplo] [,info]) C: lapack_int LAPACKE_pbstf( int matrix_order, char uplo, lapack_int n, lapack_int kb, * bb, lapack_int ldbb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes a split Cholesky factorization of a real symmetric or complex Hermitian positivedefinite band matrix B. It is to be used in conjunction with sbgst/hbgst. The factorization has the form B = ST*S (or B = SH*S for complex flavors), where S is a band matrix of the same bandwidth as B and the following structure: S is upper triangular in the first (n+kb)/2 rows and lower triangular in the remaining rows. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 831 uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', bb stores the upper triangular part of B. If uplo = 'L', bb stores the lower triangular part of B. n INTEGER. The order of the matrix B (n = 0). kb INTEGER. The number of super- or sub-diagonals in B (kb = 0). bb REAL for spbstf DOUBLE PRECISION for dpbstf COMPLEX for cpbstf DOUBLE COMPLEX for zpbstf. bb (ldbb,*) is an array containing either upper or lower triangular part of the matrix B (as specified by uplo) in band storage format. The second dimension of the array bb must be at least max(1, n). ldbb INTEGER. The leading dimension of bb; must be at least kb+1. Output Parameters bb On exit, this array is overwritten by the elements of the split Cholesky factor S. info INTEGER. If info = 0, the execution is successful. If info = i, then the factorization could not be completed, because the updated element bii would be the square root of a negative number; hence the matrix B is not positive-definite. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbstf interface are the following: bb Holds the array B of size (kb+1,n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed factor S is the exact factor of a perturbed matrix B + E, where c(n) is a modest linear function of n, and e is the machine precision. The total number of floating-point operations for real flavors is approximately n(kb+1)2. The number of operations for complex flavors is 4 times greater. All these estimates assume that kb is much less than n. After calling this routine, you can call sbgst/hbgst to solve the generalized eigenproblem Az = ?Bz, where A and B are banded and B is positive-definite. 4 Intel® Math Kernel Library Reference Manual 832 Nonsymmetric Eigenvalue Problems This section describes LAPACK routines for solving nonsymmetric eigenvalue problems, computing the Schur factorization of general matrices, as well as performing a number of related computational tasks. A nonsymmetric eigenvalue problem is as follows: given a nonsymmetric (or non-Hermitian) matrix A, find the eigenvalues ? and the corresponding eigenvectors z that satisfy the equation Az = ?z (right eigenvectors z) or the equation zHA = ?zH (left eigenvectors z). Nonsymmetric eigenvalue problems have the following properties: • The number of eigenvectors may be less than the matrix order (but is not less than the number of distinct eigenvalues of A). • Eigenvalues may be complex even for a real matrix A. • If a real nonsymmetric matrix has a complex eigenvalue a+bi corresponding to an eigenvector z, then abi is also an eigenvalue. The eigenvalue a-bi corresponds to the eigenvector whose elements are complex conjugate to the elements of z. To solve a nonsymmetric eigenvalue problem with LAPACK, you usually need to reduce the matrix to the upper Hessenberg form and then solve the eigenvalue problem with the Hessenberg matrix obtained. Table "Computational Routines for Solving Nonsymmetric Eigenvalue Problems" lists LAPACK routines (FORTRAN 77 interface) to reduce the matrix to the upper Hessenberg form by an orthogonal (or unitary) similarity transformation A = QHQH as well as routines to solve eigenvalue problems with Hessenberg matrices, forming the Schur factorization of such matrices and computing the corresponding condition numbers. Respective routine names in the Fortran 95 interface are without the first symbol (see Routine Naming Conventions). The decision tree in Figure "Decision Tree: Real Nonsymmetric Eigenvalue Problems" helps you choose the right routine or sequence of routines for an eigenvalue problem with a real nonsymmetric matrix. If you need to solve an eigenvalue problem with a complex non-Hermitian matrix, use the decision tree shown in Figure "Decision Tree: Complex Non-Hermitian Eigenvalue Problems". Computational Routines for Solving Nonsymmetric Eigenvalue Problems Operation performed Routines for real matrices Routines for complex matrices Reduce to Hessenberg form A = QHQH ?gehrd, ?gehrd Generate the matrix Q ?orghr ?unghr Apply the matrix Q ?ormhr ?unmhr Balance matrix ?gebal ?gebal Transform eigenvectors of balanced matrix to those of the original matrix ?gebak ?gebak Find eigenvalues and Schur factorization (QR algorithm) ?hseqr ?hseqr Find eigenvectors from Hessenberg form (inverse iteration) ?hsein ?hsein Find eigenvectors from Schur factorization ?trevc ?trevc LAPACK Routines: Least Squares and Eigenvalue Problems 4 833 Operation performed Routines for real matrices Routines for complex matrices Estimate sensitivities of eigenvalues and eigenvectors ?trsna ?trsna Reorder Schur factorization ?trexc ?trexc Reorder Schur factorization, find the invariant subspace and estimate sensitivities ?trsen ?trsen Solves Sylvester's equation. ?trsyl ?trsyl Decision Tree: Real Nonsymmetric Eigenvalue Problems 4 Intel® Math Kernel Library Reference Manual 834 Decision Tree: Complex Non-Hermitian Eigenvalue Problems ?gehrd Reduces a general matrix to upper Hessenberg form. Syntax Fortran 77: call sgehrd(n, ilo, ihi, a, lda, tau, work, lwork, info) call dgehrd(n, ilo, ihi, a, lda, tau, work, lwork, info) call cgehrd(n, ilo, ihi, a, lda, tau, work, lwork, info) call zgehrd(n, ilo, ihi, a, lda, tau, work, lwork, info) Fortran 95: call gehrd(a [, tau] [,ilo] [,ihi] [,info]) LAPACK Routines: Least Squares and Eigenvalue Problems 4 835 C: lapack_int LAPACKE_gehrd( int matrix_order, lapack_int n, lapack_int ilo, lapack_int ihi, * a, lapack_int lda, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a general matrix A to upper Hessenberg form H by an orthogonal or unitary similarity transformation A = Q*H*QH. Here H has real subdiagonal elements. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of elementary reflectors. Routines are provided to work with Q in this representation. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A (n = 0). ilo, ihi INTEGER. If A is an output by ?gebal, then ilo and ihi must contain the values returned by that routine. Otherwise ilo = 1 and ihi = n. (If n > 0, then 1 = ilo = ihi = n; if n = 0, ilo = 1 and ihi = 0.) a, work REAL for sgehrd DOUBLE PRECISION for dgehrd COMPLEX for cgehrd DOUBLE COMPLEX for zgehrd. Arrays: a (lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work (lwork) is a workspace array. lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array; at least max(1, n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the upper Hessenberg matrix H and details of the matrix Q. The subdiagonal elements of H are real. tau REAL for sgehrd DOUBLE PRECISION for dgehrd COMPLEX for cgehrd DOUBLE COMPLEX for zgehrd. Array, DIMENSION at least max (1, n-1). Contains additional information on the matrix Q. 4 Intel® Math Kernel Library Reference Manual 836 work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gehrd interface are the following: a Holds the matrix A of size (n,n). tau Holds the vector of length (n-1). ilo Default value for this argument is ilo = 1. ihi Default value for this argument is ihi = n. Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed Hessenberg matrix H is exactly similar to a nearby matrix A + E, where ||E||2 < c(n)e|| A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The approximate number of floating-point operations for real flavors is (2/3)*(ihi - ilo)2(2ihi + 2ilo + 3n); for complex flavors it is 4 times greater. ?orghr Generates the real orthogonal matrix Q determined by ?gehrd. Syntax Fortran 77: call sorghr(n, ilo, ihi, a, lda, tau, work, lwork, info) call dorghr(n, ilo, ihi, a, lda, tau, work, lwork, info) Fortran 95: call orghr(a, tau [,ilo] [,ihi] [,info]) LAPACK Routines: Least Squares and Eigenvalue Problems 4 837 C: lapack_int LAPACKE_orghr( int matrix_order, lapack_int n, lapack_int ilo, lapack_int ihi, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine explicitly generates the orthogonal matrix Q that has been determined by a preceding call to sgehrd/dgehrd. (The routine ?gehrd reduces a real general matrix A to upper Hessenberg form H by an orthogonal similarity transformation, A = Q*H*QT, and represents the matrix Q as a product of ihi-ilo elementary reflectors. Here ilo and ihi are values determined by sgebal/dgebal when balancing the matrix; if the matrix has not been balanced, ilo = 1 and ihi = n.) The matrix Q generated by ?orghr has the structure: where Q22 occupies rows and columns ilo to ihi. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix Q (n = 0). ilo, ihi INTEGER. These must be the same parameters ilo and ihi, respectively, as supplied to ?gehrd. (If n > 0, then 1 = ilo = ihi = n; if n = 0, ilo = 1 and ihi = 0.) a, tau, work REAL for sorghr DOUBLE PRECISION for dorghr Arrays: a(lda,*) contains details of the vectors which define the elementary reflectors, as returned by ?gehrd. The second dimension of a must be at least max(1, n). tau(*) contains further details of the elementary reflectors, as returned by ?gehrd. The dimension of tau must be at least max (1, n-1). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array; lwork = max(1, ihi-ilo). 4 Intel® Math Kernel Library Reference Manual 838 If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the n-by-n orthogonal matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine orghr interface are the following: a Holds the matrix A of size (n,n). tau Holds the vector of length (n-1). ilo Default value for this argument is ilo = 1. ihi Default value for this argument is ihi = n. Application Notes For better performance, try using lwork =(ihi-ilo)*blocksize where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from the exact result by a matrix E such that ||E||2 = O(e), where e is the machine precision. The approximate number of floating-point operations is (4/3)(ihi-ilo)3. The complex counterpart of this routine is unghr. ?ormhr Multiplies an arbitrary real matrix C by the real orthogonal matrix Q determined by ?gehrd. LAPACK Routines: Least Squares and Eigenvalue Problems 4 839 Syntax Fortran 77: call sormhr(side, trans, m, n, ilo, ihi, a, lda, tau, c, ldc, work, lwork, info) call dormhr(side, trans, m, n, ilo, ihi, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormhr(a, tau, c [,ilo] [,ihi] [,side] [,trans] [,info]) C: lapack_int LAPACKE_ormhr( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int ilo, lapack_int ihi, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a matrix C by the orthogonal matrix Q that has been determined by a preceding call to sgehrd/dgehrd. (The routine ?gehrd reduces a real general matrix A to upper Hessenberg form H by an orthogonal similarity transformation, A = Q*H*QT, and represents the matrix Q as a product of ihi-ilo elementary reflectors. Here ilo and ihi are values determined by sgebal/dgebal when balancing the matrix;if the matrix has not been balanced, ilo = 1 and ihi = n.) With ?ormhr, you can form one of the matrix products Q*C, QT*C, C*Q, or C*QT, overwriting the result on C (which may be any real rectangular matrix). A common application of ?ormhr is to transform a matrix V of eigenvectors of H to the matrix QV of eigenvectors of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be 'L' or 'R'. If side = 'L', then the routine forms Q*C or QT*C. If side = 'R', then the routine forms C*Q or C*QT. trans CHARACTER*1. Must be 'N' or 'T'. If trans = 'N', then Q is applied to C. If trans = 'T', then QT is applied to C. m INTEGER. The number of rows in C (m = 0). n INTEGER. The number of columns in C (n = 0). ilo, ihi INTEGER. These must be the same parameters ilo and ihi, respectively, as supplied to ?gehrd. If m > 0 and side = 'L', then 1 = ilo = ihi = m. If m = 0 and side = 'L', then ilo = 1 and ihi = 0. If n > 0 and side = 'R', then 1 = ilo = ihi = n. If n = 0 and side = 'R', then ilo = 1 and ihi = 0. 4 Intel® Math Kernel Library Reference Manual 840 a, tau, c, work REAL for sormhr DOUBLE PRECISION for dormhr Arrays: a(lda,*) contains details of the vectors which define the elementary reflectors, as returned by ?gehrd. The second dimension of a must be at least max(1, m) if side = 'L' and at least max(1, n) if side = 'R'. tau(*) contains further details of the elementary reflectors, as returned by ?gehrd . The dimension of tau must be at least max (1, m-1) if side = 'L' and at least max (1, n-1) if side = 'R'. c(ldc,*) contains the m by n matrix C. The second dimension of c must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m) if side = 'L' and at least max (1, n) if side = 'R'. ldc INTEGER. The leading dimension of c; at least max(1, m). lwork INTEGER. The size of the work array. If side = 'L', lwork = max(1, n). If side = 'R', lwork = max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c C is overwritten by product Q*C, QT*C, C*Q, or C*QT as specified by side and trans. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormhr interface are the following: a Holds the matrix A of size (r,r). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (r-1). c Holds the matrix C of size (m,n). ilo Default value for this argument is ilo = 1. ihi Default value for this argument is ihi = n. side Must be 'L' or 'R'. The default value is 'L'. LAPACK Routines: Least Squares and Eigenvalue Problems 4 841 trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, lwork should be at least n*blocksize if side = 'L' and at least m*blocksize if side = 'R', where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from the exact result by a matrix E such that ||E||2 = O(e)|*|C||2, where e is the machine precision. The approximate number of floating-point operations is 2n(ihi-ilo)2 if side = 'L'; 2m(ihi-ilo)2 if side = 'R'. The complex counterpart of this routine is unmhr. ?unghr Generates the complex unitary matrix Q determined by ?gehrd. Syntax Fortran 77: call cunghr(n, ilo, ihi, a, lda, tau, work, lwork, info) call zunghr(n, ilo, ihi, a, lda, tau, work, lwork, info) Fortran 95: call unghr(a, tau [,ilo] [,ihi] [,info]) C: lapack_int LAPACKE_unghr( int matrix_order, lapack_int n, lapack_int ilo, lapack_int ihi, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description 4 Intel® Math Kernel Library Reference Manual 842 The routine is intended to be used following a call to cgehrd/zgehrd, which reduces a complex matrix A to upper Hessenberg form H by a unitary similarity transformation: A = Q*H*QH. ?gehrd represents the matrix Q as a product of ihi-ilo elementary reflectors. Here ilo and ihi are values determined by cgebal/ zgebal when balancing the matrix; if the matrix has not been balanced, ilo = 1 and ihi = n. Use the routine unghr to generate Q explicitly as a square matrix. The matrix Q has the structure: where Q22 occupies rows and columns ilo to ihi. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix Q (n = 0). ilo, ihi INTEGER. These must be the same parameters ilo and ihi, respectively, as supplied to ?gehrd . (If n > 0, then 1 = ilo = ihi = n. If n = 0, then ilo = 1 and ihi = 0.) a, tau, work COMPLEX for cunghr DOUBLE COMPLEX for zunghr. Arrays: a(lda,*) contains details of the vectors which define the elementary reflectors, as returned by ?gehrd. The second dimension of a must be at least max(1, n). tau(*) contains further details of the elementary reflectors, as returned by ?gehrd . The dimension of tau must be at least max (1, n-1). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array; lwork = max(1, ihi-ilo). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the n-by-n unitary matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Least Squares and Eigenvalue Problems 4 843 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unghr interface are the following: a Holds the matrix A of size (n,n). tau Holds the vector of length (n-1). ilo Default value for this argument is ilo = 1. ihi Default value for this argument is ihi = n. Application Notes For better performance, try using lwork = (ihi-ilo)*blocksize, where blocksize is a machinedependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from the exact result by a matrix E such that ||E||2 = O(e), where e is the machine precision. The approximate number of real floating-point operations is (16/3)(ihi-ilo)3. The real counterpart of this routine is orghr. ?unmhr Multiplies an arbitrary complex matrix C by the complex unitary matrix Q determined by ?gehrd. Syntax Fortran 77: call cunmhr(side, trans, m, n, ilo, ihi, a, lda, tau, c, ldc, work, lwork, info) call zunmhr(side, trans, m, n, ilo, ihi, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmhr(a, tau, c [,ilo] [,ihi] [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmhr( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int ilo, lapack_int ihi, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h 4 Intel® Math Kernel Library Reference Manual 844 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a matrix C by the unitary matrix Q that has been determined by a preceding call to cgehrd/zgehrd. (The routine ?gehrd reduces a real general matrix A to upper Hessenberg form H by an orthogonal similarity transformation, A = Q*H*QH, and represents the matrix Q as a product of ihi-ilo elementary reflectors. Here ilo and ihi are values determined by cgebal/zgebal when balancing the matrix; if the matrix has not been balanced, ilo = 1 and ihi = n.) With ?unmhr, you can form one of the matrix products Q*C, QH*C, C*Q, or C*QH, overwriting the result on C (which may be any complex rectangular matrix). A common application of this routine is to transform a matrix V of eigenvectors of H to the matrix QV of eigenvectors of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be 'L' or 'R'. If side = 'L', then the routine forms Q*C or QH*C. If side = 'R', then the routine forms C*Q or C*QH. trans CHARACTER*1. Must be 'N' or 'C'. If trans = 'N', then Q is applied to C. If trans = 'T', then QH is applied to C. m INTEGER. The number of rows in C (m= 0). n INTEGER. The number of columns in C (n= 0). ilo, ihi INTEGER. These must be the same parameters ilo and ihi, respectively, as supplied to ?gehrd . If m > 0 and side = 'L', then 1 =ilo=ihi=m. If m = 0 and side = 'L', then ilo = 1 and ihi = 0. If n > 0 and side = 'R', then 1 =ilo=ihi=n. If n = 0 and side = 'R', then ilo =1 and ihi = 0. a, tau, c, work COMPLEX for cunmhr DOUBLE COMPLEX for zunmhr. Arrays: a (lda,*) contains details of the vectors which define the elementary reflectors, as returned by ?gehrd. The second dimension of a must be at least max(1, m) if side = 'L' and at least max(1, n) if side = 'R'. tau(*) contains further details of the elementary reflectors, as returned by ?gehrd. The dimension of tau must be at least max (1, m-1) if side = 'L' and at least max (1, n-1) if side = 'R'. c (ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m) if side = 'L' and at least max (1, n) if side = 'R'. ldc INTEGER. The leading dimension of c; at least max(1, m). lwork INTEGER. The size of the work array. LAPACK Routines: Least Squares and Eigenvalue Problems 4 845 If side = 'L', lwork= max(1,n). If side = 'R', lwork= max(1,m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c C is overwritten by Q*C, or QH*C, or C*QH, or C*Q as specified by side and trans. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmhr interface are the following: a Holds the matrix A of size (r,r). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (r-1). c Holds the matrix C of size (m,n). ilo Default value for this argument is ilo = 1. ihi Default value for this argument is ihi = n. side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, lwork should be at least n*blocksize if side = 'L' and at least m*blocksize if side = 'R', where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from the exact result by a matrix E such that ||E||2 = O(e)*||C||2, where e is the machine precision. 4 Intel® Math Kernel Library Reference Manual 846 The approximate number of floating-point operations is 8n(ihi-ilo)2 if side = 'L'; 8m(ihi-ilo)2 if side = 'R'. The real counterpart of this routine is ormhr. ?gebal Balances a general matrix to improve the accuracy of computed eigenvalues and eigenvectors. Syntax Fortran 77: call sgebal(job, n, a, lda, ilo, ihi, scale, info) call dgebal(job, n, a, lda, ilo, ihi, scale, info) call cgebal(job, n, a, lda, ilo, ihi, scale, info) call zgebal(job, n, a, lda, ilo, ihi, scale, info) Fortran 95: call gebal(a [, scale] [,ilo] [,ihi] [,job] [,info]) C: lapack_int LAPACKE_sgebal( int matrix_order, char job, lapack_int n, float* a, lapack_int lda, lapack_int* ilo, lapack_int* ihi, float* scale ); lapack_int LAPACKE_dgebal( int matrix_order, char job, lapack_int n, double* a, lapack_int lda, lapack_int* ilo, lapack_int* ihi, double* scale ); lapack_int LAPACKE_cgebal( int matrix_order, char job, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_int* ilo, lapack_int* ihi, float* scale ); lapack_int LAPACKE_zgebal( int matrix_order, char job, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_int* ilo, lapack_int* ihi, double* scale ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine balances a matrix A by performing either or both of the following two similarity transformations: (1) The routine first attempts to permute A to block upper triangular form: LAPACK Routines: Least Squares and Eigenvalue Problems 4 847 where P is a permutation matrix, and A'11 and A'33 are upper triangular. The diagonal elements of A'11 and A'33 are eigenvalues of A. The rest of the eigenvalues of A are the eigenvalues of the central diagonal block A'22, in rows and columns ilo to ihi. Subsequent operations to compute the eigenvalues of A (or its Schur factorization) need only be applied to these rows and columns; this can save a significant amount of work if ilo > 1 and ihi < n. If no suitable permutation exists (as is often the case), the routine sets ilo = 1 and ihi = n, and A'22 is the whole of A. (2) The routine applies a diagonal similarity transformation to A', to make the rows and columns of A'22 as close in norm as possible: This scaling can reduce the norm of the matrix (that is, ||A'2'2|| < ||A'22||), and hence reduce the effect of rounding errors on the accuracy of computed eigenvalues and eigenvectors. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. job CHARACTER*1. Must be 'N' or 'P' or 'S' or 'B'. If job = 'N', then A is neither permuted nor scaled (but ilo, ihi, and scale get their values). If job = 'P', then A is permuted but not scaled. If job = 'S', then A is scaled but not permuted. If job = 'B', then A is both scaled and permuted. n INTEGER. The order of the matrix A (n = 0). a REAL for sgebal DOUBLE PRECISION for dgebal COMPLEX for cgebal DOUBLE COMPLEX for zgebal. Arrays: a (lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). a is not referenced if job = 'N'. lda INTEGER. The leading dimension of a; at least max(1, n). Output Parameters a Overwritten by the balanced matrix (a is not referenced if job = 'N'). ilo, ihi INTEGER. The values ilo and ihi such that on exit a(i,j) is zero if i > j and 1 = j < ilo or ihi < i = n. If job = 'N' or 'S', then ilo = 1 and ihi = n. scale REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors Array, DIMENSION at least max(1, n). 4 Intel® Math Kernel Library Reference Manual 848 Contains details of the permutations and scaling factors. More precisely, if pj is the index of the row and column interchanged with row and column j, and dj is the scaling factor used to balance row and column j, then scale(j) = pj for j = 1, 2,..., ilo-1, ihi+1,..., n; scale(j) = dj for j = ilo, ilo + 1,..., ihi. The order in which the interchanges are made is n to ihi+1, then 1 to ilo-1. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gebal interface are the following: a Holds the matrix A of size (n,n). scale Holds the vector of length n. ilo Default value for this argument is ilo = 1. ihi Default value for this argument is ihi = n. job Must be 'B', 'S', 'P', or 'N'. The default value is 'B'. Application Notes The errors are negligible, compared with those in subsequent computations. If the matrix A is balanced by this routine, then any eigenvectors computed subsequently are eigenvectors of the matrix A'' and hence you must call gebak to transform them back to eigenvectors of A. If the Schur vectors of A are required, do not call this routine with job = 'S' or 'B', because then the balancing transformation is not orthogonal (not unitary for complex flavors). If you call this routine with job = 'P', then any Schur vectors computed subsequently are Schur vectors of the matrix A'', and you need to call gebak (with side = 'R') to transform them back to Schur vectors of A. The total number of floating-point operations is proportional to n2. ?gebak Transforms eigenvectors of a balanced matrix to those of the original nonsymmetric matrix. Syntax Fortran 77: call sgebak(job, side, n, ilo, ihi, scale, m, v, ldv, info) call dgebak(job, side, n, ilo, ihi, scale, m, v, ldv, info) call cgebak(job, side, n, ilo, ihi, scale, m, v, ldv, info) call zgebak(job, side, n, ilo, ihi, scale, m, v, ldv, info) Fortran 95: call gebak(v, scale [,ilo] [,ihi] [,job] [,side] [,info]) LAPACK Routines: Least Squares and Eigenvalue Problems 4 849 C: lapack_int LAPACKE_sgebak( int matrix_order, char job, char side, lapack_int n, lapack_int ilo, lapack_int ihi, const float* scale, lapack_int m, float* v, lapack_int ldv ); lapack_int LAPACKE_dgebak( int matrix_order, char job, char side, lapack_int n, lapack_int ilo, lapack_int ihi, const double* scale, lapack_int m, double* v, lapack_int ldv ); lapack_int LAPACKE_cgebak( int matrix_order, char job, char side, lapack_int n, lapack_int ilo, lapack_int ihi, const float* scale, lapack_int m, lapack_complex_float* v, lapack_int ldv ); lapack_int LAPACKE_zgebak( int matrix_order, char job, char side, lapack_int n, lapack_int ilo, lapack_int ihi, const double* scale, lapack_int m, lapack_complex_double* v, lapack_int ldv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine is intended to be used after a matrix A has been balanced by a call to ?gebal, and eigenvectors of the balanced matrix A''22 have subsequently been computed. For a description of balancing, see gebal. The balanced matrix A'' is obtained as A''= D*P*A*PT*inv(D), where P is a permutation matrix and D is a diagonal scaling matrix. This routine transforms the eigenvectors as follows: if x is a right eigenvector of A'', then PT*inv(D)*x is a right eigenvector of A; if x is a left eigenvector of A'', then PT*D*y is a left eigenvector of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. job CHARACTER*1. Must be 'N' or 'P' or 'S' or 'B'. The same parameter job as supplied to ?gebal. side CHARACTER*1. Must be 'L' or 'R'. If side = 'L', then left eigenvectors are transformed. If side = 'R', then right eigenvectors are transformed. n INTEGER. The number of rows of the matrix of eigenvectors (n = 0). ilo, ihi INTEGER. The values ilo and ihi, as returned by ?gebal. (If n > 0, then 1 = ilo = ihi = n; if n = 0, then ilo = 1 and ihi = 0.) scale REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors Array, DIMENSION at least max(1, n). Contains details of the permutations and/or the scaling factors used to balance the original general matrix, as returned by ?gebal. m INTEGER. The number of columns of the matrix of eigenvectors (m = 0). v REAL for sgebak 4 Intel® Math Kernel Library Reference Manual 850 DOUBLE PRECISION for dgebak COMPLEX for cgebak DOUBLE COMPLEX for zgebak. Arrays: v (ldv,*) contains the matrix of left or right eigenvectors to be transformed. The second dimension of v must be at least max(1, m). ldv INTEGER. The leading dimension of v; at least max(1, n). Output Parameters v Overwritten by the transformed eigenvectors. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gebak interface are the following: v Holds the matrix V of size (n,m). scale Holds the vector of length n. ilo Default value for this argument is ilo = 1. ihi Default value for this argument is ihi = n. job Must be 'B', 'S', 'P', or 'N'. The default value is 'B'. side Must be 'L' or 'R'. The default value is 'L'. Application Notes The errors in this routine are negligible. The approximate number of floating-point operations is approximately proportional to m*n. ?hseqr Computes all eigenvalues and (optionally) the Schur factorization of a matrix reduced to Hessenberg form. Syntax Fortran 77: call shseqr(job, compz, n, ilo, ihi, h, ldh, wr, wi, z, ldz, work, lwork, info) call dhseqr(job, compz, n, ilo, ihi, h, ldh, wr, wi, z, ldz, work, lwork, info) call chseqr(job, compz, n, ilo, ihi, h, ldh, w, z, ldz, work, lwork, info) call zhseqr(job, compz, n, ilo, ihi, h, ldh, w, z, ldz, work, lwork, info) Fortran 95: call hseqr(h, wr, wi [,ilo] [,ihi] [,z] [,job] [,compz] [,info]) call hseqr(h, w [,ilo] [,ihi] [,z] [,job] [,compz] [,info]) LAPACK Routines: Least Squares and Eigenvalue Problems 4 851 C: lapack_int LAPACKE_shseqr( int matrix_order, char job, char compz, lapack_int n, lapack_int ilo, lapack_int ihi, float* h, lapack_int ldh, float* wr, float* wi, float* z, lapack_int ldz ); lapack_int LAPACKE_dhseqr( int matrix_order, char job, char compz, lapack_int n, lapack_int ilo, lapack_int ihi, double* h, lapack_int ldh, double* wr, double* wi, double* z, lapack_int ldz ); lapack_int LAPACKE_chseqr( int matrix_order, char job, char compz, lapack_int n, lapack_int ilo, lapack_int ihi, lapack_complex_float* h, lapack_int ldh, lapack_complex_float* w, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zhseqr( int matrix_order, char job, char compz, lapack_int n, lapack_int ilo, lapack_int ihi, lapack_complex_double* h, lapack_int ldh, lapack_complex_double* w, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally the Schur factorization, of an upper Hessenberg matrix H: H = Z*T*ZH, where T is an upper triangular (or, for real flavors, quasi-triangular) matrix (the Schur form of H), and Z is the unitary or orthogonal matrix whose columns are the Schur vectors zi. You can also use this routine to compute the Schur factorization of a general matrix A which has been reduced to upper Hessenberg form H: A = Q*H*QH, where Q is unitary (orthogonal for real flavors); A = (QZ)*H*(QZ)H. In this case, after reducing A to Hessenberg form by gehrd, call orghr to form Q explicitly and then pass Q to ?hseqr with compz = 'V'. You can also call gebal to balance the original matrix before reducing it to Hessenberg form by ?hseqr, so that the Hessenberg matrix H will have the structure: where H11 and H33 are upper triangular. If so, only the central diagonal block H22 (in rows and columns ilo to ihi) needs to be further reduced to Schur form (the blocks H12 and H23 are also affected). Therefore the values of ilo and ihi can be supplied to ?hseqr directly. Also, after calling this routine you must call gebak to permute the Schur vectors of the balanced matrix to those of the original matrix. If ?gebal has not been called, however, then ilo must be set to 1 and ihi to n. Note that if the Schur factorization of A is required, ?gebal must not be called with job = 'S' or 'B', because the balancing transformation is not unitary (for real flavors, it is not orthogonal). 4 Intel® Math Kernel Library Reference Manual 852 ?hseqr uses a multishift form of the upper Hessenberg QR algorithm. The Schur vectors are normalized so that ||zi||2 = 1, but are determined only to within a complex factor of absolute value 1 (for the real flavors, to within a factor ±1). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. job CHARACTER*1. Must be 'E' or 'S'. If job = 'E', then eigenvalues only are required. If job = 'S', then the Schur form T is required. compz CHARACTER*1. Must be 'N' or 'I' or 'V'. If compz = 'N', then no Schur vectors are computed (and the array z is not referenced). If compz = 'I', then the Schur vectors of H are computed (and the array z is initialized by the routine). If compz = 'V', then the Schur vectors of A are computed (and the array z must contain the matrix Q on entry). n INTEGER. The order of the matrix H (n = 0). ilo, ihi INTEGER. If A has been balanced by ?gebal, then ilo and ihi must contain the values returned by ?gebal. Otherwise, ilo must be set to 1 and ihi to n. h, z, work REAL for shseqr DOUBLE PRECISION for dhseqr COMPLEX for chseqr DOUBLE COMPLEX for zhseqr. Arrays: h(ldh,*) The n-by-n upper Hessenberg matrix H. The second dimension of h must be at least max(1, n). z(ldz,*) If compz = 'V', then z must contain the matrix Q from the reduction to Hessenberg form. If compz = 'I', then z need not be set. If compz = 'N', then z is not referenced. The second dimension of z must be at least max(1, n) if compz = 'V' or 'I'; at least 1 if compz = 'N'. work(lwork) is a workspace array. The dimension of work must be at least max (1, n). ldh INTEGER. The leading dimension of h; at least max(1, n). ldz INTEGER. The leading dimension of z; If compz = 'N', then ldz = 1. If compz = 'V' or 'I', then ldz = max(1, n). lwork INTEGER. The dimension of the array work. lwork = max(1, n) is sufficient and delivers very good and sometimes optimal performance. However, lwork as large as 11*n may be required for optimal performance. A workspace query is recommended to determine the optimal workspace size. LAPACK Routines: Least Squares and Eigenvalue Problems 4 853 If lwork = -1, then a workspace query is assumed; the routine only estimates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for details. Output Parameters w COMPLEX for chseqr DOUBLE COMPLEX for zhseqr. Array, DIMENSION at least max (1, n). Contains the computed eigenvalues, unless info>0. The eigenvalues are stored in the same order as on the diagonal of the Schur form T (if computed). wr, wi REAL for shseqr DOUBLE PRECISION for dhseqr Arrays, DIMENSION at least max (1, n) each. Contain the real and imaginary parts, respectively, of the computed eigenvalues, unless info > 0. Complex conjugate pairs of eigenvalues appear consecutively with the eigenvalue having positive imaginary part first. The eigenvalues are stored in the same order as on the diagonal of the Schur form T (if computed). h If info = 0 and job = 'S', h contains the upper triangular matrix T from the Schur decomposition (the Schur form). If info = 0 and job = 'E', the contents of h are unspecified on exit. (The output value of h when info > 0 is given under the description of info below.) z If compz = 'V' and info = 0, then z contains Q*Z. If compz = 'I' and info = 0, then z contains the unitary or orthogonal matrix Z of the Schur vectors of H. If compz = 'N', then z is not referenced. work(1) On exit, if info = 0, then work(1) returns the optimal lwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, elements 1,2, ..., ilo-1 and i+1, i+2, ..., n of wr and wi contain the real and imaginary parts of those eigenvalues that have been succesively found. If info > 0, and job = 'E', then on exit, the remaining unconverged eigenvalues are the eigenvalues of the upper Hessenberg matrix rows and columns ilo through info of the final output value of H. If info > 0, and job = 'S', then on exit (initial value of H)*U = U*(final value of H), where U is a unitary matrix. The final value of H is upper Hessenberg and triangular in rows and columns info+1 through ihi. If info > 0, and compz = 'V', then on exit (final value of Z) = (initial value of Z)*U, where U is the unitary matrix (regardless of the value of job). If info > 0, and compz = 'I', then on exit (final value of Z) = U, where U is the unitary matrix (regardless of the value of job). If info > 0, and compz = 'N', then Z is not accessed. 4 Intel® Math Kernel Library Reference Manual 854 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hseqr interface are the following: h Holds the matrix H of size (n,n). wr Holds the vector of length n. Used in real flavors only. wi Holds the vector of length n. Used in real flavors only. w Holds the vector of length n. Used in complex flavors only. z Holds the matrix Z of size (n,n). job Must be 'E' or 'S'. The default value is 'E'. compz If omitted, this argument is restored based on the presence of argument z as follows: compz = 'I', if z is present, compz = 'N', if z is omitted. If present, compz must be equal to 'I' or 'V' and the argument z must also be present. Note that there will be an error condition if compz is present and z omitted. Application Notes The computed Schur factorization is the exact factorization of a nearby matrix H + E, where ||E||2 < O(e) ||H||2/si, and e is the machine precision. If ?i is an exact eigenvalue, and µi is the corresponding computed value, then |?i - µi|= c(n)*e*||H||2/ si, where c(n) is a modestly increasing function of n, and si is the reciprocal condition number of ?i. The condition numbers si may be computed by calling trsna. The total number of floating-point operations depends on how rapidly the algorithm converges; typical numbers are as follows. If only eigenvalues are computed: 7n3 for real flavors 25n3 for complex flavors. If the Schur form is computed: 10n3 for real flavors 35n3 for complex flavors. If the full Schur factorization is computed: 20n3 for real flavors 70n3 for complex flavors. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?hsein Computes selected eigenvectors of an upper Hessenberg matrix that correspond to specified eigenvalues. LAPACK Routines: Least Squares and Eigenvalue Problems 4 855 Syntax Fortran 77: call shsein(job, eigsrc, initv, select, n, h, ldh, wr, wi, vl, ldvl, vr, ldvr, mm, m, work, ifaill, ifailr, info) call dhsein(job, eigsrc, initv, select, n, h, ldh, wr, wi, vl, ldvl, vr, ldvr, mm, m, work, ifaill, ifailr, info) call chsein(job, eigsrc, initv, select, n, h, ldh, w, vl, ldvl, vr, ldvr, mm, m, work, rwork, ifaill, ifailr, info) call zhsein(job, eigsrc, initv, select, n, h, ldh, w, vl, ldvl, vr, ldvr, mm, m, work, rwork, ifaill, ifailr, info) Fortran 95: call hsein(h, wr, wi, select [, vl] [,vr] [,ifaill] [,ifailr] [,initv] [,eigsrc] [,m] [,info]) call hsein(h, w, select [,vl] [,vr] [,ifaill] [,ifailr] [,initv] [,eigsrc] [,m] [,info]) C: lapack_int LAPACKE_shsein( int matrix_order, char job, char eigsrc, char initv, lapack_logical* select, lapack_int n, const float* h, lapack_int ldh, float* wr, const float* wi, float* vl, lapack_int ldvl, float* vr, lapack_int ldvr, lapack_int mm, lapack_int* m, lapack_int* ifaill, lapack_int* ifailr ); lapack_int LAPACKE_dhsein( int matrix_order, char job, char eigsrc, char initv, lapack_logical* select, lapack_int n, const double* h, lapack_int ldh, double* wr, const double* wi, double* vl, lapack_int ldvl, double* vr, lapack_int ldvr, lapack_int mm, lapack_int* m, lapack_int* ifaill, lapack_int* ifailr ); lapack_int LAPACKE_chsein( int matrix_order, char job, char eigsrc, char initv, const lapack_logical* select, lapack_int n, const lapack_complex_float* h, lapack_int ldh, lapack_complex_float* w, lapack_complex_float* vl, lapack_int ldvl, lapack_complex_float* vr, lapack_int ldvr, lapack_int mm, lapack_int* m, lapack_int* ifaill, lapack_int* ifailr ); lapack_int LAPACKE_zhsein( int matrix_order, char job, char eigsrc, char initv, const lapack_logical* select, lapack_int n, const lapack_complex_double* h, lapack_int ldh, lapack_complex_double* w, lapack_complex_double* vl, lapack_int ldvl, lapack_complex_double* vr, lapack_int ldvr, lapack_int mm, lapack_int* m, lapack_int* ifaill, lapack_int* ifailr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes left and/or right eigenvectors of an upper Hessenberg matrix H, corresponding to selected eigenvalues. The right eigenvector x and the left eigenvector y, corresponding to an eigenvalue ?, are defined by: H*x = ?*x and yH*H = ?*yH (or HH*y = ?**y). Here ?* denotes the conjugate of ?. 4 Intel® Math Kernel Library Reference Manual 856 The eigenvectors are computed by inverse iteration. They are scaled so that, for a real eigenvector x, max| xi| = 1, and for a complex eigenvector, max(|Rexi| + |Imxi|) = 1. If H has been formed by reduction of a general matrix A to upper Hessenberg form, then eigenvectors of H may be transformed to eigenvectors of A by ormhr or unmhr. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. job CHARACTER*1. Must be 'R' or 'L' or 'B'. If job = 'R', then only right eigenvectors are computed. If job = 'L', then only left eigenvectors are computed. If job = 'B', then all eigenvectors are computed. eigsrc CHARACTER*1. Must be 'Q' or 'N'. If eigsrc = 'Q', then the eigenvalues of H were found using hseqr; thus if H has any zero sub-diagonal elements (and so is block triangular), then the j-th eigenvalue can be assumed to be an eigenvalue of the block containing the j-th row/column. This property allows the routine to perform inverse iteration on just one diagonal block. If eigsrc = 'N', then no such assumption is made and the routine performs inverse iteration using the whole matrix. initv CHARACTER*1. Must be 'N' or 'U'. If initv = 'N', then no initial estimates for the selected eigenvectors are supplied. If initv = 'U', then initial estimates for the selected eigenvectors are supplied in vl and/or vr. select LOGICAL. Array, DIMENSION at least max (1, n). Specifies which eigenvectors are to be computed. For real flavors: To obtain the real eigenvector corresponding to the real eigenvalue wr(j), set select(j) to .TRUE. To select the complex eigenvector corresponding to the complex eigenvalue (wr(j), wi(j)) with complex conjugate (wr(j+1), wi(j+1)), set select(j) and/or select(j+1) to .TRUE.; the eigenvector corresponding to the first eigenvalue in the pair is computed. For complex flavors: To select the eigenvector corresponding to the eigenvalue w(j), set select(j) to .TRUE. n INTEGER. The order of the matrix H (n = 0). h, vl, vr, REAL for shsein DOUBLE PRECISION for dhsein COMPLEX for chsein DOUBLE COMPLEX for zhsein. Arrays: h(ldh,*) The n-by-n upper Hessenberg matrix H. The second dimension of h must be at least max(1, n). (ldvl,*) LAPACK Routines: Least Squares and Eigenvalue Problems 4 857 If initv = 'V' and job = 'L' or 'B', then vl must contain starting vectors for inverse iteration for the left eigenvectors. Each starting vector must be stored in the same column or columns as will be used to store the corresponding eigenvector. If initv = 'N', then vl need not be set. The second dimension of vl must be at least max(1, mm) if job = 'L' or 'B' and at least 1 if job = 'R'. The array vl is not referenced if job = 'R'. vr(ldvr,*) If initv = 'V' and job = 'R' or 'B', then vr must contain starting vectors for inverse iteration for the right eigenvectors. Each starting vector must be stored in the same column or columns as will be used to store the corresponding eigenvector. If initv = 'N', then vr need not be set. The second dimension of vr must be at least max(1, mm) if job = 'R' or 'B' and at least 1 if job = 'L'. The array vr is not referenced if job = 'L'. work(*) is a workspace array. DIMENSION at least max (1, n*(n+2)) for real flavors and at least max (1, n*n) for complex flavors. ldh INTEGER. The leading dimension of h; at least max(1, n). w COMPLEX for chsein DOUBLE COMPLEX for zhsein. Array, DIMENSION at least max (1, n). Contains the eigenvalues of the matrix H. If eigsrc = 'Q', the array must be exactly as returned by ?hseqr. wr, wi REAL for shsein DOUBLE PRECISION for dhsein Arrays, DIMENSION at least max (1, n) each. Contain the real and imaginary parts, respectively, of the eigenvalues of the matrix H. Complex conjugate pairs of values must be stored in consecutive elements of the arrays. If eigsrc = 'Q', the arrays must be exactly as returned by ?hseqr. ldvl INTEGER. The leading dimension of vl. If job = 'L' or 'B', ldvl = max(1,n). If job = 'R', ldvl = 1. ldvr INTEGER. The leading dimension of vr. If job = 'R' or 'B', ldvr = max(1,n). If job = 'L', ldvr =1. mm INTEGER. The number of columns in vl and/or vr. Must be at least m, the actual number of columns required (see Output Parameters below). For real flavors, m is obtained by counting 1 for each selected real eigenvector and 2 for each selected complex eigenvector (see select). For complex flavors, m is the number of selected eigenvectors (see select). Constraint: 0 = mm = n. rwork REAL for chsein DOUBLE PRECISION for zhsein. Array, DIMENSION at least max (1, n). 4 Intel® Math Kernel Library Reference Manual 858 Output Parameters select Overwritten for real flavors only. If a complex eigenvector was selected as specified above, then select(j) is set to .TRUE. and select(j+1) to .FALSE. w The real parts of some elements of w may be modified, as close eigenvalues are perturbed slightly in searching for independent eigenvectors. wr Some elements of wr may be modified, as close eigenvalues are perturbed slightly in searching for independent eigenvectors. vl, vr If job = 'L' or 'B', vl contains the computed left eigenvectors (as specified by select). If job = 'R' or 'B', vr contains the computed right eigenvectors (as specified by select). The eigenvectors are stored consecutively in the columns of the array, in the same order as their eigenvalues. For real flavors: a real eigenvector corresponding to a selected real eigenvalue occupies one column; a complex eigenvector corresponding to a selected complex eigenvalue occupies two columns: the first column holds the real part and the second column holds the imaginary part. m INTEGER. For real flavors: the number of columns of vl and/or vr required to store the selected eigenvectors. For complex flavors: the number of selected eigenvectors. ifaill, ifailr INTEGER. Arrays, DIMENSION at least max(1, mm) each. ifaill(i) = 0 if the ith column of vl converged; ifaill(i) = j > 0 if the eigenvector stored in the i-th column of vl (corresponding to the jth eigenvalue) failed to converge. ifailr(i) = 0 if the ith column of vr converged; ifailr(i) = j > 0 if the eigenvector stored in the i-th column of vr (corresponding to the jth eigenvalue) failed to converge. For real flavors: if the ith and (i+1)th columns of vl contain a selected complex eigenvector, then ifaill(i) and ifaill(i+1) are set to the same value. A similar rule holds for vr and ifailr. The array ifaill is not referenced if job = 'R'. The array ifailr is not referenced if job = 'L'. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info > 0, then i eigenvectors (as indicated by the parameters ifaill and/or ifailr above) failed to converge. The corresponding columns of vl and/or vr contain no useful information. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hsein interface are the following: h Holds the matrix H of size (n,n). wr Holds the vector of length n. Used in real flavors only. wi Holds the vector of length n. Used in real flavors only. LAPACK Routines: Least Squares and Eigenvalue Problems 4 859 w Holds the vector of length n. Used in complex flavors only. select Holds the vector of length n. vl Holds the matrix VL of size (n,mm). vr Holds the matrix VR of size (n,mm). ifaill Holds the vector of length (mm). Note that there will be an error condition if ifaill is present and vl is omitted. ifailr Holds the vector of length (mm). Note that there will be an error condition if ifailr is present and vr is omitted. initv Must be 'N' or 'U'. The default value is 'N'. eigsrc Must be 'N' or 'Q'. The default value is 'N'. job Restored based on the presence of arguments vl and vr as follows: job = 'B', if both vl and vr are present, job = 'L', if vl is present and vr omitted, job = 'R', if vl is omitted and vr present, Note that there will be an error condition if both vl and vr are omitted. Application Notes Each computed right eigenvector x i is the exact eigenvector of a nearby matrix A + Ei, such that ||Ei|| < O(e)||A||. Hence the residual is small: ||Axi - ?ixi|| = O(e)||A||. However, eigenvectors corresponding to close or coincident eigenvalues may not accurately span the relevant subspaces. Similar remarks apply to computed left eigenvectors. ?trevc Computes selected eigenvectors of an upper (quasi-) triangular matrix computed by ?hseqr. Syntax Fortran 77: call strevc(side, howmny, select, n, t, ldt, vl, ldvl, vr, ldvr, mm, m, work, info) call dtrevc(side, howmny, select, n, t, ldt, vl, ldvl, vr, ldvr, mm, m, work, info) call ctrevc(side, howmny, select, n, t, ldt, vl, ldvl, vr, ldvr, mm, m, work, rwork, info) call ztrevc(side, howmny, select, n, t, ldt, vl, ldvl, vr, ldvr, mm, m, work, rwork, info) Fortran 95: call trevc(t [, howmny] [,select] [,vl] [,vr] [,m] [,info]) C: lapack_int LAPACKE_strevc( int matrix_order, char side, char howmny, lapack_logical* select, lapack_int n, const float* t, lapack_int ldt, float* vl, lapack_int ldvl, float* vr, lapack_int ldvr, lapack_int mm, lapack_int* m ); lapack_int LAPACKE_dtrevc( int matrix_order, char side, char howmny, lapack_logical* select, lapack_int n, const double* t, lapack_int ldt, double* vl, lapack_int ldvl, double* vr, lapack_int ldvr, lapack_int mm, lapack_int* m ); 4 Intel® Math Kernel Library Reference Manual 860 lapack_int LAPACKE_ctrevc( int matrix_order, char side, char howmny, const lapack_logical* select, lapack_int n, lapack_complex_float* t, lapack_int ldt, lapack_complex_float* vl, lapack_int ldvl, lapack_complex_float* vr, lapack_int ldvr, lapack_int mm, lapack_int* m ); lapack_int LAPACKE_ztrevc( int matrix_order, char side, char howmny, const lapack_logical* select, lapack_int n, lapack_complex_double* t, lapack_int ldt, lapack_complex_double* vl, lapack_int ldvl, lapack_complex_double* vr, lapack_int ldvr, lapack_int mm, lapack_int* m ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes some or all of the right and/or left eigenvectors of an upper triangular matrix T (or, for real flavors, an upper quasi-triangular matrix T). Matrices of this type are produced by the Schur factorization of a general matrix: A = Q*T*QH, as computed by hseqr. The right eigenvector x and the left eigenvector y of T corresponding to an eigenvalue w, are defined by: T*x = w*x, yH*T = w*yH, where yH denotes the conjugate transpose of y. The eigenvalues are not input to this routine, but are read directly from the diagonal blocks of T. This routine returns the matrices X and/or Y of right and left eigenvectors of T, or the products Q*X and/or Q*Y, where Q is an input matrix. If Q is the orthogonal/unitary factor that reduces a matrix A to Schur form T, then Q*X and Q*Y are the matrices of right and left eigenvectors of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be 'R' or 'L' or 'B'. If side = 'R', then only right eigenvectors are computed. If side = 'L', then only left eigenvectors are computed. If side = 'B', then all eigenvectors are computed. howmny CHARACTER*1. Must be 'A' or 'B' or 'S'. If howmny = 'A', then all eigenvectors (as specified by side) are computed. If howmny = 'B', then all eigenvectors (as specified by side) are computed and backtransformed by the matrices supplied in vl and vr. If howmny = 'S', then selected eigenvectors (as specified by side and select) are computed. select LOGICAL. Array, DIMENSION at least max (1, n). If howmny = 'S', select specifies which eigenvectors are to be computed. If howmny = 'A' or 'B', select is not referenced. For real flavors: If omega(j) is a real eigenvalue, the corresponding real eigenvector is computed if select(j) is .TRUE.. LAPACK Routines: Least Squares and Eigenvalue Problems 4 861 If omega(j) and omega(j+1) are the real and imaginary parts of a complex eigenvalue, the corresponding complex eigenvector is computed if either select(j) or select(j+1) is .TRUE., and on exit select(j) is set to .TRUE.and select(j+1) is set to .FALSE.. For complex flavors: The eigenvector corresponding to the j-th eigenvalue is computed if select(j) is .TRUE.. n INTEGER. The order of the matrix T (n = 0). t, vl, vr REAL for strevc DOUBLE PRECISION for dtrevc COMPLEX for ctrevc DOUBLE COMPLEX for ztrevc. Arrays: t(ldt,*) contains the n-by-n matrix T in Schur canonical form. For complex flavors ctrevc and ztrevc, contains the upper triangular matrix T. The second dimension of t must be at least max(1, n). vl(ldvl,*) If howmny = 'B' and side = 'L' or 'B', then vl must contain an n-by-n matrix Q (usually the matrix of Schur vectors returned by ?hseqr). If howmny = 'A' or 'S', then vl need not be set. The second dimension of vl must be at least max(1, mm) if side = 'L' or 'B' and at least 1 if side = 'R'. The array vl is not referenced if side = 'R'. vr (ldvr,*) If howmny = 'B' and side = 'R' or 'B', then vr must contain an n-by-n matrix Q (usually the matrix of Schur vectors returned by ?hseqr). . If howmny = 'A' or 'S', then vr need not be set. The second dimension of vr must be at least max(1, mm) if side = 'R' or 'B' and at least 1 if side = 'L'. The array vr is not referenced if side = 'L'. work(*) is a workspace array. DIMENSION at least max (1, 3*n) for real flavors and at least max (1, 2*n) for complex flavors. ldt INTEGER. The leading dimension of t; at least max(1, n). ldvl INTEGER. The leading dimension of vl. If side = 'L' or 'B', ldvl = n. If side = 'R', ldvl = 1. ldvr INTEGER. The leading dimension of vr. If side = 'R' or 'B', ldvr = n. If side = 'L', ldvr = 1. mm INTEGER. The number of columns in the arrays vl and/or vr. Must be at least m (the precise number of columns required). If howmny = 'A' or 'B', m = n. If howmny = 'S': for real flavors, m is obtained by counting 1 for each selected real eigenvector and 2 for each selected complex eigenvector; for complex flavors, m is the number of selected eigenvectors (see select). Constraint: 0 = m = n. rwork REAL for ctrevc DOUBLE PRECISION for ztrevc. Workspace array, DIMENSION at least max (1, n). 4 Intel® Math Kernel Library Reference Manual 862 Output Parameters select If a complex eigenvector of a real matrix was selected as specified above, then select(j) is set to .TRUE. and select(j+1) to .FALSE. t COMPLEX for ctrevc DOUBLE COMPLEX for ztrevc. ctrevc/ztrevc modify the t(ldt,*) array, which is restored on exit. vl, vr If side = 'L' or 'B', vl contains the computed left eigenvectors (as specified by howmny and select). If side = 'R' or 'B', vr contains the computed right eigenvectors (as specified by howmny and select). The eigenvectors are stored consecutively in the columns of the array, in the same order as their eigenvalues. For real flavors: corresponding to each real eigenvalue is a real eigenvector, occupying one column;corresponding to each complex conjugate pair of eigenvalues is a complex eigenvector, occupying two columns; the first column holds the real part and the second column holds the imaginary part. m INTEGER. For complex flavors: the number of selected eigenvectors. If howmny = 'A' or 'B', m is set to n. For real flavors: the number of columns of vl and/or vr actually used to store the selected eigenvectors. If howmny = 'A' or 'B', m is set to n. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine trevc interface are the following: t Holds the matrix T of size (n,n). select Holds the vector of length n. vl Holds the matrix VL of size (n,mm). vr Holds the matrix VR of size (n,mm). side If omitted, this argument is restored based on the presence of arguments vl and vr as follows: side = 'B', if both vl and vr are present, side = 'L', if vr is omitted, side = 'R', if vl is omitted. Note that there will be an error condition if both vl and vr are omitted. howmny If omitted, this argument is restored based on the presence of argument select as follows: howmny = 'V', if q is present, howmny = 'N', if q is omitted. If present, vect = 'V' or 'U' and the argument q must also be present. Note that there will be an error condition if both select and howmny are present. LAPACK Routines: Least Squares and Eigenvalue Problems 4 863 Application Notes If x i is an exact right eigenvector and yi is the corresponding computed eigenvector, then the angle ?(yi, xi) between them is bounded as follows: ?(yi,xi)=(c(n)e||T||2)/sepi where sepi is the reciprocal condition number of xi. The condition number sepi may be computed by calling ?trsna. ?trsna Estimates condition numbers for specified eigenvalues and right eigenvectors of an upper (quasi-) triangular matrix. Syntax Fortran 77: call strsna(job, howmny, select, n, t, ldt, vl, ldvl, vr, ldvr, s, sep, mm, m, work, ldwork, iwork, info) call dtrsna(job, howmny, select, n, t, ldt, vl, ldvl, vr, ldvr, s, sep, mm, m, work, ldwork, iwork, info) call ctrsna(job, howmny, select, n, t, ldt, vl, ldvl, vr, ldvr, s, sep, mm, m, work, ldwork, rwork, info) call ztrsna(job, howmny, select, n, t, ldt, vl, ldvl, vr, ldvr, s, sep, mm, m, work, ldwork, rwork, info) Fortran 95: call trsna(t [, s] [,sep] [,vl] [,vr] [,select] [,m] [,info]) C: lapack_int LAPACKE_strsna( int matrix_order, char job, char howmny, const lapack_logical* select, lapack_int n, const float* t, lapack_int ldt, const float* vl, lapack_int ldvl, const float* vr, lapack_int ldvr, float* s, float* sep, lapack_int mm, lapack_int* m ); lapack_int LAPACKE_dtrsna( int matrix_order, char job, char howmny, const lapack_logical* select, lapack_int n, const double* t, lapack_int ldt, const double* vl, lapack_int ldvl, const double* vr, lapack_int ldvr, double* s, double* sep, lapack_int mm, lapack_int* m ); lapack_int LAPACKE_ctrsna( int matrix_order, char job, char howmny, const lapack_logical* select, lapack_int n, const lapack_complex_float* t, lapack_int ldt, const lapack_complex_float* vl, lapack_int ldvl, const lapack_complex_float* vr, lapack_int ldvr, float* s, float* sep, lapack_int mm, lapack_int* m ); lapack_int LAPACKE_ztrsna( int matrix_order, char job, char howmny, const lapack_logical* select, lapack_int n, const lapack_complex_double* t, lapack_int ldt, const lapack_complex_double* vl, lapack_int ldvl, const lapack_complex_double* vr, lapack_int ldvr, double* s, double* sep, lapack_int mm, lapack_int* m ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description 4 Intel® Math Kernel Library Reference Manual 864 The routine estimates condition numbers for specified eigenvalues and/or right eigenvectors of an upper triangular matrix T (or, for real flavors, upper quasi-triangular matrix T in canonical Schur form). These are the same as the condition numbers of the eigenvalues and right eigenvectors of an original matrix A = Z*T*ZH (with unitary or, for real flavors, orthogonal Z), from which T may have been derived. The routine computes the reciprocal of the condition number of an eigenvalue lambda(i) as si = |vT*u|/ (||u||E||v||E) for real flavors and si = |vH*u|/(||u||E||v||E) for complex flavors, where: • u and v are the right and left eigenvectors of T, respectively, corresponding to lambda(i). • vT/vH denote transpose/conjugate transpose of v, respectively. This reciprocal condition number always lies between zero (ill-conditioned) and one (well-conditioned). An approximate error estimate for a computed eigenvalue lambda(i)is then given by e*||T||/si, where e is the machine precision. To estimate the reciprocal of the condition number of the right eigenvector corresponding to lambda(i), the routine first calls trexc to reorder the eigenvalues so that lambda(i) is in the leading position: The reciprocal condition number of the eigenvector is then estimated as sepi, the smallest singular value of the matrix T22 - lambda(i)*I. This number ranges from zero (ill-conditioned) to very large (wellconditioned). An approximate error estimate for a computed right eigenvector u corresponding to lambda(i) is then given by e*||T||/sepi. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. job CHARACTER*1. Must be 'E' or 'V' or 'B'. If job = 'E', then condition numbers for eigenvalues only are computed. If job = 'V', then condition numbers for eigenvectors only are computed. If job = 'B', then condition numbers for both eigenvalues and eigenvectors are computed. howmny CHARACTER*1. Must be 'A' or 'S'. If howmny = 'A', then the condition numbers for all eigenpairs are computed. If howmny = 'S', then condition numbers for selected eigenpairs (as specified by select) are computed. select LOGICAL. Array, DIMENSION at least max (1, n) if howmny = 'S' and at least 1 otherwise. Specifies the eigenpairs for which condition numbers are to be computed if howmny= 'S'. For real flavors: To select condition numbers for the eigenpair corresponding to the real eigenvalue lambda(j), select(j) must be set .TRUE.; LAPACK Routines: Least Squares and Eigenvalue Problems 4 865 to select condition numbers for the eigenpair corresponding to a complex conjugate pair of eigenvalues lambda(j) and lambda(j+1), select(j) and/ or select(j+1) must be set .TRUE. For complex flavors To select condition numbers for the eigenpair corresponding to the eigenvalue lambda(j), select(j) must be set .TRUE. select is not referenced if howmny = 'A'. n INTEGER. The order of the matrix T (n = 0). t, vl, vr, work REAL for strsna DOUBLE PRECISION for dtrsna COMPLEX for ctrsna DOUBLE COMPLEX for ztrsna. Arrays: t(ldt,*) contains the n-by-n matrix T. The second dimension of t must be at least max(1, n). vl(ldvl,*) If job = 'E' or 'B', then vl must contain the left eigenvectors of T (or of any matrix Q*T*QH with Q unitary or orthogonal) corresponding to the eigenpairs specified by howmny and select. The eigenvectors must be stored in consecutive columns of vl, as returned by trevc or hsein. The second dimension of vl must be at least max(1, mm) if job = 'E' or 'B' and at least 1 if job = 'V'. The array vl is not referenced if job = 'V'. vr(ldvr,*) If job = 'E' or 'B', then vr must contain the right eigenvectors of T (or of any matrix Q*T*QH with Q unitary or orthogonal) corresponding to the eigenpairs specified by howmny and select. The eigenvectors must be stored in consecutive columns of vr, as returned by trevc or hsein. The second dimension of vr must be at least max(1, mm) if job = 'E' or 'B' and at least 1 if job = 'V'. The array vr is not referenced if job = 'V'. work is a workspace array, its dimension (ldwork,n+6). The array work is not referenced if job = 'E'. ldt INTEGER. The leading dimension of t; at least max(1, n). ldvl INTEGER. The leading dimension of vl. If job = 'E' or 'B', ldvl = max(1,n). If job = 'V', ldvl = 1. ldvr INTEGER. The leading dimension of vr. If job = 'E' or 'B', ldvr = max(1,n). If job = 'R', ldvr = 1. mm INTEGER. The number of elements in the arrays s and sep, and the number of columns in vl and vr (if used). Must be at least m (the precise number required). If howmny = 'A', m = n; if howmny = 'S', for real flavors m is obtained by counting 1 for each selected real eigenvalue and 2 for each selected complex conjugate pair of eigenvalues. for complex flavors m is the number of selected eigenpairs (see select). Constraint: 0 = m = n. ldwork INTEGER. The leading dimension of work. 4 Intel® Math Kernel Library Reference Manual 866 If job = 'V' or 'B', ldwork = max(1,n). If job = 'E', ldwork = 1. rwork REAL for ctrsna, ztrsna. Array, DIMENSION at least max (1, n). The array is not referenced if job = 'E'. iwork INTEGER for strsna, dtrsna. Array, DIMENSION at least max (1, 2*(n - 1)). The array is not referenced if job = 'E'. Output Parameters s REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION at least max(1, mm) if job = 'E' or 'B' and at least 1 if job = 'V'. Contains the reciprocal condition numbers of the selected eigenvalues if job = 'E' or 'B', stored in consecutive elements of the array. Thus s(j), sep(j) and the j-th columns of vl and vr all correspond to the same eigenpair (but not in general the j th eigenpair unless all eigenpairs have been selected). For real flavors: for a complex conjugate pair of eigenvalues, two consecutive elements of S are set to the same value. The array s is not referenced if job = 'V'. sep REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION at least max(1, mm) if job = 'V' or 'B' and at least 1 if job = 'E'. Contains the estimated reciprocal condition numbers of the selected right eigenvectors if job = 'V' or 'B', stored in consecutive elements of the array. For real flavors: for a complex eigenvector, two consecutive elements of sep are set to the same value; if the eigenvalues cannot be reordered to compute sep(j), then sep(j) is set to zero; this can only occur when the true value would be very small anyway. The array sep is not referenced if job = 'E'. m INTEGER. For complex flavors: the number of selected eigenpairs. If howmny = 'A', m is set to n. For real flavors: the number of elements of s and/or sep actually used to store the estimated condition numbers. If howmny = 'A', m is set to n. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine trsna interface are the following: t Holds the matrix T of size (n,n). s Holds the vector of length (mm). LAPACK Routines: Least Squares and Eigenvalue Problems 4 867 sep Holds the vector of length (mm). vl Holds the matrix VL of size (n,mm). vr Holds the matrix VR of size (n,mm). select Holds the vector of length n. job Restored based on the presence of arguments s and sep as follows: job = 'B', if both s and sep are present, job = 'E', if s is present and sep omitted, job = 'V', if s is omitted and sep present. Note an error condition if both s and sep are omitted. howmny Restored based on the presence of the argument select as follows: howmny = 'S', if select is present, howmny = 'A', if select is omitted. Note that the arguments s, vl, and vr must either be all present or all omitted. Otherwise, an error condition is observed. Application Notes The computed values sepi may overestimate the true value, but seldom by a factor of more than 3. ?trexc Reorders the Schur factorization of a general matrix. Syntax Fortran 77: call strexc(compq, n, t, ldt, q, ldq, ifst, ilst, work, info) call dtrexc(compq, n, t, ldt, q, ldq, ifst, ilst, work, info) call ctrexc(compq, n, t, ldt, q, ldq, ifst, ilst, info) call ztrexc(compq, n, t, ldt, q, ldq, ifst, ilst, info) Fortran 95: call trexc(t, ifst, ilst [,q] [,info]) C: lapack_int LAPACKE_strexc( int matrix_order, char compq, lapack_int n, float* t, lapack_int ldt, float* q, lapack_int ldq, lapack_int* ifst, lapack_int* ilst ); lapack_int LAPACKE_dtrexc( int matrix_order, char compq, lapack_int n, double* t, lapack_int ldt, double* q, lapack_int ldq, lapack_int* ifst, lapack_int* ilst ); lapack_int LAPACKE_ctrexc( int matrix_order, char compq, lapack_int n, lapack_complex_float* t, lapack_int ldt, lapack_complex_float* q, lapack_int ldq, lapack_int ifst, lapack_int ilst ); lapack_int LAPACKE_ztrexc( int matrix_order, char compq, lapack_int n, lapack_complex_double* t, lapack_int ldt, lapack_complex_double* q, lapack_int ldq, lapack_int ifst, lapack_int ilst ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 4 Intel® Math Kernel Library Reference Manual 868 Description The routine reorders the Schur factorization of a general matrix A = Q*T*QH, so that the diagonal element or block of T with row index ifst is moved to row ilst. The reordered Schur form S is computed by an unitary (or, for real flavors, orthogonal) similarity transformation: S = ZH*T*Z. Optionally the updated matrix P of Schur vectors is computed as P = Q*Z, giving A = P*S*PH. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. compq CHARACTER*1. Must be 'V' or 'N'. If compq = 'V', then the Schur vectors (Q) are updated. If compq = 'N', then no Schur vectors are updated. n INTEGER. The order of the matrix T (n = 0). t, q REAL for strexc DOUBLE PRECISION for dtrexc COMPLEX for ctrexc DOUBLE COMPLEX for ztrexc. Arrays: t(ldt,*) contains the n-by-n matrix T. The second dimension of t must be at least max(1, n). q(ldq,*) If compq = 'V', then q must contain Q (Schur vectors). If compq = 'N', then q is not referenced. The second dimension of q must be at least max(1, n) if compq = 'V' and at least 1 if compq = 'N'. ldt INTEGER. The leading dimension of t; at least max(1, n). ldq INTEGER. The leading dimension of q; If compq = 'N', then ldq= 1. If compq = 'V', then ldq= max(1, n). ifst, ilst INTEGER. 1 = ifst = n; 1 = ilst = n. Must specify the reordering of the diagonal elements (or blocks, which is possible for real flavors) of the matrix T. The element (or block) with row index ifst is moved to row ilst by a sequence of exchanges between adjacent elements (or blocks). work REAL for strexc DOUBLE PRECISION for dtrexc. Array, DIMENSION at least max (1, n). Output Parameters t Overwritten by the updated matrix S. q If compq = 'V', q contains the updated matrix of Schur vectors. ifst, ilst Overwritten for real flavors only. If ifst pointed to the second row of a 2 by 2 block on entry, it is changed to point to the first row; ilst always points to the first row of the block in its final position (which may differ from its input value by ±1). info INTEGER. If info = 0, the execution is successful. LAPACK Routines: Least Squares and Eigenvalue Problems 4 869 If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine trexc interface are the following: t Holds the matrix T of size (n,n). q Holds the matrix Q of size (n,n). compq Restored based on the presence of the argument q as follows: compq = 'V', if q is present, compq = 'N', if q is omitted. Application Notes The computed matrix S is exactly similar to a matrix T+E, where ||E||2 = O(e)*||T||2, and e is the machine precision. Note that if a 2 by 2 diagonal block is involved in the re-ordering, its off-diagonal elements are in general changed; the diagonal elements and the eigenvalues of the block are unchanged unless the block is sufficiently ill-conditioned, in which case they may be noticeably altered. It is possible for a 2 by 2 block to break into two 1 by 1 blocks, that is, for a pair of complex eigenvalues to become purely real. The values of eigenvalues however are never changed by the re-ordering. The approximate number of floating-point operations is for real flavors: 6n(ifst-ilst) if compq = 'N'; 12n(ifst-ilst) if compq = 'V'; for complex flavors: 20n(ifst-ilst) if compq = 'N'; 40n(ifst-ilst) if compq = 'V'. ?trsen Reorders the Schur factorization of a matrix and (optionally) computes the reciprocal condition numbers and invariant subspace for the selected cluster of eigenvalues. Syntax Fortran 77: call strsen(job, compq, select, n, t, ldt, q, ldq, wr, wi, m, s, sep, work, lwork, iwork, liwork, info) call dtrsen(job, compq, select, n, t, ldt, q, ldq, wr, wi, m, s, sep, work, lwork, iwork, liwork, info) call ctrsen(job, compq, select, n, t, ldt, q, ldq, w, m, s, sep, work, lwork, info) call ztrsen(job, compq, select, n, t, ldt, q, ldq, w, m, s, sep, work, lwork, info) Fortran 95: call trsen(t, select [,wr] [,wi] [,m] [,s] [,sep] [,q] [,info]) call trsen(t, select [,w] [,m] [,s] [,sep] [,q] [,info]) 4 Intel® Math Kernel Library Reference Manual 870 C: lapack_int LAPACKE_strsen( int matrix_order, char job, char compq, const lapack_logical* select, lapack_int n, float* t, lapack_int ldt, float* q, lapack_int ldq, float* wr, float* wi, lapack_int* m, float* s, float* sep ); lapack_int LAPACKE_dtrsen( int matrix_order, char job, char compq, const lapack_logical* select, lapack_int n, double* t, lapack_int ldt, double* q, lapack_int ldq, double* wr, double* wi, lapack_int* m, double* s, double* sep ); lapack_int LAPACKE_ctrsen( int matrix_order, char job, char compq, const lapack_logical* select, lapack_int n, lapack_complex_float* t, lapack_int ldt, lapack_complex_float* q, lapack_int ldq, lapack_complex_float* w, lapack_int* m, float* s, float* sep ); lapack_int LAPACKE_ztrsen( int matrix_order, char job, char compq, const lapack_logical* select, lapack_int n, lapack_complex_double* t, lapack_int ldt, lapack_complex_double* q, lapack_int ldq, lapack_complex_double* w, lapack_int* m, double* s, double* sep ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reorders the Schur factorization of a general matrix A = Q*T*QT (for real flavors) or A = Q*T*QH (for complex flavors) so that a selected cluster of eigenvalues appears in the leading diagonal elements (or, for real flavors, diagonal blocks) of the Schur form. The reordered Schur form R is computed by a unitary (orthogonal) similarity transformation: R = ZH*T*Z. Optionally the updated matrix P of Schur vectors is computed as P = Q*Z, giving A = P*R*PH. Let where the selected eigenvalues are precisely the eigenvalues of the leading m-by-m submatrix T11. Let P be correspondingly partitioned as (Q1 Q2) where Q1 consists of the first m columns of Q. Then A*Q1 = Q1*T11, and so the m columns of Q1 form an orthonormal basis for the invariant subspace corresponding to the selected cluster of eigenvalues. Optionally the routine also computes estimates of the reciprocal condition numbers of the average of the cluster of eigenvalues and of the invariant subspace. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. job CHARACTER*1. Must be 'N' or 'E' or 'V' or 'B'. If job = 'N', then no condition numbers are required. LAPACK Routines: Least Squares and Eigenvalue Problems 4 871 If job = 'E', then only the condition number for the cluster of eigenvalues is computed. If job = 'V', then only the condition number for the invariant subspace is computed. If job = 'B', then condition numbers for both the cluster and the invariant subspace are computed. compq CHARACTER*1. Must be 'V' or 'N'. If compq = 'V', then Q of the Schur vectors is updated. If compq = 'N', then no Schur vectors are updated. select LOGICAL. Array, DIMENSION at least max (1, n). Specifies the eigenvalues in the selected cluster. To select an eigenvalue lambda(j), select(j) must be .TRUE. For real flavors: to select a complex conjugate pair of eigenvalues lambda(j) and lambda(j+1) (corresponding 2 by 2 diagonal block), select(j) and/or select(j+1) must be .TRUE.; the complex conjugate lambda(j)and lambda(j+1) must be either both included in the cluster or both excluded. n INTEGER. The order of the matrix T (n = 0). t, q, work REAL for strsen DOUBLE PRECISION for dtrsen COMPLEX for ctrsen DOUBLE COMPLEX for ztrsen. Arrays: t (ldt,*) The n-by-n T. The second dimension of t must be at least max(1, n). q (ldq,*) If compq = 'V', then q must contain Q of Schur vectors. If compq = 'N', then q is not referenced. The second dimension of q must be at least max(1, n) if compq = 'V' and at least 1 if compq = 'N'. work is a workspace array, its dimension max(1, lwork). ldt INTEGER. The leading dimension of t; at least max(1, n). ldq INTEGER. The leading dimension of q; If compq = 'N', then ldq = 1. If compq = 'V', then ldq = max(1, n). lwork INTEGER. The dimension of the array work. If job = 'V' or 'B', lwork = max(1,2*m*(n-m)). If job = 'E', then lwork = max(1, m*(n-m)) If job = 'N', then lwork = 1 for complex flavors and lwork = max(1,n) for real flavors. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for details. iwork INTEGER.iwork(liwork) is a workspace array. The array iwork is not referenced if job = 'N' or 'E'. The actual amount of workspace required cannot exceed n2/2 if job = 'V' or 'B'. liwork INTEGER. The dimension of the array iwork. 4 Intel® Math Kernel Library Reference Manual 872 If job = 'V' or 'B', liwork = max(1,2m(n-m)). If job = 'E' or 'E', liwork = 1. If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the iwork array, returns this value as the first entry of the iwork array, and no error message related to liwork is issued by xerbla. See Application Notes for details. Output Parameters t Overwritten by the updated matrix R. q If compq = 'V', q contains the updated matrix of Schur vectors; the first m columns of the Q form an orthogonal basis for the specified invariant subspace. w COMPLEX for ctrsen DOUBLE COMPLEX for ztrsen. Array, DIMENSION at least max(1, n). The recorded eigenvalues of R. The eigenvalues are stored in the same order as on the diagonal of R. wr, wi REAL for strsen DOUBLE PRECISION for dtrsen Arrays, DIMENSION at least max(1, n). Contain the real and imaginary parts, respectively, of the reordered eigenvalues of R. The eigenvalues are stored in the same order as on the diagonal of R. Note that if a complex eigenvalue is sufficiently ill-conditioned, then its value may differ significantly from its value before reordering. m INTEGER. For complex flavors: the number of the specified invariant subspaces, which is the same as the number of selected eigenvalues (see select). For real flavors: the dimension of the specified invariant subspace. The value of m is obtained by counting 1 for each selected real eigenvalue and 2 for each selected complex conjugate pair of eigenvalues (see select). Constraint: 0 = m = n. s REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. If job = 'E' or 'B', s is a lower bound on the reciprocal condition number of the average of the selected cluster of eigenvalues. If m = 0 or n, then s = 1. For real flavors: if info = 1, then s is set to zero.s is not referenced if job = 'N' or 'V'. sep REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. If job = 'V' or 'B', sep is the estimated reciprocal condition number of the specified invariant subspace. If m = 0 or n, then sep = |T|. For real flavors: if info = 1, then sep is set to zero. sep is not referenced if job = 'N' or 'E'. work(1) On exit, if info = 0, then work(1) returns the optimal size of lwork. iwork(1) On exit, if info = 0, then iwork(1) returns the optimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Least Squares and Eigenvalue Problems 4 873 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine trsen interface are the following: t Holds the matrix T of size (n,n). select Holds the vector of length n. wr Holds the vector of length n. Used in real flavors only. wi Holds the vector of length n. Used in real flavors only. w Holds the vector of length n. Used in complex flavors only. q Holds the matrix Q of size (n,n). compq Restored based on the presence of the argument q as follows: compq = 'V', if q is present, compq = 'N', if q is omitted. job Restored based on the presence of arguments s and sep as follows: job = 'B', if both s and sep are present, job = 'E', if s is present and sep omitted, job = 'V', if s is omitted and sep present, job = 'N', if both s and sep are omitted. Application Notes The computed matrix R is exactly similar to a matrix T+E, where ||E||2 = O(e)*||T||2, and e is the machine precision. The computed s cannot underestimate the true reciprocal condition number by more than a factor of (min(m, n-m))1/2; sep may differ from the true value by (m*n-m2)1/2. The angle between the computed invariant subspace and the true subspace is O(e)*||A||2/sep. Note that if a 2-by-2 diagonal block is involved in the re-ordering, its off-diagonal elements are in general changed; the diagonal elements and the eigenvalues of the block are unchanged unless the block is sufficiently ill-conditioned, in which case they may be noticeably altered. It is possible for a 2-by-2 block to break into two 1-by-1 blocks, that is, for a pair of complex eigenvalues to become purely real. The values of eigenvalues however are never changed by the re-ordering. If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run or set lwork = -1 (liwork = -1). If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if lwork (liwork) is less than the minimal required value and is not equal to -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?trsyl Solves Sylvester equation for real quasi-triangular or complex triangular matrices. Syntax Fortran 77: call strsyl(trana, tranb, isgn, m, n, a, lda, b, ldb, c, ldc, scale, info) 4 Intel® Math Kernel Library Reference Manual 874 call dtrsyl(trana, tranb, isgn, m, n, a, lda, b, ldb, c, ldc, scale, info) call ctrsyl(trana, tranb, isgn, m, n, a, lda, b, ldb, c, ldc, scale, info) call ztrsyl(trana, tranb, isgn, m, n, a, lda, b, ldb, c, ldc, scale, info) Fortran 95: call trsyl(a, b, c, scale [, trana] [,tranb] [,isgn] [,info]) C: lapack_int LAPACKE_strsyl( int matrix_order, char trana, char tranb, lapack_int isgn, lapack_int m, lapack_int n, const float* a, lapack_int lda, const float* b, lapack_int ldb, float* c, lapack_int ldc, float* scale ); lapack_int LAPACKE_dtrsyl( int matrix_order, char trana, char tranb, lapack_int isgn, lapack_int m, lapack_int n, const double* a, lapack_int lda, const double* b, lapack_int ldb, double* c, lapack_int ldc, double* scale ); lapack_int LAPACKE_ctrsyl( int matrix_order, char trana, char tranb, lapack_int isgn, lapack_int m, lapack_int n, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* c, lapack_int ldc, float* scale ); lapack_int LAPACKE_ztrsyl( int matrix_order, char trana, char tranb, lapack_int isgn, lapack_int m, lapack_int n, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* c, lapack_int ldc, double* scale ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves the Sylvester matrix equation op(A)*X ± X*op(B) = a*C, where op(A) = A or AH, and the matrices A and B are upper triangular (or, for real flavors, upper quasi-triangular in canonical Schur form); a = 1 is a scale factor determined by the routine to avoid overflow in X; A is m-by-m, B is n-by-n, and C and X are both m-by-n. The matrix X is obtained by a straightforward process of back substitution. The equation has a unique solution if and only if ai ± ßi ? 0, where {ai} and {ßi} are the eigenvalues of A and B, respectively, and the sign (+ or -) is the same as that used in the equation to be solved. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trana CHARACTER*1. Must be 'N' or 'T' or 'C'. If trana = 'N', then op(A) = A. If trana = 'T', then op(A) = AT (real flavors only). If trana = 'C' then op(A) = AH. tranb CHARACTER*1. Must be 'N' or 'T' or 'C'. If tranb = 'N', then op(B) = B. If tranb = 'T', then op(B) = BT (real flavors only). If tranb = 'C', then op(B) = BH. LAPACK Routines: Least Squares and Eigenvalue Problems 4 875 isgn INTEGER. Indicates the form of the Sylvester equation. If isgn = +1, op(A)*X + X*op(B) = alpha*C. If isgn = -1, op(A)*X - X*op(B) = alpha*C. m INTEGER. The order of A, and the number of rows in X and C (m = 0). n INTEGER. The order of B, and the number of columns in X and C (n = 0). a, b, c REAL for strsyl DOUBLE PRECISION for dtrsyl COMPLEX for ctrsyl DOUBLE COMPLEX for ztrsyl. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, m). b(ldb,*) contains the matrix B. The second dimension of b must be at least max(1, n). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n). lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; at least max(1, n). ldc INTEGER. The leading dimension of c; at least max(1, n). Output Parameters c Overwritten by the solution matrix X. scale REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. The value of the scale factor a. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, A and B have common or close eigenvalues perturbed values were used to solve the equation. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine trsyl interface are the following: a Holds the matrix A of size (m,m). b Holds the matrix B of size (n,n). c Holds the matrix C of size (m,n). trana Must be 'N', 'C', or 'T'. The default value is 'N'. tranb Must be 'N', 'C', or 'T'. The default value is 'N'. isgn Must be +1 or -1. The default value is +1. Application Notes Let X be the exact, Y the corresponding computed solution, and R the residual matrix: R = C - (AY ± YB). Then the residual is always small: ||R||F = O(e)*(||A||F +||B||F)*||Y||F. 4 Intel® Math Kernel Library Reference Manual 876 However, Y is not necessarily the exact solution of a slightly perturbed equation; in other words, the solution is not backwards stable. For the forward error, the following bound holds: ||Y - X||F =||R||F/sep(A,B) but this may be a considerable overestimate. See [Golub96] for a definition of sep(A, B). The approximate number of floating-point operations for real flavors is m*n*(m + n). For complex flavors it is 4 times greater. Generalized Nonsymmetric Eigenvalue Problems This section describes LAPACK routines for solving generalized nonsymmetric eigenvalue problems, reordering the generalized Schur factorization of a pair of matrices, as well as performing a number of related computational tasks. A generalized nonsymmetric eigenvalue problem is as follows: given a pair of nonsymmetric (or non- Hermitian) n-by-n matrices A and B, find the generalized eigenvalues ? and the corresponding generalized eigenvectors x and y that satisfy the equations Ax = ?Bx (right generalized eigenvectors x) and yHA = ?yHB (left generalized eigenvectors y). Table "Computational Routines for Solving Generalized Nonsymmetric Eigenvalue Problems" lists LAPACK routines (FORTRAN 77 interface) used to solve the generalized nonsymmetric eigenvalue problems and the generalized Sylvester equation. Respective routine names in Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Computational Routines for Solving Generalized Nonsymmetric Eigenvalue Problems Routine name Operation performed gghrd Reduces a pair of matrices to generalized upper Hessenberg form using orthogonal/ unitary transformations. ggbal Balances a pair of general real or complex matrices. ggbak Forms the right or left eigenvectors of a generalized eigenvalue problem. hgeqz Implements the QZ method for finding the generalized eigenvalues of the matrix pair (H,T). tgevc Computes some or all of the right and/or left generalized eigenvectors of a pair of upper triangular matrices tgexc Reorders the generalized Schur decomposition of a pair of matrices (A,B) so that one diagonal block of (A,B) moves to another row index. tgsen Reorders the generalized Schur decomposition of a pair of matrices (A,B) so that a selected cluster of eigenvalues appears in the leading diagonal blocks of (A,B). tgsyl Solves the generalized Sylvester equation. tgsyl Estimates reciprocal condition numbers for specified eigenvalues and/or eigenvectors of a pair of matrices in generalized real Schur canonical form. LAPACK Routines: Least Squares and Eigenvalue Problems 4 877 ?gghrd Reduces a pair of matrices to generalized upper Hessenberg form using orthogonal/unitary transformations. Syntax Fortran 77: call sgghrd(compq, compz, n, ilo, ihi, a, lda, b, ldb, q, ldq, z, ldz, info) call dgghrd(compq, compz, n, ilo, ihi, a, lda, b, ldb, q, ldq, z, ldz, info) call cgghrd(compq, compz, n, ilo, ihi, a, lda, b, ldb, q, ldq, z, ldz, info) call zgghrd(compq, compz, n, ilo, ihi, a, lda, b, ldb, q, ldq, z, ldz, info) Fortran 95: call gghrd(a, b [,ilo] [,ihi] [,q] [,z] [,compq] [,compz] [,info]) C: lapack_int LAPACKE_gghrd( int matrix_order, char compq, char compz, lapack_int n, lapack_int ilo, lapack_int ihi, * a, lapack_int lda, * b, lapack_int ldb, * q, lapack_int ldq, * z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a pair of real/complex matrices (A,B) to generalized upper Hessenberg form using orthogonal/unitary transformations, where A is a general matrix and B is upper triangular. The form of the generalized eigenvalue problem is A*x = ?*B*x, and B is typically made upper triangular by computing its QR factorization and moving the orthogonal matrix Q to the left side of the equation. This routine simultaneously reduces A to a Hessenberg matrix H: QH*A*Z = H and transforms B to another upper triangular matrix T: QH*B*Z = T in order to reduce the problem to its standard form H*y = ?*T*y, where y = ZH*x. The orthogonal/unitary matrices Q and Z are determined as products of Givens rotations. They may either be formed explicitly, or they may be postmultiplied into input matrices Q1 and Z1, so that Q1*A*Z1 H = (Q1*Q)*H*(Z1*Z)H Q1*B*Z1 H = (Q1*Q)*T*(Z1*Z)H If Q1 is the orthogonal/unitary matrix from the QR factorization of B in the original equation A*x = ?*B*x, then the routine ?gghrd reduces the original problem to generalized Hessenberg form. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. 4 Intel® Math Kernel Library Reference Manual 878 compq CHARACTER*1. Must be 'N', 'I', or 'V'. If compq = 'N', matrix Q is not computed. If compq = 'I', Q is initialized to the unit matrix, and the orthogonal/ unitary matrix Q is returned; If compq = 'V', Q must contain an orthogonal/unitary matrix Q1 on entry, and the product Q1*Q is returned. compz CHARACTER*1. Must be 'N', 'I', or 'V'. If compz = 'N', matrix Z is not computed. If compz = 'I', Z is initialized to the unit matrix, and the orthogonal/ unitary matrix Z is returned; If compz = 'V', Z must contain an orthogonal/unitary matrix Z1 on entry, and the product Z1*Z is returned. n INTEGER. The order of the matrices A and B (n = 0). ilo, ihi INTEGER. ilo and ihi mark the rows and columns of A which are to be reduced. It is assumed that A is already upper triangular in rows and columns 1:ilo-1 and ihi+1:n. Values of ilo and ihi are normally set by a previous call to ggbal; otherwise they should be set to 1 and n respectively. Constraint: If n > 0, then 1 = ilo = ihi = n; if n = 0, then ilo = 1 and ihi = 0. a, b, q, z REAL for sgghrd DOUBLE PRECISION for dgghrd COMPLEX for cgghrd DOUBLE COMPLEX for zgghrd. Arrays: a(lda,*) contains the n-by-n general matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the n-by-n upper triangular matrix B. The second dimension of b must be at least max(1, n). q (ldq,*) If compq = 'N', then q is not referenced. If compq = 'V', then q must contain the orthogonal/unitary matrix Q1, typically from the QR factorization of B. The second dimension of q must be at least max(1, n). z (ldz,*) If compz = 'N', then z is not referenced. If compz = 'V', then z must contain the orthogonal/unitary matrix Z1. The second dimension of z must be at least max(1, n). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). ldq INTEGER. The leading dimension of q; If compq = 'N', then ldq = 1. If compq = 'I'or 'V', then ldq = max(1, n). ldz INTEGER. The leading dimension of z; If compz = 'N', then ldz = 1. If compz = 'I'or 'V', then ldz = max(1, n). Output Parameters a On exit, the upper triangle and the first subdiagonal of A are overwritten with the upper Hessenberg matrix H, and the rest is set to zero. LAPACK Routines: Least Squares and Eigenvalue Problems 4 879 b On exit, overwritten by the upper triangular matrix T = QH*B*Z. The elements below the diagonal are set to zero. q If compq = 'I', then q contains the orthogonal/unitary matrix Q, ; If compq = 'V', then q is overwritten by the product Q1*Q. z If compz = 'I', then z contains the orthogonal/unitary matrix Z; If compz = 'V', then z is overwritten by the product Z1*Z. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gghrd interface are the following: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,n). q Holds the matrix Q of size (n,n). z Holds the matrix Z of size (n,n). ilo Default value for this argument is ilo = 1. ihi Default value for this argument is ihi = n. compq If omitted, this argument is restored based on the presence of argument q as follows: compq = 'I', if q is present, compq = 'N', if q is omitted. If present, compq must be equal to 'I' or 'V' and the argument q must also be present. Note that there will be an error condition if compq is present and q omitted. compz If omitted, this argument is restored based on the presence of argument z as follows: compz = 'I', if z is present, compz = 'N', if z is omitted. If present, compz must be equal to 'I' or 'V' and the argument z must also be present. Note that there will be an error condition if compz is present and z omitted. ?ggbal Balances a pair of general real or complex matrices. Syntax Fortran 77: call sggbal(job, n, a, lda, b, ldb, ilo, ihi, lscale, rscale, work, info) call dggbal(job, n, a, lda, b, ldb, ilo, ihi, lscale, rscale, work, info) call cggbal(job, n, a, lda, b, ldb, ilo, ihi, lscale, rscale, work, info) call zggbal(job, n, a, lda, b, ldb, ilo, ihi, lscale, rscale, work, info) Fortran 95: call ggbal(a, b [,ilo] [,ihi] [,lscale] [,rscale] [,job] [,info]) 4 Intel® Math Kernel Library Reference Manual 880 C: lapack_int LAPACKE_sggbal( int matrix_order, char job, lapack_int n, float* a, lapack_int lda, float* b, lapack_int ldb, lapack_int* ilo, lapack_int* ihi, float* lscale, float* rscale ); lapack_int LAPACKE_dggbal( int matrix_order, char job, lapack_int n, double* a, lapack_int lda, double* b, lapack_int ldb, lapack_int* ilo, lapack_int* ihi, double* lscale, double* rscale ); lapack_int LAPACKE_cggbal( int matrix_order, char job, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, lapack_int* ilo, lapack_int* ihi, float* lscale, float* rscale ); lapack_int LAPACKE_zggbal( int matrix_order, char job, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, lapack_int* ilo, lapack_int* ihi, double* lscale, double* rscale ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine balances a pair of general real/complex matrices (A,B). This involves, first, permuting A and B by similarity transformations to isolate eigenvalues in the first 1 to ilo-1 and last ihi+1 to n elements on the diagonal;and second, applying a diagonal similarity transformation to rows and columns ilo to ihi to make the rows and columns as close in norm as possible. Both steps are optional. Balancing may reduce the 1- norm of the matrices, and improve the accuracy of the computed eigenvalues and/or eigenvectors in the generalized eigenvalue problem A*x = ?*B*x. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. job CHARACTER*1. Specifies the operations to be performed on A and B. Must be 'N' or 'P' or 'S' or 'B'. If job = 'N ', then no operations are done; simply set ilo =1, ihi=n, lscale(i) =1.0 and rscale(i)=1.0 for i = 1,..., n. If job = 'P', then permute only. If job = 'S', then scale only. If job = 'B', then both permute and scale. n INTEGER. The order of the matrices A and B (n = 0). a, b REAL for sggbal DOUBLE PRECISION for dggbal COMPLEX for cggbal DOUBLE COMPLEX for zggbal. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the matrix B. The second dimension of b must be at least max(1, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 881 If job = 'N', a and b are not referenced. lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). work REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 6n) when job = 'S'or 'B', or at least 1 when job = 'N'or 'P'. Output Parameters a, b Overwritten by the balanced matrices A and B, respectively. ilo, ihi INTEGER. ilo and ihi are set to integers such that on exit a(i,j)=0 and b(i,j)=0 if i>j and j=1,...,ilo-1 or i=ihi+1,..., n. If job = 'N'or 'S', then ilo = 1 and ihi = n. lscale, rscale REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, n). lscale contains details of the permutations and scaling factors applied to the left side of A and B. If Pj is the index of the row interchanged with row j, and Dj is the scaling factor applied to row j, then lscale(j) = Pj, for j = 1,..., ilo-1 = Dj, for j = ilo,...,ihi = Pj, for j = ihi+1,..., n. rscale contains details of the permutations and scaling factors applied to the right side of A and B. If Pj is the index of the column interchanged with column j, and Dj is the scaling factor applied to column j, then rscale(j) = Pj, for j = 1,..., ilo-1 = Dj, for j = ilo,...,ihi = Pj, for j = ihi+1,..., n The order in which the interchanges are made is n to ihi+1, then 1 to ilo-1. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggbal interface are the following: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,n). lscale Holds the vector of length (n). rscale Holds the vector of length (n). ilo Default value for this argument is ilo = 1. ihi Default value for this argument is ihi = n. job Must be 'B', 'S', 'P', or 'N'. The default value is 'B'. 4 Intel® Math Kernel Library Reference Manual 882 ?ggbak Forms the right or left eigenvectors of a generalized eigenvalue problem. Syntax Fortran 77: call sggbak(job, side, n, ilo, ihi, lscale, rscale, m, v, ldv, info) call dggbak(job, side, n, ilo, ihi, lscale, rscale, m, v, ldv, info) call cggbak(job, side, n, ilo, ihi, lscale, rscale, m, v, ldv, info) call zggbak(job, side, n, ilo, ihi, lscale, rscale, m, v, ldv, info) Fortran 95: call ggbak(v [, ilo] [,ihi] [,lscale] [,rscale] [,job] [,info]) C: lapack_int LAPACKE_sggbak( int matrix_order, char job, char side, lapack_int n, lapack_int ilo, lapack_int ihi, const float* lscale, const float* rscale, lapack_int m, float* v, lapack_int ldv ); lapack_int LAPACKE_dggbak( int matrix_order, char job, char side, lapack_int n, lapack_int ilo, lapack_int ihi, const double* lscale, const double* rscale, lapack_int m, double* v, lapack_int ldv ); lapack_int LAPACKE_cggbak( int matrix_order, char job, char side, lapack_int n, lapack_int ilo, lapack_int ihi, const float* lscale, const float* rscale, lapack_int m, lapack_complex_float* v, lapack_int ldv ); lapack_int LAPACKE_zggbak( int matrix_order, char job, char side, lapack_int n, lapack_int ilo, lapack_int ihi, const double* lscale, const double* rscale, lapack_int m, lapack_complex_double* v, lapack_int ldv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the right or left eigenvectors of a real/complex generalized eigenvalue problem A*x = ?*B*x by backward transformation on the computed eigenvectors of the balanced pair of matrices output by ggbal. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. job CHARACTER*1. Specifies the type of backward transformation required. Must be 'N', 'P', 'S', or 'B'. If job = 'N', then no operations are done; return. If job = 'P', then do backward transformation for permutation only. LAPACK Routines: Least Squares and Eigenvalue Problems 4 883 If job = 'S', then do backward transformation for scaling only. If job = 'B', then do backward transformation for both permutation and scaling. This argument must be the same as the argument job supplied to ?ggbal. side CHARACTER*1. Must be 'L' or 'R'. If side = 'L', then v contains left eigenvectors. If side = 'R', then v contains right eigenvectors. n INTEGER. The number of rows of the matrix V (n = 0). ilo, ihi INTEGER. The integers ilo and ihi determined by ?gebal. Constraint: If n > 0, then 1 = ilo = ihi = n; if n = 0, then ilo = 1 and ihi = 0. lscale, rscale REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, n). The array lscale contains details of the permutations and/or scaling factors applied to the left side of A and B, as returned by ?ggbal. The array rscale contains details of the permutations and/or scaling factors applied to the right side of A and B, as returned by ?ggbal. m INTEGER. The number of columns of the matrix V (m = 0). v REAL for sggbak DOUBLE PRECISION for dggbak COMPLEX for cggbak DOUBLE COMPLEX for zggbak. Array v(ldv,*). Contains the matrix of right or left eigenvectors to be transformed, as returned by tgevc. The second dimension of v must be at least max(1, m). ldv INTEGER. The leading dimension of v; at least max(1, n). Output Parameters v Overwritten by the transformed eigenvectors info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggbak interface are the following: v Holds the matrix V of size (n,m). lscale Holds the vector of length n. rscale Holds the vector of length n. ilo Default value for this argument is ilo = 1. ihi Default value for this argument is ihi = n. job Must be 'B', 'S', 'P', or 'N'. The default value is 'B'. side If omitted, this argument is restored based on the presence of arguments lscale and rscale as follows: 4 Intel® Math Kernel Library Reference Manual 884 side = 'L', if lscale is present and rscale omitted, side = 'R', if lscale is omitted and rscale present. Note that there will be an error condition if both lscale and rscale are present or if they both are omitted. ?hgeqz Implements the QZ method for finding the generalized eigenvalues of the matrix pair (H,T). Syntax Fortran 77: call shgeqz(job, compq, compz, n, ilo, ihi, h, ldh, t, ldt, alphar, alphai, beta, q, ldq, z, ldz, work, lwork, info) call dhgeqz(job, compq, compz, n, ilo, ihi, h, ldh, t, ldt, alphar, alphai, beta, q, ldq, z, ldz, work, lwork, info) call chgeqz(job, compq, compz, n, ilo, ihi, h, ldh, t, ldt, alpha, beta, q, ldq, z, ldz, work, lwork, rwork, info) call zhgeqz(job, compq, compz, n, ilo, ihi, h, ldh, t, ldt, alpha, beta, q, ldq, z, ldz, work, lwork, rwork, info) Fortran 95: call hgeqz(h, t [,ilo] [,ihi] [,alphar] [,alphai] [,beta] [,q] [,z] [,job] [,compq] [,compz] [,info]) call hgeqz(h, t [,ilo] [,ihi] [,alpha] [,beta] [,q] [,z] [,job] [,compq] [, compz] [,info]) C: lapack_int LAPACKE_shgeqz( int matrix_order, char job, char compq, char compz, lapack_int n, lapack_int ilo, lapack_int ihi, float* h, lapack_int ldh, float* t, lapack_int ldt, float* alphar, float* alphai, float* beta, float* q, lapack_int ldq, float* z, lapack_int ldz ); lapack_int LAPACKE_dhgeqz( int matrix_order, char job, char compq, char compz, lapack_int n, lapack_int ilo, lapack_int ihi, double* h, lapack_int ldh, double* t, lapack_int ldt, double* alphar, double* alphai, double* beta, double* q, lapack_int ldq, double* z, lapack_int ldz ); lapack_int LAPACKE_chgeqz( int matrix_order, char job, char compq, char compz, lapack_int n, lapack_int ilo, lapack_int ihi, lapack_complex_float* h, lapack_int ldh, lapack_complex_float* t, lapack_int ldt, lapack_complex_float* alpha, lapack_complex_float* beta, lapack_complex_float* q, lapack_int ldq, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zhgeqz( int matrix_order, char job, char compq, char compz, lapack_int n, lapack_int ilo, lapack_int ihi, lapack_complex_double* h, lapack_int ldh, lapack_complex_double* t, lapack_int ldt, lapack_complex_double* alpha, lapack_complex_double* beta, lapack_complex_double* q, lapack_int ldq, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 LAPACK Routines: Least Squares and Eigenvalue Problems 4 885 • C: mkl_lapacke.h Description The routine computes the eigenvalues of a real/complex matrix pair (H,T), where H is an upper Hessenberg matrix and T is upper triangular, using the double-shift version (for real flavors) or single-shift version (for complex flavors) of the QZ method. Matrix pairs of this type are produced by the reduction to generalized upper Hessenberg form of a real/complex matrix pair (A,B): A = Q1*H*Z1 H, B = Q1*T*Z1 H, as computed by ?gghrd. For real flavors: If job = 'S', then the Hessenberg-triangular pair (H,T) is also reduced to generalized Schur form, H = Q*S*ZT, T = Q *P*ZT, where Q and Z are orthogonal matrices, P is an upper triangular matrix, and S is a quasi-triangular matrix with 1-by-1 and 2-by-2 diagonal blocks. The 1-by-1 blocks correspond to real eigenvalues of the matrix pair (H,T) and the 2-by-2 blocks correspond to complex conjugate pairs of eigenvalues. Additionally, the 2-by-2 upper triangular diagonal blocks of P corresponding to 2-by-2 blocks of S are reduced to positive diagonal form, that is, if S(j+1,j) is non-zero, then P(j+1,j) = P(j,j+1) = 0, P(j,j) > 0, and P(j+1,j+1) > 0. For complex flavors: If job = 'S', then the Hessenberg-triangular pair (H,T) is also reduced to generalized Schur form, H = Q* S*ZH, T = Q*P*ZH, where Q and Z are unitary matrices, and S and P are upper triangular. For all function flavors: Optionally, the orthogonal/unitary matrix Q from the generalized Schur factorization may be postmultiplied into an input matrix Q1, and the orthogonal/unitary matrix Z may be postmultiplied into an input matrix Z1. If Q1 and Z1 are the orthogonal/unitary matrices from ?gghrd that reduced the matrix pair (A,B) to generalized upper Hessenberg form, then the output matrices Q1Q and Z 1Z are the orthogonal/unitary factors from the generalized Schur factorization of (A,B): A = (Q1Q)*S *(Z1Z)H, B = (Q1Q)*P*(Z1Z)H. To avoid overflow, eigenvalues of the matrix pair (H,T) (equivalently, of (A,B)) are computed as a pair of values (alpha,beta). For chgeqz/zhgeqz, alpha and beta are complex, and for shgeqz/dhgeqz, alpha is complex and beta real. If beta is nonzero, ? = alpha/beta is an eigenvalue of the generalized nonsymmetric eigenvalue problem (GNEP) A*x = ?*B*x and if alpha is nonzero, µ = beta/alpha is an eigenvalue of the alternate form of the GNEP µ*A*y = B*y . Real eigenvalues (for real flavors) or the values of alpha and beta for the i-th eigenvalue (for complex flavors) can be read directly from the generalized Schur form: alpha = S(i,i), beta = P(i,i). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. 4 Intel® Math Kernel Library Reference Manual 886 job CHARACTER*1. Specifies the operations to be performed. Must be 'E' or 'S'. If job = 'E', then compute eigenvalues only; If job = 'S', then compute eigenvalues and the Schur form. compq CHARACTER*1. Must be 'N', 'I', or 'V'. If compq = 'N', left Schur vectors (q) are not computed; If compq = 'I', q is initialized to the unit matrix and the matrix of left Schur vectors of (H,T) is returned; If compq = 'V', q must contain an orthogonal/unitary matrix Q1 on entry and the product Q1*Q is returned. compz CHARACTER*1. Must be 'N', 'I', or 'V'. If compz = 'N', right Schur vectors (z) are not computed; If compz = 'I', z is initialized to the unit matrix and the matrix of right Schur vectors of (H,T) is returned; If compz = 'V', z must contain an orthogonal/unitary matrix Z1 on entry and the product Z1*Z is returned. n INTEGER. The order of the matrices H, T, Q, and Z (n = 0). ilo, ihi INTEGER. ilo and ihi mark the rows and columns of H which are in Hessenberg form. It is assumed that H is already upper triangular in rows and columns 1:ilo-1 and ihi+1:n. Constraint: If n > 0, then 1 = ilo = ihi = n; if n = 0, then ilo = 1 and ihi = 0. h, t, q, z, work REAL for shgeqz DOUBLE PRECISION for dhgeqz COMPLEX for chgeqz DOUBLE COMPLEX for zhgeqz. Arrays: On entry, h(ldh,*) contains the n-by-n upper Hessenberg matrix H. The second dimension of h must be at least max(1, n). On entry, t(ldt,*) contains the n-by-n upper triangular matrix T. The second dimension of t must be at least max(1, n). q (ldq,*): On entry, if compq = 'V', this array contains the orthogonal/unitary matrix Q1 used in the reduction of (A,B) to generalized Hessenberg form. If compq = 'N', then q is not referenced. The second dimension of q must be at least max(1, n). z (ldz,*): On entry, if compz = 'V', this array contains the orthogonal/unitary matrix Z1 used in the reduction of (A,B) to generalized Hessenberg form. If compz = 'N', then z is not referenced. The second dimension of z must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). ldh INTEGER. The leading dimension of h; at least max(1, n). ldt INTEGER. The leading dimension of t; at least max(1, n). ldq INTEGER. The leading dimension of q; If compq = 'N', then ldq = 1. If compq = 'I'or 'V', then ldq = max(1, n). ldz INTEGER. The leading dimension of z; LAPACK Routines: Least Squares and Eigenvalue Problems 4 887 If compq = 'N', then ldz = 1. If compq = 'I'or 'V', then ldz = max(1, n). lwork INTEGER. The dimension of the array work. lwork = max(1, n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for details. rwork REAL for chgeqz DOUBLE PRECISION for zhgeqz. Workspace array, DIMENSION at least max(1, n). Used in complex flavors only. Output Parameters h For real flavors: If job = 'S', then on exit h contains the upper quasi-triangular matrix S from the generalized Schur factorization. If job = 'E', then on exit the diagonal blocks of h match those of S, but the rest of h is unspecified. For complex flavors: If job = 'S', then, on exit, h contains the upper triangular matrix S from the generalized Schur factorization. If job = 'E', then on exit the diagonal of h matches that of S, but the rest of h is unspecified. t If job = 'S', then, on exit, t contains the upper triangular matrix P from the generalized Schur factorization. For real flavors: 2-by-2 diagonal blocks of P corresponding to 2-by-2 blocks of S are reduced to positive diagonal form, that is, if h(j+1,j) is non-zero, then t(j +1,j)=t(j,j+1)=0 and t(j,j) and t(j+1,j+1) will be positive. If job = 'E', then on exit the diagonal blocks of t match those of P, but the rest of t is unspecified. For complex flavors: if job = 'E', then on exit the diagonal of t matches that of P, but the rest of t is unspecified. alphar, alphai REAL for shgeqz; DOUBLE PRECISION for dhgeqz. Arrays, DIMENSION at least max(1, n). The real and imaginary parts, respectively, of each scalar alpha defining an eigenvalue of GNEP. If alphai(j) is zero, then the j-th eigenvalue is real; if positive, then the jth and (j+1)-th eigenvalues are a complex conjugate pair, with alphai(j+1) = -alphai(j). alpha COMPLEX for chgeqz; DOUBLE COMPLEX for zhgeqz. Array, DIMENSION at least max(1, n). The complex scalars alpha that define the eigenvalues of GNEP. alphai(i) = S(i,i) in the generalized Schur factorization. beta REAL for shgeqz DOUBLE PRECISION for dhgeqz COMPLEX for chgeqz DOUBLE COMPLEX for zhgeqz. 4 Intel® Math Kernel Library Reference Manual 888 Array, DIMENSION at least max(1, n). For real flavors: The scalars beta that define the eigenvalues of GNEP. Together, the quantities alpha = (alphar(j), alphai(j)) and beta = beta(j) represent the j-th eigenvalue of the matrix pair (A,B), in one of the forms lambda = alpha/beta or mu = beta/alpha. Since either lambda or mu may overflow, they should not, in general, be computed. For complex flavors: The real non-negative scalars beta that define the eigenvalues of GNEP. beta(i) = P(i,i) in the generalized Schur factorization. Together, the quantities alpha = alpha(j) and beta = beta(j) represent the j-th eigenvalue of the matrix pair (A,B), in one of the forms lambda = alpha/ beta or mu = beta/alpha. Since either lambda or mu may overflow, they should not, in general, be computed. q On exit, if compq = 'I', q is overwritten by the orthogonal/unitary matrix of left Schur vectors of the pair (H,T), and if compq = 'V', q is overwritten by the orthogonal/unitary matrix of left Schur vectors of (A,B). z On exit, if compz = 'I', z is overwritten by the orthogonal/unitary matrix of right Schur vectors of the pair (H,T), and if compz = 'V', z is overwritten by the orthogonal/unitary matrix of right Schur vectors of (A,B). work(1) If info = 0, on exit, work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1,..., n, the QZ iteration did not converge. (H,T) is not in Schur form, but alphar(i), alphai(i) (for real flavors), alpha(i) (for complex flavors), and beta(i), i=info+1,..., n should be correct. If info = n+1,...,2n, the shift calculation failed. (H,T) is not in Schur form, but alphar(i), alphai(i) (for real flavors), alpha(i) (for complex flavors), and beta(i), i =info-n+1,..., n should be correct. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hgeqz interface are the following: h Holds the matrix H of size (n,n). t Holds the matrix T of size (n,n). alphar Holds the vector of length n. Used in real flavors only. alphai Holds the vector of length n. Used in real flavors only. alpha Holds the vector of length n. Used in complex flavors only. beta Holds the vector of length n. q Holds the matrix Q of size (n,n). z Holds the matrix Z of size (n,n). ilo Default value for this argument is ilo = 1. ihi Default value for this argument is ihi = n. LAPACK Routines: Least Squares and Eigenvalue Problems 4 889 job Must be 'E' or 'S'. The default value is 'E'. compq If omitted, this argument is restored based on the presence of argument q as follows: compq = 'I', if q is present, compq = 'N', if q is omitted. If present, compq must be equal to 'I' or 'V' and the argument q must also be present. Note that there will be an error condition if compq is present and q omitted. compz If omitted, this argument is restored based on the presence of argument z as follows: compz = 'I', if z is present, compz = 'N', if z is omitted. If present, compz must be equal to 'I' or 'V' and the argument z must also be present. Note an error condition if compz is present and z is omitted. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?tgevc Computes some or all of the right and/or left generalized eigenvectors of a pair of upper triangular matrices. Syntax Fortran 77: call stgevc(side, howmny, select, n, s, lds, p, ldp, vl, ldvl, vr, ldvr, mm, m, work, info) call dtgevc(side, howmny, select, n, s, lds, p, ldp, vl, ldvl, vr, ldvr, mm, m, work, info) call ctgevc(side, howmny, select, n, s, lds, p, ldp, vl, ldvl, vr, ldvr, mm, m, work, rwork, info) call ztgevc(side, howmny, select, n, s, lds, p, ldp, vl, ldvl, vr, ldvr, mm, m, work, rwork, info) Fortran 95: call tgevc(s, p [,howmny] [,select] [,vl] [,vr] [,m] [,info]) 4 Intel® Math Kernel Library Reference Manual 890 C: lapack_int LAPACKE_tgevc( int matrix_order, char side, char howmny, const lapack_logical* select, lapack_int n, const * s, lapack_int lds, const * p, lapack_int ldp, * vl, lapack_int ldvl, * vr, lapack_int ldvr, lapack_int mm, lapack_int* m ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes some or all of the right and/or left eigenvectors of a pair of real/complex matrices (S,P), where S is quasi-triangular (for real flavors) or upper triangular (for complex flavors) and P is upper triangular. Matrix pairs of this type are produced by the generalized Schur factorization of a real/complex matrix pair (A,B): A = Q*S*ZH, B = Q*P*ZH as computed by ?gghrd plus ?hgeqz. The right eigenvector x and the left eigenvector y of (S,P) corresponding to an eigenvalue w are defined by: S*x = w*P*x, yH*S = w*yH*P The eigenvalues are not input to this routine, but are computed directly from the diagonal blocks or diagonal elements of S and P. This routine returns the matrices X and/or Y of right and left eigenvectors of (S,P), or the products Z*X and/ or Q*Y, where Z and Q are input matrices. If Q and Z are the orthogonal/unitary factors from the generalized Schur factorization of a matrix pair (A,B), then Z*X and Q*Y are the matrices of right and left eigenvectors of (A,B). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be 'R', 'L', or 'B'. If side = 'R', compute right eigenvectors only. If side = 'L', compute left eigenvectors only. If side = 'B', compute both right and left eigenvectors. howmny CHARACTER*1. Must be 'A', 'B', or 'S'. If howmny = 'A', compute all right and/or left eigenvectors. If howmny = 'B', compute all right and/or left eigenvectors, backtransformed by the matrices in vr and/or vl. If howmny = 'S', compute selected right and/or left eigenvectors, specified by the logical array select. select LOGICAL. Array, DIMENSION at least max (1, n). If howmny = 'S', select specifies the eigenvectors to be computed. If howmny = 'A'or 'B', select is not referenced. For real flavors: LAPACK Routines: Least Squares and Eigenvalue Problems 4 891 If omega(j) is a real eigenvalue, the corresponding real eigenvector is computed if select(j) is .TRUE.. If omega(j) and omega(j+1) are the real and imaginary parts of a complex eigenvalue, the corresponding complex eigenvector is computed if either select(j) or select(j+1) is .TRUE., and on exit select(j) is set to .TRUE.and select(j+1) is set to .FALSE.. For complex flavors: The eigenvector corresponding to the j-th eigenvalue is computed if select(j) is .TRUE.. n INTEGER. The order of the matrices S and P (n = 0). s, p, vl, vr, work REAL for stgevc DOUBLE PRECISION for dtgevc COMPLEX for ctgevc DOUBLE COMPLEX for ztgevc. Arrays: s(lds,*) contains the matrix S from a generalized Schur factorization as computed by ?hgeqz. This matrix is upper quasi-triangular for real flavors, and upper triangular for complex flavors. The second dimension of s must be at least max(1, n). p(ldp,*) contains the upper triangular matrix P from a generalized Schur factorization as computed by ?hgeqz. For real flavors, 2-by-2 diagonal blocks of P corresponding to 2-by-2 blocks of S must be in positive diagonal form. For complex flavors, P must have real diagonal elements. The second dimension of p must be at least max(1, n). If side = 'L' or 'B' and howmny = 'B', vl(ldvl,*) must contain an nby- n matrix Q (usually the orthogonal/unitary matrix Q of left Schur vectors returned by ?hgeqz). The second dimension of vl must be at least max(1, mm). If side = 'R' , vl is not referenced. If side = 'R' or 'B' and howmny = 'B', vr(ldvr,*) must contain an nby- n matrix Z (usually the orthogonal/unitary matrix Z of right Schur vectors returned by ?hgeqz). The second dimension of vr must be at least max(1, mm). If side = 'L', vr is not referenced. work(*) is a workspace array. DIMENSION at least max (1, 6*n) for real flavors and at least max (1, 2*n) for complex flavors. lds INTEGER. The leading dimension of s; at least max(1, n). ldp INTEGER. The leading dimension of p; at least max(1, n). ldvl INTEGER. The leading dimension of vl; If side = 'L' or 'B', then ldvl =n. If side = 'R', then ldvl = 1. ldvr INTEGER. The leading dimension of vr; If side = 'R' or 'B', then ldvr =n. If side = 'L', then ldvr = 1. mm INTEGER. The number of columns in the arrays vl and/or vr (mm = m). rwork REAL for ctgevc DOUBLE PRECISION for ztgevc. Workspace array, DIMENSION at least max (1, 2*n). Used in complex flavors only. 4 Intel® Math Kernel Library Reference Manual 892 Output Parameters vl On exit, if side = 'L' or 'B', vl contains: if howmny = 'A', the matrix Y of left eigenvectors of (S,P); if howmny = 'B', the matrix Q*Y; if howmny = 'S', the left eigenvectors of (S,P) specified by select, stored consecutively in the columns of vl, in the same order as their eigenvalues. For real flavors: A complex eigenvector corresponding to a complex eigenvalue is stored in two consecutive columns, the first holding the real part, and the second the imaginary part. vr On exit, if side = 'R' or 'B', vr contains: if howmny = 'A', the matrix X of right eigenvectors of (S,P); if howmny = 'B', the matrix Z*X; if howmny = 'S', the right eigenvectors of (S,P) specified by select, stored consecutively in the columns of vr, in the same order as their eigenvalues. For real flavors: A complex eigenvector corresponding to a complex eigenvalue is stored in two consecutive columns, the first holding the real part, and the second the imaginary part. m INTEGER. The number of columns in the arrays vl and/or vr actually used to store the eigenvectors. If howmny = 'A' or 'B', m is set to n. For real flavors: Each selected real eigenvector occupies one column and each selected complex eigenvector occupies two columns. For complex flavors: Each selected eigenvector occupies one column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. For real flavors: if info = i>0, the 2-by-2 block (i:i+1) does not have a complex eigenvalue. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine tgevc interface are the following: s Holds the matrix S of size (n,n). p Holds the matrix P of size (n,n). select Holds the vector of length n. vl Holds the matrix VL of size (n,mm). vr Holds the matrix VR of size (n,mm). side Restored based on the presence of arguments vl and vr as follows: side = 'B', if both vl and vr are present, side = 'L', if vl is present and vr omitted, side = 'R', if vl is omitted and vr present, Note that there will be an error condition if both vl and vr are omitted. LAPACK Routines: Least Squares and Eigenvalue Problems 4 893 howmny If omitted, this argument is restored based on the presence of argument select as follows: howmny = 'S', if select is present, howmny = 'A', if select is omitted. If present, howmny must be equal to 'A' or 'B' and the argument select must be omitted. Note that there will be an error condition if both howmny and select are present. ?tgexc Reorders the generalized Schur decomposition of a pair of matrices (A,B) so that one diagonal block of (A,B) moves to another row index. Syntax Fortran 77: call stgexc(wantq, wantz, n, a, lda, b, ldb, q, ldq, z, ldz, ifst, ilst, work, lwork, info) call dtgexc(wantq, wantz, n, a, lda, b, ldb, q, ldq, z, ldz, ifst, ilst, work, lwork, info) call ctgexc(wantq, wantz, n, a, lda, b, ldb, q, ldq, z, ldz, ifst, ilst, info) call ztgexc(wantq, wantz, n, a, lda, b, ldb, q, ldq, z, ldz, ifst, ilst, info) Fortran 95: call tgexc(a, b [,ifst] [,ilst] [,z] [,q] [,info]) C: lapack_int LAPACKE_tgexc( int matrix_order, lapack_logical wantq, lapack_logical wantz, lapack_int n, * a, lapack_int lda, * b, lapack_int ldb, * q, lapack_int ldq, * z, lapack_int ldz, lapack_int* ifst, lapack_int* ilst ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reorders the generalized real-Schur/Schur decomposition of a real/complex matrix pair (A,B) using an orthogonal/unitary equivalence transformation (A,B) = Q*(A,B)*ZH, so that the diagonal block of (A, B) with row index ifst is moved to row ilst. Matrix pair (A, B) must be in a generalized real-Schur/Schur canonical form (as returned by gges), that is, A is block upper triangular with 1-by-1 and 2-by-2 diagonal blocks and B is upper triangular. Optionally, the matrices Q and Z of generalized Schur vectors are updated. Q(in)*A(in)*Z(in)' = Q(out)*A(out)*Z(out)' Q(in)*B(in)*Z(in)' = Q(out)*B(out)*Z(out)'. 4 Intel® Math Kernel Library Reference Manual 894 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. wantq, wantz LOGICAL. If wantq = .TRUE., update the left transformation matrix Q; If wantq = .FALSE., do not update Q; If wantz = .TRUE., update the right transformation matrix Z; If wantz = .FALSE., do not update Z. n INTEGER. The order of the matrices A and B (n = 0). a, b, q, z REAL for stgexc DOUBLE PRECISION for dtgexc COMPLEX for ctgexc DOUBLE COMPLEX for ztgexc. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the matrix B. The second dimension of b must be at least max(1, n). q (ldq,*) If wantq = .FALSE., then q is not referenced. If wantq = .TRUE., then q must contain the orthogonal/unitary matrix Q. The second dimension of q must be at least max(1, n). z (ldz,*) If wantz = .FALSE., then z is not referenced. If wantz = .TRUE., then z must contain the orthogonal/unitary matrix Z. The second dimension of z must be at least max(1, n). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). ldq INTEGER. The leading dimension of q; If wantq = .FALSE., then ldq = 1. If wantq = .TRUE., then ldq = max(1, n). ldz INTEGER. The leading dimension of z; If wantz = .FALSE., then ldz = 1. If wantz = .TRUE., then ldz = max(1, n). ifst, ilst INTEGER. Specify the reordering of the diagonal blocks of (A, B). The block with row index ifst is moved to row ilst, by a sequence of swapping between adjacent blocks. Constraint: 1 = ifst, ilst = n. work REAL for stgexc; DOUBLE PRECISION for dtgexc. Workspace array, DIMENSION (lwork). Used in real flavors only. lwork INTEGER. The dimension of work; must be at least 4n +16. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for details. Output Parameters a, b, q, z Overwritten by the updated matrices A,B, Q, and Z respectively. LAPACK Routines: Least Squares and Eigenvalue Problems 4 895 ifst, ilst Overwritten for real flavors only. If ifst pointed to the second row of a 2 by 2 block on entry, it is changed to point to the first row; ilst always points to the first row of the block in its final position (which may differ from its input value by ±1). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, the transformed matrix pair (A, B) would be too far from generalized Schur form; the problem is ill-conditioned. (A, B) may have been partially reordered, and ilst points to the first row of the current position of the block being moved. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine tgexc interface are the following: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,n). z Holds the matrix Z of size (n,n). q Holds the matrix Q of size (n,n). wantq Restored based on the presence of the argument q as follows: wantq = .TRUE, if q is present, wantq = .FALSE, if q is omitted. wantz Restored based on the presence of the argument z as follows: wantz = .TRUE, if z is present, wantz = .FALSE, if z is omitted. Application Notes If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?tgsen Reorders the generalized Schur decomposition of a pair of matrices (A,B) so that a selected cluster of eigenvalues appears in the leading diagonal blocks of (A,B). Syntax Fortran 77: call stgsen(ijob, wantq, wantz, select, n, a, lda, b, ldb, alphar, alphai, beta, q, ldq, z, ldz, m, pl, pr, dif, work, lwork, iwork, liwork, info) 4 Intel® Math Kernel Library Reference Manual 896 call dtgsen(ijob, wantq, wantz, select, n, a, lda, b, ldb, alphar, alphai, beta, q, ldq, z, ldz, m, pl, pr, dif, work, lwork, iwork, liwork, info) call ctgsen(ijob, wantq, wantz, select, n, a, lda, b, ldb, alpha, beta, q, ldq, z, ldz, m, pl, pr, dif, work, lwork, iwork, liwork, info) call ztgsen(ijob, wantq, wantz, select, n, a, lda, b, ldb, alpha, beta, q, ldq, z, ldz, m, pl, pr, dif, work, lwork, iwork, liwork, info) Fortran 95: call tgsen(a, b, select [,alphar] [,alphai] [,beta] [,ijob] [,q] [,z] [,pl] [,pr] [,dif] [,m] [,info]) call tgsen(a, b, select [,alpha] [,beta] [,ijob] [,q] [,z] [,pl] [,pr] [, dif] [,m] [,info]) C: lapack_int LAPACKE_stgsen( int matrix_order, lapack_int ijob, lapack_logical wantq, lapack_logical wantz, const lapack_logical* select, lapack_int n, float* a, lapack_int lda, float* b, lapack_int ldb, float* alphar, float* alphai, float* beta, float* q, lapack_int ldq, float* z, lapack_int ldz, lapack_int* m, float* pl, float* pr, float* dif ); lapack_int LAPACKE_dtgsen( int matrix_order, lapack_int ijob, lapack_logical wantq, lapack_logical wantz, const lapack_logical* select, lapack_int n, double* a, lapack_int lda, double* b, lapack_int ldb, double* alphar, double* alphai, double* beta, double* q, lapack_int ldq, double* z, lapack_int ldz, lapack_int* m, double* pl, double* pr, double* dif ); lapack_int LAPACKE_ctgsen( int matrix_order, lapack_int ijob, lapack_logical wantq, lapack_logical wantz, const lapack_logical* select, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* alpha, lapack_complex_float* beta, lapack_complex_float* q, lapack_int ldq, lapack_complex_float* z, lapack_int ldz, lapack_int* m, float* pl, float* pr, float* dif ); lapack_int LAPACKE_ztgsen( int matrix_order, lapack_int ijob, lapack_logical wantq, lapack_logical wantz, const lapack_logical* select, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* alpha, lapack_complex_double* beta, lapack_complex_double* q, lapack_int ldq, lapack_complex_double* z, lapack_int ldz, lapack_int* m, double* pl, double* pr, double* dif ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reorders the generalized real-Schur/Schur decomposition of a real/complex matrix pair (A, B) (in terms of an orthogonal/unitary equivalence transformation QT*(A,B)*Z for real flavors or QH*(A,B)*Z for complex flavors), so that a selected cluster of eigenvalues appears in the leading diagonal blocks of the pair (A, B). The leading columns of Q and Z form orthonormal/unitary bases of the corresponding left and right eigenspaces (deflating subspaces). (A, B) must be in generalized real-Schur/Schur canonical form (as returned by gges), that is, A and B are both upper triangular. LAPACK Routines: Least Squares and Eigenvalue Problems 4 897 ?tgsen also computes the generalized eigenvalues ?j = (alphar(j) + alphai(j)*i)/beta(j) (for real flavors) ?j = alpha(j)/beta(j) (for complex flavors) of the reordered matrix pair (A, B). Optionally, the routine computes the estimates of reciprocal condition numbers for eigenvalues and eigenspaces. These are Difu[(A11, B11), (A22, B22)] and Difl[(A11, B11), (A22, B22)], that is, the separation(s) between the matrix pairs (A11, B11) and (A22, B22) that correspond to the selected cluster and the eigenvalues outside the cluster, respectively, and norms of "projections" onto left and right eigenspaces with respect to the selected cluster in the (1,1)-block. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. ijob INTEGER. Specifies whether condition numbers are required for the cluster of eigenvalues (pl and pr) or the deflating subspaces Difu and Difl. If ijob =0, only reorder with respect to select; If ijob =1, reciprocal of norms of "projections" onto left and right eigenspaces with respect to the selected cluster (pl and pr); If ijob =2, compute upper bounds on Difu and Difl, using F-norm-based estimate (dif (1:2)); If ijob =3, compute estimate of Difu and Difl, sing 1-norm-based estimate (dif (1:2)). This option is about 5 times as expensive as ijob =2; If ijob =4,>compute pl, pr and dif (i.e., options 0, 1 and 2 above). This is an economic version to get it all; If ijob =5, compute pl, pr and dif (i.e., options 0, 1 and 3 above). wantq, wantz LOGICAL. If wantq = .TRUE., update the left transformation matrix Q; If wantq = .FALSE., do not update Q; If wantz = .TRUE., update the right transformation matrix Z; If wantz = .FALSE., do not update Z. select LOGICAL. Array, DIMENSION at least max (1, n). Specifies the eigenvalues in the selected cluster. To select an eigenvalue omega(j), select(j) must be .TRUE. For real flavors: to select a complex conjugate pair of eigenvalues omega(j) and omega(j+1) (corresponding 2 by 2 diagonal block), select(j) and/or select(j+1) must be set to .TRUE.; the complex conjugate omega(j) and omega(j+1) must be either both included in the cluster or both excluded. n INTEGER. The order of the matrices A and B (n = 0). a, b, q, z, work REAL for stgsen DOUBLE PRECISION for dtgsen COMPLEX for ctgsen DOUBLE COMPLEX for ztgsen. Arrays: a(lda,*) contains the matrix A. For real flavors: A is upper quasi-triangular, with (A, B) in generalized real Schur canonical form. 4 Intel® Math Kernel Library Reference Manual 898 For complex flavors: A is upper triangular, in generalized Schur canonical form. The second dimension of a must be at least max(1, n). b(ldb,*) contains the matrix B. For real flavors: B is upper triangular, with (A, B) in generalized real Schur canonical form. For complex flavors: B is upper triangular, in generalized Schur canonical form. The second dimension of b must be at least max(1, n). q (ldq,*) If wantq = .TRUE., then q is an n-by-n matrix; If wantq = .FALSE., then q is not referenced. The second dimension of q must be at least max(1, n). z (ldz,*) If wantz = .TRUE., then z is an n-by-n matrix; If wantz = .FALSE., then z is not referenced. The second dimension of z must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). ldq INTEGER. The leading dimension of q; ldq = 1. If wantq = .TRUE., then ldq = max(1, n). ldz INTEGER. The leading dimension of z; ldz = 1. If wantz = .TRUE., then ldz = max(1, n). lwork INTEGER. The dimension of the array work. For real flavors: If ijob = 1, 2, or 4, lwork = max(4n+16, 2m(n-m)). If ijob = 3 or 5, lwork = max(4n+16, 4m(n-m)). For complex flavors: If ijob = 1, 2, or 4, lwork = max(1, 2m(n-m)). If ijob = 3 or 5, lwork = max(1, 4m(n-m)). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork. For real flavors: If ijob = 1, 2, or 4, liwork = n+6. If ijob = 3 or 5, liwork = max(n+6, 2m(n-m)). For complex flavors: If ijob = 1, 2, or 4, liwork = n+2. If ijob = 3 or 5, liwork = max(n+2, 2m(n-m)). If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the iwork array, returns this value as the first entry of the iwork array, and no error message related to liwork is issued by xerbla. See Application Notes for details. Output Parameters a, b Overwritten by the reordered matrices A and B, respectively. alphar, alphai REAL for stgsen; LAPACK Routines: Least Squares and Eigenvalue Problems 4 899 DOUBLE PRECISION for dtgsen. Arrays, DIMENSION at least max(1, n). Contain values that form generalized eigenvalues in real flavors. See beta. alpha COMPLEX for ctgsen; DOUBLE COMPLEX for ztgsen. Array, DIMENSION at least max(1, n). Contain values that form generalized eigenvalues in complex flavors. See beta. beta REAL for stgsen DOUBLE PRECISION for dtgsen COMPLEX for ctgsen DOUBLE COMPLEX for ztgsen. Array, DIMENSION at least max(1, n). For real flavors: On exit, (alphar(j) + alphai(j)*i)/beta(j), j=1,..., n, will be the generalized eigenvalues. alphar(j) + alphai(j)*i and beta(j), j=1,..., n are the diagonals of the complex Schur form (S,T) that would result if the 2-by-2 diagonal blocks of the real generalized Schur form of (A,B) were further reduced to triangular form using complex unitary transformations. If alphai(j) is zero, then the j-th eigenvalue is real; if positive, then the jth and (j+1)-st eigenvalues are a complex conjugate pair, with alphai(j +1) negative. For complex flavors: The diagonal elements of A and B, respectively, when the pair (A,B) has been reduced to generalized Schur form. alpha(i)/beta(i), i=1,..., n are the generalized eigenvalues. q If wantq =.TRUE., then, on exit, Q has been postmultiplied by the left orthogonal transformation matrix which reorder (A, B). The leading m columns of Q form orthonormal bases for the specified pair of left eigenspaces (deflating subspaces). z If wantz =.TRUE., then, on exit, Z has been postmultiplied by the left orthogonal transformation matrix which reorder (A, B). The leading m columns of Z form orthonormal bases for the specified pair of left eigenspaces (deflating subspaces). m INTEGER. The dimension of the specified pair of left and right eigen-spaces (deflating subspaces); 0 = m = n. pl, pr REAL for single precision flavors; DOUBLE PRECISION for double precision flavors. If ijob = 1, 4, or 5, pl and pr are lower bounds on the reciprocal of the norm of "projections" onto left and right eigenspaces with respect to the selected cluster. 0 < pl, pr = 1. If m = 0 or m = n, pl = pr = 1. If ijob = 0, 2 or 3, pl and pr are not referenced dif REAL for single precision flavors;DOUBLE PRECISION for double precision flavors. Array, DIMENSION (2). If ijob = 2, dif(1:2) store the estimates of Difu and Difl. 4 Intel® Math Kernel Library Reference Manual 900 If ijob = 2 or 4, dif(1:2) are F-norm-based upper bounds on Difu and Difl. If ijob = 3 or 5, dif(1:2) are 1-norm-based estimates of Difu and Difl. If m = 0 or n, dif(1:2) = F-norm([A, B]). If ijob = 0 or 1, dif is not referenced. work(1) If ijob is not 0 and info = 0, on exit, work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. iwork(1) If ijob is not 0 and info = 0, on exit, iwork(1) contains the minimum value of liwork required for optimum performance. Use this liwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, Reordering of (A, B) failed because the transformed matrix pair (A, B) would be too far from generalized Schur form; the problem is very ill-conditioned. (A, B) may have been partially reordered. If requested, 0 is returned in dif(*), pl and pr. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine tgsen interface are the following: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,n). select Holds the vector of length n. alphar Holds the vector of length n. Used in real flavors only. alphai Holds the vector of length n. Used in real flavors only. alpha Holds the vector of length n. Used in complex flavors only. beta Holds the vector of length n. q Holds the matrix Q of size (n,n). z Holds the matrix Z of size (n,n). dif Holds the vector of length (2). ijob Must be 0, 1, 2, 3, 4, or 5. The default value is 0. wantq Restored based on the presence of the argument q as follows: wantq = .TRUE, if q is present, wantq = .FALSE, if q is omitted. wantz Restored based on the presence of the argument z as follows: wantz = .TRUE, if z is present, wantz = .FALSE, if z is omitted. Application Notes If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run or set lwork = -1 (liwork = -1). LAPACK Routines: Least Squares and Eigenvalue Problems 4 901 If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if lwork (liwork) is less than the minimal required value and is not equal to -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?tgsyl Solves the generalized Sylvester equation. Syntax Fortran 77: call stgsyl(trans, ijob, m, n, a, lda, b, ldb, c, ldc, d, ldd, e, lde, f, ldf, scale, dif, work, lwork, iwork, info) call dtgsyl(trans, ijob, m, n, a, lda, b, ldb, c, ldc, d, ldd, e, lde, f, ldf, scale, dif, work, lwork, iwork, info) call ctgsyl(trans, ijob, m, n, a, lda, b, ldb, c, ldc, d, ldd, e, lde, f, ldf, scale, dif, work, lwork, iwork, info) call ztgsyl(trans, ijob, m, n, a, lda, b, ldb, c, ldc, d, ldd, e, lde, f, ldf, scale, dif, work, lwork, iwork, info) Fortran 95: call tgsyl(a, b, c, d, e, f [,ijob] [,trans] [,scale] [,dif] [,info]) C: lapack_int LAPACKE_stgsyl( int matrix_order, char trans, lapack_int ijob, lapack_int m, lapack_int n, const float* a, lapack_int lda, const float* b, lapack_int ldb, float* c, lapack_int ldc, const float* d, lapack_int ldd, const float* e, lapack_int lde, float* f, lapack_int ldf, float* scale, float* dif ); lapack_int LAPACKE_dtgsyl( int matrix_order, char trans, lapack_int ijob, lapack_int m, lapack_int n, const double* a, lapack_int lda, const double* b, lapack_int ldb, double* c, lapack_int ldc, const double* d, lapack_int ldd, const double* e, lapack_int lde, double* f, lapack_int ldf, double* scale, double* dif ); lapack_int LAPACKE_ctgsyl( int matrix_order, char trans, lapack_int ijob, lapack_int m, lapack_int n, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* c, lapack_int ldc, const lapack_complex_float* d, lapack_int ldd, const lapack_complex_float* e, lapack_int lde, lapack_complex_float* f, lapack_int ldf, float* scale, float* dif ); lapack_int LAPACKE_ztgsyl( int matrix_order, char trans, lapack_int ijob, lapack_int m, lapack_int n, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* c, lapack_int ldc, const lapack_complex_double* d, lapack_int ldd, const lapack_complex_double* e, lapack_int lde, lapack_complex_double* f, lapack_int ldf, double* scale, double* dif ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h 4 Intel® Math Kernel Library Reference Manual 902 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves the generalized Sylvester equation: A*R-L*B = scale*C D*R-L*E = scale*F where R and L are unknown m-by-n matrices, (A, D), (B, E) and (C, F) are given matrix pairs of size m-by-m, nby- n and m-by-n, respectively, with real/complex entries. (A, D) and (B, E) must be in generalized real-Schur/ Schur canonical form, that is, A, B are upper quasi-triangular/triangular and D, E are upper triangular. The solution (R, L) overwrites (C, F). The factor scale, 0=scale=1, is an output scaling factor chosen to avoid overflow. In matrix notation the above equation is equivalent to the following: solve Z*x = scale*b, where Z is defined as Here Ik is the identity matrix of size k and X' is the transpose/conjugate-transpose of X. kron(X, Y) is the Kronecker product between the matrices X and Y. If trans = 'T' (for real flavors), or trans = 'C' (for complex flavors), the routine ?tgsyl solves the transposed/conjugate-transposed system Z'*y = scale*b, which is equivalent to solve for R and L in A'*R+D'*L = scale*C R*B'+L*E' = scale*(-F) This case (trans = 'T' for stgsyl/dtgsyl or trans = 'C' for ctgsyl/ztgsyl) is used to compute an one-norm-based estimate of Dif[(A, D), (B, E)], the separation between the matrix pairs (A,D) and (B,E), using lacon/lacon. If ijob = 1, ?tgsyl computes a Frobenius norm-based estimate of Dif[(A, D), (B,E)]. That is, the reciprocal of a lower bound on the reciprocal of the smallest singular value of Z. This is a level 3 BLAS algorithm. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N', 'T', or 'C'. If trans = 'N', solve the generalized Sylvester equation. If trans = 'T', solve the 'transposed' system (for real flavors only). If trans = 'C', solve the ' conjugate transposed' system (for complex flavors only). ijob INTEGER. Specifies what kind of functionality to be performed: If ijob =0, solve the generalized Sylvester equation only; If ijob =1, perform the functionality of ijob =0 and ijob =3; If ijob =2, perform the functionality of ijob =0 and ijob =4; LAPACK Routines: Least Squares and Eigenvalue Problems 4 903 If ijob =3, only an estimate of Dif[(A, D), (B, E)] is computed (look ahead strategy is used); If ijob =4, only an estimate of Dif[(A, D), (B,E)] is computed (?gecon on sub-systems is used). If trans = 'T' or 'C', ijob is not referenced. m INTEGER. The order of the matrices A and D, and the row dimension of the matrices C, F, R and L. n INTEGER. The order of the matrices B and E, and the column dimension of the matrices C, F, R and L. a, b, c, d, e, f, work REAL for stgsyl DOUBLE PRECISION for dtgsyl COMPLEX for ctgsyl DOUBLE COMPLEX for ztgsyl. Arrays: a(lda,*) contains the upper quasi-triangular (for real flavors) or upper triangular (for complex flavors) matrix A. The second dimension of a must be at least max(1, m). b(ldb,*) contains the upper quasi-triangular (for real flavors) or upper triangular (for complex flavors) matrix B. The second dimension of b must be at least max(1, n). c (ldc,*) contains the right-hand-side of the first matrix equation in the generalized Sylvester equation (as defined by trans) The second dimension of c must be at least max(1, n). d (ldd,*) contains the upper triangular matrix D. The second dimension of d must be at least max(1, m). e (lde,*) contains the upper triangular matrix E. The second dimension of e must be at least max(1, n). f (ldf,*) contains the right-hand-side of the second matrix equation in the generalized Sylvester equation (as defined by trans) The second dimension of f must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; at least max(1, n). ldc INTEGER. The leading dimension of c; at least max(1, m). ldd INTEGER. The leading dimension of d; at least max(1, m). lde INTEGER. The leading dimension of e; at least max(1, n). ldf INTEGER. The leading dimension of f; at least max(1, m). lwork INTEGER. The dimension of the array work. lwork = 1. If ijob = 1 or 2 and trans = 'N', lwork = max(1, 2*m*n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, DIMENSION at least (m+n+6) for real flavors, and at least (m+n+2) for complex flavors. Output Parameters c If ijob=0, 1, or 2, overwritten by the solution R. If ijob=3 or 4 and trans = 'N', c holds R, the solution achieved during the computation of the Dif-estimate. 4 Intel® Math Kernel Library Reference Manual 904 f If ijob=0, 1, or 2, overwritten by the solution L. If ijob=3 or 4 and trans = 'N', f holds L, the solution achieved during the computation of the Dif-estimate. dif REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. On exit, dif is the reciprocal of a lower bound of the reciprocal of the Diffunction, that is, dif is an upper bound of Dif[(A, D), (B, E)] = sigma_min(Z), where Z as in (2). If ijob = 0, or trans = 'T' (for real flavors), or trans = 'C' (for complex flavors), dif is not touched. scale REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. On exit, scale is the scaling factor in the generalized Sylvester equation. If 0 < scale < 1, c and f hold the solutions R and L, respectively, to a slightly perturbed system but the input matrices A, B, D and E have not been changed. If scale = 0, c and f hold the solutions R and L, respectively, to the homogeneous system with C = F = 0. Normally, scale = 1. work(1) If info = 0, work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info > 0, (A, D) and (B, E) have common or close eigenvalues. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine tgsyl interface are the following: a Holds the matrix A of size (m,m). b Holds the matrix B of size (n,n). c Holds the matrix C of size (m,n). d Holds the matrix D of size (m,m). e Holds the matrix E of size (n,n). f Holds the matrix F of size (m,n). ijob Must be 0, 1, 2, 3, or 4. The default value is 0. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. LAPACK Routines: Least Squares and Eigenvalue Problems 4 905 Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?tgsna Estimates reciprocal condition numbers for specified eigenvalues and/or eigenvectors of a pair of matrices in generalized real Schur canonical form. Syntax Fortran 77: call stgsna(job, howmny, select, n, a, lda, b, ldb, vl, ldvl, vr, ldvr, s, dif, mm, m, work, lwork, iwork, info) call dtgsna(job, howmny, select, n, a, lda, b, ldb, vl, ldvl, vr, ldvr, s, dif, mm, m, work, lwork, iwork, info) call ctgsna(job, howmny, select, n, a, lda, b, ldb, vl, ldvl, vr, ldvr, s, dif, mm, m, work, lwork, iwork, info) call ztgsna(job, howmny, select, n, a, lda, b, ldb, vl, ldvl, vr, ldvr, s, dif, mm, m, work, lwork, iwork, info) Fortran 95: call tgsna(a, b [,s] [,dif] [,vl] [,vr] [,select] [,m] [,info]) C: lapack_int LAPACKE_stgsna( int matrix_order, char job, char howmny, const lapack_logical* select, lapack_int n, const float* a, lapack_int lda, const float* b, lapack_int ldb, const float* vl, lapack_int ldvl, const float* vr, lapack_int ldvr, float* s, float* dif, lapack_int mm, lapack_int* m ); lapack_int LAPACKE_dtgsna( int matrix_order, char job, char howmny, const lapack_logical* select, lapack_int n, const double* a, lapack_int lda, const double* b, lapack_int ldb, const double* vl, lapack_int ldvl, const double* vr, lapack_int ldvr, double* s, double* dif, lapack_int mm, lapack_int* m ); lapack_int LAPACKE_ctgsna( int matrix_order, char job, char howmny, const lapack_logical* select, lapack_int n, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* b, lapack_int ldb, const lapack_complex_float* vl, lapack_int ldvl, const lapack_complex_float* vr, lapack_int ldvr, float* s, float* dif, lapack_int mm, lapack_int* m ); lapack_int LAPACKE_ztgsna( int matrix_order, char job, char howmny, const lapack_logical* select, lapack_int n, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* b, lapack_int ldb, const lapack_complex_double* vl, lapack_int ldvl, const lapack_complex_double* vr, lapack_int ldvr, double* s, double* dif, lapack_int mm, lapack_int* m ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description 4 Intel® Math Kernel Library Reference Manual 906 The real flavors stgsna/dtgsna of this routine estimate reciprocal condition numbers for specified eigenvalues and/or eigenvectors of a matrix pair (A, B) in generalized real Schur canonical form (or of any matrix pair (Q*A*ZT, Q*B*ZT) with orthogonal matrices Q and Z. (A, B) must be in generalized real Schur form (as returned by gges/gges), that is, A is block upper triangular with 1-by-1 and 2-by-2 diagonal blocks. B is upper triangular. The complex flavors ctgsna/ztgsna estimate reciprocal condition numbers for specified eigenvalues and/or eigenvectors of a matrix pair (A, B). (A, B) must be in generalized Schur canonical form, that is, A and B are both upper triangular. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. job CHARACTER*1. Specifies whether condition numbers are required for eigenvalues or eigenvectors. Must be 'E' or 'V' or 'B'. If job = 'E', for eigenvalues only (compute s ). If job = 'V', for eigenvectors only (compute dif ). If job = 'B', for both eigenvalues and eigenvectors (compute both s and dif). howmny CHARACTER*1. Must be 'A' or 'S'. If howmny = 'A', compute condition numbers for all eigenpairs. If howmny = 'S', compute condition numbers for selected eigenpairs specified by the logical array select. select LOGICAL. Array, DIMENSION at least max (1, n). If howmny = 'S', select specifies the eigenpairs for which condition numbers are required. If howmny = 'A', select is not referenced. For real flavors: To select condition numbers for the eigenpair corresponding to a real eigenvalue omega(j), select(j) must be set to .TRUE.; to select condition numbers corresponding to a complex conjugate pair of eigenvalues omega(j) and omega(j+1), either select(j) or select(j+1) must be set to .TRUE. For complex flavors: To select condition numbers for the corresponding j-th eigenvalue and/or eigenvector, select(j) must be set to .TRUE.. n INTEGER. The order of the square matrix pair (A, B) (n = 0). a, b, vl, vr, work REAL for stgsna DOUBLE PRECISION for dtgsna COMPLEX for ctgsna DOUBLE COMPLEX for ztgsna. Arrays: a(lda,*) contains the upper quasi-triangular (for real flavors) or upper triangular (for complex flavors) matrix A in the pair (A, B). The second dimension of a must be at least max(1, n). b(ldb,*) contains the upper triangular matrix B in the pair (A, B). The second dimension of b must be at least max(1, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 907 If job = 'E' or 'B', vl(ldvl,*) must contain left eigenvectors of (A, B), corresponding to the eigenpairs specified by howmny and select. The eigenvectors must be stored in consecutive columns of vl, as returned by ? tgevc. If job = 'V', vl is not referenced. The second dimension of vl must be at least max(1, m). If job = 'E' or 'B', vr(ldvr,*) must contain right eigenvectors of (A, B), corresponding to the eigenpairs specified by howmny and select. The eigenvectors must be stored in consecutive columns of vr, as returned by ? tgevc. If job = 'V', vr is not referenced. The second dimension of vr must be at least max(1, m). work is a workspace array, its dimension max(1, lwork). If job = 'E', work is not referenced. lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). ldvl INTEGER. The leading dimension of vl; ldvl = 1. If job = 'E' or 'B', then ldvl = max(1, n). ldvr INTEGER. The leading dimension of vr; ldvr = 1. If job = 'E' or 'B', then ldvr = max(1, n). mm INTEGER. The number of elements in the arrays s and dif (mm = m). lwork INTEGER. The dimension of the array work. lwork = max(1, n). If job = 'V' or 'B', lwork = 2*n*(n+2)+16 for real flavors, and lwork = max(1, 2*n*n) for complex flavors. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, DIMENSION at least (n+6) for real flavors, and at least (n+2) for complex flavors. If job = 'E', iwork is not referenced. Output Parameters s REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION (mm ). If job = 'E' or 'B', contains the reciprocal condition numbers of the selected eigenvalues, stored in consecutive elements of the array. If job = 'V', s is not referenced. For real flavors: For a complex conjugate pair of eigenvalues two consecutive elements of s are set to the same value. Thus, s(j), dif(j), and the j-th columns of vl and vr all correspond to the same eigenpair (but not in general the j-th eigenpair, unless all eigenpairs are selected). dif REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION (mm ). If job = 'V' or 'B', contains the estimated reciprocal condition numbers of the selected eigenvectors, stored in consecutive elements of the array. 4 Intel® Math Kernel Library Reference Manual 908 If the eigenvalues cannot be reordered to compute dif(j), dif(j) is set to 0; this can only occur when the true value would be very small anyway. If job = 'E', dif is not referenced. For real flavors: For a complex eigenvector, two consecutive elements of dif are set to the same value. For complex flavors: For each eigenvalue/vector specified by select, dif stores a Frobenius norm-based estimate of Difl. m INTEGER. The number of elements in the arrays s and dif used to store the specified condition numbers; for each selected eigenvalue one element is used. If howmny = 'A', m is set to n. work(1) work(1) If job is not 'E' and info = 0, on exit, work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine tgsna interface are the following: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,n). s Holds the vector of length (mm). dif Holds the vector of length (mm). vl Holds the matrix VL of size (n,mm). vr Holds the matrix VR of size (n,mm). select Holds the vector of length n. howmny Restored based on the presence of the argument select as follows: howmny = 'S', if select is present, howmny = 'A', if select is omitted. job Restored based on the presence of arguments s and dif as follows: job = 'B', if both s and dif are present, job = 'E', if s is present and dif omitted, job = 'V', if s is omitted and dif present, Note that there will be an error condition if both s and dif are omitted. Application Notes If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. LAPACK Routines: Least Squares and Eigenvalue Problems 4 909 Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Generalized Singular Value Decomposition This section describes LAPACK computational routines used for finding the generalized singular value decomposition (GSVD) of two matrices A and B as UHAQ = D1*(0 R), VHBQ = D2*(0 R), where U, V, and Q are orthogonal/unitary matrices, R is a nonsingular upper triangular matrix, and D1, D2 are “diagonal” matrices of the structure detailed in the routines description section. Table “Computational Routines for Generalized Singular Value Decomposition” lists LAPACK routines (FORTRAN 77 interface) that perform generalized singular value decomposition of matrices. Respective routine names in Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Computational Routines for Generalized Singular Value Decomposition Routine name Operation performed ggsvp Computes the preprocessing decomposition for the generalized SVD tgsja Computes the generalized SVD of two upper triangular or trapezoidal matrices You can use routines listed in the above table as well as the driver routine ggsvd to find the GSVD of a pair of general rectangular matrices. ?ggsvp Computes the preprocessing decomposition for the generalized SVD. Syntax Fortran 77: call sggsvp(jobu, jobv, jobq, m, p, n, a, lda, b, ldb, tola, tolb, k, l, u, ldu, v, ldv, q, ldq, iwork, tau, work, info) call dggsvp(jobu, jobv, jobq, m, p, n, a, lda, b, ldb, tola, tolb, k, l, u, ldu, v, ldv, q, ldq, iwork, tau, work, info) call cggsvp(jobu, jobv, jobq, m, p, n, a, lda, b, ldb, tola, tolb, k, l, u, ldu, v, ldv, q, ldq, iwork, rwork, tau, work, info) call zggsvp(jobu, jobv, jobq, m, p, n, a, lda, b, ldb, tola, tolb, k, l, u, ldu, v, ldv, q, ldq, iwork, rwork, tau, work, info) Fortran 95: call ggsvp(a, b, tola, tolb [, k] [,l] [,u] [,v] [,q] [,info]) C: lapack_int LAPACKE_sggsvp( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int p, lapack_int n, float* a, lapack_int lda, float* b, lapack_int ldb, float tola, float tolb, lapack_int* k, lapack_int* l, float* u, lapack_int ldu, float* v, lapack_int ldv, float* q, lapack_int ldq ); 4 Intel® Math Kernel Library Reference Manual 910 lapack_int LAPACKE_dggsvp( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int p, lapack_int n, double* a, lapack_int lda, double* b, lapack_int ldb, double tola, double tolb, lapack_int* k, lapack_int* l, double* u, lapack_int ldu, double* v, lapack_int ldv, double* q, lapack_int ldq ); lapack_int LAPACKE_cggsvp( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int p, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, float tola, float tolb, lapack_int* k, lapack_int* l, lapack_complex_float* u, lapack_int ldu, lapack_complex_float* v, lapack_int ldv, lapack_complex_float* q, lapack_int ldq ); lapack_int LAPACKE_zggsvp( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int p, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, double tola, double tolb, lapack_int* k, lapack_int* l, lapack_complex_double* u, lapack_int ldu, lapack_complex_double* v, lapack_int ldv, lapack_complex_double* q, lapack_int ldq ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes orthogonal matrices U, V and Q such that where the k-by-k matrix A12 and l-by-l matrix B13 are nonsingular upper triangular; A23 is l-by-l upper triangular if m-k-l =0, otherwise A23 is (m-k)-by-l upper trapezoidal. The sum k+l is equal to the effective numerical rank of the (m+p)-by-n matrix (AH,BH)H. LAPACK Routines: Least Squares and Eigenvalue Problems 4 911 This decomposition is the preprocessing step for computing the Generalized Singular Value Decomposition (GSVD), see subroutine ggsvp. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobu CHARACTER*1. Must be 'U' or 'N'. If jobu = 'U', orthogonal/unitary matrix U is computed. If jobu = 'N', U is not computed. jobv CHARACTER*1. Must be 'V' or 'N'. If jobv = 'V', orthogonal/unitary matrix V is computed. If jobv = 'N', V is not computed. jobq CHARACTER*1. Must be 'Q' or 'N'. If jobq = 'Q', orthogonal/unitary matrix Q is computed. If jobq = 'N', Q is not computed. m INTEGER. The number of rows of the matrix A (m = 0). p INTEGER. The number of rows of the matrix B (p = 0). n INTEGER. The number of columns of the matrices A and B (n = 0). a, b, tau, work REAL for sggsvp DOUBLE PRECISION for dggsvp COMPLEX for cggsvp DOUBLE COMPLEX for zggsvp. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the p-by-n matrix B. The second dimension of b must be at least max(1, n). tau(*) is a workspace array. The dimension of tau must be at least max(1, n). work(*) is a workspace array. The dimension of work must be at least max(1, 3n, m, p). lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; at least max(1, p). tola, tolb REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. tola and tolb are the thresholds to determine the effective numerical rank of matrix B and a subblock of A. Generally, they are set to tola = max(m, n)*||A||*MACHEPS, tolb = max(p, n)*||B||*MACHEPS. The size of tola and tolb may affect the size of backward errors of the decomposition. ldu INTEGER. The leading dimension of the output array u . ldu = max(1, m) if jobu = 'U'; ldu = 1 otherwise. ldv INTEGER. The leading dimension of the output array v . ldv = max(1, p) if jobv = 'V'; ldv = 1 otherwise. ldq INTEGER. The leading dimension of the output array q . ldq = max(1, n) if jobq = 'Q'; ldq = 1 otherwise. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). 4 Intel® Math Kernel Library Reference Manual 912 rwork REAL for cggsvp DOUBLE PRECISION for zggsvp. Workspace array, DIMENSION at least max(1, 2n). Used in complex flavors only. Output Parameters a Overwritten by the triangular (or trapezoidal) matrix described in the Description section. b Overwritten by the triangular matrix described in the Description section. k, l INTEGER. On exit, k and l specify the dimension of subblocks. The sum k + l is equal to effective numerical rank of (AH, BH)H. u, v, q REAL for sggsvp DOUBLE PRECISION for dggsvp COMPLEX for cggsvp DOUBLE COMPLEX for zggsvp. Arrays: If jobu = 'U', u(ldu,*) contains the orthogonal/unitary matrix U. The second dimension of u must be at least max(1, m). If jobu = 'N', u is not referenced. If jobv = 'V', v(ldv,*) contains the orthogonal/unitary matrix V. The second dimension of v must be at least max(1, m). If jobv = 'N', v is not referenced. If jobq = 'Q', q(ldq,*) contains the orthogonal/unitary matrix Q. The second dimension of q must be at least max(1, n). If jobq = 'N', q is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggsvp interface are the following: a Holds the matrix A of size (m,n). b Holds the matrix B of size (p,n). u Holds the matrix U of size (m,m). v Holds the matrix V of size (p,m). q Holds the matrix Q of size (n,n). jobu Restored based on the presence of the argument u as follows: jobu = 'U', if u is present, jobu = 'N', if u is omitted. jobv Restored based on the presence of the argument v as follows: jobz = 'V', if v is present, jobz = 'N', if v is omitted. jobq Restored based on the presence of the argument q as follows: jobz = 'Q', if q is present, jobz = 'N', if q is omitted. LAPACK Routines: Least Squares and Eigenvalue Problems 4 913 ?tgsja Computes the generalized SVD of two upper triangular or trapezoidal matrices. Syntax Fortran 77: call stgsja(jobu, jobv, jobq, m, p, n, k, l, a, lda, b, ldb, tola, tolb, alpha, beta, u, ldu, v, ldv, q, ldq, work, ncycle, info) call dtgsja(jobu, jobv, jobq, m, p, n, k, l, a, lda, b, ldb, tola, tolb, alpha, beta, u, ldu, v, ldv, q, ldq, work, ncycle, info) call ctgsja(jobu, jobv, jobq, m, p, n, k, l, a, lda, b, ldb, tola, tolb, alpha, beta, u, ldu, v, ldv, q, ldq, work, ncycle, info) call ztgsja(jobu, jobv, jobq, m, p, n, k, l, a, lda, b, ldb, tola, tolb, alpha, beta, u, ldu, v, ldv, q, ldq, work, ncycle, info) Fortran 95: call tgsja(a, b, tola, tolb, k, l [,u] [,v] [,q] [,jobu] [,jobv] [,jobq] [,alpha] [,beta] [,ncycle] [,info]) C: lapack_int LAPACKE_stgsja( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int p, lapack_int n, lapack_int k, lapack_int l, float* a, lapack_int lda, float* b, lapack_int ldb, float tola, float tolb, float* alpha, float* beta, float* u, lapack_int ldu, float* v, lapack_int ldv, float* q, lapack_int ldq, lapack_int* ncycle ); lapack_int LAPACKE_dtgsja( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int p, lapack_int n, lapack_int k, lapack_int l, double* a, lapack_int lda, double* b, lapack_int ldb, double tola, double tolb, double* alpha, double* beta, double* u, lapack_int ldu, double* v, lapack_int ldv, double* q, lapack_int ldq, lapack_int* ncycle ); lapack_int LAPACKE_ctgsja( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int p, lapack_int n, lapack_int k, lapack_int l, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, float tola, float tolb, float* alpha, float* beta, lapack_complex_float* u, lapack_int ldu, lapack_complex_float* v, lapack_int ldv, lapack_complex_float* q, lapack_int ldq, lapack_int* ncycle ); lapack_int LAPACKE_ztgsja( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int p, lapack_int n, lapack_int k, lapack_int l, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, double tola, double tolb, double* alpha, double* beta, lapack_complex_double* u, lapack_int ldu, lapack_complex_double* v, lapack_int ldv, lapack_complex_double* q, lapack_int ldq, lapack_int* ncycle ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 4 Intel® Math Kernel Library Reference Manual 914 Description The routine computes the generalized singular value decomposition (GSVD) of two real/complex upper triangular (or trapezoidal) matrices A and B. On entry, it is assumed that matrices A and B have the following forms, which may be obtained by the preprocessing subroutine ggsvp from a general m-by-n matrix A and pby- n matrix B: where the k-by-k matrix A12 and l-by-l matrix B13 are nonsingular upper triangular; A23 is l-by-l upper triangular if m-k-l =0, otherwise A23 is (m-k)-by-l upper trapezoidal. On exit, UH*A*Q = D1*(0 R), VH*B*Q = D2*(0 R), where U, V and Q are orthogonal/unitary matrices, R is a nonsingular upper triangular matrix, and D1 and D2 are "diagonal" matrices, which are of the following structures: If m-k-l =0, LAPACK Routines: Least Squares and Eigenvalue Problems 4 915 where C = diag(alpha(k+1),...,alpha(k+l)) S = diag(beta(k+1),...,beta(k+l)) C2 + S2 = I R is stored in a(1:k+l, n-k-l+1:n ) on exit. If m-k-l < 0, where 4 Intel® Math Kernel Library Reference Manual 916 C = diag(alpha(K+1),...,alpha(m)), S = diag(beta(K+1),...,beta(m)), C2 + S2 = I On exit, is stored in a(1:m, n-k-l+1:n ) and R33 is stored in b(m-k+1:l, n+m-k-l+1:n ). The computation of the orthogonal/unitary transformation matrices U, V or Q is optional. These matrices may either be formed explicitly, or they may be postmultiplied into input matrices U1, V1, or Q1. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobu CHARACTER*1. Must be 'U', 'I', or 'N'. If jobu = 'U', u must contain an orthogonal/unitary matrix U1 on entry. If jobu = 'I', u is initialized to the unit matrix. If jobu = 'N', u is not computed. jobv CHARACTER*1. Must be 'V', 'I', or 'N'. If jobv = 'V', v must contain an orthogonal/unitary matrix V1 on entry. If jobv = 'I', v is initialized to the unit matrix. If jobv = 'N', v is not computed. jobq CHARACTER*1. Must be 'Q', 'I', or 'N'. If jobq = 'Q', q must contain an orthogonal/unitary matrix Q1 on entry. If jobq = 'I', q is initialized to the unit matrix. If jobq = 'N', q is not computed. m INTEGER. The number of rows of the matrix A (m = 0). p INTEGER. The number of rows of the matrix B (p = 0). n INTEGER. The number of columns of the matrices A and B (n = 0). k, l INTEGER. Specify the subblocks in the input matrices A and B, whose GSVD is computed. a,b,u,v,q,work REAL for stgsja DOUBLE PRECISION for dtgsja COMPLEX for ctgsja DOUBLE COMPLEX for ztgsja. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the p-by-n matrix B. The second dimension of b must be at least max(1, n). If jobu = 'U', u(ldu,*) must contain a matrix U1 (usually the orthogonal/ unitary matrix returned by ?ggsvp). The second dimension of u must be at least max(1, m). If jobv = 'V', v(ldv,*) must contain a matrix V1 (usually the orthogonal/ unitary matrix returned by ?ggsvp). The second dimension of v must be at least max(1, p). LAPACK Routines: Least Squares and Eigenvalue Problems 4 917 If jobq = 'Q', q(ldq,*) must contain a matrix Q1 (usually the orthogonal/ unitary matrix returned by ?ggsvp). The second dimension of q must be at least max(1, n). work(*) is a workspace array. The dimension of work must be at least max(1, 2n). lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; at least max(1, p). ldu INTEGER. The leading dimension of the array u . ldu = max(1, m) if jobu = 'U'; ldu = 1 otherwise. ldv INTEGER. The leading dimension of the array v . ldv = max(1, p) if jobv = 'V'; ldv = 1 otherwise. ldq INTEGER. The leading dimension of the array q . ldq = max(1, n) if jobq = 'Q'; ldq = 1 otherwise. tola, tolb REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. tola and tolb are the convergence criteria for the Jacobi-Kogbetliantz iteration procedure. Generally, they are the same as used in ?ggsvp: tola = max(m, n)*|A|*MACHEPS, tolb = max(p, n)*|B|*MACHEPS. Output Parameters a On exit, a(n-k+1:n, 1:min(k+l, m)) contains the triangular matrix R or part of R. b On exit, if necessary, b(m-k+1: l, n+m-k-l+1: n)) contains a part of R. alpha, beta REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays, DIMENSION at least max(1, n). Contain the generalized singular value pairs of A and B: alpha(1:k) = 1, beta(1:k) = 0, and if m-k-l = 0, alpha(k+1:k+l) = diag(C), beta(k+1:k+l) = diag(S), or if m-k-l < 0, alpha(k+1:m)= C, alpha(m+1:k+l)=0 beta(K+1:M) = S, beta(m+1:k+l) = 1. Furthermore, if k+l < n, alpha(k+l+1:n)= 0 and beta(k+l+1:n) = 0. u If jobu = 'I', u contains the orthogonal/unitary matrix U. If jobu = 'U', u contains the product U1*U. If jobu = 'N', u is not referenced. v If jobv = 'I', v contains the orthogonal/unitary matrix U. If jobv = 'V', v contains the product V1*V. If jobv = 'N', v is not referenced. q If jobq = 'I', q contains the orthogonal/unitary matrix U. If jobq = 'Q', q contains the product Q1*Q. If jobq = 'N', q is not referenced. 4 Intel® Math Kernel Library Reference Manual 918 ncycle INTEGER. The number of cycles required for convergence. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, the procedure does not converge after MAXIT cycles. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine tgsja interface are the following: a Holds the matrix A of size (m,n). b Holds the matrix B of size (p,n). u Holds the matrix U of size (m,m). v Holds the matrix V of size (p,p). q Holds the matrix Q of size (n,n). alpha Holds the vector of length n. beta Holds the vector of length n. jobu If omitted, this argument is restored based on the presence of argument u as follows: jobu = 'U', if u is present, jobu = 'N', if u is omitted. If present, jobu must be equal to 'I' or 'U' and the argument u must also be present. Note that there will be an error condition if jobu is present and u omitted. jobv If omitted, this argument is restored based on the presence of argument v as follows: jobv = 'V', if v is present, jobv = 'N', if v is omitted. If present, jobv must be equal to 'I' or 'V' and the argument v must also be present. Note that there will be an error condition if jobv is present and v omitted. jobq If omitted, this argument is restored based on the presence of argument q as follows: jobq = 'Q', if q is present, jobq = 'N', if q is omitted. If present, jobq must be equal to 'I' or 'Q' and the argument q must also be present. Note that there will be an error condition if jobq is present and q omitted. Cosine-Sine Decomposition This section describes LAPACK computational routines for computing the cosine-sine decomposition (CS decomposition) of a partitioned unitary/orthogonal matrix. The algorithm computes a complete 2-by-2 CS decomposition, which requires simultaneous diagonalization of all the four blocks of a unitary/orthogonal matrix partitioned into a 2-by-2 block structure. The computation has the following phases: 1. The matrix is reduced to a bidiagonal block form. 2. The blocks are simultaneously diagonalized using techniques from the bidiagonal SVD algorithms. LAPACK Routines: Least Squares and Eigenvalue Problems 4 919 Table "Computational Routines for Cosine-Sine Decomposition (CSD)" lists LAPACK routines (FORTRAN 77 interface) that perform CS decomposition of matrices. Respective routine names in Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Computational Routines for Cosine-Sine Decomposition (CSD) Operation Real matrices Complex matrices Compute the CS decomposition of an orthogonal/unitary matrix in bidiagonal-block form bbcsd/bbcsd bbcsd/bbcsd Simultaneously bidiagonalize the blocks of a partitioned orthogonal matrix orbdb unbdb Simultaneously bidiagonalize the blocks of a partitioned unitary matrix orbdb unbdb See Also Cosine-Sine Decomposition ?bbcsd Computes the CS decomposition of an orthogonal/ unitary matrix in bidiagonal-block form. Syntax Fortran 77: call sbbcsd( jobu1, jobu2, jobv1t, jobv2t, trans, m, p, q, theta, phi, u1, ldu1, u2, ldu2, v1t, ldv1t, v2t, ldv2t, b11d, b11e, b12d, b12e, b21d, b21e, b21e, b22e, work, lwork, info ) call dbbcsd( jobu1, jobu2, jobv1t, jobv2t, trans, m, p, q, theta, phi, u1, ldu1, u2, ldu2, v1t, ldv1t, v2t, ldv2t, b11d, b11e, b12d, b12e, b21d, b21e, b21e, b22e, work, lwork, info ) call cbbcsd( jobu1, jobu2, jobv1t, jobv2t, trans, m, p, q, theta, phi, u1, ldu1, u2, ldu2, v1t, ldv1t, v2t, ldv2t, b11d, b11e, b12d, b12e, b21d, b21e, b21e, b22e, rwork, rlwork, info ) call zbbcsd( jobu1, jobu2, jobv1t, jobv2t, trans, m, p, q, theta, phi, u1, ldu1, u2, ldu2, v1t, ldv1t, v2t, ldv2t, b11d, b11e, b12d, b12e, b21d, b21e, b21e, b22e, rwork, rlwork, info ) Fortran 95: call bbcsd( theta,phi,u1,u2,v1t,v2t[,b11d][,b11e][,b12d][,b12e][,b21d][,b21e][,b22d] [,b22e][,jobu1][,jobu2][,jobv1t][,jobv2t][,trans][,info] ) C: lapack_int LAPACKE_sbbcsd( int matrix_order, char jobu1, char jobu2, char jobv1t, char jobv2t, char trans, lapack_int m, lapack_int p, lapack_int q, float* theta, float* phi, float* u1, lapack_int ldu1, float* u2, lapack_int ldu2, float* v1t, lapack_int ldv1t, float* v2t, lapack_int ldv2t, float* b11d, float* b11e, float* b12d, float* b12e, float* b21d, float* b21e, float* b22d, float* b22e ); lapack_int LAPACKE_dbbcsd( int matrix_order, char jobu1, char jobu2, char jobv1t, char jobv2t, char trans, lapack_int m, lapack_int p, lapack_int q, double* theta, double* phi, double* u1, lapack_int ldu1, double* u2, lapack_int ldu2, double* v1t, lapack_int ldv1t, double* v2t, lapack_int ldv2t, double* b11d, double* b11e, double* b12d, double* b12e, double* b21d, double* b21e, double* b22d, double* b22e ); 4 Intel® Math Kernel Library Reference Manual 920 lapack_int LAPACKE_cbbcsd( int matrix_order, char jobu1, char jobu2, char jobv1t, char jobv2t, char trans, lapack_int m, lapack_int p, lapack_int q, float* theta, float* phi, lapack_complex_float* u1, lapack_int ldu1, lapack_complex_float* u2, lapack_int ldu2, lapack_complex_float* v1t, lapack_int ldv1t, lapack_complex_float* v2t, lapack_int ldv2t, float* b11d, float* b11e, float* b12d, float* b12e, float* b21d, float* b21e, float* b22d, float* b22e ); lapack_int LAPACKE_zbbcsd( int matrix_order, char jobu1, char jobu2, char jobv1t, char jobv2t, char trans, lapack_int m, lapack_int p, lapack_int q, double* theta, double* phi, lapack_complex_double* u1, lapack_int ldu1, lapack_complex_double* u2, lapack_int ldu2, lapack_complex_double* v1t, lapack_int ldv1t, lapack_complex_double* v2t, lapack_int ldv2t, double* b11d, double* b11e, double* b12d, double* b12e, double* b21d, double* b21e, double* b22d, double* b22e ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description mkl_lapack.fiThe routine ?bbcsd computes the CS decomposition of an orthogonal or unitary matrix in bidiagonal-block form: or respectively. x is m-by-m with the top-left block p-by-q. Note that q must not be larger than p, m-p, or m-q. If q is not the smallest index, x must be transposed and/or permuted in constant time using the trans option. See ? orcsd/?uncsd for details. The bidiagonal matrices b11, b12, b21, and b22 are represented implicitly by angles theta(1:q) and phi(1:q-1). The orthogonal/unitary matrices u1, u2, v1 t, and v2 t are input/output. The input matrices are pre- or postmultiplied by the appropriate singular vector matrices. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 921 jobu1 CHARACTER. If equals Y, then u1 is updated. Otherwise, u1 is not updated. jobu2 CHARACTER. If equals Y, then u2 is updated. Otherwise, u2 is not updated. jobv1t CHARACTER. If equals Y, then v1 t is updated. Otherwise, v1 t is not updated. jobv2t CHARACTER. If equals Y, then v2 t is updated. Otherwise, v2 t is not updated. trans CHARACTER = 'T': x, u1, u2, v1 t, v2 t are stored in row-major order. otherwise x, u1, u2, v1 t, v2 t are stored in column-major order. m INTEGER. The number of rows and columns of the orthogonal/unitary matrix X in bidiagonal-block form. p INTEGER. The number of rows in the top-left block of x. 0 ? p ? m. q INTEGER. The number of columns in the top-left block of x. 0 ? q ? min(p,m-p,m-q). theta REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Array, DIMENSION (q). On entry, the angles theta(1), ..., theta(q) that, along with phi(1), ..., phi(q-1), define the matrix in bidiagonal-block form. phi REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Array, DIMENSION (q-1). The angles phi(1), ..., phi(q-1) that, along with theta(1), ..., theta(q), define the matrix in bidiagonal-block form. u1 REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Array, DIMENSION (ldu1,p). On entry, an ldu1-by-p matrix. ldu1 INTEGER. The leading dimension of the array u1. u2 REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Array, DIMENSION (ldu2,m-p). On entry, an ldu2-by-(m-p) matrix. ldu2 INTEGER. The leading dimension of the array u2. v1t REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Array, DIMENSION (ldv1t,q). On entry, an ldv1t-by-q matrix. ldv1t INTEGER. The leading dimension of the array v1t. 4 Intel® Math Kernel Library Reference Manual 922 v2t REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Array, DIMENSION (ldv2t,m-q). On entry, an ldv2t-by-(m-q) matrix. ldv2t INTEGER. The leading dimension of the array v2t. work REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Workspace array, DIMENSION (max(1,lwork)). lwork INTEGER. The size of the work array. lwork ? max(1,8*q) If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. Output Parameters theta REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd On exit, the angles whose cosines and sines define th edaigonal blocks in the CS decomposition. u1 REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd On exit, u1 is postmultiplied by the left singular vector matrix common to [ b11 ; 0 ] and [ b12 0 0 ; 0 -I 0 0 ]. u2 REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd On exit, u2 is postmultiplied by the left singular vector matrix common to [ b21 ; 0 ] and [ b22 0 0 ; 0 0 I ]. v1t REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Array, DIMENSION (q). On exit, v1t is premultiplied by the transpose of the right singular vector matrix common to [ b11 ; 0 ] and [ b21 ; 0 ]. v2t REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd On exit, v2t is premultiplied by the transpose of the right singular vector matrix common to [ b12 0 0 ; 0 -I 0 ] and [ b22 0 0 ; 0 0 I ]. LAPACK Routines: Least Squares and Eigenvalue Problems 4 923 b11d REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Array, DIMENSION (q). When ?bbcsd converges, b11d contains the cosines of theta(1), ..., theta(q). If ?bbcsd fails to converge, b11d contains the diagonal of the partially reduced top left block. b11e REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Array, DIMENSION (q-1). When ?bbcsd converges, b11e contains zeros. If ?bbcsd fails to converge, b11e contains the superdiagonal of the partially reduced top left block. b12d REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Array, DIMENSION (q). When ?bbcsd converges, b12d contains the negative sines of theta(1), ..., theta(q). If ?bbcsd fails to converge, b12d contains the diagonal of the partially reduced top right block. b12e REAL for sbbcsd DOUBLE PRECISION for dbbcsd COMPLEX for cbbcsd DOUBLE COMPLEX for zbbcsd Array, DIMENSION (q-1). When ?bbcsd converges, b12e contains zeros. If ?bbcsd fails to converge, b11e contains the superdiagonal of the partially reduced top right block. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument has an illegal value > 0: if ?bbcsd did not converge, info specifies the number of nonzero entries in phi, and b11d, b11e, etc. and contains the partially reduced matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ?bbcsd interface are as follows: theta Holds the vector of length q. phi Holds the vector of length q-1. u1 Holds the matrix of size (p,p). u2 Holds the matrix of size (m-p,m-p). v1t Holds the matrix of size (q,q). v2t Holds the matrix of size (m-q,m-q). b11d Holds the vector of length q. 4 Intel® Math Kernel Library Reference Manual 924 b11e Holds the vector of length q-1. b12d Holds the vector of length q. b12e Holds the vector of length q-1. b21d Holds the vector of length q. b21e Holds the vector of length q-1. b22d Holds the vector of length q. b22e Holds the vector of length q-1. jobsu1 Indicates whether u1 is computed. Must be 'Y' or 'O'. jobsu2 Indicates whether u2 is computed. Must be 'Y' or 'O'. jobv1t Indicates whether v1 t is computed. Must be 'Y' or 'O'. jobv2t Indicates whether v2 t is computed. Must be 'Y' or 'O'. trans Must be 'N' or 'T'. See Also ?orcsd/?uncsd xerbla ?orbdb/?unbdb Simultaneously bidiagonalizes the blocks of a partitioned orthogonal/unitary matrix. Syntax Fortran 77: call sorbdb( trans, signs, m, p, q, x11, ldx11, x12, ldx12, x21, ldx21, x22, ldx22, theta, phi, taup1, taup2, tauq1, tauq2, work, lwork, info ) call dorbdb( trans, signs, m, p, q, x11, ldx11, x12, ldx12, x21, ldx21, x22, ldx22, theta, phi, taup1, taup2, tauq1, tauq2, work, lwork, info ) call cunbdb( trans, signs, m, p, q, x11, ldx11, x12, ldx12, x21, ldx21, x22, ldx22, theta, phi, taup1, taup2, tauq1, tauq2, work, lwork, info ) call zunbdb( trans, signs, m, p, q, x11, ldx11, x12, ldx12, x21, ldx21, x22, ldx22, theta, phi, taup1, taup2, tauq1, tauq2, work, lwork, info ) Fortran 95: call orbdb( x11,x12,x21,x22,theta,phi,taup1,taup2,tauq1,tauq2[,trans][,signs][,info] ) call unbdb( x11,x12,x21,x22,theta,phi,taup1,taup2,tauq1,tauq2[,trans][,signs][,info] ) C: lapack_int LAPACKE_sorbdb( int matrix_order, char trans, char signs, lapack_int m, lapack_int p, lapack_int q, float* x11, lapack_int ldx11, float* x12, lapack_int ldx12, float* x21, lapack_int ldx21, float* x22, lapack_int ldx22, float* theta, float* phi, float* taup1, float* taup2, float* tauq1, float* tauq2 ); lapack_int LAPACKE_dorbdb( int matrix_order, char trans, char signs, lapack_int m, lapack_int p, lapack_int q, double* x11, lapack_int ldx11, double* x12, lapack_int ldx12, double* x21, lapack_int ldx21, double* x22, lapack_int ldx22, double* theta, double* phi, double* taup1, double* taup2, double* tauq1, double* tauq ); lapack_int LAPACKE_cunbdb( int matrix_order, char trans, char signs, lapack_int m, lapack_int p, lapack_int q, lapack_complex_float* x11, lapack_int ldx11, lapack_complex_float* x12, lapack_int ldx12, lapack_complex_float* x21, lapack_int LAPACK Routines: Least Squares and Eigenvalue Problems 4 925 ldx21, lapack_complex_float* x22, lapack_int ldx22, float* theta, float* phi, lapack_complex_float* taup1, lapack_complex_float* taup2, lapack_complex_float* tauq1, lapack_complex_float* tauq2 ); lapack_int LAPACKE_zunbdb( int matrix_order, char trans, char signs, lapack_int m, lapack_int p, lapack_int q, lapack_complex_double* x11, lapack_int ldx11, lapack_complex_double* x12, lapack_int ldx12, lapack_complex_double* x21, lapack_int ldx21, lapack_complex_double* x22, lapack_int ldx22, double* theta, double* phi, lapack_complex_double* taup1, lapack_complex_double* taup2, lapack_complex_double* tauq1, lapack_complex_double* tauq2 ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routines ?orbdb/?unbdb simultaneously bidiagonalizes the blocks of an m-by-m partitioned orthogonal matrix X: or unitary matrix: x11 is p-by-q. q must not be larger than p, m-p, or m-q. Otherwise, x must be transposed and/or permuted in constant time using the trans and signs options. See ?orcsd/?uncsd for details. The orthogonal/unitary matrices p1, p2, q1, and q 2 are p-by-p, (m-p)-by-(m-p), q-by-q, (m-q)-by-(m-q), respectively. They are represented implicitly by Housholder vectors. The bidiagonal matrices b11, b12, b21, and b22 are q-by-q bidiagonal matrices represented implicitly by angles theta(1), ..., theta(q) and phi(1), ..., phi(q-1). b11 and b12 are upper bidiagonal, while b21 and b22 are lower bidiagonal. Every entry in each bidiagonal band is a product of a sine or cosine of theta with a sine or cosine of phi. See [Sutton09] or description of ?orcsd/?uncsd for details. p1, p2, q1, and q2 are represented as products of elementary reflectors. See description of ?orcsd/?uncsd for details on generating p1, p2, q1, and q2 using ?orgqr and ?orglq. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. 4 Intel® Math Kernel Library Reference Manual 926 trans CHARACTER = 'T': x, u1, u2, v1 t, v2 t are stored in row-major order. otherwise x, u1, u2, v1 t, v2 t are stored in column-major order. signs CHARACTER = 'O': The lower-left block is made nonpositive (the "other" convention). otherwise The upper-right block is made nonpositive (the "default" convention). m INTEGER. The number of rows and columns of the matrix X. p INTEGER. The number of rows in x11 and x12. 0 = p = m. q INTEGER. The number of columns in x11 and x21. 0 = q = min(p,m-p,mq). x11 REAL for sorbdb DOUBLE PRECISION for dorbdb COMPLEX for cunbdb DOUBLE COMPLEX for zunbdb Array, DIMENSION (ldx11,q). On entry, the top-left block of the orthogonal/unitary matrix to be reduced. ldx11 INTEGER. The leading dimension of the array X11. If trans = 'T', ldx11 = p. Otherwise, ldx11 = q. x12 REAL for sorbdb DOUBLE PRECISION for dorbdb COMPLEX for cunbdb DOUBLE COMPLEX for zunbdb Array, DIMENSION (ldx12,m-q). On entry, the top-right block of the orthogonal/unitary matrix to be reduced. ldx12 INTEGER. The leading dimension of the array X12. If trans = 'N', ldx12 = p. Otherwise, ldx12 = m-q. x21 REAL for sorbdb DOUBLE PRECISION for dorbdb COMPLEX for cunbdb DOUBLE COMPLEX for zunbdb Array, DIMENSION (ldx21,q). On entry, the bottom-left block of the orthogonal/unitary matrix to be reduced. ldx21 INTEGER. The leading dimension of the array X21. If trans = 'N', ldx21 = m-p. Otherwise, ldx21 = q. x22 REAL for sorbdb DOUBLE PRECISION for dorbdb COMPLEX for cunbdb DOUBLE COMPLEX for zunbdb Array, DIMENSION (ldx22,m-q). On entry, the bottom-right block of the orthogonal/unitary matrix to be reduced. ldx22 INTEGER. The leading dimension of the array X21. If trans = 'N', ldx22 = m-p. Otherwise, ldx22 = m-q. LAPACK Routines: Least Squares and Eigenvalue Problems 4 927 work REAL for sorbdb DOUBLE PRECISION for dorbdb COMPLEX for cunbdb DOUBLE COMPLEX for zunbdb Workspace array, DIMENSION (lwork). lwork INTEGER. The size of the work array. lwork = m-q If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. Output Parameters x11 On exit, the form depends on trans: If trans='N', the columns of tril(x11) specify reflectors for p1, the rows of triu(x11,1) specify reflectors for q1 otherwise trans='T', the rows of triu(x11) specify reflectors for p1, the columns of tril(x11,-1) specify reflectors for q1 x12 On exit, the form depends on trans: If trans='N', the columns of triu(x12) specify the first p reflectors for q2 otherwise trans='T', the columns of tril(x12) specify the first p reflectors for q2 x21 On exit, the form depends on trans: If trans='N', the columns of tril(x21) specify the reflectors for p2 otherwise trans='T', the columns of triu(x21) specify the reflectors for p2 x22 On exit, the form depends on trans: If trans='N', the rows of triu(x22(q+1:m-p,p+1:m-q)) specify the last m-p-q reflectors for q2 otherwise trans='T', the columns of tril(x22(p+1:m-q,q+1:m-p)) specify the last m-p-q reflectors for p2 theta REAL for sorbdb DOUBLE PRECISION for dorbdb COMPLEX for cunbdb DOUBLE COMPLEX for zunbdb Array, DIMENSION (q). The entries of bidiagonal blocks b11, b12, b21, and b22 can be computed from the angles theta and phi. See the Description section for details. phi REAL for sorbdb DOUBLE PRECISION for dorbdb COMPLEX for cunbdb DOUBLE COMPLEX for zunbdb Array, DIMENSION (q-1). The entries of bidiagonal blocks b11, b12, b21, and b22 can be computed from the angles theta and phi. See the Description section for details. taup1 REAL for sorbdb DOUBLE PRECISION for dorbdb 4 Intel® Math Kernel Library Reference Manual 928 COMPLEX for cunbdb DOUBLE COMPLEX for zunbdb Array, DIMENSION (p). Scalar factors of the elementary reflectors that define p1. taup2 REAL for sorbdb DOUBLE PRECISION for dorbdb COMPLEX for cunbdb DOUBLE COMPLEX for zunbdb Array, DIMENSION (m-p). Scalar factors of the elementary reflectors that define p2. tauq1 REAL for sorbdb DOUBLE PRECISION for dorbdb COMPLEX for cunbdb DOUBLE COMPLEX for zunbdb Array, DIMENSION (q). Scalar factors of the elementary reflectors that define q1. tauq2 REAL for sorbdb DOUBLE PRECISION for dorbdb COMPLEX for cunbdb DOUBLE COMPLEX for zunbdb Array, DIMENSION (m-q). Scalar factors of the elementary reflectors that define q2. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument has an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ?orbdb/?unbdb interface are as follows: x11 Holds the block of matrix X of size (p, q). x12 Holds the block of matrix X of size (p, m-q). x21 Holds the block of matrix X of size (m-p, q). x22 Holds the block of matrix X of size (m-p, m-q). theta Holds the vector of length q. phi Holds the vector of length q-1. taup1 Holds the vector of length p. taup2 Holds the vector of length m-p. tauq1 Holds the vector of length q. taupq2 Holds the vector of length m-q. trans Must be 'N' or 'T'. signs Must be 'O' or 'D'. See Also ?orcsd/?uncsd ?orgqr ?ungqr LAPACK Routines: Least Squares and Eigenvalue Problems 4 929 ?orglq ?unglq xerbla Driver Routines Each of the LAPACK driver routines solves a complete problem. To arrive at the solution, driver routines typically call a sequence of appropriate computational routines. Driver routines are described in the following sections : Linear Least Squares (LLS) Problems Generalized LLS Problems Symmetric Eigenproblems Nonsymmetric Eigenproblems Singular Value Decomposition Cosine-Sine Decomposition Generalized Symmetric Definite Eigenproblems Generalized Nonsymmetric Eigenproblems Linear Least Squares (LLS) Problems This section describes LAPACK driver routines used for solving linear least squares problems. Table "Driver Routines for Solving LLS Problems" lists all such routines for the FORTRAN 77 interface. Respective routine names in the Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Driver Routines for Solving LLS Problems Routine Name Operation performed gels Uses QR or LQ factorization to solve a overdetermined or underdetermined linear system with full rank matrix. gelsy Computes the minimum-norm solution to a linear least squares problem using a complete orthogonal factorization of A. gelss Computes the minimum-norm solution to a linear least squares problem using the singular value decomposition of A. gelsd Computes the minimum-norm solution to a linear least squares problem using the singular value decomposition of A and a divide and conquer method. ?gels Uses QR or LQ factorization to solve a overdetermined or underdetermined linear system with full rank matrix. Syntax Fortran 77: call sgels(trans, m, n, nrhs, a, lda, b, ldb, work, lwork, info) call dgels(trans, m, n, nrhs, a, lda, b, ldb, work, lwork, info) call cgels(trans, m, n, nrhs, a, lda, b, ldb, work, lwork, info) call zgels(trans, m, n, nrhs, a, lda, b, ldb, work, lwork, info) 4 Intel® Math Kernel Library Reference Manual 930 Fortran 95: call gels(a, b [,trans] [,info]) C: lapack_int LAPACKE_gels( int matrix_order, char trans, lapack_int m, lapack_int n, lapack_int nrhs, * a, lapack_int lda, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves overdetermined or underdetermined real/ complex linear systems involving an m-by-n matrix A, or its transpose/ conjugate-transpose, using a QR or LQ factorization of A. It is assumed that A has full rank. The following options are provided: 1. If trans = 'N' and m = n: find the least squares solution of an overdetermined system, that is, solve the least squares problem minimize ||b - A*x||2 2. If trans = 'N' and m < n: find the minimum norm solution of an underdetermined system A*X = B. 3. If trans = 'T' or 'C' and m = n: find the minimum norm solution of an undetermined system AH*X = B. 4. If trans = 'T' or 'C' and m < n: find the least squares solution of an overdetermined system, that is, solve the least squares problem minimize ||b - AH*x||2 Several right hand side vectors b and solution vectors x can be handled in a single call; they are stored as the columns of the m-by-nrhs right hand side matrix B and the n-by-nrh solution matrix X. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N', 'T', or 'C'. If trans = 'N', the linear system involves matrix A; If trans = 'T', the linear system involves the transposed matrix AT (for real flavors only); If trans = 'C', the linear system involves the conjugate-transposed matrix AH (for complex flavors only). m INTEGER. The number of rows of the matrix A (m = 0). n INTEGER. The number of columns of the matrix A (n = 0). nrhs INTEGER. The number of right-hand sides; the number of columns in B (nrhs = 0). a, b, work REAL for sgels DOUBLE PRECISION for dgels LAPACK Routines: Least Squares and Eigenvalue Problems 4 931 COMPLEX for cgels DOUBLE COMPLEX for zgels. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the matrix B of right hand side vectors, stored columnwise; B is m-by-nrhs if trans = 'N', or n-by-nrhs if trans = 'T' or 'C'. The second dimension of b must be at least max(1, nrhs). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; must be at least max(1, m, n). lwork INTEGER. The size of the work array; must be at least min (m, n)+max(1, m, n, nrhs). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a On exit, overwritten by the factorization data as follows: if m = n, array a contains the details of the QR factorization of the matrix A as returned by ?geqrf; if m < n, array a contains the details of the LQ factorization of the matrix A as returned by ?gelqf. b If info = 0, b overwritten by the solution vectors, stored columnwise: if trans = 'N' and m = n, rows 1 to n of b contain the least squares solution vectors; the residual sum of squares for the solution in each column is given by the sum of squares of modulus of elements n+1 to m in that column; if trans = 'N' and m < n, rows 1 to n of b contain the minimum norm solution vectors; if trans = 'T' or 'C' and m = n, rows 1 to m of b contain the minimum norm solution vectors; if trans = 'T' or 'C' and m < n, rows 1 to m of b contain the least squares solution vectors; the residual sum of squares for the solution in each column is given by the sum of squares of modulus of elements m+1 to n in that column. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of the triangular factor of A is zero, so that A does not have full rank; the least squares solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. 4 Intel® Math Kernel Library Reference Manual 932 Specific details for the routine gels interface are the following: a Holds the matrix A of size (m,n). b Holds the matrix of size max(m,n)-by-nrhs. If trans = 'N', then, on entry, the size of b is m-by-nrhs, If trans = 'T', then, on entry, the size of b is n-by-nrhs, trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = min (m, n)+max(1, m, n, nrhs)*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?gelsy Computes the minimum-norm solution to a linear least squares problem using a complete orthogonal factorization of A. Syntax Fortran 77: call sgelsy(m, n, nrhs, a, lda, b, ldb, jpvt, rcond, rank, work, lwork, info) call dgelsy(m, n, nrhs, a, lda, b, ldb, jpvt, rcond, rank, work, lwork, info) call cgelsy(m, n, nrhs, a, lda, b, ldb, jpvt, rcond, rank, work, lwork, rwork, info) call zgelsy(m, n, nrhs, a, lda, b, ldb, jpvt, rcond, rank, work, lwork, rwork, info) Fortran 95: call gelsy(a, b [,rank] [,jpvt] [,rcond] [,info]) C: lapack_int LAPACKE_sgelsy( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* b, lapack_int ldb, lapack_int* jpvt, float rcond, lapack_int* rank ); lapack_int LAPACKE_dgelsy( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* b, lapack_int ldb, lapack_int* jpvt, double rcond, lapack_int* rank ); lapack_int LAPACKE_cgelsy( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, lapack_int* jpvt, float rcond, lapack_int* rank ); LAPACK Routines: Least Squares and Eigenvalue Problems 4 933 lapack_int LAPACKE_zgelsy( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, lapack_int* jpvt, double rcond, lapack_int* rank ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The ?gelsy routine computes the minimum-norm solution to a real/complex linear least squares problem: minimize ||b - A*x||2 using a complete orthogonal factorization of A. A is an m-by-n matrix which may be rank-deficient. Several right hand side vectors b and solution vectors x can be handled in a single call; they are stored as the columns of the m-by-nrhs right hand side matrix B and the n-by-nrhs solution matrix X. The routine first computes a QR factorization with column pivoting: with R11 defined as the largest leading submatrix whose estimated condition number is less than 1/rcond. The order of R11, rank, is the effective rank of A. Then, R22 is considered to be negligible, and R12 is annihilated by orthogonal/unitary transformations from the right, arriving at the complete orthogonal factorization: The minimum-norm solution is then for real flavors and for complex flavors, where Q1 consists of the first rank columns of Q. The ?gelsy routine is identical to the original deprecated ?gelsx routine except for the following differences: • The call to the subroutine ?geqpf has been substituted by the call to the subroutine ?geqp3, which is a BLAS-3 version of the QR factorization with column pivoting. • The matrix B (the right hand side) is updated with BLAS-3. • The permutation of the matrix B (the right hand side) is faster and more simple. 4 Intel® Math Kernel Library Reference Manual 934 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A (m = 0). n INTEGER. The number of columns of the matrix A (n = 0). nrhs INTEGER. The number of right-hand sides; the number of columns in B (nrhs = 0). a, b, work REAL for sgelsy DOUBLE PRECISION for dgelsy COMPLEX for cgelsy DOUBLE COMPLEX for zgelsy. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the m-by-nrhs right hand side matrix B. The second dimension of b must be at least max(1, nrhs). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; must be at least max(1, m, n). jpvt INTEGER. Array, DIMENSION at least max(1, n). On entry, if jpvt(i)? 0, the i-th column of A is permuted to the front of AP, otherwise the i-th column of A is a free column. rcond REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. rcond is used to determine the effective rank of A, which is defined as the order of the largest leading triangular submatrix R11 in the QR factorization with pivoting of A, whose estimated condition number < 1/rcond. lwork INTEGER. The size of the work array. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. rwork REAL for cgelsy DOUBLE PRECISION for zgelsy. Workspace array, DIMENSION at least max(1, 2n). Used in complex flavors only. Output Parameters a On exit, overwritten by the details of the complete orthogonal factorization of A. b Overwritten by the n-by-nrhs solution matrix X. jpvt On exit, if jpvt(i)= k, then the i-th column of AP was the k-th column of A. rank INTEGER. The effective rank of A, that is, the order of the submatrix R11. This is the same as the order of the submatrix T11 in the complete orthogonal factorization of A. info INTEGER. If info = 0, the execution is successful. LAPACK Routines: Least Squares and Eigenvalue Problems 4 935 If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gelsy interface are the following: a Holds the matrix A of size (m,n). b Holds the matrix of size max(m,n)-by-nrhs. On entry, contains the m-by-nrhs right hand side matrix B, On exit, overwritten by the n-by-nrhs solution matrix X. jpvt Holds the vector of length n. Default value for this element is jpvt(i) = 0. rcond Default value for this element is rcond = 100*EPSILON(1.0_WP). Application Notes For real flavors: The unblocked strategy requires that: lwork = max( mn+3n+1, 2*mn + nrhs ), where mn = min( m, n ). The block algorithm requires that: lwork = max( mn+2n+nb*(n+1), 2*mn+nb*nrhs ), where nb is an upper bound on the blocksize returned by ilaenv for the routines sgeqp3/dgeqp3, stzrzf/ dtzrzf, stzrqf/dtzrqf, sormqr/dormqr, and sormrz/dormrz. For complex flavors: The unblocked strategy requires that: lwork = mn + max( 2*mn, n+1, mn + nrhs ), where mn = min( m, n ). The block algorithm requires that: lwork < mn + max(2*mn, nb*(n+1), mn+mn*nb, mn+ nb*nrhs ), where nb is an upper bound on the blocksize returned by ilaenv for the routines cgeqp3/zgeqp3, ctzrzf/ ztzrzf, ctzrqf/ztzrqf, cunmqr/zunmqr, and cunmrz/zunmrz. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. 4 Intel® Math Kernel Library Reference Manual 936 ?gelss Computes the minimum-norm solution to a linear least squares problem using the singular value decomposition of A. Syntax Fortran 77: call sgelss(m, n, nrhs, a, lda, b, ldb, s, rcond, rank, work, lwork, info) call dgelss(m, n, nrhs, a, lda, b, ldb, s, rcond, rank, work, lwork, info) call cgelss(m, n, nrhs, a, lda, b, ldb, s, rcond, rank, work, lwork, rwork, info) call zgelss(m, n, nrhs, a, lda, b, ldb, s, rcond, rank, work, lwork, rwork, info) Fortran 95: call gelss(a, b [,rank] [,s] [,rcond] [,info]) C: lapack_int LAPACKE_sgelss( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* b, lapack_int ldb, float* s, float rcond, lapack_int* rank ); lapack_int LAPACKE_dgelss( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* b, lapack_int ldb, double* s, double rcond, lapack_int* rank ); lapack_int LAPACKE_cgelss( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, float* s, float rcond, lapack_int* rank ); lapack_int LAPACKE_zgelss( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, double* s, double rcond, lapack_int* rank ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the minimum norm solution to a real linear least squares problem: minimize ||b - A*x||2 using the singular value decomposition (SVD) of A. A is an m-by-n matrix which may be rank-deficient. Several right hand side vectors b and solution vectors x can be handled in a single call; they are stored as the columns of the m-by-nrhs right hand side matrix B and the n-by-nrhs solution matrix X. The effective rank of A is determined by treating as zero those singular values which are less than rcond times the largest singular value. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 937 m INTEGER. The number of rows of the matrix A (m = 0). n INTEGER. The number of columns of the matrix A (n = 0). nrhs INTEGER. The number of right-hand sides; the number of columns in B (nrhs = 0). a, b, work REAL for sgelss DOUBLE PRECISION for dgelss COMPLEX for cgelss DOUBLE COMPLEX for zgelss. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the m-by-nrhs right hand side matrix B. The second dimension of b must be at least max(1, nrhs). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; must be at least max(1, m, n). rcond REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. rcond is used to determine the effective rank of A. Singular values s(i) = rcond *s(1) are treated as zero. If rcond <0, machine precision is used instead. lwork INTEGER. The size of the work array; lwork= 1. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. rwork REAL for cgelss DOUBLE PRECISION for zgelss. Workspace array used in complex flavors only. DIMENSION at least max(1, 5*min(m, n)). Output Parameters a On exit, the first min(m, n) rows of A are overwritten with its right singular vectors, stored row-wise. b Overwritten by the n-by-nrhs solution matrix X. If m=n and rank = n, the residual sum-of-squares for the solution in the ith column is given by the sum of squares of modulus of elements n+1:m in that column. s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, min(m, n)). The singular values of A in decreasing order. The condition number of A in the 2-norm is k2(A) = s(1)/ s(min(m, n)) . rank INTEGER. The effective rank of A, that is, the number of singular values which are greater than rcond *s(1). work(1) If info = 0, on exit, work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. 4 Intel® Math Kernel Library Reference Manual 938 info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then the algorithm for computing the SVD failed to converge; i indicates the number of off-diagonal elements of an intermediate bidiagonal form which did not converge to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gelss interface are the following: a Holds the matrix A of size (m,n). b Holds the matrix of size max(m,n)-by-nrhs. On entry, contains the m-by-nrhs right hand side matrix B, On exit, overwritten by the n-by-nrhs solution matrix X. s Holds the vector of length min(m,n). rcond Default value for this element is rcond = 100*EPSILON(1.0_WP). Application Notes For real flavors: lwork = 3*min(m, n)+ max( 2*min(m, n), max(m, n), nrhs) For complex flavors: lwork = 2*min(m, n)+ max(m, n, nrhs) For good performance, lwork should generally be larger. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?gelsd Computes the minimum-norm solution to a linear least squares problem using the singular value decomposition of A and a divide and conquer method. Syntax Fortran 77: call sgelsd(m, n, nrhs, a, lda, b, ldb, s, rcond, rank, work, lwork, iwork, info) call dgelsd(m, n, nrhs, a, lda, b, ldb, s, rcond, rank, work, lwork, iwork, info) call cgelsd(m, n, nrhs, a, lda, b, ldb, s, rcond, rank, work, lwork, rwork, iwork, info) LAPACK Routines: Least Squares and Eigenvalue Problems 4 939 call zgelsd(m, n, nrhs, a, lda, b, ldb, s, rcond, rank, work, lwork, rwork, iwork, info) Fortran 95: call gelsd(a, b [,rank] [,s] [,rcond] [,info]) C: lapack_int LAPACKE_sgelsd( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* b, lapack_int ldb, float* s, float rcond, lapack_int* rank ); lapack_int LAPACKE_dgelsd( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* b, lapack_int ldb, double* s, double rcond, lapack_int* rank ); lapack_int LAPACKE_cgelsd( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, float* s, float rcond, lapack_int* rank ); lapack_int LAPACKE_zgelsd( int matrix_order, lapack_int m, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, double* s, double rcond, lapack_int* rank ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the minimum-norm solution to a real linear least squares problem: minimize ||b - A*x||2 using the singular value decomposition (SVD) of A. A is an m-by-n matrix which may be rank-deficient. Several right hand side vectors b and solution vectors x can be handled in a single call; they are stored as the columns of the m-by-nrhs right hand side matrix B and the n-by-nrhs solution matrix X. The problem is solved in three steps: 1. Reduce the coefficient matrix A to bidiagonal form with Householder transformations, reducing the original problem into a "bidiagonal least squares problem" (BLS). 2. Solve the BLS using a divide and conquer approach. 3. Apply back all the Householder transformations to solve the original least squares problem. The effective rank of A is determined by treating as zero those singular values which are less than rcond times the largest singular value. The routine uses auxiliary routines lals0 and lalsa. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A (m = 0). n INTEGER. The number of columns of the matrix A (n = 0). 4 Intel® Math Kernel Library Reference Manual 940 nrhs INTEGER. The number of right-hand sides; the number of columns in B (nrhs = 0). a, b, work REAL for sgelsd DOUBLE PRECISION for dgelsd COMPLEX for cgelsd DOUBLE COMPLEX for zgelsd. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the m-by-nrhs right hand side matrix B. The second dimension of b must be at least max(1, nrhs). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; must be at least max(1, m, n). rcond REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. rcond is used to determine the effective rank of A. Singular values s(i) = rcond *s(1) are treated as zero. If rcond = 0, machine precision is used instead. lwork INTEGER. The size of the work array; lwork = 1. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the array work and the minimum sizes of the arrays rwork and iwork, and returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. iwork INTEGER. Workspace array. See Application Notes for the suggested dimension of iwork. rwork REAL for cgelsd DOUBLE PRECISION for zgelsd. Workspace array, used in complex flavors only. See Application Notes for the suggested dimension of rwork. Output Parameters a On exit, A has been overwritten. b Overwritten by the n-by-nrhs solution matrix X. If m = n and rank = n, the residual sum-of-squares for the solution in the i-th column is given by the sum of squares of modulus of elements n+1:m in that column. s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, min(m, n)). The singular values of A in decreasing order. The condition number of A in the 2-norm is k2(A) = s(1)/ s(min(m, n)). rank INTEGER. The effective rank of A, that is, the number of singular values which are greater than rcond *s(1). work(1) If info = 0, on exit, work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. LAPACK Routines: Least Squares and Eigenvalue Problems 4 941 rwork(1) If info = 0, on exit, rwork(1) returns the minimum size of the workspace array iwork required for optimum performance. iwork(1) If info = 0, on exit, iwork(1) returns the minimum size of the workspace array iwork required for optimum performance. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then the algorithm for computing the SVD failed to converge; i indicates the number of off-diagonal elements of an intermediate bidiagonal form that did not converge to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gelsd interface are the following: a Holds the matrix A of size (m,n). b Holds the matrix of size max(m,n)-by-nrhs. On entry, contains the m-by-nrhs right hand side matrix B, On exit, overwritten by the n-by-nrhs solution matrix X. s Holds the vector of length min(m,n). rcond Default value for this element is rcond = 100*EPSILON(1.0_WP). Application Notes The divide and conquer algorithm makes very mild assumptions about floating point arithmetic. It will work on machines with a guard digit in add/subtract. It could conceivably fail on hexadecimal or decimal machines without guard digits, but we know of none. The exact minimum amount of workspace needed depends on m, n and nrhs. The size lwork of the workspace array work must be as given below. For real flavors: If m = n, lwork = 12n + 2n*smlsiz + 8n*nlvl + n*nrhs + (smlsiz+1)2; If m < n, lwork = 12m + 2m*smlsiz + 8m*nlvl + m*nrhs + (smlsiz+1)2; For complex flavors: If m = n, lwork< 2n + n*nrhs; If m < n, lwork = 2m + m*nrhs; where smlsiz is returned by ilaenv and is equal to the maximum size of the subproblems at the bottom of the computation tree (usually about 25), and nlvl = INT( log2( min( m, n )/(smlsiz+1)) ) + 1. For good performance, lwork should generally be larger. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. 4 Intel® Math Kernel Library Reference Manual 942 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The dimension of the workspace array iwork must be at least 3*min( m, n )*nlvl + 11*min( m, n ). The dimension of the workspace array iwork (for complex flavors) must be at least max(1, lrwork). lrwork = 10n + 2n*smlsiz + 8n*nlvl + 3*smlsiz*nrhs + (smlsiz+1)2 if m = n, and lrwork = 10m + 2m*smlsiz + 8m*nlvl + 3*smlsiz*nrhs + (smlsiz+1)2 if m < n. Generalized LLS Problems This section describes LAPACK driver routines used for solving generalized linear least squares problems. Table "Driver Routines for Solving Generalized LLS Problems" lists all such routines for the FORTRAN 77 interface. Respective routine names in the Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Driver Routines for Solving Generalized LLS Problems Routine Name Operation performed gglse Solves the linear equality-constrained least squares problem using a generalized RQ factorization. ggglm Solves a general Gauss-Markov linear model problem using a generalized QR factorization. ?gglse Solves the linear equality-constrained least squares problem using a generalized RQ factorization. Syntax Fortran 77: call sgglse(m, n, p, a, lda, b, ldb, c, d, x, work, lwork, info) call dgglse(m, n, p, a, lda, b, ldb, c, d, x, work, lwork, info) call cgglse(m, n, p, a, lda, b, ldb, c, d, x, work, lwork, info) call zgglse(m, n, p, a, lda, b, ldb, c, d, x, work, lwork, info) Fortran 95: call gglse(a, b, c, d, x [,info]) C: lapack_int LAPACKE_gglse( int matrix_order, lapack_int m, lapack_int n, lapack_int p, * a, lapack_int lda, * b, lapack_int ldb, * c, * d, * x ); LAPACK Routines: Least Squares and Eigenvalue Problems 4 943 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves the linear equality-constrained least squares (LSE) problem: minimize ||c - A*x||2 subject to B*x = d where A is an m-by-n matrix, B is a p-by-n matrix, c is a given d is a given p-vector. It is assumed that p = n = m+p, and These conditions ensure that the LSE problem has a unique solution, which is obtained using a generalized RQ factorization of the matrices (B, A) given by B=(0 R)*Q, A=Z*T*Q Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A (m = 0). n INTEGER. The number of columns of the matrices A and B (n = 0). p INTEGER. The number of rows of the matrix B (0 = p = n = m+p). a, b, c, d, work REAL for sgglse DOUBLE PRECISION for dgglse COMPLEX for cgglse DOUBLE COMPLEX for zgglse. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the p-by-n matrix B. The second dimension of b must be at least max(1, n). c(*), dimension at least max(1, m), contains the right hand side vector for the least squares part of the LSE problem. d(*), dimension at least max(1, p), contains the right hand side vector for the constrained equation. work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; at least max(1, p). lwork INTEGER. The size of the work array; lwork = max(1, m+n+p). 4 Intel® Math Kernel Library Reference Manual 944 If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters x REAL for sgglse b On exit, the upper triangle of the subarray b(1:p, n-p+1:n) contains the p-by-p upper triangular matrix R. d On exit, d is destroyed. c On exit, the residual sum-of-squares for the solution is given by the sum of squares of elements n-p+1 to m of vector c. work(1) If info = 0, on exit, work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, the upper triangular factor R associated with B in the generalized RQ factorization of the pair (B, A) is singular, so that rank(B) < P; the least squares solution could not be computed. If info = 2, the (n-p)-by-(n-p) part of the upper trapezoidal factor T associated with A in the generalized RQ factorization of the pair (B, A) is singular, so that ; the least squares solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gglse interface are the following: a Holds the matrix A of size (m,n). b Holds the matrix B of size (p,n). c Holds the vector of length (m). d Holds the vector of length (p). x Holds the vector of length n. Application Notes For optimum performance, use lwork = p + min(m, n) + max(m, n)*nb, where nb is an upper bound for the optimal blocksizes for ?geqrf, ?gerqf, ?ormqr/?unmqr and ?ormrq/? unmrq. LAPACK Routines: Least Squares and Eigenvalue Problems 4 945 You may set lwork to -1. The routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?ggglm Solves a general Gauss-Markov linear model problem using a generalized QR factorization. Syntax Fortran 77: call sggglm(n, m, p, a, lda, b, ldb, d, x, y, work, lwork, info) call dggglm(n, m, p, a, lda, b, ldb, d, x, y, work, lwork, info) call cggglm(n, m, p, a, lda, b, ldb, d, x, y, work, lwork, info) call zggglm(n, m, p, a, lda, b, ldb, d, x, y, work, lwork, info) Fortran 95: call ggglm(a, b, d, x, y [,info]) C: lapack_int LAPACKE_ggglm( int matrix_order, lapack_int n, lapack_int m, lapack_int p, * a, lapack_int lda, * b, lapack_int ldb, * d, * x, * y ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves a general Gauss-Markov linear model (GLM) problem: minimizex ||y||2 subject to d = A*x + B*y where A is an n-by-m matrix, B is an n-by-p matrix, and d is a given n-vector. It is assumed that m = n = m +p, and rank(A) = m and rank(A B) = n. Under these assumptions, the constrained equation is always consistent, and there is a unique solution x and a minimal 2-norm solution y, which is obtained using a generalized QR factorization of the matrices (A, B ) given by In particular, if matrix B is square nonsingular, then the problem GLM is equivalent to the following weighted linear least squares problem minimizex ||B-1(d-A*x)||2. 4 Intel® Math Kernel Library Reference Manual 946 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The number of rows of the matrices A and B (n = 0). m INTEGER. The number of columns in A (m = 0). p INTEGER. The number of columns in B (p = n - m). a, b, d, work REAL for sggglm DOUBLE PRECISION for dggglm COMPLEX for cggglm DOUBLE COMPLEX for zggglm. Arrays: a(lda,*) contains the n-by-m matrix A. The second dimension of a must be at least max(1, m). b(ldb,*) contains the n-by-p matrix B. The second dimension of b must be at least max(1, p). d(*), dimension at least max(1, n), contains the left hand side of the GLM equation. work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). lwork INTEGER. The size of the work array; lwork = max(1, n+m+p). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters x, y REAL for sggglm DOUBLE PRECISION for dggglm COMPLEX for cggglm DOUBLE COMPLEX for zggglm. Arrays x(*), y(*). DIMENSION at least max(1, m) for x and at least max(1, p) for y. On exit, x and y are the solutions of the GLM problem. a On exit, the upper triangular part of the array a contains the m-by-m upper triangular matrix R. b On exit, if n = p, the upper triangle of the subarray b(1:n,p-n+1:p) contains the n-by-n upper triangular matrix T; if n > p, the elements on and above the (n-p)-th subdiagonal contain the n-by-p upper trapezoidal matrix T. d On exit, d is destroyed work(1) If info = 0, on exit, work(1) contains the minimum value of lwork required for optimum performance. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Least Squares and Eigenvalue Problems 4 947 If info = 1, the upper triangular factor R associated with A in the generalized QR factorization of the pair (A, B) is singular, so that rank(A) < m; the least squares solution could not be computed. If info = 2, the bottom (n-m)-by-(n-m) part of the upper trapezoidal factor T associated with B in the generalized QR factorization of the pair (A, B) is singular, so that rank(A B) < n; the least squares solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggglm interface are the following: a Holds the matrix A of size (n,m). b Holds the matrix B of size (n,p). d Holds the vector of length n. x Holds the vector of length (m). y Holds the vector of length (p). Application Notes For optimum performance, use lwork = m + min(n, p) + max(n, p)*nb, where nb is an upper bound for the optimal blocksizes for ?geqrf, ?gerqf, ?ormqr/?unmqr and ?ormrq/? unmrq. You may set lwork to -1. The routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Symmetric Eigenproblems This section describes LAPACK driver routines used for solving symmetric eigenvalue problems. See also computational routines that can be called to solve these problems. Table "Driver Routines for Solving Symmetric Eigenproblems" lists all such driver routines for the FORTRAN 77 interface. Respective routine names in the Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Driver Routines for Solving Symmetric Eigenproblems Routine Name Operation performed syev/heev Computes all eigenvalues and, optionally, eigenvectors of a real symmetric / Hermitian matrix. syevd/heevd Computes all eigenvalues and (optionally) all eigenvectors of a real symmetric / Hermitian matrix using divide and conquer algorithm. syevx/heevx Computes selected eigenvalues and, optionally, eigenvectors of a symmetric / Hermitian matrix. syevr/heevr Computes selected eigenvalues and, optionally, eigenvectors of a real symmetric / Hermitian matrix using the Relatively Robust Representations. 4 Intel® Math Kernel Library Reference Manual 948 Routine Name Operation performed spev/hpev Computes all eigenvalues and, optionally, eigenvectors of a real symmetric / Hermitian matrix in packed storage. spevd/hpevd Uses divide and conquer algorithm to compute all eigenvalues and (optionally) all eigenvectors of a real symmetric / Hermitian matrix held in packed storage. spevx/hpevx Computes selected eigenvalues and, optionally, eigenvectors of a real symmetric / Hermitian matrix in packed storage. sbev /hbev Computes all eigenvalues and, optionally, eigenvectors of a real symmetric / Hermitian band matrix. sbevd/hbevd Computes all eigenvalues and (optionally) all eigenvectors of a real symmetric / Hermitian band matrix using divide and conquer algorithm. sbevx/hbevx Computes selected eigenvalues and, optionally, eigenvectors of a real symmetric / Hermitian band matrix. stev Computes all eigenvalues and, optionally, eigenvectors of a real symmetric tridiagonal matrix. stevd Computes all eigenvalues and (optionally) all eigenvectors of a real symmetric tridiagonal matrix using divide and conquer algorithm. stevx Computes selected eigenvalues and eigenvectors of a real symmetric tridiagonal matrix. stevr Computes selected eigenvalues and, optionally, eigenvectors of a real symmetric tridiagonal matrix using the Relatively Robust Representations. ?syev Computes all eigenvalues and, optionally, eigenvectors of a real symmetric matrix. Syntax Fortran 77: call ssyev(jobz, uplo, n, a, lda, w, work, lwork, info) call dsyev(jobz, uplo, n, a, lda, w, work, lwork, info) Fortran 95: call syev(a, w [,jobz] [,uplo] [,info]) C: lapack_int LAPACKE_syev( int matrix_order, char jobz, char uplo, lapack_int n, * a, lapack_int lda, * w ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all eigenvalues and, optionally, eigenvectors of a real symmetric matrix A. Note that for most cases of real symmetric eigenvalue problems the default choice should be syevr function as its underlying algorithm is faster and uses less workspace. LAPACK Routines: Least Squares and Eigenvalue Problems 4 949 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a, work REAL for ssyev DOUBLE PRECISION for dsyev Arrays: a(lda,*) is an array containing either upper or lower triangular part of the symmetric matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). lwork INTEGER. The dimension of the array work. Constraint: lwork = max(1, 3n-1). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a On exit, if jobz = 'V', then if info = 0, array a contains the orthonormal eigenvectors of the matrix A. If jobz = 'N', then on exit the lower triangle (if uplo = 'L') or the upper triangle (if uplo = 'U') of A, including the diagonal, is overwritten. w REAL for ssyev DOUBLE PRECISION for dsyev Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues of the matrix A in ascending order. work(1) On exit, if lwork > 0, then work(1) returns the required minimal size of lwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then the algorithm failed to converge; i indicates the number of elements of an intermediate tridiagonal form which did not converge to zero. 4 Intel® Math Kernel Library Reference Manual 950 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine syev interface are the following: a Holds the matrix A of size (n, n). w Holds the vector of length n. job Must be 'N' or 'V'. The default value is 'N'. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For optimum performance set lwork = (nb+2)*n, where nb is the blocksize for ?sytrd returned by ilaenv. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. If lwork has any of admissible sizes, which is no less than the minimal value described, then the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array work. This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?heev Computes all eigenvalues and, optionally, eigenvectors of a Hermitian matrix. Syntax Fortran 77: call cheev(jobz, uplo, n, a, lda, w, work, lwork, rwork, info) call zheev(jobz, uplo, n, a, lda, w, work, lwork, rwork, info) Fortran 95: call heev(a, w [,jobz] [,uplo] [,info]) LAPACK Routines: Least Squares and Eigenvalue Problems 4 951 C: lapack_int LAPACKE_cheev( int matrix_order, char jobz, char uplo, lapack_int n, lapack_complex_float* a, lapack_int lda, float* w ); lapack_int LAPACKE_zheev( int matrix_order, char jobz, char uplo, lapack_int n, lapack_complex_double* a, lapack_int lda, double* w ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A. Note that for most cases of complex Hermitian eigenvalue problems the default choice should be heevr function as its underlying algorithm is faster and uses less workspace. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a, work COMPLEX for cheev DOUBLE COMPLEX for zheev Arrays: a(lda,*) is an array containing either upper or lower triangular part of the Hermitian matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). lwork INTEGER. The dimension of the array work. C onstraint: lwork = max(1, 2n-1). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. rwork REAL for cheev DOUBLE PRECISION for zheev. Workspace array, DIMENSION at least max(1, 3n-2). 4 Intel® Math Kernel Library Reference Manual 952 Output Parameters a On exit, if jobz = 'V', then if info = 0, array a contains the orthonormal eigenvectors of the matrix A. If jobz = 'N', then on exit the lower triangle (if uplo = 'L') or the upper triangle (if uplo = 'U') of A, including the diagonal, is overwritten. w REAL for cheev DOUBLE PRECISION for zheev Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues of the matrix A in ascending order. work(1) On exit, if lwork > 0, then work(1) returns the required minimal size of lwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then the algorithm failed to converge; i indicates the number of elements of an intermediate tridiagonal form which did not converge to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine heev interface are the following: a Holds the matrix A of size (n, n). w Holds the vector of length n. job Must be 'N' or 'V'. The default value is 'N'. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For optimum performance use lwork = (nb+1)*n, where nb is the blocksize for ?hetrd returned by ilaenv. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. LAPACK Routines: Least Squares and Eigenvalue Problems 4 953 ?syevd Computes all eigenvalues and (optionally) all eigenvectors of a real symmetric matrix using divide and conquer algorithm. Syntax Fortran 77: call ssyevd(jobz, uplo, n, a, lda, w, work, lwork, iwork, liwork, info) call dsyevd(jobz, uplo, n, a, lda, w, work, lwork, iwork, liwork, info) Fortran 95: call syevd(a, w [,jobz] [,uplo] [,info]) C: lapack_int LAPACKE_syevd( int matrix_order, char jobz, char uplo, lapack_int n, * a, lapack_int lda, * w ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally all the eigenvectors, of a real symmetric matrix A. In other words, it can compute the spectral factorization of A as: A = Z*?*ZT. Here ? is a diagonal matrix whose diagonal elements are the eigenvalues ?i, and Z is the orthogonal matrix whose columns are the eigenvectors zi. Thus, A*zi = ?i*zi for i = 1, 2, ..., n. If the eigenvectors are requested, then this routine uses a divide and conquer algorithm to compute eigenvalues and eigenvectors. However, if only eigenvalues are required, then it uses the Pal-Walker-Kahan variant of the QL or QR algorithm. Note that for most cases of real symmetric eigenvalue problems the default choice should be syevr function as its underlying algorithm is faster and uses less workspace. ?syevd requires more workspace but is faster in some cases, especially for large matrices. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a REAL for ssyevd 4 Intel® Math Kernel Library Reference Manual 954 DOUBLE PRECISION for dsyevd Array, DIMENSION (lda, *). a(lda,*) is an array containing either upper or lower triangular part of the symmetric matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). work REAL for ssyevd DOUBLE PRECISION for dsyevd. Workspace array, DIMENSION at least lwork. lwork INTEGER. The dimension of the array work. Constraints: if n = 1, then lwork = 1; if jobz = 'N' and n > 1, then lwork = 2*n + 1; if jobz = 'V' and n > 1, then lwork = 2*n2+ 6*n + 1. If lwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork. Constraints: if n = 1, then liwork = 1; if jobz = 'N' and n > 1, then liwork = 1; if jobz = 'V' and n > 1, then liwork = 5*n + 3. If liwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters w REAL for ssyevd DOUBLE PRECISION for dsyevd Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues of the matrix A in ascending order. See also info. a If jobz = 'V', then on exit this array is overwritten by the orthogonal matrix Z which contains the eigenvectors of A. work(1) On exit, if lwork > 0, then work(1) returns the required minimal size of lwork. iwork(1) On exit, if liwork > 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. LAPACK Routines: Least Squares and Eigenvalue Problems 4 955 If info = i, and jobz = 'N', then the algorithm failed to converge; i indicates the number of off-diagonal elements of an intermediate tridiagonal form which did not converge to zero. If info = i, and jobz = 'V', then the algorithm failed to compute an eigenvalue while working on the submatrix lying in rows and columns info/(n+1) through mod(info,n+1). If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine syevd interface are the following: a Holds the matrix A of size (n, n). w Holds the vector of length n. jobz Must be 'N' or 'V'. The default value is 'N'. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed eigenvalues and eigenvectors are exact for a matrix T+E such that ||E||2 = O(e)*||T||2, where e is the machine precision. If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run, or set lwork = -1 (liwork = -1). If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, then the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if lwork (liwork) is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex analogue of this routine is heevd ?heevd Computes all eigenvalues and (optionally) all eigenvectors of a complex Hermitian matrix using divide and conquer algorithm. Syntax Fortran 77: call cheevd(jobz, uplo, n, a, lda, w, work, lwork, rwork, lrwork, iwork, liwork, info) call zheevd(jobz, uplo, n, a, lda, w, work, lwork, rwork, lrwork, iwork, liwork, info) Fortran 95: call heevd(a, w [,job] [,uplo] [,info]) 4 Intel® Math Kernel Library Reference Manual 956 C: lapack_int LAPACKE_cheevd( int matrix_order, char jobz, char uplo, lapack_int n, lapack_complex_float* a, lapack_int lda, float* w ); lapack_int LAPACKE_zheevd( int matrix_order, char jobz, char uplo, lapack_int n, lapack_complex_double* a, lapack_int lda, double* w ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally all the eigenvectors, of a complex Hermitian matrix A. In other words, it can compute the spectral factorization of A as: A = Z*?*ZH. Here ? is a real diagonal matrix whose diagonal elements are the eigenvalues ?i, and Z is the (complex) unitary matrix whose columns are the eigenvectors zi. Thus, A*zi = ?i*zi for i = 1, 2, ..., n. If the eigenvectors are requested, then this routine uses a divide and conquer algorithm to compute eigenvalues and eigenvectors. However, if only eigenvalues are required, then it uses the Pal-Walker-Kahan variant of the QL or QR algorithm. Note that for most cases of complex Hermetian eigenvalue problems the default choice should be heevr function as its underlying algorithm is faster and uses less workspace. ?heevd requires more workspace but is faster in some cases, especially for large matrices. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a COMPLEX for cheevd DOUBLE COMPLEX for zheevd Array, DIMENSION (lda, *). a(lda,*) is an array containing either upper or lower triangular part of the Hermitian matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). work COMPLEX for cheevd DOUBLE COMPLEX for zheevd. Workspace array, DIMENSION max(1, lwork). lwork INTEGER. The dimension of the array work. Constraints: LAPACK Routines: Least Squares and Eigenvalue Problems 4 957 if n = 1, then lwork = 1; if jobz = 'N' and n > 1, then lwork = n+1; if jobz = 'V' and n > 1, then lwork = n2+2*n. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. rwork REAL for cheevd DOUBLE PRECISION for zheevd Workspace array, DIMENSION at least lrwork. lrwork INTEGER. The dimension of the array rwork. Constraints: if n = 1, then lrwork = 1; if job = 'N' and n > 1, then lrwork = n; if job = 'V' and n > 1, then lrwork = 2*n2+ 5*n + 1. If lrwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork. Constraints: if n = 1, then liwork = 1; if jobz = 'N' and n > 1, then liwork = 1; if jobz = 'V' and n > 1, then liwork = 5*n+3. If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters w REAL for cheevd DOUBLE PRECISION for zheevd Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues of the matrix A in ascending order. See also info. a If jobz = 'V', then on exit this array is overwritten by the unitary matrix Z which contains the eigenvectors of A. work(1) On exit, if lwork > 0, then the real part of work(1) returns the required minimal size of lwork. rwork(1) On exit, if lrwork > 0, then rwork(1) returns the required minimal size of lrwork. iwork(1) On exit, if liwork > 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. 4 Intel® Math Kernel Library Reference Manual 958 If info = i, and jobz = 'N', then the algorithm failed to converge; i offdiagonal elements of an intermediate tridiagonal form did not converge to zero; if info = i, and jobz = 'V', then the algorithm failed to compute an eigenvalue while working on the submatrix lying in rows and columns info/(n+1) through mod(info, n+1). If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine heevd interface are the following: a Holds the matrix A of size (n, n). w Holds the vector of length (n). jobz Must be 'N' or 'V'. The default value is 'N'. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed eigenvalues and eigenvectors are exact for a matrix A + E such that ||E||2 = O(e)*||A||2, where e is the machine precision. If you are in doubt how much workspace to supply, use a generous value of lwork (liwork or lrwork) for the first run or set lwork = -1 (liwork = -1, lrwork = -1). If you choose the first option and set any of admissible lwork (liwork or lrwork) sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork) on exit. Use this value (work(1), iwork(1), rwork(1)) for subsequent runs. If you set lwork = -1 (liwork = -1, lrwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork). This operation is called a workspace query. Note that if you set lwork (liwork, lrwork) to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real analogue of this routine is syevd. See also hpevd for matrices held in packed storage, and hbevd for banded matrices. ?syevx Computes selected eigenvalues and, optionally, eigenvectors of a symmetric matrix. Syntax Fortran 77: call ssyevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info) call dsyevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info) LAPACK Routines: Least Squares and Eigenvalue Problems 4 959 Fortran 95: call syevx(a, w [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,abstol] [,info]) C: lapack_int LAPACKE_syevx( int matrix_order, char jobz, char range, char uplo, lapack_int n, * a, lapack_int lda, vl, vu, lapack_int il, lapack_int iu, abstol, lapack_int* m, * w, * z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a real symmetric matrix A. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Note that for most cases of real symmetric eigenvalue problems the default choice should be syevr function as its underlying algorithm is faster and uses less workspace. ?syevx is faster for a few selected eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A', 'V', or 'I'. If range = 'A', all eigenvalues will be found. If range = 'V', all eigenvalues in the half-open interval (vl, vu] will be found. If range = 'I', the eigenvalues with indices il through iu will be found. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a, work REAL for ssyevx DOUBLE PRECISION for dsyevx. Arrays: a(lda,*) is an array containing either upper or lower triangular part of the symmetric matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). vl, vu REAL for ssyevx DOUBLE PRECISION for dsyevx. 4 Intel® Math Kernel Library Reference Manual 960 If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues; vl = vu. Not referenced if range = 'A'or 'I'. il, iu INTEGER. If range = 'I', the indices of the smallest and largest eigenvalues to be returned. Constraints: 1 = il = iu = n, if n > 0; il = 1 and iu = 0, if n = 0. Not referenced if range = 'A'or 'V'. abstol REAL for ssyevx DOUBLE PRECISION for dsyevx. The absolute error tolerance for the eigenvalues. See Application Notes for more information. ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', then ldz = max(1, n). lwork INTEGER. The dimension of the array work. If n = 1 then lwork = 1, otherwise lwork=8*n. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). Output Parameters a On exit, the lower triangle (if uplo = 'L') or the upper triangle (if uplo = 'U') of A, including the diagonal, is overwritten. m INTEGER. The total number of eigenvalues found; 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w REAL for ssyevx DOUBLE PRECISION for dsyevx Array, DIMENSION at least max(1, n). The first m elements contain the selected eigenvalues of the matrix A in ascending order. z REAL for ssyevx DOUBLE PRECISION for dsyevx. Array z(ldz,*) contains eigenvectors. The second dimension of z must be at least max(1, m). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix A corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. LAPACK Routines: Least Squares and Eigenvalue Problems 4 961 work(1) On exit, if lwork > 0, then work(1) returns the required minimal size of lwork. ifail INTEGER. Array, DIMENSION at least max(1, n). If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, then ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'V', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then i eigenvectors failed to converge; their indices are stored in the array ifail. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine syevx interface are the following: a Holds the matrix A of size (n, n). w Holds the vector of length n. a Holds the matrix A of size (m, n). ifail Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted Note that there will be an error condition if ifail is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes For optimum performance use lwork = (nb+3)*n, where nb is the maximum of the blocksize for ?sytrd and ?ormtr returned by ilaenv. If it is not clear how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If lwork has any of admissible sizes, which is no less than the minimal value described, then the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. 4 Intel® Math Kernel Library Reference Manual 962 If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array work. This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. If abstol is less than or equal to zero, then e*|T| is used as tolerance, where|T| is the 1-norm of the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues are computed most accurately when abstol is set to twice the underflow threshold 2*slamch('S'), not zero. If this routine returns with info > 0, indicating that some eigenvectors did not converge, try setting abstol to 2*slamch('S'). ?heevx Computes selected eigenvalues and, optionally, eigenvectors of a Hermitian matrix. Syntax Fortran 77: call cheevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, rwork, iwork, ifail, info) call zheevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, rwork, iwork, ifail, info) Fortran 95: call heevx(a, w [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,abstol] [,info]) C: lapack_int LAPACKE_cheevx( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_complex_float* a, lapack_int lda, float vl, float vu, lapack_int il, lapack_int iu, float abstol, lapack_int* m, float* w, lapack_complex_float* z, lapack_int ldz, lapack_int* ifail ); lapack_int LAPACKE_zheevx( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_complex_double* a, lapack_int lda, double vl, double vu, lapack_int il, lapack_int iu, double abstol, lapack_int* m, double* w, lapack_complex_double* z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Note that for most cases of complex Hermetian eigenvalue problems the default choice should be heevr function as its underlying algorithm is faster and uses less workspace. ?heevx is faster for a few selected eigenvalues. LAPACK Routines: Least Squares and Eigenvalue Problems 4 963 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A', 'V', or 'I'. If range = 'A', all eigenvalues will be found. If range = 'V', all eigenvalues in the half-open interval (vl, vu] will be found. If range = 'I', the eigenvalues with indices il through iu will be found. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a, work COMPLEX for cheevx DOUBLE COMPLEX for zheevx. Arrays: a(lda,*) is an array containing either upper or lower triangular part of the Hermitian matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). vl, vu REAL for cheevx DOUBLE PRECISION for zheevx. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues; vl = vu. Not referenced if range = 'A'or 'I'. il, iu INTEGER. If range = 'I', the indices of the smallest and largest eigenvalues to be returned. Constraints: 1 = il = iu = n, if n > 0;il = 1 and iu = 0, if n = 0. Not referenced if range = 'A'or 'V'. abstol REAL for cheevx DOUBLE PRECISION for zheevx. The absolute error tolerance for the eigenvalues. See Application Notes for more information. ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', then ldz =max(1, n). lwork INTEGER. The dimension of the array work. lwork = 1 if n=1; otherwise at least 2*n. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. rwork REAL for cheevx DOUBLE PRECISION for zheevx. Workspace array, DIMENSION at least max(1, 7n). 4 Intel® Math Kernel Library Reference Manual 964 iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). Output Parameters a On exit, the lower triangle (if uplo = 'L') or the upper triangle (if uplo = 'U') of A, including the diagonal, is overwritten. m INTEGER. The total number of eigenvalues found; 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w REAL for cheevx DOUBLE PRECISION for zheevx Array, DIMENSION at least max(1, n). The first m elements contain the selected eigenvalues of the matrix A in ascending order. z COMPLEX for cheevx DOUBLE COMPLEX for zheevx. Array z(ldz,*) contains eigenvectors. The second dimension of z must be at least max(1, m). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix A corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. work(1) On exit, if lwork > 0, then work(1) returns the required minimal size of lwork. ifail INTEGER. Array, DIMENSION at least max(1, n). If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, then ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'V', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then i eigenvectors failed to converge; their indices are stored in the array ifail. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine heevx interface are the following: a Holds the matrix A of size (n, n). w Holds the vector of length n. z Holds the matrix Z of size (n, n). ifail Holds the vector of length n. LAPACK Routines: Least Squares and Eigenvalue Problems 4 965 uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted Note that there will be an error condition if ifail is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes For optimum performance use lwork = (nb+1)*n, where nb is the maximum of the blocksize for ?hetrd and ?unmtr returned by ilaenv. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. If abstol is less than or equal to zero, then e*|T| will be used in its place, where |T| is the 1-norm of the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*slamch('S'), not zero. If this routine returns with info > 0, indicating that some eigenvectors did not converge, try setting abstol to 2*slamch('S'). ?syevr Computes selected eigenvalues and, optionally, eigenvectors of a real symmetric matrix using the Relatively Robust Representations. Syntax Fortran 77: call ssyevr(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) call dsyevr(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) 4 Intel® Math Kernel Library Reference Manual 966 Fortran 95: call syevr(a, w [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,isuppz] [,abstol] [,info]) C: lapack_int LAPACKE_syevr( int matrix_order, char jobz, char range, char uplo, lapack_int n, * a, lapack_int lda, vl, vu, lapack_int il, lapack_int iu, abstol, lapack_int* m, * w, * z, lapack_int ldz, lapack_int* isuppz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a real symmetric matrix A. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. The routine first reduces the matrix A to tridiagonal form T with a call to sytrd. Then, whenever possible, ? syevr calls stemr to compute the eigenspectrum using Relatively Robust Representations. stemr computes eigenvalues by the dqds algorithm, while orthogonal eigenvectors are computed from various "good" L*D*LT representations (also known as Relatively Robust Representations). Gram-Schmidt orthogonalization is avoided as far as possible. More specifically, the various steps of the algorithm are as follows. For the each unreduced block of T: a. Compute T - s*I = L*D*LT, so that L and D define all the wanted eigenvalues to high relative accuracy. This means that small relative changes in the entries of D and L cause only small relative changes in the eigenvalues and eigenvectors. The standard (unfactored) representation of the tridiagonal matrix T does not have this property in general. b. Compute the eigenvalues to suitable accuracy. If the eigenvectors are desired, the algorithm attains full accuracy of the computed eigenvalues only right before the corresponding vectors have to be computed, see Steps c) and d). c. For each cluster of close eigenvalues, select a new shift close to the cluster, find a new factorization, and refine the shifted eigenvalues to suitable accuracy. d. For each eigenvalue with a large enough relative separation, compute the corresponding eigenvector by forming a rank revealing twisted factorization. Go back to Step c) for any clusters that remain. The desired accuracy of the output can be specified by the input parameter abstol. The routine ?syevr calls stemr when the full spectrum is requested on machines that conform to the IEEE-754 floating point standard. ?syevr calls stebz and stein on non-IEEE machines and when partial spectrum requests are made. Note that ?syevr is preferable for most cases of real symmetric eigenvalue problems as its underlying algorithm is fast and uses less workspace. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. LAPACK Routines: Least Squares and Eigenvalue Problems 4 967 range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl < lambda(i) = vu. If range = 'I', the routine computes eigenvalues with indices il to iu. For range = 'V'or 'I' and iu-il < n-1, sstebz/dstebz and sstein/ dstein are called. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a, work REAL for ssyevr DOUBLE PRECISION for dsyevr. Arrays: a(lda,*) is an array containing either upper or lower triangular part of the symmetric matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). vl, vu REAL for ssyevr DOUBLE PRECISION for dsyevr. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues. Constraint: vl< vu. If range = 'A' or 'I', vl and vu are not referenced. il, iu INTEGER. If range = 'I', the indices in ascending order of the smallest and largest eigenvalues to be returned. Constraint: 1 = il = iu = n, if n > 0; il=1 and iu=0, if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for ssyevr DOUBLE PRECISION for dsyevr. The absolute error tolerance to which each eigenvalue/eigenvector is required. If jobz = 'V', the eigenvalues and eigenvectors output have residual norms bounded by abstol, and the dot products between different eigenvectors are bounded by abstol. If abstol < n *eps*|T|, then n *eps*|T| is used instead, where eps is the machine precision, and |T| is the 1-norm of the matrix T. The eigenvalues are computed to an accuracy of eps*|T| irrespective of abstol. If high relative accuracy is important, set abstol to ?lamch('S'). ldz INTEGER. The leading dimension of the output array z. Constraints: ldz = 1 and ldz = max(1, n) if jobz = 'V'. lwork INTEGER. The dimension of the array work. 4 Intel® Math Kernel Library Reference Manual 968 Constraint: lwork = max(1, 26n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork, lwork = max(1, 10n). If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the iwork array, returns this value as the first entry of the iwork array, and no error message related to liwork is issued by xerbla. Output Parameters a On exit, the lower triangle (if uplo = 'L') or the upper triangle (if uplo = 'U') of A, including the diagonal, is overwritten. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w, z REAL for ssyevr DOUBLE PRECISION for dsyevr. Arrays: w(*), DIMENSION at least max(1, n), contains the selected eigenvalues in ascending order, stored in w(1) to w(m); z(ldz, *), the second dimension of z must be at least max(1, m). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix T corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If jobz = 'N', then z is not referenced. Note that you must ensure that at least max(1, m) columns are supplied in the array z ; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. isuppz INTEGER. Array, DIMENSION at least 2 *max(1, m). The support of the eigenvectors in z, i.e., the indices indicating the nonzero elements in z. The i-th eigenvector is nonzero only in elements isuppz( 2i-1) through isuppz( 2i ). Referenced only if eigenvectors are needed (jobz = 'V') and all eigenvalues are needed, that is, range = 'A' or range = 'I' and il = 1 and iu = n. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, an internal error has occurred. LAPACK Routines: Least Squares and Eigenvalue Problems 4 969 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine syevr interface are the following: a Holds the matrix A of size (n, n). w Holds the vector of length n. z Holds the matrix Z of size (n, n), where the values n and m are significant. isuppz Holds the vector of length (2*m), where the values (2*m) are significant. uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted Note that there will be an error condition if isuppz is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes For optimum performance use lwork = (nb+6)*n, where nb is the maximum of the blocksize for ?sytrd and ?ormtr returned by ilaenv. If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run or set lwork = -1 (liwork = -1). If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, then the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if lwork (liwork) is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Normal execution of ?stegr may create NaNs and infinities and hence may abort due to a floating point exception in environments which do not handle NaNs and infinities in the IEEE standard default manner. ?heevr Computes selected eigenvalues and, optionally, eigenvectors of a Hermitian matrix using the Relatively Robust Representations. 4 Intel® Math Kernel Library Reference Manual 970 Syntax Fortran 77: call cheevr(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, rwork, lrwork, iwork, liwork, info) call zheevr(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, rwork, lrwork, iwork, liwork, info) Fortran 95: call heevr(a, w [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,isuppz] [,abstol] [,info]) C: lapack_int LAPACKE_cheevr( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_complex_float* a, lapack_int lda, float vl, float vu, lapack_int il, lapack_int iu, float abstol, lapack_int* m, float* w, lapack_complex_float* z, lapack_int ldz, lapack_int* isuppz ); lapack_int LAPACKE_zheevr( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_complex_double* a, lapack_int lda, double vl, double vu, lapack_int il, lapack_int iu, double abstol, lapack_int* m, double* w, lapack_complex_double* z, lapack_int ldz, lapack_int* isuppz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. The routine first reduces the matrix A to tridiagonal form T with a call to hetrd. Then, whenever possible, ? heevr calls stegr to compute the eigenspectrum using Relatively Robust Representations. ?stegr computes eigenvalues by the dqds algorithm, while orthogonal eigenvectors are computed from various "good" L*D*LT representations (also known as Relatively Robust Representations). Gram-Schmidt orthogonalization is avoided as far as possible. More specifically, the various steps of the algorithm are as follows. For each unreduced block (submatrix) of T: a. Compute T - s*I = L*D*LT, so that L and D define all the wanted eigenvalues to high relative accuracy. This means that small relative changes in the entries of D and L cause only small relative changes in the eigenvalues and eigenvectors. The standard (unfactored) representation of the tridiagonal matrix T does not have this property in general. b. Compute the eigenvalues to suitable accuracy. If the eigenvectors are desired, the algorithm attains full accuracy of the computed eigenvalues only right before the corresponding vectors have to be computed, see Steps c) and d). c. For each cluster of close eigenvalues, select a new shift close to the cluster, find a new factorization, and refine the shifted eigenvalues to suitable accuracy. d. For each eigenvalue with a large enough relative separation, compute the corresponding eigenvector by forming a rank revealing twisted factorization. Go back to Step c) for any clusters that remain. The desired accuracy of the output can be specified by the input parameter abstol. LAPACK Routines: Least Squares and Eigenvalue Problems 4 971 The routine ?heevr calls stemr when the full spectrum is requested on machines which conform to the IEEE-754 floating point standard, or stebz and stein on non-IEEE machines and when partial spectrum requests are made. Note that the routine ?heevr is preferable for most cases of complex Hermitian eigenvalue problems as its underlying algorithm is fast and uses less workspace. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If job = 'N', then only eigenvalues are computed. If job = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl< lambda(i) = vu. If range = 'I', the routine computes eigenvalues with indices il to iu. For range = 'V'or 'I', sstebz/dstebz and cstein/zstein are called. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a, work COMPLEX for cheevr DOUBLE COMPLEX for zheevr. Arrays: a(lda,*) is an array containing either upper or lower triangular part of the Hermitian matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). vl, vu REAL for cheevr DOUBLE PRECISION for zheevr. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues. Constraint: vl< vu. If range = 'A' or 'I', vl and vu are not referenced. il, iu INTEGER. If range = 'I', the indices in ascending order of the smallest and largest eigenvalues to be returned. Constraint: 1 = il = iu = n, if n > 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for cheevr DOUBLE PRECISION for zheevr. The absolute error tolerance to which each eigenvalue/eigenvector is required. If jobz = 'V', the eigenvalues and eigenvectors output have residual norms bounded by abstol, and the dot products between different eigenvectors are bounded by abstol. 4 Intel® Math Kernel Library Reference Manual 972 If abstol < n *eps*|T|, then n *eps*|T| will be used in its place, where eps is the machine precision, and |T| is the 1-norm of the matrix T. The eigenvalues are computed to an accuracy of eps*|T| irrespective of abstol. If high relative accuracy is important, set abstol to ?lamch('S'). ldz INTEGER. The leading dimension of the output array z. Constraints: ldz = 1 if jobz = 'N'; ldz = max(1, n) if jobz = 'V'. lwork INTEGER. The dimension of the array work. Constraint: lwork = max(1, 2n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for the suggested value of lwork. rwork REAL for cheevr DOUBLE PRECISION for zheevr. Workspace array, DIMENSION max(1, lwork). lrwork INTEGER. The dimension of the array rwork; lwork = max(1, 24n). If lrwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork, lwork = max(1, 10n). If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. Output Parameters a On exit, the lower triangle (if uplo = 'L') or the upper triangle (if uplo = 'U') of A, including the diagonal, is overwritten. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w REAL for cheevr DOUBLE PRECISION for zheevr. Array, DIMENSION at least max(1, n), contains the selected eigenvalues in ascending order, stored in w(1) to w(m). z COMPLEX for cheevr DOUBLE COMPLEX for zheevr. Array z(ldz, *); the second dimension of z must be at least max(1, m). LAPACK Routines: Least Squares and Eigenvalue Problems 4 973 If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix T corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. isuppz INTEGER. Array, DIMENSION at least 2 *max(1, m). The support of the eigenvectors in z, i.e., the indices indicating the nonzero elements in z. The i-th eigenvector is nonzero only in elements isuppz(2i-1) through isuppz(2i). Referenced only if eigenvectors are needed (jobz = 'V') and all eigenvalues are needed, that is, range = 'A' or range = 'I' and il = 1 and iu = n. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. rwork(1) On exit, if info = 0, then rwork(1) returns the required minimal size of lrwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, an internal error has occurred. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine heevr interface are the following: a Holds the matrix A of size (n, n). w Holds the vector of length n. z Holds the matrix Z of size (n, n), where the values n and m are significant. isuppz Holds the vector of length (2*n), where the values (2*m) are significant. uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted Note that there will be an error condition if isuppz is present and z is omitted. 4 Intel® Math Kernel Library Reference Manual 974 range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes For optimum performance use lwork = (nb+1)*n, where nb is the maximum of the blocksize for ?hetrd and ?unmtr returned by ilaenv. If you are in doubt how much workspace to supply, use a generous value of lwork (or lrwork, or liwork) for the first run or set lwork = -1 (lrwork = -1, liwork = -1). If you choose the first option and set any of admissible lwork (or lrwork, liwork) sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, rwork, iwork) on exit. Use this value (work(1), rwork(1), iwork(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, rwork, iwork). This operation is called a workspace query. Note that if you set lwork (lrwork, liwork) to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Normal execution of ?stegr may create NaNs and infinities and hence may abort due to a floating point exception in environments which do not handle NaNs and infinities in the IEEE standard default manner. ?spev Computes all eigenvalues and, optionally, eigenvectors of a real symmetric matrix in packed storage. Syntax Fortran 77: call sspev(jobz, uplo, n, ap, w, z, ldz, work, info) call dspev(jobz, uplo, n, ap, w, z, ldz, work, info) Fortran 95: call spev(ap, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_spev( int matrix_order, char jobz, char uplo, lapack_int n, * ap, * w, * z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues and, optionally, eigenvectors of a real symmetric matrix A in packed storage. LAPACK Routines: Least Squares and Eigenvalue Problems 4 975 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If job = 'N', then only eigenvalues are computed. If job = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ap stores the packed upper triangular part of A. If uplo = 'L', ap stores the packed lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). ap, work REAL for sspev DOUBLE PRECISION for dspev Arrays: ap(*) contains the packed upper or lower triangle of symmetric matrix A, as specified by uplo. The dimension of ap must be at least max(1, n*(n+1)/2). work (*) is a workspace array, DIMENSION at least max(1, 3n). ldz INTEGER. The leading dimension of the output array z. Constraints: if jobz = 'N', then ldz = 1; if jobz = 'V', then ldz = max(1, n). Output Parameters w, z REAL for sspev DOUBLE PRECISION for dspev Arrays: w(*), DIMENSION at least max(1, n). If info = 0, w contains the eigenvalues of the matrix A in ascending order. z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the orthonormal eigenvectors of the matrix A, with the i-th column of z holding the eigenvector associated with w(i). If jobz = 'N', then z is not referenced. ap On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. The elements of the diagonal and the offdiagonal of the tridiagonal matrix overwrite the corresponding elements of A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then the algorithm failed to converge; i indicates the number of elements of an intermediate tridiagonal form which did not converge to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine spev interface are the following: 4 Intel® Math Kernel Library Reference Manual 976 ap Holds the array A of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. ?hpev Computes all eigenvalues and, optionally, eigenvectors of a Hermitian matrix in packed storage. Syntax Fortran 77: call chpev(jobz, uplo, n, ap, w, z, ldz, work, rwork, info) call zhpev(jobz, uplo, n, ap, w, z, ldz, work, rwork, info) Fortran 95: call hpev(ap, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_chpev( int matrix_order, char jobz, char uplo, lapack_int n, lapack_complex_float* ap, float* w, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zhpev( int matrix_order, char jobz, char uplo, lapack_int n, lapack_complex_double* ap, double* w, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A in packed storage. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If job = 'N', then only eigenvalues are computed. If job = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ap stores the packed upper triangular part of A. If uplo = 'L', ap stores the packed lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). ap, work COMPLEX for chpev DOUBLE COMPLEX for zhpev. LAPACK Routines: Least Squares and Eigenvalue Problems 4 977 Arrays: ap(*) contains the packed upper or lower triangle of Hermitian matrix A, as specified by uplo. The dimension of ap must be at least max(1, n*(n+1)/2). work (*) is a workspace array, DIMENSION at least max(1, 2n-1). ldz INTEGER. The leading dimension of the output array z. Constraints: if jobz = 'N', then ldz = 1; if jobz = 'V', then ldz = max(1, n) . rwork REAL for chpev DOUBLE PRECISION for zhpev. Workspace array, DIMENSION at least max(1, 3n-2). Output Parameters w REAL for chpev DOUBLE PRECISION for zhpev. Array, DIMENSION at least max(1, n). If info = 0, w contains the eigenvalues of the matrix A in ascending order. z COMPLEX for chpev DOUBLE COMPLEX for zhpev. Array z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the orthonormal eigenvectors of the matrix A, with the i-th column of z holding the eigenvector associated with w(i). If jobz = 'N', then z is not referenced. ap On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. The elements of the diagonal and the offdiagonal of the tridiagonal matrix overwrite the corresponding elements of A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then the algorithm failed to converge; i indicates the number of elements of an intermediate tridiagonal form which did not converge to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpev interface are the following: ap Holds the array A of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. 4 Intel® Math Kernel Library Reference Manual 978 ?spevd Uses divide and conquer algorithm to compute all eigenvalues and (optionally) all eigenvectors of a real symmetric matrix held in packed storage. Syntax Fortran 77: call sspevd(jobz, uplo, n, ap, w, z, ldz, work, lwork, iwork, liwork, info) call dspevd(jobz, uplo, n, ap, w, z, ldz, work, lwork, iwork, liwork, info) Fortran 95: call spevd(ap, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_spevd( int matrix_order, char jobz, char uplo, lapack_int n, * ap, * w, * z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally all the eigenvectors, of a real symmetric matrix A (held in packed storage). In other words, it can compute the spectral factorization of A as: A = Z*?*ZT. Here ? is a diagonal matrix whose diagonal elements are the eigenvalues ?i, and Z is the orthogonal matrix whose columns are the eigenvectors zi. Thus, A*zi = ?i*zi for i = 1, 2, ..., n. If the eigenvectors are requested, then this routine uses a divide and conquer algorithm to compute eigenvalues and eigenvectors. However, if only eigenvalues are required, then it uses the Pal-Walker-Kahan variant of the QL or QR algorithm. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ap stores the packed upper triangular part of A. If uplo = 'L', ap stores the packed lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). ap, work REAL for sspevd DOUBLE PRECISION for dspevd Arrays: LAPACK Routines: Least Squares and Eigenvalue Problems 4 979 ap(*) contains the packed upper or lower triangle of symmetric matrix A, as specified by uplo. The dimension of ap must be max(1, n*(n+1)/2) work is a workspace array, its dimension max(1, lwork). ldz INTEGER. The leading dimension of the output array z. Constraints: if jobz = 'N', then ldz = 1; if jobz = 'V', then ldz = max(1, n). lwork INTEGER. The dimension of the array work. Constraints: if n = 1, then lwork = 1; if jobz = 'N' and n > 1, then lwork = 2*n; if jobz = 'V' and n > 1, then lwork = n2+ 6*n + 1. If lwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork. Constraints: if n = 1, then liwork = 1; if jobz = 'N' and n > 1, then liwork = 1; if jobz = 'V' and n > 1, then liwork = 5*n+3. If liwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters w, z REAL for sspevd DOUBLE PRECISION for dspevd Arrays: w(*), DIMENSION at least max(1, n). If info = 0, contains the eigenvalues of the matrix A in ascending order. See also info. z(ldz,*). The second dimension of z must be: at least 1 if jobz = 'N';at least max(1, n) if jobz = 'V'. If jobz = 'V', then this array is overwritten by the orthogonal matrix Z which contains the eigenvectors of A. If jobz = 'N', then z is not referenced. ap On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. The elements of the diagonal and the offdiagonal of the tridiagonal matrix overwrite the corresponding elements of A. work(1) On exit, if info = 0, then work(1) returns the required lwork. 4 Intel® Math Kernel Library Reference Manual 980 iwork(1) On exit, if info = 0, then iwork(1) returns the required liwork. info INTEGER. If info = 0, the execution is successful. If info = i, then the algorithm failed to converge; i indicates the number of elements of an intermediate tridiagonal form which did not converge to zero. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine spevd interface are the following: ap Holds the array A of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Application Notes The computed eigenvalues and eigenvectors are exact for a matrix T+E such that ||E||2 = O(e)*||T||2, where e is the machine precision. If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run or set lwork = -1 (liwork = -1). If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if lwork (liwork) is less than the minimal required value and is not equal to -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex analogue of this routine is hpevd. See also syevd for matrices held in full storage, and sbevd for banded matrices. ?hpevd Uses divide and conquer algorithm to compute all eigenvalues and (optionally) all eigenvectors of a complex Hermitian matrix held in packed storage. Syntax Fortran 77: call chpevd(jobz, uplo, n, ap, w, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) LAPACK Routines: Least Squares and Eigenvalue Problems 4 981 call zhpevd(jobz, uplo, n, ap, w, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) Fortran 95: call hpevd(ap, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_chpevd( int matrix_order, char jobz, char uplo, lapack_int n, lapack_complex_float* ap, float* w, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zhpevd( int matrix_order, char jobz, char uplo, lapack_int n, lapack_complex_double* ap, double* w, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally all the eigenvectors, of a complex Hermitian matrix A (held in packed storage). In other words, it can compute the spectral factorization of A as: A = Z*?*ZH. Here ? is a real diagonal matrix whose diagonal elements are the eigenvalues ?i, and Z is the (complex) unitary matrix whose columns are the eigenvectors zi. Thus, A*zi = ?i*zi for i = 1, 2, ..., n. If the eigenvectors are requested, then this routine uses a divide and conquer algorithm to compute eigenvalues and eigenvectors. However, if only eigenvalues are required, then it uses the Pal-Walker-Kahan variant of the QL or QR algorithm. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ap stores the packed upper triangular part of A. If uplo = 'L', ap stores the packed lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). ap, work COMPLEX for chpevd DOUBLE COMPLEX for zhpevd Arrays: ap(*) contains the packed upper or lower triangle of Hermitian matrix A, as specified by uplo. The dimension of ap must be at least max(1, n*(n+1)/2). work is a workspace array, its dimension max(1, lwork). ldz INTEGER. The leading dimension of the output array z. Constraints: if jobz = 'N', then ldz = 1; if jobz = 'V', then ldz = max(1, n). 4 Intel® Math Kernel Library Reference Manual 982 lwork INTEGER. The dimension of the array work. Constraints: if n = 1, then lwork = 1; if jobz = 'N' and n > 1, then lwork = n; if jobz = 'V' and n > 1, then lwork = 2*n. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. rwork REAL for chpevd DOUBLE PRECISION for zhpevd Workspace array, its dimension max(1, lrwork). lrwork INTEGER. The dimension of the array rwork. Constraints: if n = 1, then lrwork = 1; if jobz = 'N' and n > 1, then lrwork = n; if jobz = 'V' and n > 1, then lrwork = 2*n2 + 5*n + 1. If lrwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork. Constraints: if n = 1, then liwork = 1; if jobz = 'N' and n > 1, then liwork = 1; if jobz = 'V' and n > 1, then liwork = 5*n+3. If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters w REAL for chpevd DOUBLE PRECISION for zhpevd Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues of the matrix A in ascending order. See also info. z COMPLEX for chpevd DOUBLE COMPLEX for zhpevd Array, DIMENSION (ldz,*). The second dimension of z must be: at least 1 if jobz = 'N'; at least max(1, n) if jobz = 'V'. If jobz = 'V', then this array is overwritten by the unitary matrix Z which contains the eigenvectors of A. LAPACK Routines: Least Squares and Eigenvalue Problems 4 983 If jobz = 'N', then z is not referenced. ap On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. The elements of the diagonal and the offdiagonal of the tridiagonal matrix overwrite the corresponding elements of A. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. rwork(1) On exit, if info = 0, then rwork(1) returns the required minimal size of lrwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = i, then the algorithm failed to converge; i indicates the number of elements of an intermediate tridiagonal form which did not converge to zero. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpevd interface are the following: ap Holds the array A of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Application Notes The computed eigenvalues and eigenvectors are exact for a matrix T + E such that ||E||2 = O(e)*||T||2, where e is the machine precision. If you are in doubt how much workspace to supply, use a generous value of lwork (liwork or lrwork) for the first run or set lwork = -1 (liwork = -1, lrwork = -1). If you choose the first option and set any of admissible lwork (liwork or lrwork) sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork) on exit. Use this value (work(1), iwork(1), rwork(1)) for subsequent runs. If you set lwork = -1 (liwork = -1, lrwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork). This operation is called a workspace query. Note that if you set lwork (liwork, lrwork) to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real analogue of this routine is spevd. 4 Intel® Math Kernel Library Reference Manual 984 See also heevd for matrices held in full storage, and hbevd for banded matrices. ?spevx Computes selected eigenvalues and, optionally, eigenvectors of a real symmetric matrix in packed storage. Syntax Fortran 77: call sspevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) call dspevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) Fortran 95: call spevx(ap, w [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,abstol] [,info]) C: lapack_int LAPACKE_spevx( int matrix_order, char jobz, char range, char uplo, lapack_int n, * ap, vl, vu, lapack_int il, lapack_int iu, abstol, lapack_int* m, * w, * z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a real symmetric matrix A in packed storage. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If job = 'N', then only eigenvalues are computed. If job = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl< lambda(i)= vu. If range = 'I', the routine computes eigenvalues with indices il to iu. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ap stores the packed upper triangular part of A. If uplo = 'L', ap stores the packed lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). LAPACK Routines: Least Squares and Eigenvalue Problems 4 985 ap, work REAL for sspevx DOUBLE PRECISION for dspevx Arrays: ap(*) contains the packed upper or lower triangle of the symmetric matrix A, as specified by uplo. The dimension of ap must be at least max(1, n*(n+1)/2). work(*) is a workspace array, DIMENSION at least max(1, 8n). vl, vu REAL for sspevx DOUBLE PRECISION for dspevx If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues. Constraint: vl< vu. If range = 'A' or 'I', vl and vu are not referenced. il, iu INTEGER. If range = 'I', the indices in ascending order of the smallest and largest eigenvalues to be returned. Constraint: 1 = il = iu = n, if n > 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for sspevx DOUBLE PRECISION for dspevx The absolute error tolerance to which each eigenvalue is required. See Application notes for details on error tolerance. ldz INTEGER. The leading dimension of the output array z. Constraints: if jobz = 'N', then ldz = 1; if jobz = 'V', then ldz = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). Output Parameters ap On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. The elements of the diagonal and the offdiagonal of the tridiagonal matrix overwrite the corresponding elements of A. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w, z REAL for sspevx DOUBLE PRECISION for dspevx Arrays: w(*), DIMENSION at least max(1, n). If info = 0, contains the selected eigenvalues of the matrix A in ascending order. z(ldz,*). The second dimension of z must be at least max(1, m). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix A corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. 4 Intel® Math Kernel Library Reference Manual 986 If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. ifail INTEGER. Array, DIMENSION at least max(1, n). If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, the ifail contains the indices the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then i eigenvectors failed to converge; their indices are stored in the array ifail. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine spevx interface are the following: ap Holds the array A of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n), where the values n and m are significant. ifail Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted Note that there will be an error condition if ifail is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. If abstol is less than or equal to zero, then e*||T||1 will be used in its place, where T is the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*?lamch('S'), not zero. LAPACK Routines: Least Squares and Eigenvalue Problems 4 987 If this routine returns with info > 0, indicating that some eigenvectors did not converge, try setting abstol to 2*?lamch('S'). ?hpevx Computes selected eigenvalues and, optionally, eigenvectors of a Hermitian matrix in packed storage. Syntax Fortran 77: call chpevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, rwork, iwork, ifail, info) call zhpevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, rwork, iwork, ifail, info) Fortran 95: call hpevx(ap, w [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,abstol] [,info]) C: lapack_int LAPACKE_chpevx( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_complex_float* ap, float vl, float vu, lapack_int il, lapack_int iu, float abstol, lapack_int* m, float* w, lapack_complex_float* z, lapack_int ldz, lapack_int* ifail ); lapack_int LAPACKE_zhpevx( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_complex_double* ap, double vl, double vu, lapack_int il, lapack_int iu, double abstol, lapack_int* m, double* w, lapack_complex_double* z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A in packed storage. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If job = 'N', then only eigenvalues are computed. If job = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for chpevx DOUBLE PRECISION for zhpevx The absolute error tolerance to which each eigenvalue is required. See Application notes for details on error tolerance. ldz INTEGER. The leading dimension of the output array z. Constraints: if jobz = 'N', then ldz = 1; if jobz = 'V', then ldz = max(1, n). rwork REAL for chpevx DOUBLE PRECISION for zhpevx Workspace array, DIMENSION at least max(1, 7n). iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). Output Parameters ap On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. The elements of the diagonal and the offdiagonal of the tridiagonal matrix overwrite the corresponding elements of A. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w REAL for chpevx DOUBLE PRECISION for zhpevx Array, DIMENSION at least max(1, n). If info = 0, contains the selected eigenvalues of the matrix A in ascending order. z COMPLEX for chpevx DOUBLE COMPLEX for zhpevx Array z(ldz,*). LAPACK Routines: Least Squares and Eigenvalue Problems 4 989 The second dimension of z must be at least max(1, m). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix A corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. ifail INTEGER. Array, DIMENSION at least max(1, n). If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, the ifail contains the indices the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then i eigenvectors failed to converge; their indices are stored in the array ifail. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpevx interface are the following: ap Holds the array A of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n), where the values n and m are significant. ifail Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted Note that there will be an error condition if ifail is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, 4 Intel® Math Kernel Library Reference Manual 990 Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. If abstol is less than or equal to zero, then e*||T||1 will be used in its place, where T is the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*?lamch('S'), not zero. If this routine returns with info > 0, indicating that some eigenvectors did not converge, try setting abstol to 2*?lamch('S'). ?sbev Computes all eigenvalues and, optionally, eigenvectors of a real symmetric band matrix. Syntax Fortran 77: call ssbev(jobz, uplo, n, kd, ab, ldab, w, z, ldz, work, info) call dsbev(jobz, uplo, n, kd, ab, ldab, w, z, ldz, work, info) Fortran 95: call sbev(ab, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_sbev( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int kd, * ab, lapack_int ldab, * w, * z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all eigenvalues and, optionally, eigenvectors of a real symmetric band matrix A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ab stores the upper triangular part of A. If uplo = 'L', ab stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). LAPACK Routines: Least Squares and Eigenvalue Problems 4 991 kd INTEGER. The number of super- or sub-diagonals in A (kd = 0). ab, work REAL for ssbev DOUBLE PRECISION for dsbev. Arrays: ab (ldab,*) is an array containing either upper or lower triangular part of the symmetric matrix A (as specified by uplo) in band storage format. The second dimension of ab must be at least max(1, n). work (*) is a workspace array. The dimension of work must be at least max(1, 3n-2). ldab INTEGER. The leading dimension of ab; must be at least kd +1. ldz INTEGER. The leading dimension of the output array z. Constraints: if jobz = 'N', then ldz = 1; if jobz = 'V', then ldz = max(1, n) . Output Parameters w, z REAL for ssbev DOUBLE PRECISION for dsbev Arrays: w(*), DIMENSION at least max(1, n). If info = 0, contains the eigenvalues of the matrix A in ascending order. z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the orthonormal eigenvectors of the matrix A, with the i-th column of z holding the eigenvector associated with w(i). If jobz = 'N', then z is not referenced. ab On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. If uplo = 'U', the first superdiagonal and the diagonal of the tridiagonal matrix T are returned in rows kd and kd+1 of ab, and if uplo = 'L', the diagonal and first subdiagonal of T are returned in the first two rows of ab. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then the algorithm failed to converge; i indicates the number of elements of an intermediate tridiagonal form which did not converge to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sbev interface are the following: ab Holds the array A of size (kd+1,n). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. 4 Intel® Math Kernel Library Reference Manual 992 jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. ?hbev Computes all eigenvalues and, optionally, eigenvectors of a Hermitian band matrix. Syntax Fortran 77: call chbev(jobz, uplo, n, kd, ab, ldab, w, z, ldz, work, rwork, info) call zhbev(jobz, uplo, n, kd, ab, ldab, w, z, ldz, work, rwork, info) Fortran 95: call hbev(ab, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_chbev( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int kd, lapack_complex_float* ab, lapack_int ldab, float* w, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zhbev( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int kd, lapack_complex_double* ab, lapack_int ldab, double* w, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all eigenvalues and, optionally, eigenvectors of a complex Hermitian band matrix A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ab stores the upper triangular part of A. If uplo = 'L', ab stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). kd INTEGER. The number of super- or sub-diagonals in A (kd = 0). ab, work COMPLEX for chbev DOUBLE COMPLEX for zhbev. Arrays: LAPACK Routines: Least Squares and Eigenvalue Problems 4 993 ab (ldab,*) is an array containing either upper or lower triangular part of the Hermitian matrix A (as specified by uplo) in band storage format. The second dimension of ab must be at least max(1, n). work (*) is a workspace array. The dimension of work must be at least max(1, n). ldab INTEGER. The leading dimension of ab; must be at least kd +1. ldz INTEGER. The leading dimension of the output array z. Constraints: if jobz = 'N', then ldz = 1; if jobz = 'V', then ldz = max(1, n) . rwork REAL for chbev DOUBLE PRECISION for zhbev Workspace array, DIMENSION at least max(1, 3n-2). Output Parameters w REAL for chbev DOUBLE PRECISION for zhbev Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. z COMPLEX for chbev DOUBLE COMPLEX for zhbev. Array z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the orthonormal eigenvectors of the matrix A, with the i-th column of z holding the eigenvector associated with w(i). If jobz = 'N', then z is not referenced. ab On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. If uplo = 'U', the first superdiagonal and the diagonal of the tridiagonal matrix T are returned in rows kd and kd+1 of ab, and if uplo = 'L', the diagonal and first subdiagonal of T are returned in the first two rows of ab. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. If info = i, then the algorithm failed to converge; i indicates the number of elements of an intermediate tridiagonal form which did not converge to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbev interface are the following: ab Holds the array A of size (kd+1,n). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: 4 Intel® Math Kernel Library Reference Manual 994 jobz = 'V', if z is present, jobz = 'N', if z is omitted. ?sbevd Computes all eigenvalues and (optionally) all eigenvectors of a real symmetric band matrix using divide and conquer algorithm. Syntax Fortran 77: call ssbevd(jobz, uplo, n, kd, ab, ldab, w, z, ldz, work, lwork, iwork, liwork, info) call dsbevd(jobz, uplo, n, kd, ab, ldab, w, z, ldz, work, lwork, iwork, liwork, info) Fortran 95: call sbevd(ab, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_sbevd( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int kd, * ab, lapack_int ldab, * w, * z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally all the eigenvectors, of a real symmetric band matrix A. In other words, it can compute the spectral factorization of A as: A = Z*?*ZT Here ? is a diagonal matrix whose diagonal elements are the eigenvalues ?i, and Z is the orthogonal matrix whose columns are the eigenvectors zi. Thus, A*zi = ?i*zi for i = 1, 2, ..., n. If the eigenvectors are requested, then this routine uses a divide and conquer algorithm to compute eigenvalues and eigenvectors. However, if only eigenvalues are required, then it uses the Pal-Walker-Kahan variant of the QL or QR algorithm. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ab stores the upper triangular part of A. If uplo = 'L', ab stores the lower triangular part of A. LAPACK Routines: Least Squares and Eigenvalue Problems 4 995 n INTEGER. The order of the matrix A (n = 0). kd INTEGER. The number of super- or sub-diagonals in A (kd = 0). ab, work REAL for ssbevd DOUBLE PRECISION for dsbevd. Arrays: ab (ldab,*) is an array containing either upper or lower triangular part of the symmetric matrix A (as specified by uplo) in band storage format. The second dimension of ab must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). ldab INTEGER. The leading dimension of ab; must be at least kd+1. ldz INTEGER. The leading dimension of the output array z. Constraints: if jobz = 'N', then ldz = 1; if jobz = 'V', then ldz = max(1, n) . lwork INTEGER. The dimension of the array work. Constraints: if n = 1, then lwork = 1; if jobz = 'N' and n > 1, then lwork = 2n; if jobz = 'V' and n > 1, then lwork = 2*n2 + 5*n + 1. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork. Constraints: if n = 1, then liwork < 1; if job = 'N' and n > 1, then liwork < 1; if job = 'V' and n > 1, then liwork < 5*n+3. If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters w, z REAL for ssbevd DOUBLE PRECISION for dsbevd Arrays: w(*), DIMENSION at least max(1, n). If info = 0, contains the eigenvalues of the matrix A in ascending order. See also info. z(ldz,*). The second dimension of z must be: at least 1 if job = 'N'; at least max(1, n) if job = 'V'. 4 Intel® Math Kernel Library Reference Manual 996 If job = 'V', then this array is overwritten by the orthogonal matrix Z which contains the eigenvectors of A. The i-th column of Z contains the eigenvector which corresponds to the eigenvalue w(i). If job = 'N', then z is not referenced. ab On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. work(1) On exit, if lwork > 0, then work(1) returns the required minimal size of lwork. iwork(1) On exit, if liwork > 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = i, then the algorithm failed to converge; i indicates the number of elements of an intermediate tridiagonal form which did not converge to zero. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sbevd interface are the following: ab Holds the array A of size (kd+1,n). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Application Notes The computed eigenvalues and eigenvectors are exact for a matrix T+E such that ||E||2=O(e)*||T||2, where e is the machine precision. If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run or set lwork = -1 (liwork = -1). If any of admissible lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if work (liwork) is less than the minimal required value and is not equal to -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex analogue of this routine is hbevd. See also syevd for matrices held in full storage, and spevd for matrices held in packed storage. LAPACK Routines: Least Squares and Eigenvalue Problems 4 997 ?hbevd Computes all eigenvalues and (optionally) all eigenvectors of a complex Hermitian band matrix using divide and conquer algorithm. Syntax Fortran 77: call chbevd(jobz, uplo, n, kd, ab, ldab, w, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) call zhbevd(jobz, uplo, n, kd, ab, ldab, w, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) Fortran 95: call hbevd(ab, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_chbevd( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int kd, lapack_complex_float* ab, lapack_int ldab, float* w, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zhbevd( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int kd, lapack_complex_double* ab, lapack_int ldab, double* w, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally all the eigenvectors, of a complex Hermitian band matrix A. In other words, it can compute the spectral factorization of A as: A = Z*?*ZH. Here ? is a real diagonal matrix whose diagonal elements are the eigenvalues ?i, and Z is the (complex) unitary matrix whose columns are the eigenvectors zi. Thus, A*zi = ?i*zi for i = 1, 2, ..., n. If the eigenvectors are requested, then this routine uses a divide and conquer algorithm to compute eigenvalues and eigenvectors. However, if only eigenvalues are required, then it uses the Pal-Walker-Kahan variant of the QL or QR algorithm. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ab stores the upper triangular part of A. If uplo = 'L', ab stores the lower triangular part of A. 4 Intel® Math Kernel Library Reference Manual 998 n INTEGER. The order of the matrix A (n = 0). kd INTEGER. The number of super- or sub-diagonals in A (kd = 0). ab, work COMPLEX for chbevd DOUBLE COMPLEX for zhbevd. Arrays: ab (ldab,*) is an array containing either upper or lower triangular part of the Hermitian matrix A (as specified by uplo) in band storage format. The second dimension of ab must be at least max(1, n). work (*) is a workspace array, its dimension max(1, lwork). ldab INTEGER. The leading dimension of ab; must be at least kd+1. ldz INTEGER. The leading dimension of the output array z. Constraints: if jobz = 'N', then ldz = 1; if jobz = 'V', then ldz = max(1, n) . lwork INTEGER. The dimension of the array work. Constraints: if n = 1, then lwork = 1; if jobz = 'N' and n > 1, then lwork = n; if jobz = 'V' and n > 1, then lwork = 2*n2. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. rwork REAL for chbevd DOUBLE PRECISION for zhbevd Workspace array, DIMENSION at least lrwork. lrwork INTEGER. The dimension of the array rwork. Constraints: if n = 1, then lrwork = 1; if jobz = 'N' and n > 1, then lrwork = n; if jobz = 'V' and n > 1, then lrwork = 2*n2 + 5*n + 1. If lrwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, DIMENSION max(1, liwork). liwork INTEGER. The dimension of the array iwork. Constraints: if jobz = 'N' or n = 1, then liwork = 1; if jobz = 'V' and n > 1, then liwork = 5*n+3. LAPACK Routines: Least Squares and Eigenvalue Problems 4 999 If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters w REAL for chbevd DOUBLE PRECISION for zhbevd Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues of the matrix A in ascending order. See also info. z COMPLEX for chbevd DOUBLE COMPLEX for zhbevd Array, DIMENSION (ldz,*). The second dimension of z must be: at least 1 if jobz = 'N'; at least max(1, n) if jobz = 'V'. If jobz = 'V', then this array is overwritten by the unitary matrix Z which contains the eigenvectors of A. The i-th column of Z contains the eigenvector which corresponds to the eigenvalue w(i). If jobz = 'N', then z is not referenced. ab On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. work(1) On exit, if lwork > 0, then the real part of work(1) returns the required minimal size of lwork. rwork(1) On exit, if lrwork > 0, then rwork(1) returns the required minimal size of lrwork. iwork(1) On exit, if liwork > 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = i, then the algorithm failed to converge; i indicates the number of elements of an intermediate tridiagonal form which did not converge to zero. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbevd interface are the following: ab Holds the array A of size (kd+1,n). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. 4 Intel® Math Kernel Library Reference Manual 1000 Application Notes The computed eigenvalues and eigenvectors are exact for a matrix T + E such that ||E||2 = O(e)||T||2, where e is the machine precision. If you are in doubt how much workspace to supply, use a generous value of lwork (liwork or lrwork) for the first run or set lwork = -1 (liwork = -1, lrwork = -1). If you choose the first option and set any of admissible lwork (liwork or lrwork) sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork) on exit. Use this value (work(1), iwork(1), rwork(1)) for subsequent runs. If you set lwork = -1 (liwork = -1, lrwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork). This operation is called a workspace query. Note that if you set lwork (liwork, lrwork) to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real analogue of this routine is sbevd. See also heevd for matrices held in full storage, and hpevd for matrices held in packed storage. ?sbevx Computes selected eigenvalues and, optionally, eigenvectors of a real symmetric band matrix. Syntax Fortran 77: call ssbevx(jobz, range, uplo, n, kd, ab, ldab, q, ldq, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) call dsbevx(jobz, range, uplo, n, kd, ab, ldab, q, ldq, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) Fortran 95: call sbevx(ab, w [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,q] [,abstol] [,info]) C: lapack_int LAPACKE_sbevx( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_int kd, * ab, lapack_int ldab, * q, lapack_int ldq, vl, vu, lapack_int il, lapack_int iu, abstol, lapack_int* m, * w, * z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description LAPACK Routines: Least Squares and Eigenvalue Problems 4 1001 The routine computes selected eigenvalues and, optionally, eigenvectors of a real symmetric band matrix A. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for chpevx DOUBLE PRECISION for zhpevx The absolute error tolerance to which each eigenvalue is required. See Application notes for details on error tolerance. ldq, ldz INTEGER. The leading dimensions of the output arrays q and z, respectively. Constraints: 4 Intel® Math Kernel Library Reference Manual 1002 ldq = 1, ldz = 1; If jobz = 'V', then ldq = max(1, n) and ldz = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). Output Parameters q REAL for ssbevx DOUBLE PRECISION for dsbevx. Array, DIMENSION (ldz,n). If jobz = 'V', the n-by-n orthogonal matrix is used in the reduction to tridiagonal form. If jobz = 'N', the array q is not referenced. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w, z REAL for ssbevx DOUBLE PRECISION for dsbevx Arrays: w(*), DIMENSION at least max(1, n). The first m elements of w contain the selected eigenvalues of the matrix A in ascending order. z(ldz,*). The second dimension of z must be at least max(1, m). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix A corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. ab On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. If uplo = 'U', the first superdiagonal and the diagonal of the tridiagonal matrix T are returned in rows kd and kd+1 of ab, and if uplo = 'L', the diagonal and first subdiagonal of T are returned in the first two rows of ab. ifail INTEGER. Array, DIMENSION at least max(1, n). If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, the ifail contains the indices the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then i eigenvectors failed to converge; their indices are stored in the array ifail. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1003 Specific details for the routine sbevx interface are the following: ab Holds the array A of size (kd+1,n). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n), where the values n and m are significant. ifail Holds the vector with the number of elements n. q Holds the matrix Q of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted Note that there will be an error condition if either ifail or q is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. If abstol is less than or equal to zero, then e*||T||1 is used as tolerance, where T is the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*?lamch('S'), not zero. If this routine returns with info > 0, indicating that some eigenvectors did not converge, try setting abstol to 2*?lamch('S'). ?hbevx Computes selected eigenvalues and, optionally, eigenvectors of a Hermitian band matrix. Syntax Fortran 77: call chbevx(jobz, range, uplo, n, kd, ab, ldab, q, ldq, vl, vu, il, iu, abstol, m, w, z, ldz, work, rwork, iwork, ifail, info) call zhbevx(jobz, range, uplo, n, kd, ab, ldab, q, ldq, vl, vu, il, iu, abstol, m, w, z, ldz, work, rwork, iwork, ifail, info) Fortran 95: call hbevx(ab, w [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,q] [,abstol] [,info]) 4 Intel® Math Kernel Library Reference Manual 1004 C: lapack_int LAPACKE_chbevx( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_int kd, lapack_complex_float* ab, lapack_int ldab, lapack_complex_float* q, lapack_int ldq, float vl, float vu, lapack_int il, lapack_int iu, float abstol, lapack_int* m, float* w, lapack_complex_float* z, lapack_int ldz, lapack_int* ifail ); lapack_int LAPACKE_zhbevx( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_int kd, lapack_complex_double* ab, lapack_int ldab, lapack_complex_double* q, lapack_int ldq, double vl, double vu, lapack_int il, lapack_int iu, double abstol, lapack_int* m, double* w, lapack_complex_double* z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a complex Hermitian band matrix A. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If job = 'N', then only eigenvalues are computed. If job = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl< lambda(i) = vu. If range = 'I', the routine computes eigenvalues with indices il to iu. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ab stores the upper triangular part of A. If uplo = 'L', ab stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). kd INTEGER. The number of super- or sub-diagonals in A (kd = 0). ab, work COMPLEX for chbevx DOUBLE COMPLEX for zhbevx. Arrays: ab (ldab,*) is an array containing either upper or lower triangular part of the Hermitian matrix A (as specified by uplo) in band storage format. The second dimension of ab must be at least max(1, n). work (*) is a workspace array. The dimension of work must be at least max(1, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1005 ldab INTEGER. The leading dimension of ab; must be at least kd +1. vl, vu REAL for chbevx DOUBLE PRECISION for zhbevx. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues. Constraint: vl< vu. If range = 'A' or 'I', vl and vu are not referenced. il, iu INTEGER. If range = 'I', the indices in ascending order of the smallest and largest eigenvalues to be returned. Constraint: 1 = il = iu = n, if n > 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for chbevx DOUBLE PRECISION for zhbevx. The absolute error tolerance to which each eigenvalue is required. See Application notes for details on error tolerance. ldq, ldz INTEGER. The leading dimensions of the output arrays q and z, respectively. Constraints: ldq = 1, ldz = 1; If jobz = 'V', then ldq = max(1, n) and ldz = max(1, n). rwork REAL for chbevx DOUBLE PRECISION for zhbevx Workspace array, DIMENSION at least max(1, 7n). iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). Output Parameters q COMPLEX for chbevx DOUBLE COMPLEX for zhbevx. Array, DIMENSION (ldz,n). If jobz = 'V', the n-by-n unitary matrix is used in the reduction to tridiagonal form. If jobz = 'N', the array q is not referenced. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w REAL for chbevx DOUBLE PRECISION for zhbevx Array, DIMENSION at least max(1, n). The first m elements contain the selected eigenvalues of the matrix A in ascending order. z COMPLEX for chbevx DOUBLE COMPLEX for zhbevx. Array z(ldz,*). The second dimension of z must be at least max(1, m). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix A corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. 4 Intel® Math Kernel Library Reference Manual 1006 If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. ab On exit, this array is overwritten by the values generated during the reduction to tridiagonal form. If uplo = 'U', the first superdiagonal and the diagonal of the tridiagonal matrix T are returned in rows kd and kd+1 of ab, and if uplo = 'L', the diagonal and first subdiagonal of T are returned in the first two rows of ab. ifail INTEGER. Array, DIMENSION at least max(1, n). If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, the ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then i eigenvectors failed to converge; their indices are stored in the array ifail. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbevx interface are the following: ab Holds the array A of size (kd+1,n). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n), where the values n and m are significant. ifail Holds the vector with the number of elements n. q Holds the matrix Q of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted Note that there will be an error condition if either ifail or q is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1007 Application Notes An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol + e * max( |a|,|b| ), where e is the machine precision. If abstol is less than or equal to zero, then e*||T||1 will be used in its place, where T is the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*?lamch('S'), not zero. If this routine returns with info > 0, indicating that some eigenvectors did not converge, try setting abstol to 2*?lamch('S'). ?stev Computes all eigenvalues and, optionally, eigenvectors of a real symmetric tridiagonal matrix. Syntax Fortran 77: call sstev(jobz, n, d, e, z, ldz, work, info) call dstev(jobz, n, d, e, z, ldz, work, info) Fortran 95: call stev(d, e [,z] [,info]) C: lapack_int LAPACKE_stev( int matrix_order, char jobz, lapack_int n, * d, * e, * z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all eigenvalues and, optionally, eigenvectors of a real symmetric tridiagonal matrix A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. n INTEGER. The order of the matrix A (n = 0). d, e, work REAL for sstev DOUBLE PRECISION for dstev. Arrays: d(*) contains the n diagonal elements of the tridiagonal matrix A. The dimension of d must be at least max(1, n). e(*) contains the n-1 subdiagonal elements of the tridiagonal matrix A. 4 Intel® Math Kernel Library Reference Manual 1008 The dimension of e must be at least max(1, n-1). The n-th element of this array is used as workspace. work(*) is a workspace array. The dimension of work must be at least max(1, 2n-2). If jobz = 'N', work is not referenced. ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V' then ldz = max(1, n). Output Parameters d On exit, if info = 0, contains the eigenvalues of the matrix A in ascending order. z REAL for sstev DOUBLE PRECISION for dstev Array, DIMENSION (ldz, *). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the orthonormal eigenvectors of the matrix A, with the i-th column of z holding the eigenvector associated with the eigenvalue returned in d(i). If job = 'N', then z is not referenced. e On exit, this array is overwritten with intermediate results. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then the algorithm failed to converge; i elements of e did not converge to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine stev interface are the following: d Holds the vector of length n. e Holds the vector of length n. z Holds the matrix Z of size (n, n). jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. ?stevd Computes all eigenvalues and (optionally) all eigenvectors of a real symmetric tridiagonal matrix using divide and conquer algorithm. Syntax Fortran 77: call sstevd(jobz, n, d, e, z, ldz, work, lwork, iwork, liwork, info) call dstevd(jobz, n, d, e, z, ldz, work, lwork, iwork, liwork, info) LAPACK Routines: Least Squares and Eigenvalue Problems 4 1009 Fortran 95: call stevd(d, e [,z] [,info]) C: lapack_int LAPACKE_stevd( int matrix_order, char jobz, lapack_int n, * d, * e, * z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally all the eigenvectors, of a real symmetric tridiagonal matrix T. In other words, the routine can compute the spectral factorization of T as: T = Z*?*ZT. Here ? is a diagonal matrix whose diagonal elements are the eigenvalues ?i, and Z is the orthogonal matrix whose columns are the eigenvectors zi. Thus, T*zi = ?i*zi for i = 1, 2, ..., n. If the eigenvectors are requested, then this routine uses a divide and conquer algorithm to compute eigenvalues and eigenvectors. However, if only eigenvalues are required, then it uses the Pal-Walker-Kahan variant of the QL or QR algorithm. There is no complex analogue of this routine. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. n INTEGER. The order of the matrix T (n = 0). d, e, work REAL for sstevd DOUBLE PRECISION for dstevd. Arrays: d(*) contains the n diagonal elements of the tridiagonal matrix T. The dimension of d must be at least max(1, n). e(*) contains the n-1 off-diagonal elements of T. The dimension of e must be at least max(1, n-1). The n-th element of this array is used as workspace. work(*) is a workspace array. The dimension of work must be at least lwork. ldz INTEGER. The leading dimension of the output array z. Constraints: ldz = 1 if job = 'N'; ldz < max(1, n) if job = 'V'. lwork INTEGER. The dimension of the array work. Constraints: if jobz = 'N' or n = 1, then lwork = 1; 4 Intel® Math Kernel Library Reference Manual 1010 if jobz = 'V' and n > 1, then lwork = n2 + 4*n + 1. If lwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork. Constraints: if jobz = 'N' or n = 1, then liwork = 1; if jobz = 'V' and n > 1, then liwork = 5*n+3. If liwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters d On exit, if info = 0, contains the eigenvalues of the matrix T in ascending order. See also info. z REAL for sstevd DOUBLE PRECISION for dstevd Array, DIMENSION (ldz, *). The second dimension of z must be: at least 1 if jobz = 'N'; at least max(1, n) if jobz = 'V'. If jobz = 'V', then this array is overwritten by the orthogonal matrix Z which contains the eigenvectors of T. If jobz = 'N', then z is not referenced. e On exit, this array is overwritten with intermediate results. work(1) On exit, if lwork > 0, then work(1) returns the required minimal size of lwork. iwork(1) On exit, if liwork > 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = i, then the algorithm failed to converge; i indicates the number of elements of an intermediate tridiagonal form which did not converge to zero. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine stevd interface are the following: d Holds the vector of length n. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1011 e Holds the vector of length n. z Holds the matrix Z of size (n, n). jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Application Notes The computed eigenvalues and eigenvectors are exact for a matrix T+E such that ||E||2 = O(e)*||T||2, where e is the machine precision. If ?i is an exact eigenvalue, and mi is the corresponding computed value, then |µi - ?i| = c(n)*e*||T||2 where c(n) is a modestly increasing function of n. If zi is the corresponding exact eigenvector, and wi is the corresponding computed vector, then the angle ?(zi, wi) between them is bounded as follows: ?(zi, wi) = c(n)*e*||T||2 / min i?j|?i - ?j|. Thus the accuracy of a computed eigenvector depends on the gap between its eigenvalue and all the other eigenvalues. If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run, or set lwork = -1 (liwork = -1). If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, then the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if lwork (liwork) is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?stevx Computes selected eigenvalues and eigenvectors of a real symmetric tridiagonal matrix. Syntax Fortran 77: call sstevx(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) call dstevx(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) Fortran 95: call stevx(d, e, w [, z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,abstol] [,info]) 4 Intel® Math Kernel Library Reference Manual 1012 C: lapack_int LAPACKE_stevx( int matrix_order, char jobz, char range, lapack_int n, * d, * e, vl, vu, lapack_int il, lapack_int iu, abstol, lapack_int* m, * w, * z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a real symmetric tridiagonal matrix A. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If job = 'N', then only eigenvalues are computed. If job = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl 0; il=1 and iu=0 if n = 0. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1013 If range = 'A' or 'V', il and iu are not referenced. abstol REAL for sstevx DOUBLE PRECISION for dstevx. The absolute error tolerance to which each eigenvalue is required. See Application notes for details on error tolerance. ldz INTEGER. The leading dimensions of the output array z; ldz = 1. If jobz = 'V', then ldz = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). Output Parameters m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w, z REAL for sstevx DOUBLE PRECISION for dstevx. Arrays: w(*), DIMENSION at least max(1, n). The first m elements of w contain the selected eigenvalues of the matrix A in ascending order. z(ldz,*). The second dimension of z must be at least max(1, m). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix A corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. d, e On exit, these arrays may be multiplied by a constant factor chosen to avoid overflow or underflow in computing the eigenvalues. ifail INTEGER. Array, DIMENSION at least max(1, n). If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, the ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then i eigenvectors failed to converge; their indices are stored in the array ifail. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine stevx interface are the following: 4 Intel® Math Kernel Library Reference Manual 1014 d Holds the vector of length n. e Holds the vector of length n. w Holds the vector of length n. z Holds the matrix Z of size (n, n), where the values n and m are significant. ifail Holds the vector of length n. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted Note that there will be an error condition if ifail is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. If abstol is less than or equal to zero, then e*||A||1 is used instead. Eigenvalues are computed most accurately when abstol is set to twice the underflow threshold 2*?lamch('S'), not zero. If this routine returns with info > 0, indicating that some eigenvectors did not converge, set abstol to 2*? lamch('S'). ?stevr Computes selected eigenvalues and, optionally, eigenvectors of a real symmetric tridiagonal matrix using the Relatively Robust Representations. Syntax Fortran 77: call sstevr(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) call dstevr(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) Fortran 95: call stevr(d, e, w [, z] [,vl] [,vu] [,il] [,iu] [,m] [,isuppz] [,abstol] [,info]) LAPACK Routines: Least Squares and Eigenvalue Problems 4 1015 C: lapack_int LAPACKE_stevr( int matrix_order, char jobz, char range, lapack_int n, * d, * e, vl, vu, lapack_int il, lapack_int iu, abstol, lapack_int* m, * w, * z, lapack_int ldz, lapack_int* isuppz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a real symmetric tridiagonal matrix T. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Whenever possible, the routine calls stemr to compute the eigenspectrum using Relatively Robust Representations. stegr computes eigenvalues by the dqds algorithm, while orthogonal eigenvectors are computed from various "good" L*D*LT representations (also known as Relatively Robust Representations). Gram-Schmidt orthogonalization is avoided as far as possible. More specifically, the various steps of the algorithm are as follows. For the i-th unreduced block of T: a. Compute T - si = Li*Di*Li T, such that Li*Di*Li T is a relatively robust representation. b. Compute the eigenvalues, ?j, of Li*Di*Li T to high relative accuracy by the dqds algorithm. c. If there is a cluster of close eigenvalues, "choose" si close to the cluster, and go to Step (a). d. Given the approximate eigenvalue ?j of Li*Di*Li T, compute the corresponding eigenvector by forming a rank-revealing twisted factorization. The desired accuracy of the output can be specified by the input parameter abstol. The routine ?stevr calls stemr when the full spectrum is requested on machines which conform to the IEEE-754 floating point standard. ?stevr calls stebz and stein on non-IEEE machines and when partial spectrum requests are made. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for sstevr DOUBLE PRECISION for dstevr. The absolute error tolerance to which each eigenvalue/eigenvector is required. If jobz = 'V', the eigenvalues and eigenvectors output have residual norms bounded by abstol, and the dot products between different eigenvectors are bounded by abstol. If abstol < n *eps*|T|, then n *eps*|T| will be used in its place, where eps is the machine precision, and |T| is the 1-norm of the matrix T. The eigenvalues are computed to an accuracy of eps*|T| irrespective of abstol. If high relative accuracy is important, set abstol to ?lamch('S'). ldz INTEGER. The leading dimension of the output array z. Constraints: ldz = 1 if jobz = 'N'; ldz = max(1, n) if jobz = 'V'. lwork INTEGER. The dimension of the array work. Constraint: lwork = max(1, 20*n). If lwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork, lwork = max(1, 10*n). If liwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1017 Output Parameters m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w, z REAL for sstevr DOUBLE PRECISION for dstevr. Arrays: w(*), DIMENSION at least max(1, n). The first m elements of w contain the selected eigenvalues of the matrix T in ascending order. z(ldz,*). The second dimension of z must be at least max(1, m). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix T corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. d, e On exit, these arrays may be multiplied by a constant factor chosen to avoid overflow or underflow in computing the eigenvalues. isuppz INTEGER. Array, DIMENSION at least 2 *max(1, m). The support of the eigenvectors in z, i.e., the indices indicating the nonzero elements in z. The i-th eigenvector is nonzero only in elements isuppz( 2i-1) through isuppz( 2i ). Implemented only for range = 'A' or 'I' and iu-il = n-1. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, an internal error has occurred. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine stevr interface are the following: d Holds the vector of length n. e Holds the vector of length n. w Holds the vector of length n. z Holds the matrix Z of size (n, n), where the values n and m are significant. isuppz Holds the vector of length (2*n), where the values (2*m) are significant. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). 4 Intel® Math Kernel Library Reference Manual 1018 il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted Note that there will be an error condition if ifail is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes Normal execution of the routine ?stegr may create NaNs and infinities and hence may abort due to a floating point exception in environments which do not handle NaNs and infinities in the IEEE standard default manner. If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run, or set lwork = -1 (liwork = -1). If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, then the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if lwork (liwork) is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Nonsymmetric Eigenproblems This section describes LAPACK driver routines used for solving nonsymmetric eigenproblems. See also computational routines that can be called to solve these problems. Table "Driver Routines for Solving Nonsymmetric Eigenproblems" lists all such driver routines for the FORTRAN 77 interface. Respective routine names in the Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Driver Routines for Solving Nonsymmetric Eigenproblems Routine Name Operation performed gees Computes the eigenvalues and Schur factorization of a general matrix, and orders the factorization so that selected eigenvalues are at the top left of the Schur form. geesx Computes the eigenvalues and Schur factorization of a general matrix, orders the factorization and computes reciprocal condition numbers. geev Computes the eigenvalues and left and right eigenvectors of a general matrix. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1019 Routine Name Operation performed geevx Computes the eigenvalues and left and right eigenvectors of a general matrix, with preliminary matrix balancing, and computes reciprocal condition numbers for the eigenvalues and right eigenvectors. ?gees Computes the eigenvalues and Schur factorization of a general matrix, and orders the factorization so that selected eigenvalues are at the top left of the Schur form. Syntax Fortran 77: call sgees(jobvs, sort, select, n, a, lda, sdim, wr, wi, vs, ldvs, work, lwork, bwork, info) call dgees(jobvs, sort, select, n, a, lda, sdim, wr, wi, vs, ldvs, work, lwork, bwork, info) call cgees(jobvs, sort, select, n, a, lda, sdim, w, vs, ldvs, work, lwork, rwork, bwork, info) call zgees(jobvs, sort, select, n, a, lda, sdim, w, vs, ldvs, work, lwork, rwork, bwork, info) Fortran 95: call gees(a, wr, wi [,vs] [,select] [,sdim] [,info]) call gees(a, w [,vs] [,select] [,sdim] [,info]) C: lapack_int LAPACKE_sgees( int matrix_order, char jobvs, char sort, LAPACK_S_SELECT2 select, lapack_int n, float* a, lapack_int lda, lapack_int* sdim, float* wr, float* wi, float* vs, lapack_int ldvs ); lapack_int LAPACKE_dgees( int matrix_order, char jobvs, char sort, LAPACK_D_SELECT2 select, lapack_int n, double* a, lapack_int lda, lapack_int* sdim, double* wr, double* wi, double* vs, lapack_int ldvs ); lapack_int LAPACKE_cgees( int matrix_order, char jobvs, char sort, LAPACK_C_SELECT1 select, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_int* sdim, lapack_complex_float* w, lapack_complex_float* vs, lapack_int ldvs ); lapack_int LAPACKE_zgees( int matrix_order, char jobvs, char sort, LAPACK_Z_SELECT1 select, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_int* sdim, lapack_complex_double* w, lapack_complex_double* vs, lapack_int ldvs ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes for an n-by-n real/complex nonsymmetric matrix A, the eigenvalues, the real Schur form T, and, optionally, the matrix of Schur vectors Z. This gives the Schur factorization A = Z*T*ZH. 4 Intel® Math Kernel Library Reference Manual 1020 Optionally, it also orders the eigenvalues on the diagonal of the real-Schur/Schur form so that selected eigenvalues are at the top left. The leading columns of Z then form an orthonormal basis for the invariant subspace corresponding to the selected eigenvalues. A real matrix is in real-Schur form if it is upper quasi-triangular with 1-by-1 and 2-by-2 blocks. 2-by-2 blocks will be standardized in the form where b*c < 0. The eigenvalues of such a block are A complex matrix is in Schur form if it is upper triangular. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobvs CHARACTER*1. Must be 'N' or 'V'. If jobvs = 'N', then Schur vectors are not computed. If jobvs = 'V', then Schur vectors are computed. sort CHARACTER*1. Must be 'N' or 'S'. Specifies whether or not to order the eigenvalues on the diagonal of the Schur form. If sort = 'N', then eigenvalues are not ordered. If sort = 'S', eigenvalues are ordered (see select). select LOGICAL FUNCTION of two REAL arguments for real flavors. LOGICAL FUNCTION of one COMPLEX argument for complex flavors. select must be declared EXTERNAL in the calling subroutine. If sort = 'S', select is used to select eigenvalues to sort to the top left of the Schur form. If sort = 'N', select is not referenced. For real flavors: An eigenvalue wr(j)+sqrt(-1)*wi(j) is selected if select(wr(j), wi(j)) is true; that is, if either one of a complex conjugate pair of eigenvalues is selected, then both complex eigenvalues are selected. Note that a selected complex eigenvalue may no longer satisfy select(wr(j), wi(j))= .TRUE. after ordering, since ordering may change the value of complex eigenvalues (especially if the eigenvalue is illconditioned); in this case info may be set to n+2 (see info below). For complex flavors: An eigenvalue w(j) is selected if select(w(j)) is true. n INTEGER. The order of the matrix A (n = 0). a, work REAL for sgees DOUBLE PRECISION for dgees COMPLEX for cgees DOUBLE COMPLEX for zgees. Arrays: a(lda,*) is an array containing the n-by-n matrix A. The second dimension of a must be at least max(1, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1021 work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). ldvs INTEGER. The leading dimension of the output array vs. Constraints: ldvs = 1; ldvs = max(1, n) if jobvs = 'V'. lwork INTEGER. The dimension of the array work. Constraint: lwork = max(1, 3n) for real flavors; lwork = max(1, 2n) for complex flavors. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. rwork REAL for cgees DOUBLE PRECISION for zgees Workspace array, DIMENSION at least max(1, n). Used in complex flavors only. bwork LOGICAL. Workspace array, DIMENSION at least max(1, n). Not referenced if sort = 'N'. Output Parameters a On exit, this array is overwritten by the real-Schur/Schur form T. sdim INTEGER. If sort = 'N', sdim= 0. If sort = 'S', sdim is equal to the number of eigenvalues (after sorting) for which select is true. Note that for real flavors complex conjugate pairs for which select is true for either eigenvalue count as 2. wr, wi REAL for sgees DOUBLE PRECISION for dgees Arrays, DIMENSION at least max (1, n) each. Contain the real and imaginary parts, respectively, of the computed eigenvalues, in the same order that they appear on the diagonal of the output real-Schur form T. Complex conjugate pairs of eigenvalues appear consecutively with the eigenvalue having positive imaginary part first. w COMPLEX for cgees DOUBLE COMPLEX for zgees. Array, DIMENSION at least max(1, n). Contains the computed eigenvalues. The eigenvalues are stored in the same order as they appear on the diagonal of the output Schur form T. vs REAL for sgees DOUBLE PRECISION for dgees COMPLEX for cgees DOUBLE COMPLEX for zgees. Array vs(ldvs,*);the second dimension of vs must be at least max(1, n). If jobvs = 'V', vs contains the orthogonal/unitary matrix Z of Schur vectors. If jobvs = 'N', vs is not referenced. 4 Intel® Math Kernel Library Reference Manual 1022 work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. If info = i, and i = n: the QR algorithm failed to compute all the eigenvalues; elements 1:ilo-1 and i+1:n of wr and wi (for real flavors) or w (for complex flavors) contain those eigenvalues which have converged; if jobvs = 'V', vs contains the matrix which reduces A to its partially converged Schur form; i = n+1: the eigenvalues could not be reordered because some eigenvalues were too close to separate (the problem is very ill-conditioned); i = n+2: after reordering, round-off changed values of some complex eigenvalues so that leading eigenvalues in the Schur form no longer satisfy select = .TRUE.. This could also be caused by underflow due to scaling. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gees interface are the following: a Holds the matrix A of size (n, n). wr Holds the vector of length n. Used in real flavors only. wi Holds the vector of length n. Used in real flavors only. w Holds the vector of length n. Used in complex flavors only. vs Holds the matrix VS of size (n, n). jobvs Restored based on the presence of the argument vs as follows: jobvs = 'V', if vs is present, jobvs = 'N', if vs is omitted. sort Restored based on the presence of the argument select as follows: sort = 'S', if select is present, sort = 'N', if select is omitted. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1023 ?geesx Computes the eigenvalues and Schur factorization of a general matrix, orders the factorization and computes reciprocal condition numbers. Syntax Fortran 77: call sgeesx(jobvs, sort, select, sense, n, a, lda, sdim, wr, wi, vs, ldvs, rconde, rcondv, work, lwork, iwork, liwork, bwork, info) call dgeesx(jobvs, sort, select, sense, n, a, lda, sdim, wr, wi, vs, ldvs, rconde, rcondv, work, lwork, iwork, liwork, bwork, info) call cgeesx(jobvs, sort, select, sense, n, a, lda, sdim, w, vs, ldvs, rconde, rcondv, work, lwork, rwork, bwork, info) call zgeesx(jobvs, sort, select, sense, n, a, lda, sdim, w, vs, ldvs, rconde, rcondv, work, lwork, rwork, bwork, info) Fortran 95: call geesx(a, wr, wi [,vs] [,select] [,sdim] [,rconde] [,rcondev] [,info]) call geesx(a, w [,vs] [,select] [,sdim] [,rconde] [,rcondev] [,info]) C: lapack_int LAPACKE_sgeesx( int matrix_order, char jobvs, char sort, LAPACK_S_SELECT2 select, char sense, lapack_int n, float* a, lapack_int lda, lapack_int* sdim, float* wr, float* wi, float* vs, lapack_int ldvs, float* rconde, float* rcondv ); lapack_int LAPACKE_dgeesx( int matrix_order, char jobvs, char sort, LAPACK_D_SELECT2 select, char sense, lapack_int n, double* a, lapack_int lda, lapack_int* sdim, double* wr, double* wi, double* vs, lapack_int ldvs, double* rconde, double* rcondv ); lapack_int LAPACKE_cgeesx( int matrix_order, char jobvs, char sort, LAPACK_C_SELECT1 select, char sense, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_int* sdim, lapack_complex_float* w, lapack_complex_float* vs, lapack_int ldvs, float* rconde, float* rcondv ); lapack_int LAPACKE_zgeesx( int matrix_order, char jobvs, char sort, LAPACK_Z_SELECT1 select, char sense, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_int* sdim, lapack_complex_double* w, lapack_complex_double* vs, lapack_int ldvs, double* rconde, double* rcondv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes for an n-by-n real/complex nonsymmetric matrix A, the eigenvalues, the real-Schur/ Schur form T, and, optionally, the matrix of Schur vectors Z. This gives the Schur factorization A = Z*T*ZH. 4 Intel® Math Kernel Library Reference Manual 1024 Optionally, it also orders the eigenvalues on the diagonal of the real-Schur/Schur form so that selected eigenvalues are at the top left; computes a reciprocal condition number for the average of the selected eigenvalues (rconde); and computes a reciprocal condition number for the right invariant subspace corresponding to the selected eigenvalues (rcondv). The leading columns of Z form an orthonormal basis for this invariant subspace. For further explanation of the reciprocal condition numbers rconde and rcondv, see [LUG], Section 4.10 (where these quantities are called s and sep respectively). A real matrix is in real-Schur form if it is upper quasi-triangular with 1-by-1 and 2-by-2 blocks. 2-by-2 blocks will be standardized in the form where b*c < 0. The eigenvalues of such a block are A complex matrix is in Schur form if it is upper triangular. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobvs CHARACTER*1. Must be 'N' or 'V'. If jobvs = 'N', then Schur vectors are not computed. If jobvs = 'V', then Schur vectors are computed. sort CHARACTER*1. Must be 'N' or 'S'. Specifies whether or not to order the eigenvalues on the diagonal of the Schur form. If sort = 'N', then eigenvalues are not ordered. If sort = 'S', eigenvalues are ordered (see select). select LOGICAL FUNCTION of two REAL arguments for real flavors. LOGICAL FUNCTION of one COMPLEX argument for complex flavors. select must be declared EXTERNAL in the calling subroutine. If sort = 'S', select is used to select eigenvalues to sort to the top left of the Schur form. If sort = 'N', select is not referenced. For real flavors: An eigenvalue wr(j)+sqrt(-1)*wi(j) is selected if select(wr(j), wi(j)) is true; that is, if either one of a complex conjugate pair of eigenvalues is selected, then both complex eigenvalues are selected. Note that a selected complex eigenvalue may no longer satisfy select(wr(j), wi(j)) = .TRUE. after ordering, since ordering may change the value of complex eigenvalues (especially if the eigenvalue is illconditioned); in this case info may be set to n+2 (see info below). For complex flavors: An eigenvalue w(j) is selected if select(w(j)) is true. sense CHARACTER*1. Must be 'N', 'E', 'V', or 'B'. Determines which reciprocal condition number are computed. If sense = 'N', none are computed; If sense = 'E', computed for average of selected eigenvalues only; LAPACK Routines: Least Squares and Eigenvalue Problems 4 1025 If sense = 'V', computed for selected right invariant subspace only; If sense = 'B', computed for both. If sense is 'E', 'V', or 'B', then sort must equal 'S'. n INTEGER. The order of the matrix A (n = 0). a, work REAL for sgeesx DOUBLE PRECISION for dgeesx COMPLEX for cgeesx DOUBLE COMPLEX for zgeesx. Arrays: a(lda,*) is an array containing the n-by-n matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). ldvs INTEGER. The leading dimension of the output array vs. Constraints: ldvs = 1; ldvs = max(1, n)if jobvs = 'V'. lwork INTEGER. The dimension of the array work. Constraint: lwork = max(1, 3n) for real flavors; lwork = max(1, 2n) for complex flavors. Also, if sense = 'E', 'V', or 'B', then lwork = n+2*sdim*(n-sdim) for real flavors; lwork = 2*sdim*(n-sdim) for complex flavors; where sdim is the number of selected eigenvalues computed by this routine. Note that 2*sdim*(n-sdim) = n*n/2. Note also that an error is only returned if lwork placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobvl CHARACTER*1. Must be 'N' or 'V'. If jobvl = 'N', then left eigenvectors of A are not computed. If jobvl = 'V', then left eigenvectors of A are computed. jobvr CHARACTER*1. Must be 'N' or 'V'. If jobvr = 'N', then right eigenvectors of A are not computed. If jobvr = 'V', then right eigenvectors of A are computed. n INTEGER. The order of the matrix A (n = 0). a, work REAL for sgeev DOUBLE PRECISION for dgeev COMPLEX for cgeev DOUBLE COMPLEX for zgeev. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1029 Arrays: a(lda,*) is an array containing the n-by-n matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). ldvl, ldvr INTEGER. The leading dimensions of the output arrays vl and vr, respectively. Constraints: ldvl = 1; ldvr = 1. If jobvl = 'V', ldvl = max(1, n); If jobvr = 'V', ldvr = max(1, n). lwork INTEGER. The dimension of the array work. Constraint: lwork = max(1, 3n), and if jobvl = 'V' or jobvr = 'V', lwork < max(1, 4n) (for real flavors); lwork < max(1, 2n) (for complex flavors). For good performance, lwork must generally be larger. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. rwork REAL for cgeev DOUBLE PRECISION for zgeev Workspace array, DIMENSION at least max(1, 2n). Used in complex flavors only. Output Parameters a On exit, this array is overwritten by intermediate results. wr, wi REAL for sgeev DOUBLE PRECISION for dgeev Arrays, DIMENSION at least max (1, n) each. Contain the real and imaginary parts, respectively, of the computed eigenvalues. Complex conjugate pairs of eigenvalues appear consecutively with the eigenvalue having positive imaginary part first. w COMPLEX for cgeev DOUBLE COMPLEX for zgeev. Array, DIMENSION at least max(1, n). Contains the computed eigenvalues. vl, vr REAL for sgeev DOUBLE PRECISION for dgeev COMPLEX for cgeev DOUBLE COMPLEX for zgeev. Arrays: vl(ldvl,*);the second dimension of vl must be at least max(1, n). If jobvl = 'V', the left eigenvectors u(j) are stored one after another in the columns of vl, in the same order as their eigenvalues. If jobvl = 'N', vl is not referenced. For real flavors: If the j-th eigenvalue is real, then u(j) = vl(:,j), the j-th column of vl. 4 Intel® Math Kernel Library Reference Manual 1030 If the j-th and (j+1)-st eigenvalues form a complex conjugate pair, then u(j) = vl(:,j) + i*vl(:,j+1) and u(j+1) = vl(:,j)- i*vl(:,j +1), where i = sqrt(-1). For complex flavors: u(j) = vl(:,j), the j-th column of vl. vr(ldvr,*); the second dimension of vr must be at least max(1, n). If jobvr = 'V', the right eigenvectors v(j) are stored one after another in the columns of vr, in the same order as their eigenvalues. If jobvr = 'N', vr is not referenced. For real flavors: If the j-th eigenvalue is real, then v(j) = vr(:,j), the j-th column of vr. If the j-th and (j+1)-st eigenvalues form a complex conjugate pair, then v(j) = vr(:,j) + i*vr(:,j+1) and v(j+1) = vr(:,j) - i*vr(:,j +1), where i = sqrt(-1). For complex flavors: v(j) = vr(:,j), the j-th column of vr. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. If info = i, the QR algorithm failed to compute all the eigenvalues, and no eigenvectors have been computed; elements i+1:n of wr and wi (for real flavors) or w (for complex flavors) contain those eigenvalues which have converged. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geev interface are the following: a Holds the matrix A of size (n, n). wr Holds the vector of length n. Used in real flavors only. wi Holds the vector of length n. Used in real flavors only. w Holds the vector of length n. Used in complex flavors only. vl Holds the matrix VL of size (n, n). vr Holds the matrix VR of size (n, n). jobvl Restored based on the presence of the argument vl as follows: jobvl = 'V', if vl is present, jobvl = 'N', if vl is omitted. jobvr Restored based on the presence of the argument vr as follows: jobvr = 'V', if vr is present, jobvr = 'N', if vr is omitted. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1031 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?geevx Computes the eigenvalues and left and right eigenvectors of a general matrix, with preliminary matrix balancing, and computes reciprocal condition numbers for the eigenvalues and right eigenvectors. Syntax Fortran 77: call sgeevx(balanc, jobvl, jobvr, sense, n, a, lda, wr, wi, vl, ldvl, vr, ldvr, ilo, ihi, scale, abnrm, rconde, rcondv, work, lwork, iwork, info) call dgeevx(balanc, jobvl, jobvr, sense, n, a, lda, wr, wi, vl, ldvl, vr, ldvr, ilo, ihi, scale, abnrm, rconde, rcondv, work, lwork, iwork, info) call cgeevx(balanc, jobvl, jobvr, sense, n, a, lda, w, vl, ldvl, vr, ldvr, ilo, ihi, scale, abnrm, rconde, rcondv, work, lwork, rwork, info) call zgeevx(balanc, jobvl, jobvr, sense, n, a, lda, w, vl, ldvl, vr, ldvr, ilo, ihi, scale, abnrm, rconde, rcondv, work, lwork, rwork, info) Fortran 95: call geevx(a, wr, wi [,vl] [,vr] [,balanc] [,ilo] [,ihi] [,scale] [,abnrm] [, rconde] [,rcondv] [,info]) call geevx(a, w [,vl] [,vr] [,balanc] [,ilo] [,ihi] [,scale] [,abnrm] [,rconde] [, rcondv] [,info]) C: lapack_int LAPACKE_sgeevx( int matrix_order, char balanc, char jobvl, char jobvr, char sense, lapack_int n, float* a, lapack_int lda, float* wr, float* wi, float* vl, lapack_int ldvl, float* vr, lapack_int ldvr, lapack_int* ilo, lapack_int* ihi, float* scale, float* abnrm, float* rconde, float* rcondv ); lapack_int LAPACKE_dgeevx( int matrix_order, char balanc, char jobvl, char jobvr, char sense, lapack_int n, double* a, lapack_int lda, double* wr, double* wi, double* vl, lapack_int ldvl, double* vr, lapack_int ldvr, lapack_int* ilo, lapack_int* ihi, double* scale, double* abnrm, double* rconde, double* rcondv ); lapack_int LAPACKE_cgeevx( int matrix_order, char balanc, char jobvl, char jobvr, char sense, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_complex_float* w, lapack_complex_float* vl, lapack_int ldvl, lapack_complex_float* vr, lapack_int ldvr, lapack_int* ilo, lapack_int* ihi, float* scale, float* abnrm, float* rconde, float* rcondv ); 4 Intel® Math Kernel Library Reference Manual 1032 lapack_int LAPACKE_zgeevx( int matrix_order, char balanc, char jobvl, char jobvr, char sense, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_complex_double* w, lapack_complex_double* vl, lapack_int ldvl, lapack_complex_double* vr, lapack_int ldvr, lapack_int* ilo, lapack_int* ihi, double* scale, double* abnrm, double* rconde, double* rcondv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes for an n-by-n real/complex nonsymmetric matrix A, the eigenvalues and, optionally, the left and/or right eigenvectors. Optionally also, it computes a balancing transformation to improve the conditioning of the eigenvalues and eigenvectors (ilo, ihi, scale, and abnrm), reciprocal condition numbers for the eigenvalues (rconde), and reciprocal condition numbers for the right eigenvectors (rcondv). The right eigenvector v(j) of A satisfies A*v(j) = ?(j)*v(j) where ?(j) is its eigenvalue. The left eigenvector u(j) of A satisfies u(j)T*A = ?(j)*u(j)T where u(j)T denotes the transpose of u(j). The computed eigenvectors are normalized to have Euclidean norm equal to 1 and largest component real. Balancing a matrix means permuting the rows and columns to make it more nearly upper triangular, and applying a diagonal similarity transformation D*A*inv(D), where D is a diagonal matrix, to make its rows and columns closer in norm and the condition numbers of its eigenvalues and eigenvectors smaller. The computed reciprocal condition numbers correspond to the balanced matrix. Permuting rows and columns will not change the condition numbers in exact arithmetic) but diagonal scaling will. For further explanation of balancing, see [LUG], Section 4.10. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. balanc CHARACTER*1. Must be 'N', 'P', 'S', or 'B'. Indicates how the input matrix should be diagonally scaled and/or permuted to improve the conditioning of its eigenvalues. If balanc = 'N', do not diagonally scale or permute; If balanc = 'P', perform permutations to make the matrix more nearly upper triangular. Do not diagonally scale; If balanc = 'S', diagonally scale the matrix, i.e. replace A by D*A*inv(D), where D is a diagonal matrix chosen to make the rows and columns of A more equal in norm. Do not permute; If balanc = 'B', both diagonally scale and permute A. Computed reciprocal condition numbers will be for the matrix after balancing and/or permuting. Permuting does not change condition numbers (in exact arithmetic), but balancing does. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1033 jobvl CHARACTER*1. Must be 'N' or 'V'. If jobvl = 'N', left eigenvectors of A are not computed; If jobvl = 'V', left eigenvectors of A are computed. If sense = 'E' or 'B', then jobvl must be 'V'. jobvr CHARACTER*1. Must be 'N' or 'V'. If jobvr = 'N', right eigenvectors of A are not computed; If jobvr = 'V', right eigenvectors of A are computed. If sense = 'E' or 'B', then jobvr must be 'V'. sense CHARACTER*1. Must be 'N', 'E', 'V', or 'B'. Determines which reciprocal condition number are computed. If sense = 'N', none are computed; If sense = 'E', computed for eigenvalues only; If sense = 'V', computed for right eigenvectors only; If sense = 'B', computed for eigenvalues and right eigenvectors. If sense is 'E' or 'B', both left and right eigenvectors must also be computed (jobvl = 'V' and jobvr = 'V'). n INTEGER. The order of the matrix A (n = 0). a, work REAL for sgeevx DOUBLE PRECISION for dgeevx COMPLEX for cgeevx DOUBLE COMPLEX for zgeevx. Arrays: a(lda,*) is an array containing the n-by-n matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). ldvl, ldvr INTEGER. The leading dimensions of the output arrays vl and vr, respectively. Constraints: ldvl = 1; ldvr = 1. If jobvl = 'V', ldvl = max(1, n); If jobvr = 'V', ldvr = max(1, n). lwork INTEGER. The dimension of the array work. For real flavors: If sense = 'N' or 'E', lwork = max(1, 2n), and if jobvl = 'V' or jobvr = 'V', lwork = 3n; If sense = 'V' or 'B', lwork = n*(n+6). For good performance, lwork must generally be larger. For complex flavors: If sense = 'N'or 'E', lwork = max(1, 2n); If sense = 'V' or 'B', lwork = n2+2n. For good performance, lwork must generally be larger. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. rwork REAL for cgeevx DOUBLE PRECISION for zgeevx Workspace array, DIMENSION at least max(1, 2n). Used in complex flavors only. 4 Intel® Math Kernel Library Reference Manual 1034 iwork INTEGER. Workspace array, DIMENSION at least max(1, 2n-2). Used in real flavors only. Not referenced if sense = 'N' or 'E'. Output Parameters a On exit, this array is overwritten. If jobvl = 'V' or jobvr = 'V', it contains the real-Schur/Schur form of the balanced version of the input matrix A. wr, wi REAL for sgeevx DOUBLE PRECISION for dgeevx Arrays, DIMENSION at least max (1, n) each. Contain the real and imaginary parts, respectively, of the computed eigenvalues. Complex conjugate pairs of eigenvalues appear consecutively with the eigenvalue having positive imaginary part first. w COMPLEX for cgeevx DOUBLE COMPLEX for zgeevx. Array, DIMENSION at least max(1, n). Contains the computed eigenvalues. vl, vr REAL for sgeevx DOUBLE PRECISION for dgeevx COMPLEX for cgeevx DOUBLE COMPLEX for zgeevx. Arrays: vl(ldvl,*); the second dimension of vl must be at least max(1, n). If jobvl = 'V', the left eigenvectors u(j) are stored one after another in the columns of vl, in the same order as their eigenvalues. If jobvl = 'N', vl is not referenced. For real flavors: If the j-th eigenvalue is real, then u(j) = vl(:,j), the j-th column of vl. If the j-th and (j+1)-st eigenvalues form a complex conjugate pair, then u(j) = vl(:,j) + i*vl(:,j+1) and (j+1) = vl(:,j) - i*vl(:,j +1), where i = sqrt(-1). For complex flavors: u(j) = vl(:,j), the j-th column of vl. vr(ldvr,*); the second dimension of vr must be at least max(1, n). If jobvr = 'V', the right eigenvectors v(j) are stored one after another in the columns of vr, in the same order as their eigenvalues. If jobvr = 'N', vr is not referenced. For real flavors: If the j-th eigenvalue is real, then v(j) = vr(:,j), the j-th column of vr. If the j-th and (j+1)-st eigenvalues form a complex conjugate pair, then v(j) = vr(:,j) + i*vr(:,j+1) and v(j+1) = vr(:,j) - i*vr(:,j +1), where i = sqrt(-1) . For complex flavors: v(j) = vr(:,j), the j-th column of vr. ilo, ihi INTEGER. ilo and ihi are integer values determined when A was balanced. The balanced A(i,j) = 0 if i > j and j = 1,..., ilo-1 or i = ihi +1,..., n. If balanc = 'N' or 'S', ilo = 1 and ihi = n. scale REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1035 Array, DIMENSION at least max(1, n). Details of the permutations and scaling factors applied when balancing A. If P(j) is the index of the row and column interchanged with row and column j, and D(j) is the scaling factor applied to row and column j, then scale(j) = P(j), for j = 1,...,ilo-1 = D(j), for j = ilo,...,ihi = P(j) for j = ihi+1,..., n. The order in which the interchanges are made is n to ihi+1, then 1 to ilo-1. abnrm REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. The one-norm of the balanced matrix (the maximum of the sum of absolute values of elements of any column). rconde, rcondv REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, n) each. rconde(j) is the reciprocal condition number of the j-th eigenvalue. rcondv(j) is the reciprocal condition number of the j-th right eigenvector. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. If info = i, the QR algorithm failed to compute all the eigenvalues, and no eigenvectors or condition numbers have been computed; elements 1:ilo-1 and i+1:n of wr and wi (for real flavors) or w (for complex flavors) contain eigenvalues which have converged. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geevx interface are the following: a Holds the matrix A of size (n, n). wr Holds the vector of length n. Used in real flavors only. wi Holds the vector of length n. Used in real flavors only. w Holds the vector of length n. Used in complex flavors only. vl Holds the matrix VL of size (n, n). vr Holds the matrix VR of size (n, n). scale Holds the vector of length n. rconde Holds the vector of length n. rcondv Holds the vector of length n. balanc Must be 'N', 'B', 'P' or 'S'. The default value is 'N'. jobvl Restored based on the presence of the argument vl as follows: jobvl = 'V', if vl is present, jobvl = 'N', if vl is omitted. jobvr Restored based on the presence of the argument vr as follows: jobvr = 'V', if vr is present, 4 Intel® Math Kernel Library Reference Manual 1036 jobvr = 'N', if vr is omitted. sense Restored based on the presence of arguments rconde and rcondv as follows: sense = 'B', if both rconde and rcondv are present, sense = 'E', if rconde is present and rcondv omitted, sense = 'V', if rconde is omitted and rcondv present, sense = 'N', if both rconde and rcondv are omitted. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Singular Value Decomposition This section describes LAPACK driver routines used for solving singular value problems. See also computational routines that can be called to solve these problems. Table "Driver Routines for Singular Value Decomposition" lists all such driver routines for the FORTRAN 77 interface. Respective routine names in the Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Driver Routines for Singular Value Decomposition Routine Name Operation performed gesvd Computes the singular value decomposition of a general rectangular matrix. gesdd Computes the singular value decomposition of a general rectangular matrix using a divide and conquer method. gejsv Computes the singular value decomposition of a real matrix using a preconditioned Jacobi SVD method. gesvj Computes the singular value decomposition of a real matrix using Jacobi plane rotations. ggsvd Computes the generalized singular value decomposition of a pair of general rectangular matrices. ?gesvd Computes the singular value decomposition of a general rectangular matrix. Syntax Fortran 77: call sgesvd(jobu, jobvt, m, n, a, lda, s, u, ldu, vt, ldvt, work, lwork, info) call dgesvd(jobu, jobvt, m, n, a, lda, s, u, ldu, vt, ldvt, work, lwork, info) call cgesvd(jobu, jobvt, m, n, a, lda, s, u, ldu, vt, ldvt, work, lwork, rwork, info) LAPACK Routines: Least Squares and Eigenvalue Problems 4 1037 call zgesvd(jobu, jobvt, m, n, a, lda, s, u, ldu, vt, ldvt, work, lwork, rwork, info) Fortran 95: call gesvd(a, s [,u] [,vt] [,ww] [,job] [,info]) C: lapack_int LAPACKE_sgesvd( int matrix_order, char jobu, char jobvt, lapack_int m, lapack_int n, float* a, lapack_int lda, float* s, float* u, lapack_int ldu, float* vt, lapack_int ldvt, float* superb ); lapack_int LAPACKE_dgesvd( int matrix_order, char jobu, char jobvt, lapack_int m, lapack_int n, double* a, lapack_int lda, double* s, double* u, lapack_int ldu, double* vt, lapack_int ldvt, double* superb ); lapack_int LAPACKE_cgesvd( int matrix_order, char jobu, char jobvt, lapack_int m, lapack_int n, lapack_complex_float* a, lapack_int lda, float* s, lapack_complex_float* u, lapack_int ldu, lapack_complex_float* vt, lapack_int ldvt, float* superb ); lapack_int LAPACKE_zgesvd( int matrix_order, char jobu, char jobvt, lapack_int m, lapack_int n, lapack_complex_double* a, lapack_int lda, double* s, lapack_complex_double* u, lapack_int ldu, lapack_complex_double* vt, lapack_int ldvt, double* superb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the singular value decomposition (SVD) of a real/complex m-by-n matrix A, optionally computing the left and/or right singular vectors. The SVD is written as A = U*S*VT for real routines A = U*S*VH for complex routines where S is an m-by-n matrix which is zero except for its min(m,n) diagonal elements, U is an m-by-m orthogonal/unitary matrix, and V is an n-by-n orthogonal/unitary matrix. The diagonal elements of S are the singular values of A; they are real and non-negative, and are returned in descending order. The first min(m, n) columns of U and V are the left and right singular vectors of A. Note that the routine returns VT (for real flavors) or VH (for complex flavors), not V. Input Parameters The data types are given for the Fortran interface, except for superb. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobu CHARACTER*1. Must be 'A', 'S', 'O', or 'N'. Specifies options for computing all or part of the matrix U. If jobu = 'A', all m columns of U are returned in the array u; if jobu = 'S', the first min(m, n) columns of U (the left singular vectors) are returned in the array u; if jobu = 'O', the first min(m, n) columns of U (the left singular vectors) are overwritten on the array a; if jobu = 'N', no columns of U (no left singular vectors) are computed. 4 Intel® Math Kernel Library Reference Manual 1038 jobvt CHARACTER*1. Must be 'A', 'S', 'O', or 'N'. Specifies options for computing all or part of the matrix VT/VH. If jobvt = 'A', all n rows of VT/VH are returned in the array vt; if jobvt = 'S', the first min(m,n) rows of VT/VH (the right singular vectors) are returned in the array vt; if jobvt = 'O', the first min(m,n) rows of VT/VH) (the right singular vectors) are overwritten on the array a; if jobvt = 'N', no rows of VT/VH (no right singular vectors) are computed. jobvt and jobu cannot both be 'O'. m INTEGER. The number of rows of the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgesvd DOUBLE PRECISION for dgesvd COMPLEX for cgesvd DOUBLE COMPLEX for zgesvd. Arrays: a(lda,*) is an array containing the m-by-n matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, m). ldu, ldvt INTEGER. The leading dimensions of the output arrays u and vt, respectively. Constraints: ldu = 1; ldvt = 1. If jobu = 'S' or 'A', ldu = m; If jobvt = 'A', ldvt = n; If jobvt = 'S', ldvt = min(m, n). lwork INTEGER. The dimension of the array work. Constraints: lwork = 1 lwork = max(3*min(m, n)+max(m, n), 5*min(m,n)) (for real flavors); lwork = 2*min(m, n)+max(m, n) (for complex flavors). For good performance, lwork must generally be larger. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for details. rwork REAL for cgesvd DOUBLE PRECISION for zgesvd Workspace array, DIMENSION at least max(1, 5*min(m, n)). Used in complex flavors only. Output Parameters a On exit, If jobu = 'O', a is overwritten with the first min(m,n) columns of U (the left singular vectors stored columnwise); If jobvt = 'O', a is overwritten with the first min(m, n) rows of VT/VH (the right singular vectors stored rowwise); If jobu?'O' and jobvt?'O', the contents of a are destroyed. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1039 s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, min(m,n)). Contains the singular values of A sorted so that s(i) = s(i+1). u, vt REAL for sgesvd DOUBLE PRECISION for dgesvd COMPLEX for cgesvd DOUBLE COMPLEX for zgesvd. Arrays: u(ldu,*); the second dimension of u must be at least max(1, m) if jobu = 'A', and at least max(1, min(m, n)) if jobu = 'S'. If jobu = 'A', u contains the m-by-m orthogonal/unitary matrix U. If jobu = 'S', u contains the first min(m, n) columns of U (the left singular vectors stored column-wise). If jobu = 'N' or 'O', u is not referenced. vt(ldvt,*); the second dimension of vt must be at least max(1, n). If jobvt = 'A', vt contains the n-by-n orthogonal/unitary matrix VT/VH. If jobvt = 'S', vt contains the first min(m, n) rows of VT/VH (the right singular vectors stored row-wise). If jobvt = 'N'or 'O', vt is not referenced. work On exit, if info = 0, then work(1) returns the required minimal size of lwork. For real flavors: If info > 0, work(2:min(m,n)) contains the unconverged superdiagonal elements of an upper bidiagonal matrix B whose diagonal is in s (not necessarily sorted). B satisfies A=u*B*vt, so it has the same singular values as A, and singular vectors related by u and vt. rwork On exit (for complex flavors), if info > 0, rwork(1:min(m,n)-1) contains the unconverged superdiagonal elements of an upper bidiagonal matrix B whose diagonal is in s (not necessarily sorted). B satisfies A = u*B*vt, so it has the same singular values as A, and singular vectors related by u and vt. superb (C interface) On exit, superb(0:min(m,n)-2) contains the unconverged superdiagonal elements of an upper bidiagonal matrix B whose diagonal is in s (not necessarily sorted). B satisfies A = u*B*vt, so it has the same singular values as A, and singular vectors related by u and vt. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then if ?bdsqr did not converge, i specifies how many superdiagonals of the intermediate bidiagonal form B did not converge to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gesvd interface are the following: a Holds the matrix A of size (m, n). s Holds the vector of length min(m, n). 4 Intel® Math Kernel Library Reference Manual 1040 u If present and is a square m-by-m matrix, on exit contains the m-by-m orthogonal/ unitary matrix U. Otherwise, if present, on exit contains the first min(m,n) columns of the matrix U (left singular vectors stored column-wise). vt If present and is a square n-by-n matrix, on exit contains the n-by-n orthogonal/ unitary matrix V'T/V'H. Otherwise, if present, on exit contains the first min(m,n) rows of the matrix V'T/ V'H (right singular vectors stored row-wise). ww Holds the vector of length min(m, n)-1. ww contains the unconverged superdiagonal elements of an upper bidiagonal matrix B whose diagonal is in s (not necessarily sorted). B satisfies A = U*B*VT, so it has the same singular values as A, and singular vectors related by U and VT. job Must be either 'N', or 'U', or 'V'. The default value is 'N'. If job = 'U', and u is not present, then u is returned in the array a. If job = 'V', and vt is not present, then vt is returned in the array a. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?gesdd Computes the singular value decomposition of a general rectangular matrix using a divide and conquer method. Syntax Fortran 77: call sgesdd(jobz, m, n, a, lda, s, u, ldu, vt, ldvt, work, lwork, iwork, info) call dgesdd(jobz, m, n, a, lda, s, u, ldu, vt, ldvt, work, lwork, iwork, info) call cgesdd(jobz, m, n, a, lda, s, u, ldu, vt, ldvt, work, lwork, rwork, iwork, info) call zgesdd(jobz, m, n, a, lda, s, u, ldu, vt, ldvt, work, lwork, rwork, iwork, info) Fortran 95: call gesdd(a, s [,u] [,vt] [,jobz] [,info]) C: lapack_int LAPACKE_sgesdd( int matrix_order, char jobz, lapack_int m, lapack_int n, float* a, lapack_int lda, float* s, float* u, lapack_int ldu, float* vt, lapack_int ldvt ); LAPACK Routines: Least Squares and Eigenvalue Problems 4 1041 lapack_int LAPACKE_dgesdd( int matrix_order, char jobz, lapack_int m, lapack_int n, double* a, lapack_int lda, double* s, double* u, lapack_int ldu, double* vt, lapack_int ldvt ); lapack_int LAPACKE_cgesdd( int matrix_order, char jobz, lapack_int m, lapack_int n, lapack_complex_float* a, lapack_int lda, float* s, lapack_complex_float* u, lapack_int ldu, lapack_complex_float* vt, lapack_int ldvt ); lapack_int LAPACKE_zgesdd( int matrix_order, char jobz, lapack_int m, lapack_int n, lapack_complex_double* a, lapack_int lda, double* s, lapack_complex_double* u, lapack_int ldu, lapack_complex_double* vt, lapack_int ldvt ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the singular value decomposition (SVD) of a real/complex m-by-n matrix A, optionally computing the left and/or right singular vectors. If singular vectors are desired, it uses a divide-and-conquer algorithm. The SVD is written A = U*S*V' for real routines, A = U*S*conjg(V') for complex routines, where S is an m-by-n matrix which is zero except for its min(m,n) diagonal elements, U is an m-by-m orthogonal/unitary matrix, and V is an n-by-n orthogonal/unitary matrix. The diagonal elements of S are the singular values of A; they are real and non-negative, and are returned in descending order. The first min(m, n) columns of U and V are the left and right singular vectors of A. Note that the routine returns vt = V' (for real flavors) or vt = conjg(V') (for complex flavors), not V. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'A', 'S', 'O', or 'N'. Specifies options for computing all or part of the matrix U. If jobz = 'A', all m columns of U and all n rows of V'/conjg(V') are returned in the arrays u and vt; if jobz = 'S', the first min(m, n) columns of U and the first min(m, n) rows of V'/conjg(V') are returned in the arrays u and vt; if jobz = 'O', then if m = n, the first n columns of U are overwritten in the array a and all rows of V'/conjg(V') are returned in the array vt; if m < n, all columns of U are returned in the array u and the first m rows of V'/conjg(V') are overwritten in the array a; if jobz = 'N', no columns of U or rows of V'/conjg(V') are computed. m INTEGER. The number of rows of the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgesdd DOUBLE PRECISION for dgesdd 4 Intel® Math Kernel Library Reference Manual 1042 COMPLEX for cgesdd DOUBLE COMPLEX for zgesdd. Arrays: a(lda,*) is an array containing the m-by-n matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, m). ldu, ldvt INTEGER. The leading dimensions of the output arrays u and vt, respectively. Constraints: ldu = 1; ldvt = 1. If jobz = 'S' or 'A', or jobz = 'O' and m < n, then ldu = m; If jobz = 'A' or jobz = 'O' and m = n, then ldvt = n; If jobz = 'S', ldvt = min(m, n). lwork INTEGER. The dimension of the array work; lwork = 1. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the work(1), and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. rwork REAL for cgesdd DOUBLE PRECISION for zgesdd Workspace array, DIMENSION at least max(1, 5*min(m,n)) if jobz = 'N'. Otherwise, the dimension of rwork must be at least max(1,min(m,n)*max(5*min(m,n)+7,2*max(m,n)+2*min(m,n)+1)). This array is used in complex flavors only. iwork INTEGER. Workspace array, DIMENSION at least max(1, 8 *min(m, n)). Output Parameters a On exit: If jobz = 'O', then if m= n, a is overwritten with the first n columns of U (the left singular vectors, stored columnwise). If m < n, a is overwritten with the first m rows of VT (the right singular vectors, stored rowwise); If jobz?'O', the contents of a are destroyed. s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, min(m,n)). Contains the singular values of A sorted so that s(i) = s(i+1). u, vt REAL for sgesdd DOUBLE PRECISION for dgesdd COMPLEX for cgesdd DOUBLE COMPLEX for zgesdd. Arrays: u(ldu,*); the second dimension of u must be at least max(1, m) if jobz = 'A' or jobz = 'O' and m < n. If jobz = 'S', the second dimension of u must be at least max(1, min(m, n)). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1043 If jobz = 'A'or jobz = 'O' and m < n, u contains the m-by-m orthogonal/ unitary matrix U. If jobz = 'S', u contains the first min(m, n) columns of U (the left singular vectors, stored columnwise). If jobz = 'O' and m=n, or jobz = 'N', u is not referenced. vt(ldvt,*); the second dimension of vt must be at least max(1, n). If jobz = 'A'or jobz = 'O' and m=n, vt contains the n-by-n orthogonal/ unitary matrix VT. If jobz = 'S', vt contains the first min(m, n) rows of VT (the right singular vectors, stored rowwise). If jobz = 'O' and m < n, or jobz = 'N', vt is not referenced. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, then ?bdsdc did not converge, updating process failed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gesdd interface are the following: a Holds the matrix A of size (m, n). s Holds the vector of length min(m, n). u Holds the matrix U of size • (m,m) if jobz='A' or jobz='O' and m < n • (m,min(m, n)) if jobz='S' u is not referenced if jobz is not supplied or if jobz='N' or jobz='O' and m = n. vt Holds the matrix VT of size • (n,n) if jobz='A' or jobz='O' and m = n • (min(m, n), n) if jobz='S' vt is not referenced if jobz is not supplied or if jobz='N' or jobz='O' and m < n. job Must be 'N', 'A', 'S', or 'O'. The default value is 'N'. Application Notes For real flavors: If jobz = 'N', lwork = 3*min(m, n) + max (max(m,n), 6*min(m, n)); If jobz = 'O', lwork = 3*(min(m, n))2 + max (max(m, n), 5*(min(m, n))2 + 4*min(m, n)); If jobz = 'S' or 'A', lwork = 3*(min(m, n))2 + max (max(m, n), 4*(min(m, n))2 + 4*min(m, n)) For complex flavors: If jobz = 'N', lwork = 2*min(m, n) + max(m, n); 4 Intel® Math Kernel Library Reference Manual 1044 If jobz = 'O', lwork = 2*(min(m, n))2 + max(m, n) + 2*min(m, n); If jobz = 'S' or 'A', lwork = (min(m, n))2 + max(m, n) + 2*min(m, n); For good performance, lwork should generally be larger. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?gejsv Computes the singular value decomposition of a real matrix using a preconditioned Jacobi SVD method. Syntax Fortran 77: call sgejsv(joba, jobu, jobv, jobr, jobt, jobp, m, n, a, lda, sva, u, ldu, v, ldv, work, lwork, iwork, info) call dgejsv(joba, jobu, jobv, jobr, jobt, jobp, m, n, a, lda, sva, u, ldu, v, ldv, work, lwork, iwork, info) C: lapack_int LAPACKE_gejsv( int matrix_order, char joba, char jobu, char jobv, char jobr, char jobt, char jobp, lapack_int m, lapack_int n, const * a, lapack_int lda, * sva, * u, lapack_int ldu, * v, lapack_int ldv, * stat, lapack_int* istat ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The routine computes the singular value decomposition (SVD) of a real m-by-n matrix A, where m = n. The SVD is written as A = U*S*VT, where S is an m-by-n matrix which is zero except for its n diagonal elements, U is an m-by-n (or m-by-m) orthonormal matrix, and V is an n-by-n orthogonal matrix. The diagonal elements of S are the singular values of A; the columns of U and V are the left and right singular vectors of A, respectively. The matrices U and V are computed and stored in the arrays u and v, respectively. The diagonal of S is computed and stored in the array sva. The routine implements a preconditioned Jacobi SVD algorithm. It uses ?geqp3, ?geqrf, and ?gelqf as preprocessors and preconditioners. Optionally, an additional row pivoting can be used as a preprocessor, which in some cases results in much higher accuracy. An example is matrix A with the structure A = D1 * C LAPACK Routines: Least Squares and Eigenvalue Problems 4 1045 * D2, where D1, D2 are arbitrarily ill-conditioned diagonal matrices and C is a well-conditioned matrix. In that case, complete pivoting in the first QR factorizations provides accuracy dependent on the condition number of C, and independent of D1, D2. Such higher accuracy is not completely understood theoretically, but it works well in practice. If A can be written as A = B*D, with well-conditioned B and some diagonal D, then the high accuracy is guaranteed, both theoretically and in software independent of D. For more details see [Drmac08-1], [Drmac08-2]. The computational range for the singular values can be the full range ( UNDERFLOW,OVERFLOW ), provided that the machine arithmetic and the BLAS and LAPACK routines called by ?gejsv are implemented to work in that range. If that is not the case, the restriction for safe computation with the singular values in the range of normalized IEEE numbers is that the spectral condition number kappa(A)=sigma_max(A)/sigma_min(A) does not overflow. This code (?gejsv) is best used in this restricted range, meaning that singular values of magnitude below ||A||_2 / slamch('O') (for single precision) or ||A||_2 / dlamch('O') (for double precision) are returned as zeros. See jobr for details on this. This implementation is slower than the one described in [Drmac08-1], [Drmac08-2] due to replacement of some non-LAPACK components, and because the choice of some tuning parameters in the iterative part (? gesvj) is left to the implementer on a particular machine. The rank revealing QR factorization (in this code: ?geqp3) should be implemented as in [Drmac08-3]. If m is much larger than n, it is obvious that the inital QRF with column pivoting can be preprocessed by the QRF without pivoting. That well known trick is not used in ?gejsv because in some cases heavy row weighting can be treated with complete pivoting. The overhead in cases m much larger than n is then only due to pivoting, but the benefits in accuracy have prevailed. You can incorporate this extra QRF step easily and also improve data movement (matrix transpose, matrix copy, matrix transposed copy) - this implementation of ?gejsv uses only the simplest, naive data movement. Input Parameters The data types are given for the Fortran interface, except for istat. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. joba CHARACTER*1. Must be 'C', 'E', 'F', 'G', 'A', or 'R'. Specifies the level of accuracy: If joba = 'C', high relative accuracy is achieved if A = B*D with wellconditioned B and arbitrary diagonal matrix D. The accuracy cannot be spoiled by column scaling. The accuracy of the computed output depends on the condition of B, and the procedure aims at the best theoretical accuracy. The relative error max_{i=1:N}|d sigma_i| / sigma_i is bounded by f(M,N)*epsilon* cond(B), independent of D. The input matrix is preprocessed with the QRF with column pivoting. This initial preprocessing and preconditioning by a rank revealing QR factorization is common for all values of joba. Additional actions are specified as follows: If joba = 'E', computation as with 'C' with an additional estimate of the condition number of B. It provides a realistic error bound. If joba = 'F', accuracy higher than in the 'C' option is achieved, if A = D1*C*D2 with ill-conditioned diagonal scalings D1, D2, and a wellconditioned matrix C. This option is advisable, if the structure of the input matrix is not known and relative accuracy is desirable. The input matrix A is preprocessed with QR factorization with full (row and column) pivoting. If joba = 'G', computation as with 'F' with an additional estimate of the condition number of B, where A = B*D. If A has heavily weighted rows, using this condition number gives too pessimistic error bound. 4 Intel® Math Kernel Library Reference Manual 1046 If joba = 'A', small singular values are the noise and the matrix is treated as numerically rank defficient. The error in the computed singular values is bounded by f(m,n)*epsilon*||A||. The computed SVD A = U*S*V**t restores A up to f(m,n)*epsilon*||A||. This enables the procedure to set all singular values below n*epsilon*||A|| to zero. If joba = 'R', the procedure is similar to the 'A' option. Rank revealing property of the initial QR factorization is used to reveal (using triangular factor) a gap sigma_{r+1} < epsilon * sigma_r, in which case the numerical rank is declared to be r. The SVD is computed with absolute error bounds, but more accurately than with 'A'. jobu CHARACTER*1. Must be 'U', 'F', 'W', or 'N'. Specifies whether to compute the columns of the matrix U: If jobu = 'U', n columns of U are returned in the array u If jobu = 'F', a full set of m left singular vectors is returned in the array u. If jobu = 'W', u may be used as workspace of length m*n. See the description of u. If jobu = 'N', u is not computed. jobv CHARACTER*1. Must be 'V', 'J', 'W', or 'N'. Specifies whether to compute the matrix V: If jobv = 'V', n columns of V are returned in the array v; Jacobi rotations are not explicitly accumulated. If jobv = 'J', n columns of V are returned in the array v but they are computed as the product of Jacobi rotations. This option is allowed only if jobu ? n If jobv = 'W', v may be used as workspace of length n*n. See the description of v. If jobv = 'N', v is not computed. jobr CHARACTER*1. Must be 'N' or 'R'. Specifies the range for the singular values. If small positive singular values are outside the specified range, they may be set to zero. If A is scaled so that the largest singular value of the scaled matrix is around sqrt(big), big = ?lamch('O'), the function can remove columns of A whose norm in the scaled matrix is less than sqrt(?lamch('S')) (for jobr = 'R'), or less than small = ?lamch('S')/?lamch('E'). If jobr = 'N', the function does not remove small columns of the scaled matrix. This option assumes that BLAS and QR factorizations and triangular solvers are implemented to work in that range. If the condition of A if greater that big, use ?gesvj. If jobr = 'R', restricted range for singular values of the scaled matrix A is [sqrt(?lamch('S'), sqrt(big)], roughly as described above. This option is recommended. For computing the singular values in the full range [?lamch('S'),big], use ?gesvj. jobt CHARACTER*1. Must be 'T' or 'N'. If the matrix is square, the procedure may determine to use a transposed A if A**t seems to be better with respect to convergence. If the matrix is not square, jobt is ignored. This is subject to changes in the future. The decision is based on two values of entropy over the adjoint orbit of A**t * A. See the descriptions of work(6) and work(7). If jobt = 'T', the function perfomrs transpositon if the entropy test indicates possibly faster convergence of the Jacobi process, if A is taken as input. If A is replaced with A**t, the row pivoting is included automatically. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1047 If jobt = 'N', the functions attempts no speculations. This option can be used to compute only the singular values, or the full SVD (u, sigma, and v). For only one set of singular vectors (u or v), the caller should provide both u and v, as one of the matrices is used as workspace if the matrix A is transposed. The implementer can easily remove this constraint and make the code more complicated. See the descriptions of u and v. jobp CHARACTER*1. Must be 'P' or 'N'. Enables structured perturbations of denormalized numbers. This option should be active if the denormals are poorly implemented, causing slow computation, especially in cases of fast convergence. For details, see [Drmac08-1], [Drmac08-2] . For simplicity, such perturbations are included only when the full SVD or only the singular values are requested. You can add the perturbation for the cases of computing one set of singular vectors. If jobp = 'P', the function introduces perturbation. If jobp = 'N', the function introduces no perturbation. m INTEGER. The number of rows of the input matrix A; m = 0. n INTEGER. The number of columns in the input matrix A; n = 0. a, work, sva, u, v REAL for sgejsv DOUBLE PRECISION for dgejsv. Array a(lda,*) is an array containing the m-by-n matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). sva is a workspace array, its dimension is n. u is a workspace array, its dimension is (ldu,*); the second dimension of u must be at least max(1, n). v is a workspace array, its dimension is (ldv,*); the second dimension of u must be at least max(1, n). lda INTEGER. The leading dimension of the array a. Must be at least max(1, m). ldu INTEGER. The leading dimension of the array u; ldu = 1. jobu = 'U' or 'F' or 'W', ldu = m. ldv INTEGER. The leading dimension of the array v; ldv = 1. jobv = 'V' or 'J' or 'W', ldv = n. lwork INTEGER. Length of work to confirm proper allocation of work space. lwork depends on the task performed: If only sigma is needed (jobu = 'N', jobv = 'N') and – ... no scaled condition estimate is required, then lwork = max(2*m+n, 4*n+1,7). This is the minimal requirement. For optimal performance (blocked code) the optimal value is lwork = max(2*m+n,3*n+(n +1)*nb,7). Here nb is the optimal block size for ?geqp3/?geqrf. In general, the optimal length lwork is computed as lwork = max(2*m+n,n+lwork(sgeqp3),n+lwork(sgeqrf),7) for sgejsv lwork = max(2*m+n,n+lwork(dgeqp3),n+lwork(dgeqrf),7) for dgejsv 4 Intel® Math Kernel Library Reference Manual 1048 – ... an estimate of the scaled condition number of A is required (joba = 'E', 'G'). In this case, lwork is the maximum of the above and n*n +4*n, that is, lwork = max(2*m+n,n*n+4*n,7). For optimal performance (blocked code) the optimal value is lwork = max(2*m+n, 3*n+(n+1)*nb, n*n+4*n, 7). In general, the optimal length lwork is computed as lwork = max(2*m+n,n+lwork(sgeqp3),n+lwork(sgeqrf),n+n*n +lwork(spocon, 7) for sgejsv lwork = max(2*m+n,n+lwork(dgeqp3),n+lwork(dgeqrf),n+n*n +lwork(dpocon, 7) for dgejsv If sigma and the right singular vectors are needed (jobv = 'V'), – the minimal requirement is lwork = max(2*m+n,4*n+1,7). – for optimal performance, lwork = max(2*m+n,3*n+(n+1)*nb,7), where nb is the optimal block size for ?geqp3, ?geqrf, ?gelqf, ?ormlq. In general, the optimal length lwork is computed as lwork = max(2*m+n, n+lwork(sgeqp3), n+lwork(spocon), n +lwork(sgelqf), 2*n+lwork(sgeqrf), n+lwork(sormlq) for sgejsv lwork = max(2*m+n, n+lwork(dgeqp3), n+lwork(dpocon), n +lwork(dgelqf), 2*n+lwork(dgeqrf), n+lwork(dormlq) for dgejsv If sigma and the left singular vectors are needed – the minimal requirement is lwork = max(2*n+m,4*n+1,7). – for optimal performance, if jobu = 'U' :: lwork = max(2*m+n,3*n+(n+1)*nb, 7), if jobu = 'F' :: lwork = max(2*m+n,3*n+(n+1)*nb, n+m*nb, 7), where nb is the optimal block size for ?geqp3, ?geqrf, ?ormlq . In general, the optimal length lwork is computed as lwork = max(2*m+n, n+lwork(sgeqp3), n+lwork(spocon), 2*n +lwork(sgeqrf), n+lwork(sormlq) for sgejsv lwork = max(2*m+n, n+lwork(dgeqp3), n+lwork(dpocon), 2*n +lwork(dgeqrf), n+lwork(dormlq) for dgejsv Here lwork(?ormlq) equals n*nb (for jobu = 'U') or m*nb (for jobu = 'F') If full SVD is needed (jobu = 'U' or 'F') and – if jobv = 'V', the minimal requirement is lwork = max(2*m+n, 6*n+2*n*n) – if jobv = 'J', the minimal requirement is lwork = max(2*m+n, 4*n+n*n, 2*n+n*n +6) – For optimal performance, lwork should be additionally larger than n +m*nb, where nb is the optimal block size for ?ormlq. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1049 iwork INTEGER. Workspace array, DIMENSION max(3, m+3*n). Output Parameters sva On exit: For work(1)/work(2) = one: the singular values of A. During the computation sva contains Euclidean column norms of the iterated matrices in the array a. For work(1)?work(2): the singular values of A are (work(1)/work(2)) * sva(1:n). This factored form is used if sigma_max(A) overflows or if small singular values have been saved from underflow by scaling the input matrix A. jobr = 'R', some of the singular values may be returned as exact zeros obtained by 'setting to zero' because they are below the numerical rank threshold or are denormalized numbers. u On exit: If jobu = 'U', contains the m-by-n matrix of the left singular vectors. If jobu = 'F', contains the m-by-m matrix of the left singular vectors, including an orthonormal basis of the orthogonal complement of the range of A. If jobu = 'W' and jobv = 'V', jobt = 'T', and m = n, then u is used as workspace if the procedure replaces A with A**t. In that case, v is computed in u as left singular vectors of A**t and copied back to the v array. This 'W' option is just a reminder to the caller that in this case u is reserved as workspace of length n*n. If jobu = 'N', u is not referenced. v On exit: If jobv = 'V' or 'J', contains the n-by-n matrix of the right singular vectors. If jobv = 'W' and jobv = 'U', jobt = 'T', and m = n, then v is used as workspace if the procedure replaces A with A**t. In that case, u is computed in v as right singular vectors of A**t and copied back to the u array. This 'W' option is just a reminder to the caller that in this case v is reserved as workspace of length n*n. If jobv = 'N', v is not referenced. work On exit, work(1) = scale = work(2)/work(1) is the scaling factor such that scale*sva(1:n) are the computed singular values of A. See the description of sva(). work(2) = see the description of work(1). work(3) = sconda is an estimate for the condition number of column equilibrated A. If joba = 'E' or 'G', sconda is an estimate of sqrt(|| (R**t * R)**(-1)||_1). It is computed using ?pocon. It holds n**(-1/4) * sconda = ||R**(-1)||_2 = n**(1/4) * sconda, where R is the triangular factor from the QRF of A. However, if R is truncated and the numerical rank is determined to be strictly smaller than n, sconda is returned as -1, indicating that the smallest singular values might be lost. If full SVD is needed, the following two condition numbers are useful for the analysis of the algorithm. They are provied for a user who is familiar with the details of the method. work(4) = an estimate of the scaled condition number of the triangular factor in the first QR factorization. 4 Intel® Math Kernel Library Reference Manual 1050 work(5) = an estimate of the scaled condition number of the triangular factor in the second QR factorization. The following two parameters are computed if jobt = 'T'. They are provided for a user who is familiar with the details of the method. work(6) = the entropy of A**t*A :: this is the Shannon entropy of diag(A**t*A) / Trace(A**t*A) taken as point in the probability simplex. work(7) = the entropy of A*A**t. iwork (Fortran), istat (C) INTEGER. On exit, iwork(1)/istat[0] = the numerical rank determined after the initial QR factorization with pivoting. See the descriptions of joba and jobr. iwork(2)/istat[1] = the number of the computed nonzero singular value. iwork(3)/istat[2] = if nonzero, a warning message. If iwork(3)/ istat[2]=1, some of the column norms of A were denormalized floats. The requested high accuracy is not warranted by the data. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info > 0, the function did not converge in the maximal number of sweeps. The computed values may be inaccurate. See Also ?geqp3 ?geqrf ?gelqf ?gesvj ?lamch ?pocon ?ormlq ?gesvj Computes the singular value decomposition of a real matrix using Jacobi plane rotations. Syntax Fortran 77: call sgesvj(joba, jobu, jobv, m, n, a, lda, sva, mv, v, ldv, work, lwork, info) call dgesvj(joba, jobu, jobv, m, n, a, lda, sva, mv, v, ldv, work, lwork, info) C: lapack_int LAPACKE_gesvj( int matrix_order, char joba, char jobu, char jobv, lapack_int m, lapack_int n, * a, lapack_int lda, * sva, lapack_int mv, * v, lapack_int ldv, * stat ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The routine computes the singular value decomposition (SVD) of a real m-by-n matrix A, where m = n. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1051 The SVD of A is written as A = U*S*V', where S is an m-by-n diagonal matrix, U is an m-by-n orthonormal matrix, and V is an n-by-n orthogonal matrix. The diagonal elements of S are the singular values of A; the columns of U and V are the left and right singular vectors of A, respectively. The matrices U and V are computed and stored in the arrays u and v, respectively. The diagonal of S is computed and stored in the array sva. The n-by-n orthogonal matrix V is obtained as a product of Jacobi plane rotations. The rotations are implemented as fast scaled rotations of Anda and Park [AndaPark94]. In the case of underflow of the Jacobi angle, a modified Jacobi transformation of Drmac ([Drmac08-4]) is used. Pivot strategy uses column interchanges of de Rijk ([deRijk98]). The relative accuracy of the computed singular values and the accuracy of the computed singular vectors (in angle metric) is as guaranteed by the theory of Demmel and Veselic [Demmel92]. The condition number that determines the accuracy in the full rank case is essentially where ?(.) is the spectral condition number. The best performance of this Jacobi SVD procedure is achieved if used in an accelerated version of Drmac and Veselic [Drmac08-1], [Drmac08-2]. Some tunning parameters (marked with TP) are available for the implementer. The computational range for the nonzero singular values is the machine number interval ( UNDERFLOW,OVERFLOW ). In extreme cases, even denormalized singular values can be computed with the corresponding gradual loss of accurate digit. Input Parameters The data types are given for the Fortran interface, except for stat. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. joba CHARACTER*1. Must be 'L', 'U' or 'G'. Specifies the structure of A: If joba = 'L', the input matrix A is lower triangular. If joba = 'U', the input matrix A is upper triangular. If joba = 'G', the input matrix A is a general m-by-n, m = n. jobu CHARACTER*1. Must be 'U', 'C' or 'N'. Specifies whether to compute the left singular vectors (columns of U): If jobu = 'U', the left singular vectors corresponding to the nonzero singular values are computed and returned in the leading columns of A. See more details in the description of a. The default numerical orthogonality threshold is set to approximately TOL=CTOL*EPS, CTOL=sqrt(m), EPS = ?lamch('E') If jobu = 'C', analogous to jobu = 'U', except that you can control the level of numerical orthogonality of the computed left singular vectors. TOL can be set to TOL=CTOL*EPS, where CTOL is given on input in the array work. No CTOL smaller than ONE is allowed. CTOL greater than 1 / EPS is meaningless. The option 'C' can be used if m*EPS is satisfactory orthogonality of the computed left singular vectors, so CTOL=m could save a few sweeps of Jacobi rotations. See the descriptions of a and work(1). If jobu = 'N', u is not computed. However, see the description of a. jobv CHARACTER*1. Must be 'V', 'A' or 'N'. Specifies whether to compute the right singular vectors, that is, the matrix V: 4 Intel® Math Kernel Library Reference Manual 1052 If jobv = 'V', the matrix V is computed and returned in the array v. If jobv = 'A', the Jacobi rotations are applied to the mv-byn array v. In other words, the right singular vector matrix V is not computed explicitly, instead it is applied to an mv-byn matrix initially stored in the first mv rows of V. If jobv = 'N', the matrix V is not computed and the array v is not referenced. m INTEGER. The number of rows of the input matrix A. 1/slamch('E')> m = 0 for sgesvj. 1/dlamch('E')> m = 0 for dgesvj. n INTEGER. The number of columns in the input matrix A; n = 0. a, work, sva, v REAL for sgesvj DOUBLE PRECISION for dgesvj. Array a(lda,*) is an array containing the m-by-n matrix A. The second dimension of a is max(1, n). work is a workspace array, its dimension max(4, m+n). If jobu = 'C', work(1)=CTOL, where CTOL defines the threshold for convergence. The process stops if all columns of A are mutually orthogonal up to CTOL*EPS, EPS=?lamch('E'). It is required that CTOL = 1, that is, it is not allowed to force the routine to obtain orthogonality below e. sva is a workspace array, its dimension is n. v is a workspace array, its dimension is (ldv,*); the second dimension of u must be at least max(1, n). lda INTEGER. The leading dimension of the array a. Must be at least max(1, m). mv INTEGER. jobv = 'A', the product of Jacobi rotations in ?gesvj is applied to the first mv rows of v. See the description of jobv. ldv INTEGER. The leading dimension of the array v; ldv = 1. jobv = 'V', ldv = max(1, n). jobv = 'A', ldv = max(1, mv). lwork INTEGER. Length of work, work = max(6,m+n). Output Parameters a On exit: If jobu = 'U' or jobu = 'C': • if info = 0, the leading columns of A contain left singular vectors corresponding to the computed singular values of a that are above the underflow threshold ?lamch('S'), that is, non-zero singular values. The number of the computed non-zero singular values is returned in work(2). Also see the descriptions of sva and work. The computed columns of u are mutually numerically orthogonal up to approximately TOL=sqrt(m)*EPS (default); or TOL=CTOL*EPS jobu = 'C', see the description of jobu. • if info > 0, the procedure ?gesvj did not converge in the given number of iterations (sweeps). In that case, the computed columns of u may not be orthogonal up to TOL. The output u (stored in a), sigma LAPACK Routines: Least Squares and Eigenvalue Problems 4 1053 (given by the computed singular values in sva(1:n)) and v is still a decomposition of the input matrix A in the sense that the residual ||ascale* u*sigma*v**t||_2 / ||a||_2 is small. If jobu = 'N': • if info = 0, note that the left singular vectors are 'for free' in the onesided Jacobi SVD algorithm. However, if only the singular values are needed, the level of numerical orthogonality of u is not an issue and iterations are stopped when the columns of the iterated matrix are numerically orthogonal up to approximately m*EPS. Thus, on exit, a contains the columns of u scaled with the corresponding singular values. • if info > 0, the procedure ?gesvj did not converge in the given number of iterations (sweeps). sva On exit: If info = 0, depending on the value scale = work(1), where scale is the scaling factor: • if scale = 1, sva(1:n) contains the computed singular values of a. During the computation, sva contains the Euclidean column norms of the iterated matrices in the array a. • if scale ? 1, the singular values of a are scale*sva(1:n), and this factored representation is due to the fact that some of the singular values of a might underflow or overflow. If info > 0, the procedure ?gesvj did not converge in the given number of iterations (sweeps) and scale*sva(1:n) may not be accurate. v On exit: If jobv = 'V', contains the n-by-n matrix of the right singular vectors. If jobv = 'A', then v contains the product of the computed right singular vector matrix and the initial matrix in the array v. If jobv = 'N', v is not referenced. work (Fortarn), stat (C) On exit, work(1)/stat[0] = scale is the scaling factor such that scale*sva(1:n) are the computed singular values of A. See the description of sva(). work(2)/stat[1] is the number of the computed nonzero singular value. work(3)/stat[2] is the number of the computed singular values that are larger than the underflow threshold. work(4)/stat[3] is the number of sweeps of Jacobi rotations needed for numerical convergence. work(5)/stat[4] = max_{i.NE.j} |COS(A(:,i),A(:,j))| in the last sweep. This is useful information in cases when ?gesvj did not converge, as it can be used to estimate whether the output is still useful and for post festum analysis. work(6)/stat[5] is the largest absolute value over all sines of the Jacobi rotation angles in the last sweep. It can be useful in a post festum analysis. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info > 0, the function did not converge in the maximal number (30) of sweeps. The output may still be useful. See the description of work. See Also ?lamch 4 Intel® Math Kernel Library Reference Manual 1054 ?ggsvd Computes the generalized singular value decomposition of a pair of general rectangular matrices. Syntax Fortran 77: call sggsvd(jobu, jobv, jobq, m, n, p, k, l, a, lda, b, ldb, alpha, beta, u, ldu, v, ldv, q, ldq, work, iwork, info) call dggsvd(jobu, jobv, jobq, m, n, p, k, l, a, lda, b, ldb, alpha, beta, u, ldu, v, ldv, q, ldq, work, iwork, info) call cggsvd(jobu, jobv, jobq, m, n, p, k, l, a, lda, b, ldb, alpha, beta, u, ldu, v, ldv, q, ldq, work, rwork, iwork, info) call zggsvd(jobu, jobv, jobq, m, n, p, k, l, a, lda, b, ldb, alpha, beta, u, ldu, v, ldv, q, ldq, work, rwork, iwork, info) Fortran 95: call ggsvd(a, b, alpha, beta [, k] [,l] [,u] [,v] [,q] [,iwork] [,info]) C: lapack_int LAPACKE_sggsvd( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int n, lapack_int p, lapack_int* k, lapack_int* l, float* a, lapack_int lda, float* b, lapack_int ldb, float* alpha, float* beta, float* u, lapack_int ldu, float* v, lapack_int ldv, float* q, lapack_int ldq, lapack_int* iwork ); lapack_int LAPACKE_dggsvd( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int n, lapack_int p, lapack_int* k, lapack_int* l, double* a, lapack_int lda, double* b, lapack_int ldb, double* alpha, double* beta, double* u, lapack_int ldu, double* v, lapack_int ldv, double* q, lapack_int ldq, lapack_int* iwork ); lapack_int LAPACKE_cggsvd( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int n, lapack_int p, lapack_int* k, lapack_int* l, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, float* alpha, float* beta, lapack_complex_float* u, lapack_int ldu, lapack_complex_float* v, lapack_int ldv, lapack_complex_float* q, lapack_int ldq, lapack_int* iwork ); lapack_int LAPACKE_zggsvd( int matrix_order, char jobu, char jobv, char jobq, lapack_int m, lapack_int n, lapack_int p, lapack_int* k, lapack_int* l, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, double* alpha, double* beta, lapack_complex_double* u, lapack_int ldu, lapack_complex_double* v, lapack_int ldv, lapack_complex_double* q, lapack_int ldq, lapack_int* iwork ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h LAPACK Routines: Least Squares and Eigenvalue Problems 4 1055 Description The routine computes the generalized singular value decomposition (GSVD) of an m-by-n real/complex matrix A and p-by-n real/complex matrix B: U'*A*Q = D1*(0 R), V'*B*Q = D2*(0 R), where U, V and Q are orthogonal/unitary matrices and U', V' mean transpose/conjugate transpose of U and V respectively. Let k+l = the effective numerical rank of the matrix (A', B')', then R is a (k+l)-by-(k+l) nonsingular upper triangular matrix, D1 and D2 are m-by-(k+l) and p-by-(k+l) "diagonal" matrices and of the following structures, respectively: If m-k-l =0, where C = diag(alpha(K+1),..., alpha(K+l)) S = diag(beta(K+1),...,beta(K+l)) C2 + S2 = I R is stored in a(1:k+l, n-k-l+1:n ) on exit. If m-k-l < 0, 4 Intel® Math Kernel Library Reference Manual 1056 where C = diag(alpha(K+1),..., alpha(m)), S = diag(beta(K+1),...,beta(m)), C2 + S2 = I On exit, is stored in a(1:m, n-k-l+1:n ) and R33 is stored in b(m-k+1:l, n+m-k-l +1:n ). The routine computes C, S, R, and optionally the orthogonal/unitary transformation matrices U, V and Q. In particular, if B is an n-by-n nonsingular matrix, then the GSVD of A and B implicitly gives the SVD of A*B-1: A*B-1 = U*(D1*D2-1)*V'. If (A', B')' has orthonormal columns, then the GSVD of A and B is also equal to the CS decomposition of A and B. Furthermore, the GSVD can be used to derive the solution of the eigenvalue problem: A'**A*x = ?*B'*B*x. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobu CHARACTER*1. Must be 'U' or 'N'. If jobu = 'U', orthogonal/unitary matrix U is computed. If jobu = 'N', U is not computed. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1057 jobv CHARACTER*1. Must be 'V' or 'N'. If jobv = 'V', orthogonal/unitary matrix V is computed. If jobv = 'N', V is not computed. jobq CHARACTER*1. Must be 'Q' or 'N'. If jobq = 'Q', orthogonal/unitary matrix Q is computed. If jobq = 'N', Q is not computed. m INTEGER. The number of rows of the matrix A (m = 0). n INTEGER. The number of columns of the matrices A and B (n = 0). p INTEGER. The number of rows of the matrix B (p = 0). a, b, work REAL for sggsvd DOUBLE PRECISION for dggsvd COMPLEX for cggsvd DOUBLE COMPLEX for zggsvd. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the p-by-n matrix B. The second dimension of b must be at least max(1, n). work(*) is a workspace array. The dimension of work must be at least max(3n, m, p)+n. lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; at least max(1, p). ldu INTEGER. The leading dimension of the array u . ldu = max(1, m) if jobu = 'U'; ldu = 1 otherwise. ldv INTEGER. The leading dimension of the array v . ldv = max(1, p) if jobv = 'V'; ldv = 1 otherwise. ldq INTEGER. The leading dimension of the array q . ldq = max(1, n) if jobq = 'Q'; ldq = 1 otherwise. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cggsvd DOUBLE PRECISION for zggsvd. Workspace array, DIMENSION at least max(1, 2n). Used in complex flavors only. Output Parameters k, l INTEGER. On exit, k and l specify the dimension of the subblocks. The sum k+l is equal to the effective numerical rank of (A', B')'. a On exit, a contains the triangular matrix R or part of R. b On exit, b contains part of the triangular matrix R if m-k-l < 0. alpha, beta REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays, DIMENSION at least max(1, n) each. Contain the generalized singular value pairs of A and B: alpha(1:k) = 1, beta(1:k) = 0, and if m-k-l = 0, alpha(k+1:k+l) = C, beta(k+1:k+l) = S, 4 Intel® Math Kernel Library Reference Manual 1058 or if m-k-l < 0, alpha(k+1:m)= C, alpha(m+1:k+l)=0 beta(k+1:m) = S, beta(m+1:k+l) = 1 and alpha(k+l+1:n) = 0 beta(k+l+1:n) = 0. u, v, q REAL for sggsvd DOUBLE PRECISION for dggsvd COMPLEX for cggsvd DOUBLE COMPLEX for zggsvd. Arrays: u(ldu,*); the second dimension of u must be at least max(1, m). If jobu = 'U', u contains the m-by-m orthogonal/unitary matrix U. If jobu = 'N', u is not referenced. v(ldv,*); the second dimension of v must be at least max(1, p). If jobv = 'V', v contains the p-by-p orthogonal/unitary matrix V. If jobv = 'N', v is not referenced. q(ldq,*); the second dimension of q must be at least max(1, n). If jobq = 'Q', q contains the n-by-n orthogonal/unitary matrix Q. If jobq = 'N', q is not referenced. iwork On exit, iwork stores the sorting information. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, the Jacobi-type procedure failed to converge. For further details, see subroutine tgsja. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggsvd interface are the following: a Holds the matrix A of size (m, n). b Holds the matrix B of size (p, n). alpha Holds the vector of length n. beta Holds the vector of length n. u Holds the matrix U of size (m, m). v Holds the matrix V of size (p, p). q Holds the matrix Q of size (n, n). iwork Holds the vector of length n. jobu Restored based on the presence of the argument u as follows: jobu = 'U', if u is present, jobu = 'N', if u is omitted. jobv Restored based on the presence of the argument v as follows: jobz = 'V', if v is present, jobz = 'N', if v is omitted. jobq Restored based on the presence of the argument q as follows: jobz = 'Q', if q is present, jobz = 'N', if q is omitted. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1059 Cosine-Sine Decomposition This section describes LAPACK driver routines for computing the cosine-sine decomposition (CS decomposition). You can also call the corresponding computational routines to perform the same task. The computation has the following phases: 1. The matrix is reduced to a bidiagonal block form. 2. The blocks are simultaneously diagonalized using techniques from the bidiagonal SVD algorithms. Table "Driver Routines for Cosine-Sine Decomposition (CSD)" lists LAPACK routines (FORTRAN 77 interface) that perform CS decomposition of matrices. Respective routine names in Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Computational Routines for Cosine-Sine Decomposition (CSD) Operation Real matrices Complex matrices Compute the CS decomposition of a blockpartitioned orthogonal matrix orcsd uncsd Compute the CS decomposition of a blockpartitioned unitary matrix orcsd uncsd See Also Cosine-Sine Decomposition ?orcsd/?uncsd Computes the CS decomposition of a block-partitioned orthogonal/unitary matrix. Syntax Fortran 77: call sorcsd( jobu1, jobu2, jobv1t, jobv2t, trans, signs, m, p, q, x11, ldx11, x12, ldx12, x21, ldx21, x22, ldx22, theta, u1, ldu1, u2, ldu2, v1t, ldv1t, v2t, ldv2t, work, lwork, iwork, info ) call dorcsd( jobu1, jobu2, jobv1t, jobv2t, trans, signs, m, p, q, x11, ldx11, x12, ldx12, x21, ldx21, x22, ldx22, theta, u1, ldu1, u2, ldu2, v1t, ldv1t, v2t, ldv2t, work, lwork, iwork, info ) call cuncsd( jobu1, jobu2, jobv1t, jobv2t, trans, signs, m, p, q, x11, ldx11, x12, ldx12, x21, ldx21, x22, ldx22, theta, u1, ldu1, u2, ldu2, v1t, ldv1t, v2t, ldv2t, work, lwork, rwork, lrwork, iwork, info ) call zuncsd( jobu1, jobu2, jobv1t, jobv2t, trans, signs, m, p, q, x11, ldx11, x12, ldx12, x21, ldx21, x22, ldx22, theta, u1, ldu1, u2, ldu2, v1t, ldv1t, v2t, ldv2t, work, lwork, rwork, lrwork, iwork, info ) Fortran 95: call orcsd( x11,x12,x21,x22,theta,u1,u2,v1t,v2t[,jobu1][,jobu2][,jobv1t][,jobv2t] [,trans][,signs][,info] ) call uncsd( x11,x12,x21,x22,theta,u1,u2,v1t,v2t[,jobu1][,jobu2][,jobv1t][,jobv2t] [,trans][,signs][,info] ) 4 Intel® Math Kernel Library Reference Manual 1060 C: lapack_int LAPACKE_sorcsd( int matrix_order, char jobu1, char jobu2, char jobv1t, char jobv2t, char trans, char signs, lapack_int m, lapack_int p, lapack_int q, float* x11, lapack_int ldx11, float* x12, lapack_int ldx12, float* x21, lapack_int ldx21, float* x22, lapack_int ldx22, float* theta, float* u1, lapack_int ldu1, float* u2, lapack_int ldu2, float* v1t, lapack_int ldv1t, float* v2t, lapack_int ldv2t ); lapack_int LAPACKE_dorcsd( int matrix_order, char jobu1, char jobu2, char jobv1t, char jobv2t, char trans, char signs, lapack_int m, lapack_int p, lapack_int q, double* x11, lapack_int ldx11, double* x12, lapack_int ldx12, double* x21, lapack_int ldx21, double* x22, lapack_int ldx22, double* theta, double* u1, lapack_int ldu1, double* u2, lapack_int ldu2, double* v1t, lapack_int ldv1t, double* v2t, lapack_int ldv2t ); lapack_int LAPACKE_cuncsd( int matrix_order, char jobu1, char jobu2, char jobv1t, char jobv2t, char trans, char signs, lapack_int m, lapack_int p, lapack_int q, lapack_complex_float* x11, lapack_int ldx11, lapack_complex_float* x12, lapack_int ldx12, lapack_complex_float* x21, lapack_int ldx21, lapack_complex_float* x22, lapack_int ldx22, float* theta, lapack_complex_float* u1, lapack_int ldu1, lapack_complex_float* u2, lapack_int ldu2, lapack_complex_float* v1t, lapack_int ldv1t, lapack_complex_float* v2t, lapack_int ldv2t ); lapack_int LAPACKE_zuncsd( int matrix_order, char jobu1, char jobu2, char jobv1t, char jobv2t, char trans, char signs, lapack_int m, lapack_int p, lapack_int q, lapack_complex_double* x11, lapack_int ldx11, lapack_complex_double* x12, lapack_int ldx12, lapack_complex_double* x21, lapack_int ldx21, lapack_complex_double* x22, lapack_int ldx22, double* theta, lapack_complex_double* u1, lapack_int ldu1, lapack_complex_double* u2, lapack_int ldu2, lapack_complex_double* v1t, lapack_int ldv1t, lapack_complex_double* v2t, lapack_int ldv2t ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routines ?orcsd/?uncsd compute the CS decomposition of an m-by-m partitioned orthogonal matrix X: or unitary matrix: LAPACK Routines: Least Squares and Eigenvalue Problems 4 1061 x 11 is p-by-q. The orthogonal/unitary matrices u1, u2, v1, and v 2 are p-by-p, (m-p)-by-(m-p), q-by-q, (mq)- by-(m-q), respectively. C and S are r-by-r nonnegative diagonal matrices satisfying C2 + S2 = I, in which r = min(p,m-p,q,m-q). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobu1 CHARACTER. If equals Y, then u1 is computed. Otherwise, u1 is not computed. jobu2 CHARACTER. If equals Y, then u2 is computed. Otherwise, u2 is not computed. jobv1t CHARACTER. If equals Y, then v1 t is computed. Otherwise, v1 t is not computed. jobv2t CHARACTER. If equals Y, then v2 t is computed. Otherwise, v2 t is not computed. trans CHARACTER = 'T': x, u1, u2, v1 t, v2 t are stored in row-major order. otherwise x, u1, u2, v1 t, v2 t are stored in column-major order. signs CHARACTER = 'O': The lower-left block is made nonpositive (the "other" convention). otherwise The upper-right block is made nonpositive (the "default" convention). m INTEGER. The number of rows and columns of the matrix X. p INTEGER. The number of rows in x 11 and x12. 0 = p = m. q INTEGER. The number of columns in x 11 and x21. 0 = q = m. x REAL for sorcsd DOUBLE PRECISION for dorcsd COMPLEX for cuncsd DOUBLE COMPLEX for zuncsd Array, DIMENSION (ldx,m). On entry, the orthogonal/unitary matrix whose CSD is desired. ldx INTEGER. The leading dimension of the array X. ldx = max(1,m). 4 Intel® Math Kernel Library Reference Manual 1062 ldu1 INTEGER. The leading dimension of the array u1. If jobu1 = 'Y', ldu1 = max(1,p). ldu2 INTEGER. The leading dimension of the array u2. If jobu2 = 'Y', ldu2 = max(1,m-p). ldv1t INTEGER. The leading dimension of the array v1t. If jobv1t = 'Y', ldv1t = max(1,q). ldv2t INTEGER. The leading dimension of the array v2t. If jobv2t = 'Y', ldv2t = max(1,m-q). work REAL for sorcsd DOUBLE PRECISION for dorcsd COMPLEX for cuncsd DOUBLE COMPLEX for zuncsd Workspace array, DIMENSION (max(1,lwork)). lwork INTEGER. The size of the work array. Constraints: If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. rwork REAL for cuncsd DOUBLE PRECISION for zuncsd Workspace array, DIMENSION (max(1,lrwork)). lrwork INTEGER. The size of the rwork array. Constraints: If lrwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the rwork array, returns this value as the first entry of the rwork array, and no error message related to lrwork is issued by xerbla. iwork INTEGER. Workspace array, dimension m. Output Parameters theta REAL for sorcsd DOUBLE PRECISION for dorcsd COMPLEX for cuncsd DOUBLE COMPLEX for zuncsd Array, DIMENSION (r), in which r = min(p,m-p,q,m-q). C = diag( cos(theta(1)), ..., cos(theta(r)) ), and S = diag( sin(theta(1)), ..., sin(theta(r)) ). u1 REAL for sorcsd DOUBLE PRECISION for dorcsd COMPLEX for cuncsd DOUBLE COMPLEX for zuncsd Array, DIMENSION (p). If jobu1 = 'Y', u1 contains the p-by-p orthogonal/unitary matrix u1. u2 REAL for sorcsd DOUBLE PRECISION for dorcsd COMPLEX for cuncsd DOUBLE COMPLEX for zuncsd Array, DIMENSION (ldu2,m-p). If jobu2 = 'Y', u2 contains the (m-p)-by-(m-p) orthogonal/unitary matrix u2. v1t REAL for sorcsd LAPACK Routines: Least Squares and Eigenvalue Problems 4 1063 DOUBLE PRECISION for dorcsd COMPLEX for cuncsd DOUBLE COMPLEX for zuncsd Array, DIMENSION (ldv1t,q). If jobv1t = 'Y', v1t contains the q-by-q orthogonal matrix v1 T or unitary matrix v1 H. v2t REAL for sorcsd DOUBLE PRECISION for dorcsd COMPLEX for cuncsd DOUBLE COMPLEX for zuncsd Array, DIMENSION (ldv2t,m-q). If jobv2t = 'Y', v2t contains the (m-q)-by-(m-q) orthogonal matrix v2 T or unitary matrix v2 H. work On exit, If info = 0, work(1) returns the optimal lwork. If info > 0, work(2:r) contains the values phi(1), ..., phi(r-1) that, together with theta(1), ..., theta(r) define the matrix in intermediate bidiagonal-block form remaining after nonconvergence. info specifies the number of nonzero phi's. rwork On exit, If info = 0, rwork(1) returns the optimal lrwork. If info > 0, rwork(2:r) contains the values phi(1), ..., phi(r-1) that, together with theta(1), ..., theta(r) define the matrix in intermediate bidiagonal-block form remaining after nonconvergence. info specifies the number of nonzero phi's. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument has an illegal value > 0: ?bbcsd did not converge. See the description of work above for details. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ?orcsd/?uncsd interface are as follows: x11 Holds the block of matrix X of size (p, q). x12 Holds the block of matrix X of size (p, m-q). x21 Holds the block of matrix X of size (m-p, q). x22 Holds the block of matrix X of size (m-p, m-q). theta Holds the vector of length r = min(p,m-p,q,m-q). u1 Holds the matrix of size (p,p). u2 Holds the matrix of size (m-p,m-p). 4 Intel® Math Kernel Library Reference Manual 1064 v1t Holds the matrix of size (q,q). v2t Holds the matrix of size (m-q,m-q). jobsu1 Indicates whether u1 is computed. Must be 'Y' or 'O'. jobsu2 Indicates whether u2 is computed. Must be 'Y' or 'O'. jobv1t Indicates whether v1 t is computed. Must be 'Y' or 'O'. jobv2t Indicates whether v2 t is computed. Must be 'Y' or 'O'. trans Must be 'N' or 'T'. signs Must be 'O' or 'D'. See Also ?bbcsd xerbla Generalized Symmetric Definite Eigenproblems This section describes LAPACK driver routines used for solving generalized symmetric definite eigenproblems. See also computational routines that can be called to solve these problems. Table "Driver Routines for Solving Generalized Symmetric Definite Eigenproblems" lists all such driver routines for the FORTRAN 77 interface. Respective routine names in the Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Driver Routines for Solving Generalized Symmetric Definite Eigenproblems Routine Name Operation performed sygv/hegv Computes all eigenvalues and, optionally, eigenvectors of a real / complex generalized symmetric /Hermitian definite eigenproblem. sygvd/hegvd Computes all eigenvalues and, optionally, eigenvectors of a real / complex generalized symmetric /Hermitian definite eigenproblem. If eigenvectors are desired, it uses a divide and conquer method. sygvx/hegvx Computes selected eigenvalues and, optionally, eigenvectors of a real / complex generalized symmetric /Hermitian definite eigenproblem. spgv/hpgv Computes all eigenvalues and, optionally, eigenvectors of a real / complex generalized symmetric /Hermitian definite eigenproblem with matrices in packed storage. spgvd/hpgvd Computes all eigenvalues and, optionally, eigenvectors of a real / complex generalized symmetric /Hermitian definite eigenproblem with matrices in packed storage. If eigenvectors are desired, it uses a divide and conquer method. spgvx/hpgvx Computes selected eigenvalues and, optionally, eigenvectors of a real / complex generalized symmetric /Hermitian definite eigenproblem with matrices in packed storage. sbgv/hbgv Computes all eigenvalues and, optionally, eigenvectors of a real / complex generalized symmetric /Hermitian definite eigenproblem with banded matrices. sbgvd/hbgvd Computes all eigenvalues and, optionally, eigenvectors of a real / complex generalized symmetric /Hermitian definite eigenproblem with banded matrices. If eigenvectors are desired, it uses a divide and conquer method. sbgvx/hbgvx Computes selected eigenvalues and, optionally, eigenvectors of a real / complex generalized symmetric /Hermitian definite eigenproblem with banded matrices. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1065 ?sygv Computes all eigenvalues and, optionally, eigenvectors of a real generalized symmetric definite eigenproblem. Syntax Fortran 77: call ssygv(itype, jobz, uplo, n, a, lda, b, ldb, w, work, lwork, info) call dsygv(itype, jobz, uplo, n, a, lda, b, ldb, w, work, lwork, info) Fortran 95: call sygv(a, b, w [,itype] [,jobz] [,uplo] [,info]) C: lapack_int LAPACKE_sygv( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, * a, lapack_int lda, * b, lapack_int ldb, * w ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally, the eigenvectors of a real generalized symmetricdefinite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. Here A and B are assumed to be symmetric and B is also positive definite. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays a and b store the upper triangles of A and B; If uplo = 'L', arrays a and b store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). a, b, work REAL for ssygv DOUBLE PRECISION for dsygv. Arrays: 4 Intel® Math Kernel Library Reference Manual 1066 a(lda,*) contains the upper or lower triangle of the symmetric matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). b(ldb,*) contains the upper or lower triangle of the symmetric positive definite matrix B, as specified by uplo. The second dimension of b must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). lwork INTEGER. The dimension of the array work; lwork = max(1, 3n-1). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a On exit, if jobz = 'V', then if info = 0, a contains the matrix Z of eigenvectors. The eigenvectors are normalized as follows: if itype = 1 or 2, ZT*B*Z = I; if itype = 3, ZT*inv(B)*Z = I; If jobz = 'N', then on exit the upper triangle (if uplo = 'U') or the lower triangle (if uplo = 'L') of A, including the diagonal, is destroyed. b On exit, if info = n, the part of b containing the matrix is overwritten by the triangular factor U or L from the Cholesky factorization B = UT*U or B = L*LT. w REAL for ssygv DOUBLE PRECISION for dsygv. Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, spotrf/dpotrf and ssyev/dsyev returned an error code: If info = i = n, ssyev/dsyev failed to converge, and i off-diagonal elements of an intermediate tridiagonal did not converge to zero; If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sygv interface are the following: a Holds the matrix A of size (n, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1067 b Holds the matrix B of size (n, n). w Holds the vector of length n. itype Must be 1, 2, or 3. The default value is 1. jobz Must be 'N' or 'V'. The default value is 'N'. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For optimum performance use lwork = (nb+2)*n, where nb is the blocksize for ssytrd/dsytrd returned by ilaenv. If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run or set lwork = -1 (liwork = -1). If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if work (liwork) is less than the minimal required value and is not equal to -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?hegv Computes all eigenvalues and, optionally, eigenvectors of a complex generalized Hermitian definite eigenproblem. Syntax Fortran 77: call chegv(itype, jobz, uplo, n, a, lda, b, ldb, w, work, lwork, rwork, info) call zhegv(itype, jobz, uplo, n, a, lda, b, ldb, w, work, lwork, rwork, info) Fortran 95: call hegv(a, b, w [,itype] [,jobz] [,uplo] [,info]) C: lapack_int LAPACKE_chegv( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, float* w ); lapack_int LAPACKE_zhegv( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, double* w ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description 4 Intel® Math Kernel Library Reference Manual 1068 The routine computes all the eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitian-definite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. Here A and B are assumed to be Hermitian and B is also positive definite. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays a and b store the upper triangles of A and B; If uplo = 'L', arrays a and b store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). a, b, work COMPLEX for chegv DOUBLE COMPLEX for zhegv. Arrays: a(lda,*) contains the upper or lower triangle of the Hermitian matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). b(ldb,*) contains the upper or lower triangle of the Hermitian positive definite matrix B, as specified by uplo. The second dimension of b must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). lwork INTEGER. The dimension of the array work; lwork = max(1, 2n-1). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. rwork REAL for chegv DOUBLE PRECISION for zhegv. Workspace array, DIMENSION at least max(1, 3n-2). Output Parameters a On exit, if jobz = 'V', then if info = 0, a contains the matrix Z of eigenvectors. The eigenvectors are normalized as follows: if itype = 1 or 2, ZH*B*Z = I; if itype = 3, ZH*inv(B)*Z = I; LAPACK Routines: Least Squares and Eigenvalue Problems 4 1069 If jobz = 'N', then on exit the upper triangle (if uplo = 'U') or the lower triangle (if uplo = 'L') of A, including the diagonal, is destroyed. b On exit, if info = n, the part of b containing the matrix is overwritten by the triangular factor U or L from the Cholesky factorization B = UH*U or B = L*LH. w REAL for chegv DOUBLE PRECISION for zhegv. Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument has an illegal value. If info > 0, cpotrf/zpotrf and cheev/zheev return an error code: If info = i = n, cheev/zheev fails to converge, and i off-diagonal elements of an intermediate tridiagonal do not converge to zero; If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B can not be completed and no eigenvalues or eigenvectors are computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hegv interface are the following: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, n). w Holds the vector of length n. itype Must be 1, 2, or 3. The default value is 1. jobz Must be 'N' or 'V'. The default value is 'N'. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For optimum performance use lwork = (nb+1)*n, where nb is the blocksize for chetrd/zhetrd returned by ilaenv. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. 4 Intel® Math Kernel Library Reference Manual 1070 ?sygvd Computes all eigenvalues and, optionally, eigenvectors of a real generalized symmetric definite eigenproblem. If eigenvectors are desired, it uses a divide and conquer method. Syntax Fortran 77: call ssygvd(itype, jobz, uplo, n, a, lda, b, ldb, w, work, lwork, iwork, liwork, info) call dsygvd(itype, jobz, uplo, n, a, lda, b, ldb, w, work, lwork, iwork, liwork, info) Fortran 95: call sygvd(a, b, w [,itype] [,jobz] [,uplo] [,info]) C: lapack_int LAPACKE_sygvd( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, * a, lapack_int lda, * b, lapack_int ldb, * w ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally, the eigenvectors of a real generalized symmetricdefinite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x . Here A and B are assumed to be symmetric and B is also positive definite. If eigenvectors are desired, it uses a divide and conquer algorithm. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays a and b store the upper triangles of A and B; If uplo = 'L', arrays a and b store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). a, b, work REAL for ssygvd DOUBLE PRECISION for dsygvd. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1071 Arrays: a(lda,*) contains the upper or lower triangle of the symmetric matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). b(ldb,*) contains the upper or lower triangle of the symmetric positive definite matrix B, as specified by uplo. The second dimension of b must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). lwork INTEGER. The dimension of the array work. Constraints: If n = 1, lwork = 1; If jobz = 'N' and n>1, lwork < 2n+1; If jobz = 'V' and n>1, lwork < 2n2+6n+1. If lwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, its dimension max(1, lwork). liwork INTEGER. The dimension of the array iwork. Constraints: If n = 1, liwork = 1; If jobz = 'N' and n>1, liwork = 1; If jobz = 'V' and n>1, liwork = 5n+3. If liwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters a On exit, if jobz = 'V', then if info = 0, a contains the matrix Z of eigenvectors. The eigenvectors are normalized as follows: if itype = 1 or 2, ZT*B*Z = I; if itype = 3, ZT*inv(B)*Z = I; If jobz = 'N', then on exit the upper triangle (if uplo = 'U') or the lower triangle (if uplo = 'L') of A, including the diagonal, is destroyed. b On exit, if info = n, the part of b containing the matrix is overwritten by the triangular factor U or L from the Cholesky factorization B = UT*U or B = L*LT. w REAL for ssygvd DOUBLE PRECISION for dsygvd. Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. 4 Intel® Math Kernel Library Reference Manual 1072 work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, an error code is returned as specified below. • For info=N: • If info = i, with i=n, and jobz = 'N', then the algorithm falied to converge; i off-diagonal elements of an intermediate tridiagonal form did not converge to zero. • If jobz = 'V', then the algorithm failed to compute an eigenvalue while working on the submatrix lying in rows and columns info/(n +1) through mod(info,n+1). • For info > N: • If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sygvd interface are the following: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, n). w Holds the vector of length n. itype Must be 1, 2, or 3. The default value is 1. jobz Must be 'N' or 'V'. The default value is 'N'. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run or set lwork = -1 (liwork = -1). If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if work (liwork) is less than the minimal required value and is not equal to -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1073 ?hegvd Computes all eigenvalues and, optionally, eigenvectors of a complex generalized Hermitian definite eigenproblem. If eigenvectors are desired, it uses a divide and conquer method. Syntax Fortran 77: call chegvd(itype, jobz, uplo, n, a, lda, b, ldb, w, work, lwork, rwork, lrwork, iwork, liwork, info) call zhegvd(itype, jobz, uplo, n, a, lda, b, ldb, w, work, lwork, rwork, lrwork, iwork, liwork, info) Fortran 95: call hegvd(a, b, w [,itype] [,jobz] [,uplo] [,info]) C: lapack_int LAPACKE_chegvd( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, float* w ); lapack_int LAPACKE_zhegvd( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, double* w ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitian-definite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. Here A and B are assumed to be Hermitian and B is also positive definite. If eigenvectors are desired, it uses a divide and conquer algorithm. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. 4 Intel® Math Kernel Library Reference Manual 1074 If uplo = 'U', arrays a and b store the upper triangles of A and B; If uplo = 'L', arrays a and b store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). a, b, work COMPLEX for chegvd DOUBLE COMPLEX for zhegvd. Arrays: a(lda,*) contains the upper or lower triangle of the Hermitian matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). b(ldb,*) contains the upper or lower triangle of the Hermitian positive definite matrix B, as specified by uplo. The second dimension of b must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). lwork INTEGER. The dimension of the array work. Constraints: If n = 1, lwork = 1; If jobz = 'N' and n>1, lwork = n+1; If jobz = 'V' and n>1, lwork = n2+2n. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. rwork REAL for chegvd DOUBLE PRECISION for zhegvd. Workspace array, DIMENSION max(1, lrwork). lrwork INTEGER. The dimension of the array rwork. Constraints: If n = 1, lrwork = 1; If jobz = 'N' and n>1, lrwork = n; If jobz = 'V' and n>1, lrwork = 2n2+5n+1. If lrwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, DIMENSION max(1, liwork). liwork INTEGER. The dimension of the array iwork. Constraints: If n = 1, liwork = 1; If jobz = 'N' and n>1, liwork = 1; If jobz = 'V' and n>1, liwork = 5n+3. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1075 If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters a On exit, if jobz = 'V', then if info = 0, a contains the matrix Z of eigenvectors. The eigenvectors are normalized as follows: if itype = 1 or 2, ZH* B*Z = I; if itype = 3, ZH*inv(B)*Z = I; If jobz = 'N', then on exit the upper triangle (if uplo = 'U') or the lower triangle (if uplo = 'L') of A, including the diagonal, is destroyed. b On exit, if info = n, the part of b containing the matrix is overwritten by the triangular factor U or L from the Cholesky factorization B = UH*U or B = L*LH. w REAL for chegvd DOUBLE PRECISION for zhegvd. Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. rwork(1) On exit, if info = 0, then rwork(1) returns the required minimal size of lrwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info = i, and jobz = 'N', then the algorithm failed to converge; i offdiagonal elements of an intermediate tridiagonal form did not converge to zero; if info = i, and jobz = 'V', then the algorithm failed to compute an eigenvalue while working on the submatrix lying in rows and columns info/(n+1) through mod(info, n+1). If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hegvd interface are the following: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, n). w Holds the vector of length n. itype Must be 1, 2, or 3. The default value is 1. jobz Must be 'N' or 'V'. The default value is 'N'. 4 Intel® Math Kernel Library Reference Manual 1076 uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork (liwork or lrwork) for the first run or set lwork = -1 (liwork = -1, lrwork = -1). If you choose the first option and set any of admissible lwork (liwork or lrwork) sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork) on exit. Use this value (work(1), iwork(1), rwork(1)) for subsequent runs. If you set lwork = -1 (liwork = -1, lrwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork). This operation is called a workspace query. Note that if you set lwork (liwork, lrwork) to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?sygvx Computes selected eigenvalues and, optionally, eigenvectors of a real generalized symmetric definite eigenproblem. Syntax Fortran 77: call ssygvx(itype, jobz, range, uplo, n, a, lda, b, ldb, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info) call dsygvx(itype, jobz, range, uplo, n, a, lda, b, ldb, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info) Fortran 95: call sygvx(a, b, w [,itype] [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,abstol] [,info]) C: lapack_int LAPACKE_sygvx( int matrix_order, lapack_int itype, char jobz, char range, char uplo, lapack_int n, * a, lapack_int lda, * b, lapack_int ldb, vl, vu, lapack_int il, lapack_int iu, abstol, lapack_int* m, * w, * z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues, and optionally, the eigenvectors of a real generalized symmetricdefinite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1077 Here A and B are assumed to be symmetric and B is also positive definite. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for ssygvx 4 Intel® Math Kernel Library Reference Manual 1078 DOUBLE PRECISION for dsygvx. The absolute error tolerance for the eigenvalues. See Application Notes for more information. ldz INTEGER. The leading dimension of the output array z. Constraints: ldz = 1; if jobz = 'V', ldz = max(1, n). lwork INTEGER. The dimension of the array work; lwork < max(1, 8n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). Output Parameters a On exit, the upper triangle (if uplo = 'U') or the lower triangle (if uplo = 'L') of A, including the diagonal, is overwritten. b On exit, if info = n, the part of b containing the matrix is overwritten by the triangular factor U or L from the Cholesky factorization B = UT*U or B = L*LT. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w, z REAL for ssygvx DOUBLE PRECISION for dsygvx. Arrays: w(*), DIMENSION at least max(1, n). The first m elements of w contain the selected eigenvalues in ascending order. z(ldz,*). The second dimension of z must be at least max(1, m). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix A corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). The eigenvectors are normalized as follows: if itype = 1 or 2, ZT*B*Z = I; if itype = 3, ZT*inv(B)*Z = I; If jobz = 'N', then z is not referenced. If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. ifail INTEGER. Array, DIMENSION at least max(1, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1079 If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, the ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith argument had an illegal value. If info > 0, spotrf/dpotrf and ssyevx/dsyevx returned an error code: If info = i = n, ssyevx/dsyevx failed to converge, and i eigenvectors failed to converge. Their indices are stored in the array ifail; If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sygvx interface are the following: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, n). w Holds the vector of length n. z Holds the matrix Z of size (n, n), where the values n and m are significant. ifail Holds the vector of length n. itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Note that there will be an error condition if ifail is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. 4 Intel® Math Kernel Library Reference Manual 1080 If abstol is less than or equal to zero, then e*||T||1 is used as tolerance, where T is the tridiagonal matrix obtained by reducing C to tridiagonal form, where C is the symmetric matrix of the standard symmetric problem to which the generalized problem is transformed. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*?lamch('S'), not zero. If this routine returns with info > 0, indicating that some eigenvectors did not converge, set abstol to 2*? lamch('S'). For optimum performance use lwork = (nb+3)*n, where nb is the blocksize for ssytrd/dsytrd returned by ilaenv. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?hegvx Computes selected eigenvalues and, optionally, eigenvectors of a complex generalized Hermitian definite eigenproblem. Syntax Fortran 77: call chegvx(itype, jobz, range, uplo, n, a, lda, b, ldb, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, rwork, iwork, ifail, info) call zhegvx(itype, jobz, range, uplo, n, a, lda, b, ldb, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, rwork, iwork, ifail, info) Fortran 95: call hegvx(a, b, w [,itype] [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,abstol] [,info]) C: lapack_int LAPACKE_chegvx( int matrix_order, lapack_int itype, char jobz, char range, char uplo, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, float vl, float vu, lapack_int il, lapack_int iu, float abstol, lapack_int* m, float* w, lapack_complex_float* z, lapack_int ldz, lapack_int* ifail ); lapack_int LAPACKE_zhegvx( int matrix_order, lapack_int itype, char jobz, char range, char uplo, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, double vl, double vu, lapack_int il, lapack_int iu, double abstol, lapack_int* m, double* w, lapack_complex_double* z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h LAPACK Routines: Least Squares and Eigenvalue Problems 4 1081 Description The routine computes selected eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitian-definite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. Here A and B are assumed to be Hermitian and B is also positive definite. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for chegvx DOUBLE PRECISION for zhegvx. The absolute error tolerance for the eigenvalues. See Application Notes for more information. ldz INTEGER. The leading dimension of the output array z. Constraints: ldz = 1; if jobz = 'V', ldz = max(1, n). lwork INTEGER. The dimension of the array work; lwork = max(1, 2n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. rwork REAL for chegvx DOUBLE PRECISION for zhegvx. Workspace array, DIMENSION at least max(1, 7n). iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). Output Parameters a On exit, the upper triangle (if uplo = 'U') or the lower triangle (if uplo = 'L') of A, including the diagonal, is overwritten. b On exit, if info = n, the part of b containing the matrix is overwritten by the triangular factor U or L from the Cholesky factorization B = UH*U or B = L*LH. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w REAL for chegvx DOUBLE PRECISION for zhegvx. Array, DIMENSION at least max(1, n). The first m elements of w contain the selected eigenvalues in ascending order. z COMPLEX for chegvx DOUBLE COMPLEX for zhegvx. Array z(ldz,*). The second dimension of z must be at least max(1, m). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix A corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). The eigenvectors are normalized as follows: if itype = 1 or 2, ZH*B*Z = I; if itype = 3, ZH*inv(B)*Z = I; If jobz = 'N', then z is not referenced. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1083 If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. ifail INTEGER. Array, DIMENSION at least max(1, n). If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, the ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith argument had an illegal value. If info > 0, cpotrf/zpotrf and cheevx/zheevx returned an error code: If info = i = n, cheevx/zheevx failed to converge, and i eigenvectors failed to converge. Their indices are stored in the array ifail; If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hegvx interface are the following: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, n). w Holds the vector of length n. z Holds the matrix Z of size (n, n), where the values n and m are significant. ifail Holds the vector of length n. itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Note that there will be an error condition if ifail is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, 4 Intel® Math Kernel Library Reference Manual 1084 range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. If abstol is less than or equal to zero, then e*||T||1 will be used in its place, where T is the tridiagonal matrix obtained by reducing C to tridiagonal form, where C is the symmetric matrix of the standard symmetric problem to which the generalized problem is transformed. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*?lamch('S'), not zero. If this routine returns with info > 0, indicating that some eigenvectors did not converge, try setting abstol to 2*?lamch('S'). For optimum performance use lwork = (nb+1)*n, where nb is the blocksize for chetrd/zhetrd returned by ilaenv. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?spgv Computes all eigenvalues and, optionally, eigenvectors of a real generalized symmetric definite eigenproblem with matrices in packed storage. Syntax Fortran 77: call sspgv(itype, jobz, uplo, n, ap, bp, w, z, ldz, work, info) call dspgv(itype, jobz, uplo, n, ap, bp, w, z, ldz, work, info) Fortran 95: call spgv(ap, bp, w [,itype] [,uplo] [,z] [,info]) C: lapack_int LAPACKE_spgv( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, * ap, * bp, * w, * z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h LAPACK Routines: Least Squares and Eigenvalue Problems 4 1085 Description The routine computes all the eigenvalues, and optionally, the eigenvectors of a real generalized symmetricdefinite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. Here A and B are assumed to be symmetric, stored in packed format, and B is also positive definite. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays ap and bp store the upper triangles of A and B; If uplo = 'L', arrays ap and bp store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). ap, bp, work REAL for sspgv DOUBLE PRECISION for dspgv. Arrays: ap(*) contains the packed upper or lower triangle of the symmetric matrix A, as specified by uplo. The dimension of ap must be at least max(1, n*(n+1)/2). bp(*) contains the packed upper or lower triangle of the symmetric matrix B, as specified by uplo. The dimension of bp must be at least max(1, n*(n+1)/2). work(*) is a workspace array, DIMENSION at least max(1, 3n). ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', ldz = max(1, n). Output Parameters ap On exit, the contents of ap are overwritten. bp On exit, contains the triangular factor U or L from the Cholesky factorization B = UT*U or B = L*LT, in the same storage format as B. w, z REAL for sspgv DOUBLE PRECISION for dspgv. Arrays: w(*), DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the matrix Z of eigenvectors. The eigenvectors are normalized as follows: if itype = 1 or 2, ZT*B*Z = I; 4 Intel® Math Kernel Library Reference Manual 1086 if itype = 3, ZT*inv(B)*Z = I; If jobz = 'N', then z is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, spptrf/dpptrf and sspev/dspev returned an error code: If info = i = n, sspev/dspev failed to converge, and i off-diagonal elements of an intermediate tridiagonal did not converge to zero; If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine spgv interface are the following: ap Holds the array A of size (n*(n+1)/2). bp Holds the array B of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. ?hpgv Computes all eigenvalues and, optionally, eigenvectors of a complex generalized Hermitian definite eigenproblem with matrices in packed storage. Syntax Fortran 77: call chpgv(itype, jobz, uplo, n, ap, bp, w, z, ldz, work, rwork, info) call zhpgv(itype, jobz, uplo, n, ap, bp, w, z, ldz, work, rwork, info) Fortran 95: call hpgv(ap, bp, w [,itype] [,uplo] [,z] [,info]) C: lapack_int LAPACKE_chpgv( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, lapack_complex_float* ap, lapack_complex_float* bp, float* w, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zhpgv( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, lapack_complex_double* ap, lapack_complex_double* bp, double* w, lapack_complex_double* z, lapack_int ldz ); LAPACK Routines: Least Squares and Eigenvalue Problems 4 1087 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitian-definite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. Here A and B are assumed to be Hermitian, stored in packed format, and B is also positive definite. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays ap and bp store the upper triangles of A and B; If uplo = 'L', arrays ap and bp store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). ap, bp, work COMPLEX for chpgv DOUBLE COMPLEX for zhpgv. Arrays: ap(*) contains the packed upper or lower triangle of the Hermitian matrix A, as specified by uplo. The dimension of ap must be at least max(1, n*(n+1)/2). bp(*) contains the packed upper or lower triangle of the Hermitian matrix B, as specified by uplo. The dimension of bp must be at least max(1, n*(n+1)/2). work(*) is a workspace array, DIMENSION at least max(1, 2n-1). ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', ldz = max(1, n). rwork REAL for chpgv DOUBLE PRECISION for zhpgv. Workspace array, DIMENSION at least max(1, 3n-2). Output Parameters ap On exit, the contents of ap are overwritten. bp On exit, contains the triangular factor U or L from the Cholesky factorization B = UH*U or B = L*LH, in the same storage format as B. w REAL for chpgv 4 Intel® Math Kernel Library Reference Manual 1088 DOUBLE PRECISION for zhpgv. Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. z COMPLEX for chpgv DOUBLE COMPLEX for zhpgv. Array z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the matrix Z of eigenvectors. The eigenvectors are normalized as follows: if itype = 1 or 2, ZH*B*Z = I; if itype = 3, ZH*inv(B)*Z = I; If jobz = 'N', then z is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, cpptrf/zpptrf and chpev/zhpev returned an error code: If info = i = n, chpev/zhpev failed to converge, and i off-diagonal elements of an intermediate tridiagonal did not converge to zero; If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpgv interface are the following: ap Holds the array A of size (n*(n+1)/2). bp Holds the array B of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. ?spgvd Computes all eigenvalues and, optionally, eigenvectors of a real generalized symmetric definite eigenproblem with matrices in packed storage. If eigenvectors are desired, it uses a divide and conquer method. Syntax Fortran 77: call sspgvd(itype, jobz, uplo, n, ap, bp, w, z, ldz, work, lwork, iwork, liwork, info) call dspgvd(itype, jobz, uplo, n, ap, bp, w, z, ldz, work, lwork, iwork, liwork, info) LAPACK Routines: Least Squares and Eigenvalue Problems 4 1089 Fortran 95: call spgvd(ap, bp, w [,itype] [,uplo] [,z] [,info]) C: lapack_int LAPACKE_spgvd( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, * ap, * bp, * w, * z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally, the eigenvectors of a real generalized symmetricdefinite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. Here A and B are assumed to be symmetric, stored in packed format, and B is also positive definite. If eigenvectors are desired, it uses a divide and conquer algorithm. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays ap and bp store the upper triangles of A and B; If uplo = 'L', arrays ap and bp store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). ap, bp, work REAL for sspgvd DOUBLE PRECISION for dspgvd. Arrays: ap(*) contains the packed upper or lower triangle of the symmetric matrix A, as specified by uplo. The dimension of ap must be at least max(1, n*(n+1)/2). bp(*) contains the packed upper or lower triangle of the symmetric matrix B, as specified by uplo. The dimension of bp must be at least max(1, n*(n+1)/2). work is a workspace array, its dimension max(1, lwork). ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', ldz = max(1, n). lwork INTEGER. 4 Intel® Math Kernel Library Reference Manual 1090 The dimension of the array work. Constraints: If n = 1, lwork = 1; If jobz = 'N' and n>1, lwork = 2n; If jobz = 'V' and n>1, lwork = 2n2+6n+1. If lwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, dimension max(1, lwork). liwork INTEGER. The dimension of the array iwork. Constraints: If n = 1, liwork = 1; If jobz = 'N' and n>1, liwork = 1; If jobz = 'V' and n>1, liwork = 5n+3. If liwork = -1, then a workspace query is assumed; the routine only calculates the required sizes of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters ap On exit, the contents of ap are overwritten. bp On exit, contains the triangular factor U or L from the Cholesky factorization B = UT*U or B = L*LT, in the same storage format as B. w, z REAL for sspgv DOUBLE PRECISION for dspgv. Arrays: w(*), DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the matrix Z of eigenvectors. The eigenvectors are normalized as follows: if itype = 1 or 2, ZT*B*Z = I; if itype = 3, ZT*inv(B)*Z = I; If jobz = 'N', then z is not referenced. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, spptrf/dpptrf and sspevd/dspevd returned an error code: If info = i = n, sspevd/dspevd failed to converge, and i off-diagonal elements of an intermediate tridiagonal did not converge to zero; LAPACK Routines: Least Squares and Eigenvalue Problems 4 1091 If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine spgvd interface are the following: ap Holds the array A of size (n*(n+1)/2). bp Holds the array B of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Application Notes If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run, or set lwork = -1 (liwork = -1). If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, then the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If lwork = -1 (liwork = -1), then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if lwork (liwork) is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?hpgvd Computes all eigenvalues and, optionally, eigenvectors of a complex generalized Hermitian definite eigenproblem with matrices in packed storage. If eigenvectors are desired, it uses a divide and conquer method. Syntax Fortran 77: call chpgvd(itype, jobz, uplo, n, ap, bp, w, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) call zhpgvd(itype, jobz, uplo, n, ap, bp, w, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) Fortran 95: call hpgvd(ap, bp, w [,itype] [,uplo] [,z] [,info]) 4 Intel® Math Kernel Library Reference Manual 1092 C: lapack_int LAPACKE_chpgvd( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, lapack_complex_float* ap, lapack_complex_float* bp, float* w, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zhpgvd( int matrix_order, lapack_int itype, char jobz, char uplo, lapack_int n, lapack_complex_double* ap, lapack_complex_double* bp, double* w, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitian-definite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. Here A and B are assumed to be Hermitian, stored in packed format, and B is also positive definite. If eigenvectors are desired, it uses a divide and conquer algorithm. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays ap and bp store the upper triangles of A and B; If uplo = 'L', arrays ap and bp store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). ap, bp, work COMPLEX for chpgvd DOUBLE COMPLEX for zhpgvd. Arrays: ap(*) contains the packed upper or lower triangle of the Hermitian matrix A, as specified by uplo. The dimension of ap must be at least max(1, n*(n+1)/2). bp(*) contains the packed upper or lower triangle of the Hermitian matrix B, as specified by uplo. The dimension of bp must be at least max(1, n*(n+1)/2). work is a workspace array, its dimension max(1, lwork). ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', ldz = max(1, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1093 lwork INTEGER. The dimension of the array work. Constraints: If n = 1, lwork = 1; If jobz = 'N' and n>1, lwork = n; If jobz = 'V' and n>1, lwork = 2n. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. rwork REAL for chpgvd DOUBLE PRECISION for zhpgvd. Workspace array, its dimension max(1, lrwork). lrwork INTEGER. The dimension of the array rwork. Constraints: If n = 1, lrwork = 1; If jobz = 'N' and n>1, lrwork = n; If jobz = 'V' and n>1, lrwork = 2n2+5n+1. If lrwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork. Constraints: If n = 1, liwork = 1; If jobz = 'N' and n>1, liwork = 1; If jobz = 'V' and n>1, liwork = 5n+3. If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters ap On exit, the contents of ap are overwritten. bp On exit, contains the triangular factor U or L from the Cholesky factorization B = UH*U or B = L*LH, in the same storage format as B. w REAL for chpgvd DOUBLE PRECISION for zhpgvd. Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. z COMPLEX for chpgvd DOUBLE COMPLEX for zhpgvd. Array z(ldz,*). 4 Intel® Math Kernel Library Reference Manual 1094 The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the matrix Z of eigenvectors. The eigenvectors are normalized as follows: if itype = 1 or 2, ZH*B*Z = I; if itype = 3, ZH*inv(B)*Z = I; If jobz = 'N', then z is not referenced. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. rwork(1) On exit, if info = 0, then rwork(1) returns the required minimal size of lrwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, cpptrf/zpptrf and chpevd/zhpevd returned an error code: If info = i = n, chpevd/zhpevd failed to converge, and i off-diagonal elements of an intermediate tridiagonal did not converge to zero; If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpgvd interface are the following: ap Holds the array A of size (n*(n+1)/2). bp Holds the array B of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork (liwork or lrwork) for the first run or set lwork = -1 (liwork = -1, lrwork = -1). If you choose the first option and set any of admissible lwork (liwork or lrwork) sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork) on exit. Use this value (work(1), iwork(1), rwork(1)) for subsequent runs. If you set lwork = -1 (liwork = -1, lrwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork). This operation is called a workspace query. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1095 Note that if you set lwork (liwork, lrwork) to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?spgvx Computes selected eigenvalues and, optionally, eigenvectors of a real generalized symmetric definite eigenproblem with matrices in packed storage. Syntax Fortran 77: call sspgvx(itype, jobz, range, uplo, n, ap, bp, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) call dspgvx(itype, jobz, range, uplo, n, ap, bp, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) Fortran 95: call spgvx(ap, bp, w [,itype] [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,abstol] [,info]) C: lapack_int LAPACKE_spgvx( int matrix_order, lapack_int itype, char jobz, char range, char uplo, lapack_int n, * ap, * bp, vl, vu, lapack_int il, lapack_int iu, abstol, lapack_int* m, * w, * z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues, and optionally, the eigenvectors of a real generalized symmetricdefinite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. Here A and B are assumed to be symmetric, stored in packed format, and B is also positive definite. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. 4 Intel® Math Kernel Library Reference Manual 1096 If jobz = 'V', then compute eigenvalues and eigenvectors. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for sspgvx DOUBLE PRECISION for dspgvx. The absolute error tolerance for the eigenvalues. See Application Notes for more information. ldz INTEGER. The leading dimension of the output array z. Constraints: ldz = 1; if jobz = 'V', ldz = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). Output Parameters ap On exit, the contents of ap are overwritten. bp On exit, contains the triangular factor U or L from the Cholesky factorization B = UT*U or B = L*LT, in the same storage format as B. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1097 w, z REAL for sspgvx DOUBLE PRECISION for dspgvx. Arrays: w(*), DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix A corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). The eigenvectors are normalized as follows: if itype = 1 or 2, ZT*B*Z = I; if itype = 3, ZT*inv(B)*Z = I; If jobz = 'N', then z is not referenced. If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. ifail INTEGER. Array, DIMENSION at least max(1, n). If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, the ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, spptrf/dpptrf and sspevx/dspevx returned an error code: If info = i = n, sspevx/dspevx failed to converge, and i eigenvectors failed to converge. Their indices are stored in the array ifail; If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine spgvx interface are the following: ap Holds the array A of size (n*(n+1)/2). bp Holds the array B of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n), where the values n and m are significant. ifail Holds the vector with the number of elements n. itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). 4 Intel® Math Kernel Library Reference Manual 1098 vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Note that there will be an error condition if ifail is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. If abstol is less than or equal to zero, then e*||T||1 is used instead, where T is the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues are computed most accurately when abstol is set to twice the underflow threshold 2*?lamch('S'), not zero. If this routine returns with info > 0, indicating that some eigenvectors did not converge, set abstol to 2*? lamch('S'). ?hpgvx Computes selected eigenvalues and, optionally, eigenvectors of a generalized Hermitian definite eigenproblem with matrices in packed storage. Syntax Fortran 77: call chpgvx(itype, jobz, range, uplo, n, ap, bp, vl, vu, il, iu, abstol, m, w, z, ldz, work, rwork, iwork, ifail, info) call zhpgvx(itype, jobz, range, uplo, n, ap, bp, vl, vu, il, iu, abstol, m, w, z, ldz, work, rwork, iwork, ifail, info) Fortran 95: call hpgvx(ap, bp, w [,itype] [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,abstol] [,info]) C: lapack_int LAPACKE_chpgvx( int matrix_order, lapack_int itype, char jobz, char range, char uplo, lapack_int n, lapack_complex_float* ap, lapack_complex_float* bp, float vl, float vu, lapack_int il, lapack_int iu, float abstol, lapack_int* m, float* w, lapack_complex_float* z, lapack_int ldz, lapack_int* ifail ); lapack_int LAPACKE_zhpgvx( int matrix_order, lapack_int itype, char jobz, char range, char uplo, lapack_int n, lapack_complex_double* ap, lapack_complex_double* bp, double vl, double vu, lapack_int il, lapack_int iu, double abstol, lapack_int* m, double* w, lapack_complex_double* z, lapack_int ldz, lapack_int* ifail ); LAPACK Routines: Least Squares and Eigenvalue Problems 4 1099 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitian-definite eigenproblem, of the form A*x = ?*B*x, A*B*x = ?*x, or B*A*x = ?*x. Here A and B are assumed to be Hermitian, stored in packed format, and B is also positive definite. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. itype INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: if itype = 1, the problem type is A*x = lambda*B*x; if itype = 2, the problem type is A*B*x = lambda*x; if itype = 3, the problem type is B*A*x = lambda*x. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl< lambda(i) = vu. If range = 'I', the routine computes eigenvalues with indices il to iu. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays ap and bp store the upper triangles of A and B; If uplo = 'L', arrays ap and bp store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). ap, bp, work COMPLEX for chpgvx DOUBLE COMPLEX for zhpgvx. Arrays: ap(*) contains the packed upper or lower triangle of the Hermitian matrix A, as specified by uplo. The dimension of ap must be at least max(1, n*(n+1)/2). bp(*) contains the packed upper or lower triangle of the Hermitian matrix B, as specified by uplo. The dimension of bp must be at least max(1, n*(n+1)/2). work(*) is a workspace array, DIMENSION at least max(1, 2n). vl, vu REAL for chpgvx DOUBLE PRECISION for zhpgvx. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues. Constraint: vl< vu. 4 Intel® Math Kernel Library Reference Manual 1100 If range = 'A' or 'I', vl and vu are not referenced. il, iu INTEGER. If range = 'I', the indices in ascending order of the smallest and largest eigenvalues to be returned. Constraint: 1 = il = iu = n, if n > 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for chpgvx DOUBLE PRECISION for zhpgvx. The absolute error tolerance for the eigenvalues. See Application Notes for more information. ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', ldz = max(1, n). rwork REAL for chpgvx DOUBLE PRECISION for zhpgvx. Workspace array, DIMENSION at least max(1, 7n). iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). Output Parameters ap On exit, the contents of ap are overwritten. bp On exit, contains the triangular factor U or L from the Cholesky factorization B = UH*U or B = L*LH, in the same storage format as B. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w REAL for chpgvx DOUBLE PRECISION for zhpgvx. Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. z COMPLEX for chpgvx DOUBLE COMPLEX for zhpgvx. Array z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix A corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). The eigenvectors are normalized as follows: if itype = 1 or 2, ZH*B*Z = I; if itype = 3, ZH*inv(B)*Z = I; If jobz = 'N', then z is not referenced. If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. Note: you must ensure that at least max(1,m) columns are supplied in the array z; if range = 'V', the exact value of m is not known in advance and an upper bound must be used. ifail INTEGER. Array, DIMENSION at least max(1, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1101 If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, the ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, cpptrf/zpptrf and chpevx/zhpevx returned an error code: If info = i = n, chpevx/zhpevx failed to converge, and i eigenvectors failed to converge. Their indices are stored in the array ifail; If info = n + i, for 1 = i = n, then the leading minor of order i of B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpgvx interface are the following: ap Holds the array A of size (n*(n+1)/2). bp Holds the array B of size (n*(n+1)/2). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n), where the values n and m are significant. ifail Holds the vector with the number of elements n. itype Must be 1, 2, or 3. The default value is 1. uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Note that there will be an error condition if ifail is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. If abstol is less than or equal to zero, then e*||T||1 is used as tolerance, where T is the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*?lamch('S'), not zero. 4 Intel® Math Kernel Library Reference Manual 1102 If this routine returns with info > 0, indicating that some eigenvectors did not converge, try setting abstol to 2*?lamch('S'). ?sbgv Computes all eigenvalues and, optionally, eigenvectors of a real generalized symmetric definite eigenproblem with banded matrices. Syntax Fortran 77: call ssbgv(jobz, uplo, n, ka, kb, ab, ldab, bb, ldbb, w, z, ldz, work, info) call dsbgv(jobz, uplo, n, ka, kb, ab, ldab, bb, ldbb, w, z, ldz, work, info) Fortran 95: call sbgv(ab, bb, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_sbgv( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int ka, lapack_int kb, * ab, lapack_int ldab, * bb, lapack_int ldbb, * w, * z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally, the eigenvectors of a real generalized symmetricdefinite banded eigenproblem, of the form A*x = ?*B*x. Here A and B are assumed to be symmetric and banded, and B is also positive definite. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays ab and bb store the upper triangles of A and B; If uplo = 'L', arrays ab and bb store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). ka INTEGER. The number of super- or sub-diagonals in A (ka = 0). kb INTEGER. The number of super- or sub-diagonals in B (kb = 0). ab, bb, work REAL for ssbgv DOUBLE PRECISION for dsbgv Arrays: LAPACK Routines: Least Squares and Eigenvalue Problems 4 1103 ab (ldab,*) is an array containing either upper or lower triangular part of the symmetric matrix A (as specified by uplo) in band storage format. The second dimension of the array ab must be at least max(1, n). bb(ldbb,*) is an array containing either upper or lower triangular part of the symmetric matrix B (as specified by uplo) in band storage format. The second dimension of the array bb must be at least max(1, n). work(*) is a workspace array, dimension at least max(1, 3n) ldab INTEGER. The leading dimension of the array ab; must be at least ka+1. ldbb INTEGER. The leading dimension of the array bb; must be at least kb+1. ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', ldz = max(1, n). Output Parameters ab On exit, the contents of ab are overwritten. bb On exit, contains the factor S from the split Cholesky factorization B = ST*S, as returned by pbstf/pbstf. w, z REAL for ssbgv DOUBLE PRECISION for dsbgv Arrays: w(*), DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the matrix Z of eigenvectors, with the i-th column of z holding the eigenvector associated with w(i). The eigenvectors are normalized so that ZT*B*Z = I. If jobz = 'N', then z is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, and if i = n, the algorithm failed to converge, and i off-diagonal elements of an intermediate tridiagonal did not converge to zero; if info = n + i, for 1 = i = n, then pbstf/pbstf returned info = i and B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sbgv interface are the following: ab Holds the array A of size (ka+1,n). bb Holds the array B of size (kb+1,n). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: 4 Intel® Math Kernel Library Reference Manual 1104 jobz = 'V', if z is present, jobz = 'N', if z is omitted. ?hbgv Computes all eigenvalues and, optionally, eigenvectors of a complex generalized Hermitian definite eigenproblem with banded matrices. Syntax Fortran 77: call chbgv(jobz, uplo, n, ka, kb, ab, ldab, bb, ldbb, w, z, ldz, work, rwork, info) call zhbgv(jobz, uplo, n, ka, kb, ab, ldab, bb, ldbb, w, z, ldz, work, rwork, info) Fortran 95: call hbgv(ab, bb, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_chbgv( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int ka, lapack_int kb, lapack_complex_float* ab, lapack_int ldab, lapack_complex_float* bb, lapack_int ldbb, float* w, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zhbgv( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int ka, lapack_int kb, lapack_complex_double* ab, lapack_int ldab, lapack_complex_double* bb, lapack_int ldbb, double* w, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitian-definite banded eigenproblem, of the form A*x = ?*B*x. Here A and B are Hermitian and banded matrices, and matrix B is also positive definite. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays ab and bb store the upper triangles of A and B; If uplo = 'L', arrays ab and bb store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). ka INTEGER. The number of super- or sub-diagonals in A LAPACK Routines: Least Squares and Eigenvalue Problems 4 1105 (ka = 0). kb INTEGER. The number of super- or sub-diagonals in B (kb = 0). ab, bb, work COMPLEX for chbgv DOUBLE COMPLEX for zhbgv Arrays: ab (ldab,*) is an array containing either upper or lower triangular part of the Hermitian matrix A (as specified by uplo) in band storage format. The second dimension of the array ab must be at least max(1, n). bb(ldbb,*) is an array containing either upper or lower triangular part of the Hermitian matrix B (as specified by uplo) in band storage format. The second dimension of the array bb must be at least max(1, n). work(*) is a workspace array, dimension at least max(1, n). ldab INTEGER. The leading dimension of the array ab; must be at least ka+1. ldbb INTEGER. The leading dimension of the array bb; must be at least kb+1. ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', ldz = max(1, n). rwork REAL for chbgv DOUBLE PRECISION for zhbgv. Workspace array, DIMENSION at least max(1, 3n). Output Parameters ab On exit, the contents of ab are overwritten. bb On exit, contains the factor S from the split Cholesky factorization B = SH*S, as returned by pbstf/pbstf. w REAL for chbgv DOUBLE PRECISION for zhbgv. Array, DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. z COMPLEX for chbgv DOUBLE COMPLEX for zhbgv Array z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the matrix Z of eigenvectors, with the i-th column of z holding the eigenvector associated with w(i). The eigenvectors are normalized so that ZH*B*Z = I. If jobz = 'N', then z is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, and if i = n, the algorithm failed to converge, and i off-diagonal elements of an intermediate tridiagonal did not converge to zero; if info = n + i, for 1 = i = n, then pbstf/pbstf returned info = i and B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. 4 Intel® Math Kernel Library Reference Manual 1106 Specific details for the routine hbgv interface are the following: ab Holds the array A of size (ka+1,n). bb Holds the array B of size (kb+1,n). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. ?sbgvd Computes all eigenvalues and, optionally, eigenvectors of a real generalized symmetric definite eigenproblem with banded matrices. If eigenvectors are desired, it uses a divide and conquer method. Syntax Fortran 77: call ssbgvd(jobz, uplo, n, ka, kb, ab, ldab, bb, ldbb, w, z, ldz, work, lwork, iwork, liwork, info) call dsbgvd(jobz, uplo, n, ka, kb, ab, ldab, bb, ldbb, w, z, ldz, work, lwork, iwork, liwork, info) Fortran 95: call sbgvd(ab, bb, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_sbgvd( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int ka, lapack_int kb, * ab, lapack_int ldab, * bb, lapack_int ldbb, * w, * z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally, the eigenvectors of a real generalized symmetricdefinite banded eigenproblem, of the form A*x = ?*B*x. Here A and B are assumed to be symmetric and banded, and B is also positive definite. If eigenvectors are desired, it uses a divide and conquer algorithm. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1107 If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays ab and bb store the upper triangles of A and B; If uplo = 'L', arrays ab and bb store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). ka INTEGER. The number of super- or sub-diagonals in A (ka = 0). kb INTEGER. The number of super- or sub-diagonals in B (kb = 0). ab, bb, work REAL for ssbgvd DOUBLE PRECISION for dsbgvd Arrays: ab (ldab,*) is an array containing either upper or lower triangular part of the symmetric matrix A (as specified by uplo) in band storage format. The second dimension of the array ab must be at least max(1, n). bb(ldbb,*) is an array containing either upper or lower triangular part of the symmetric matrix B (as specified by uplo) in band storage format. The second dimension of the array bb must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). ldab INTEGER. The leading dimension of the array ab; must be at least ka+1. ldbb INTEGER. The leading dimension of the array bb; must be at least kb+1. ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', ldz = max(1, n). lwork INTEGER. The dimension of the array work. Constraints: If n = 1, lwork = 1; If jobz = 'N' and n>1, lwork = 3n; If jobz = 'V' and n>1, lwork = 2n2+5n+1. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork. Constraints: If n = 1, liwork = 1; If jobz = 'N' and n>1, liwork = 1; If jobz = 'V' and n>1, liwork = 5n+3. If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work and iwork arrays, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters ab On exit, the contents of ab are overwritten. 4 Intel® Math Kernel Library Reference Manual 1108 bb On exit, contains the factor S from the split Cholesky factorization B = ST*S, as returned by pbstf/pbstf. w, z REAL for ssbgvd DOUBLE PRECISION for dsbgvd Arrays: w(*), DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the matrix Z of eigenvectors, with the i-th column of z holding the eigenvector associated with w(i). The eigenvectors are normalized so that ZT*B*Z = I. If jobz = 'N', then z is not referenced. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, and if i = n, the algorithm failed to converge, and i off-diagonal elements of an intermediate tridiagonal did not converge to zero; if info = n + i, for 1 = i = n, then pbstf/pbstf returned info = i and B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sbgvd interface are the following: ab Holds the array A of size (ka+1,n). bb Holds the array B of size (kb+1,n). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Application Notes If it is not clear how much workspace to supply, use a generous value of lwork (or liwork) for the first run or set lwork = -1 (liwork = -1). If lwork (or liwork) has any of admissible sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1109 If lwork = -1 (liwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if work (liwork) is less than the minimal required value and is not equal to -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?hbgvd Computes all eigenvalues and, optionally, eigenvectors of a complex generalized Hermitian definite eigenproblem with banded matrices. If eigenvectors are desired, it uses a divide and conquer method. Syntax Fortran 77: call chbgvd(jobz, uplo, n, ka, kb, ab, ldab, bb, ldbb, w, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) call zhbgvd(jobz, uplo, n, ka, kb, ab, ldab, bb, ldbb, w, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) Fortran 95: call hbgvd(ab, bb, w [,uplo] [,z] [,info]) C: lapack_int LAPACKE_chbgvd( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int ka, lapack_int kb, lapack_complex_float* ab, lapack_int ldab, lapack_complex_float* bb, lapack_int ldbb, float* w, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zhbgvd( int matrix_order, char jobz, char uplo, lapack_int n, lapack_int ka, lapack_int kb, lapack_complex_double* ab, lapack_int ldab, lapack_complex_double* bb, lapack_int ldbb, double* w, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitian-definite banded eigenproblem, of the form A*x = ?*B*x. Here A and B are assumed to be Hermitian and banded, and B is also positive definite. If eigenvectors are desired, it uses a divide and conquer algorithm. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. 4 Intel® Math Kernel Library Reference Manual 1110 If jobz = 'V', then compute eigenvalues and eigenvectors. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays ab and bb store the upper triangles of A and B; If uplo = 'L', arrays ab and bb store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). ka INTEGER. The number of super- or sub-diagonals in A (ka=0). kb INTEGER. The number of super- or sub-diagonals in B (kb = 0). ab, bb, work COMPLEX for chbgvd DOUBLE COMPLEX for zhbgvd Arrays: ab (ldab,*) is an array containing either upper or lower triangular part of the Hermitian matrix A (as specified by uplo) in band storage format. The second dimension of the array ab must be at least max(1, n). bb(ldbb,*) is an array containing either upper or lower triangular part of the Hermitian matrix B (as specified by uplo) in band storage format. The second dimension of the array bb must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). ldab INTEGER. The leading dimension of the array ab; must be at least ka+1. ldbb INTEGER. The leading dimension of the array bb; must be at least kb+1. ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', ldz = max(1, n). lwork INTEGER. The dimension of the array work. Constraints: If n = 1, lwork = 1; If jobz = 'N' and n>1, lwork = n; If jobz = 'V' and n>1, lwork = 2n2. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. rwork REAL for chbgvd DOUBLE PRECISION for zhbgvd. Workspace array, DIMENSION max(1, lrwork). lrwork INTEGER. The dimension of the array rwork. Constraints: If n = 1, lrwork = 1; If jobz = 'N' and n>1, lrwork = n; If jobz = 'V' and n>1, lrwork = 2n2+5n +1. If lrwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. iwork INTEGER. Workspace array, DIMENSION max(1, liwork). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1111 liwork INTEGER. The dimension of the array iwork. Constraints: If n = 1, lwork = 1; If jobz = 'N' and n>1, liwork = 1; If jobz = 'V' and n>1, liwork = 5n+3. If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for details. Output Parameters ab On exit, the contents of ab are overwritten. bb On exit, contains the factor S from the split Cholesky factorization B = SH*S, as returned by pbstf/pbstf. w REAL for chbgvd DOUBLE PRECISION for zhbgvd. Array, DIMENSION at least max(1, n) . If info = 0, contains the eigenvalues in ascending order. z COMPLEX for chbgvd DOUBLE COMPLEX for zhbgvd Array z(ldz,*) . The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the matrix Z of eigenvectors, with the i-th column of z holding the eigenvector associated with w(i). The eigenvectors are normalized so that ZH*B*Z = I. If jobz = 'N', then z is not referenced. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. rwork(1) On exit, if info = 0, then rwork(1) returns the required minimal size of lrwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, and if i = n, the algorithm failed to converge, and i off-diagonal elements of an intermediate tridiagonal did not converge to zero; if info = n + i, for 1 = i = n, then pbstf/pbstf returned info = i and B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbgvd interface are the following: 4 Intel® Math Kernel Library Reference Manual 1112 ab Holds the array A of size (ka+1,n). bb Holds the array B of size (kb+1,n). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork (liwork or lrwork) for the first run or set lwork = -1 (liwork = -1, lrwork = -1). If you choose the first option and set any of admissible lwork (liwork or lrwork) sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork) on exit. Use this value (work(1), iwork(1), rwork(1)) for subsequent runs. If you set lwork = -1 (liwork = -1, lrwork = -1), the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork). This operation is called a workspace query. Note that if you set lwork (liwork, lrwork) to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?sbgvx Computes selected eigenvalues and, optionally, eigenvectors of a real generalized symmetric definite eigenproblem with banded matrices. Syntax Fortran 77: call ssbgvx(jobz, range, uplo, n, ka, kb, ab, ldab, bb, ldbb, q, ldq, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) call dsbgvx(jobz, range, uplo, n, ka, kb, ab, ldab, bb, ldbb, q, ldq, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) Fortran 95: call sbgvx(ab, bb, w [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,q] [,abstol] [,info]) C: lapack_int LAPACKE_sbgvx( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_int ka, lapack_int kb, * ab, lapack_int ldab, * bb, lapack_int ldbb, * q, lapack_int ldq, vl, vu, lapack_int il, lapack_int iu, abstol, lapack_int* m, * w, * z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h LAPACK Routines: Least Squares and Eigenvalue Problems 4 1113 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues, and optionally, the eigenvectors of a real generalized symmetricdefinite banded eigenproblem, of the form A*x = ?*B*x. Here A and B are assumed to be symmetric and banded, and B is also positive definite. Eigenvalues and eigenvectors can be selected by specifying either all eigenvalues, a range of values or a range of indices for the desired eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for ssbgvx DOUBLE PRECISION for dsbgvx. The absolute error tolerance for the eigenvalues. See Application Notes for more information. ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', ldz = max(1, n). ldq INTEGER. The leading dimension of the output array q; ldq < 1. If jobz = 'V', ldq < max(1, n). iwork INTEGER. Workspace array, DIMENSION (5*n). Output Parameters ab On exit, the contents of ab are overwritten. bb On exit, contains the factor S from the split Cholesky factorization B = ST*S, as returned by pbstf/pbstf. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w, z, q REAL for ssbgvx DOUBLE PRECISION for dsbgvx Arrays: w(*), DIMENSION at least max(1, n) . If info = 0, contains the eigenvalues in ascending order. z(ldz,*) . The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the matrix Z of eigenvectors, with the i-th column of z holding the eigenvector associated with w(i). The eigenvectors are normalized so that ZT*B*Z = I. If jobz = 'N', then z is not referenced. q(ldq,*) . The second dimension of q must be at least max(1, n). If jobz = 'V', then q contains the n-by-n matrix used in the reduction of A*x = lambda*B*x to standard form, that is, C*x= lambda*x and consequently C to tridiagonal form. If jobz = 'N', then q is not referenced. ifail INTEGER. Array, DIMENSION (m). If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, the ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, and LAPACK Routines: Least Squares and Eigenvalue Problems 4 1115 if i = n, the algorithm failed to converge, and i off-diagonal elements of an intermediate tridiagonal did not converge to zero; if info = n + i, for 1 = i = n, then pbstf/pbstf returned info = i and B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sbgvx interface are the following: ab Holds the array A of size (ka+1,n). bb Holds the array B of size (kb+1,n). w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). ifail Holds the vector with the number of elements n. q Holds the matrix Q of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Note that there will be an error condition if ifail or q is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. If abstol is less than or equal to zero, then e*||T||1 is used as tolerance, where T is the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*?lamch('S'), not zero. If this routine returns with info > 0, indicating that some eigenvectors did not converge, try setting abstol to 2*?lamch('S'). 4 Intel® Math Kernel Library Reference Manual 1116 ?hbgvx Computes selected eigenvalues and, optionally, eigenvectors of a complex generalized Hermitian definite eigenproblem with banded matrices. Syntax Fortran 77: call chbgvx(jobz, range, uplo, n, ka, kb, ab, ldab, bb, ldbb, q, ldq, vl, vu, il, iu, abstol, m, w, z, ldz, work, rwork, iwork, ifail, info) call zhbgvx(jobz, range, uplo, n, ka, kb, ab, ldab, bb, ldbb, q, ldq, vl, vu, il, iu, abstol, m, w, z, ldz, work, rwork, iwork, ifail, info) Fortran 95: call hbgvx(ab, bb, w [,uplo] [,z] [,vl] [,vu] [,il] [,iu] [,m] [,ifail] [,q] [,abstol] [,info]) C: lapack_int LAPACKE_chbgvx( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_int ka, lapack_int kb, lapack_complex_float* ab, lapack_int ldab, lapack_complex_float* bb, lapack_int ldbb, lapack_complex_float* q, lapack_int ldq, float vl, float vu, lapack_int il, lapack_int iu, float abstol, lapack_int* m, float* w, lapack_complex_float* z, lapack_int ldz, lapack_int* ifail ); lapack_int LAPACKE_zhbgvx( int matrix_order, char jobz, char range, char uplo, lapack_int n, lapack_int ka, lapack_int kb, lapack_complex_double* ab, lapack_int ldab, lapack_complex_double* bb, lapack_int ldbb, lapack_complex_double* q, lapack_int ldq, double vl, double vu, lapack_int il, lapack_int iu, double abstol, lapack_int* m, double* w, lapack_complex_double* z, lapack_int ldz, lapack_int* ifail ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitian-definite banded eigenproblem, of the form A*x = ?*B*x. Here A and B are assumed to be Hermitian and banded, and B is also positive definite. Eigenvalues and eigenvectors can be selected by specifying either all eigenvalues, a range of values or a range of indices for the desired eigenvalues. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then compute eigenvalues only. If jobz = 'V', then compute eigenvalues and eigenvectors. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1117 If range = 'V', the routine computes eigenvalues lambda(i) in the halfopen interval: vl< lambda(i) = vu. If range = 'I', the routine computes eigenvalues with indices il to iu. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays ab and bb store the upper triangles of A and B; If uplo = 'L', arrays ab and bb store the lower triangles of A and B. n INTEGER. The order of the matrices A and B (n = 0). ka INTEGER. The number of super- or sub-diagonals in A (ka = 0). kb INTEGER. The number of super- or sub-diagonals in B (kb = 0). ab, bb, work COMPLEX for chbgvx DOUBLE COMPLEX for zhbgvx Arrays: ab (ldab,*) is an array containing either upper or lower triangular part of the Hermitian matrix A (as specified by uplo) in band storage format. The second dimension of the array ab must be at least max(1, n). bb(ldbb,*) is an array containing either upper or lower triangular part of the Hermitian matrix B (as specified by uplo) in band storage format. The second dimension of the array bb must be at least max(1, n). work(*) is a workspace array, DIMENSION at least max(1, n). ldab INTEGER. The leading dimension of the array ab; must be at least ka+1. ldbb INTEGER. The leading dimension of the array bb; must be at least kb+1. vl, vu REAL for chbgvx DOUBLE PRECISION for zhbgvx. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues. Constraint: vl< vu. If range = 'A' or 'I', vl and vu are not referenced. il, iu INTEGER. If range = 'I', the indices in ascending order of the smallest and largest eigenvalues to be returned. Constraint: 1 = il = iu = n, if n > 0; il=1 and iu=0 if n = 0. If range = 'A' or 'V', il and iu are not referenced. abstol REAL for chbgvx DOUBLE PRECISION for zhbgvx. The absolute error tolerance for the eigenvalues. See Application Notes for more information. ldz INTEGER. The leading dimension of the output array z; ldz = 1. If jobz = 'V', ldz = max(1, n). ldq INTEGER. The leading dimension of the output array q; ldq = 1. If jobz = 'V', ldq = max(1, n). rwork REAL for chbgvx DOUBLE PRECISION for zhbgvx. Workspace array, DIMENSION at least max(1, 7n). iwork INTEGER. Workspace array, DIMENSION at least max(1, 5n). 4 Intel® Math Kernel Library Reference Manual 1118 Output Parameters ab On exit, the contents of ab are overwritten. bb On exit, contains the factor S from the split Cholesky factorization B = SH*S, as returned by pbstf/pbstf. m INTEGER. The total number of eigenvalues found, 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu-il+1. w REAL for chbgvx DOUBLE PRECISION for zhbgvx. Array w(*), DIMENSION at least max(1, n). If info = 0, contains the eigenvalues in ascending order. z, q COMPLEX for chbgvx DOUBLE COMPLEX for zhbgvx Arrays: z(ldz,*). The second dimension of z must be at least max(1, n). If jobz = 'V', then if info = 0, z contains the matrix Z of eigenvectors, with the i-th column of z holding the eigenvector associated with w(i). The eigenvectors are normalized so that ZH*B*Z = I. If jobz = 'N', then z is not referenced. q(ldq,*). The second dimension of q must be at least max(1, n). If jobz = 'V', then q contains the n-by-n matrix used in the reduction of Ax = ?Bx to standard form, that is, Cx = ? x and consequently C to tridiagonal form. If jobz = 'N', then q is not referenced. ifail INTEGER. Array, DIMENSION at least max(1, n). If jobz = 'V', then if info = 0, the first m elements of ifail are zero; if info > 0, the ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th argument had an illegal value. If info > 0, and if i = n, the algorithm failed to converge, and i off-diagonal elements of an intermediate tridiagonal did not converge to zero; if info = n + i, for 1 = i = n, then pbstf/pbstf returned info = i and B is not positive-definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbgvx interface are the following: ab Holds the array A of size (ka+1,n). bb Holds the array B of size (kb+1,n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1119 w Holds the vector with the number of elements n. z Holds the matrix Z of size (n, n). ifail Holds the vector with the number of elements n. q Holds the matrix Q of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. vl Default value for this element is vl = -HUGE(vl). vu Default value for this element is vu = HUGE(vl). il Default value for this argument is il = 1. iu Default value for this argument is iu = n. abstol Default value for this element is abstol = 0.0_WP. jobz Restored based on the presence of the argument z as follows: jobz = 'V', if z is present, jobz = 'N', if z is omitted. Note that there will be an error condition if ifail or q is present and z is omitted. range Restored based on the presence of arguments vl, vu, il, iu as follows: range = 'V', if one of or both vl and vu are present, range = 'I', if one of or both il and iu are present, range = 'A', if none of vl, vu, il, iu is present, Note that there will be an error condition if one of or both vl and vu are present and at the same time one of or both il and iu are present. Application Notes An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol+e*max(|a|,|b|), where e is the machine precision. If abstol is less than or equal to zero, then e*||T||1 will be used in its place, where T is the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*?lamch('S'), not zero. If this routine returns with info > 0, indicating that some eigenvectors did not converge, try setting abstol to 2*?lamch('S'). Generalized Nonsymmetric Eigenproblems This section describes LAPACK driver routines used for solving generalized nonsymmetric eigenproblems. See also computational routines that can be called to solve these problems. Table "Driver Routines for Solving Generalized Nonsymmetric Eigenproblems" lists all such driver routines for the FORTRAN 77 interface. Respective routine names in the Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Driver Routines for Solving Generalized Nonsymmetric Eigenproblems Routine Name Operation performed gges Computes the generalized eigenvalues, Schur form, and the left and/or right Schur vectors for a pair of nonsymmetric matrices. ggesx Computes the generalized eigenvalues, Schur form, and, optionally, the left and/or right matrices of Schur vectors. ggev Computes the generalized eigenvalues, and the left and/or right generalized eigenvectors for a pair of nonsymmetric matrices. 4 Intel® Math Kernel Library Reference Manual 1120 Routine Name Operation performed ggevx Computes the generalized eigenvalues, and, optionally, the left and/or right generalized eigenvectors. ?gges Computes the generalized eigenvalues, Schur form, and the left and/or right Schur vectors for a pair of nonsymmetric matrices. Syntax Fortran 77: call sgges(jobvsl, jobvsr, sort, selctg, n, a, lda, b, ldb, sdim, alphar, alphai, beta, vsl, ldvsl, vsr, ldvsr, work, lwork, bwork, info) call dgges(jobvsl, jobvsr, sort, selctg, n, a, lda, b, ldb, sdim, alphar, alphai, beta, vsl, ldvsl, vsr, ldvsr, work, lwork, bwork, info) call cgges(jobvsl, jobvsr, sort, selctg, n, a, lda, b, ldb, sdim, alpha, beta, vsl, ldvsl, vsr, ldvsr, work, lwork, rwork, bwork, info) call zgges(jobvsl, jobvsr, sort, selctg, n, a, lda, b, ldb, sdim, alpha, beta, vsl, ldvsl, vsr, ldvsr, work, lwork, rwork, bwork, info) Fortran 95: call gges(a, b, alphar, alphai, beta [,vsl] [,vsr] [,select] [,sdim] [,info]) call gges(a, b, alpha, beta [, vsl] [,vsr] [,select] [,sdim] [,info]) C: lapack_int LAPACKE_sgges( int matrix_order, char jobvsl, char jobvsr, char sort, LAPACK_S_SELECT3 select, lapack_int n, float* a, lapack_int lda, float* b, lapack_int ldb, lapack_int* sdim, float* alphar, float* alphai, float* beta, float* vsl, lapack_int ldvsl, float* vsr, lapack_int ldvsr ); lapack_int LAPACKE_dgges( int matrix_order, char jobvsl, char jobvsr, char sort, LAPACK_D_SELECT3 select, lapack_int n, double* a, lapack_int lda, double* b, lapack_int ldb, lapack_int* sdim, double* alphar, double* alphai, double* beta, double* vsl, lapack_int ldvsl, double* vsr, lapack_int ldvsr ); lapack_int LAPACKE_cgges( int matrix_order, char jobvsl, char jobvsr, char sort, LAPACK_C_SELECT2 select, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, lapack_int* sdim, lapack_complex_float* alpha, lapack_complex_float* beta, lapack_complex_float* vsl, lapack_int ldvsl, lapack_complex_float* vsr, lapack_int ldvsr ); lapack_int LAPACKE_zgges( int matrix_order, char jobvsl, char jobvsr, char sort, LAPACK_Z_SELECT2 select, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, lapack_int* sdim, lapack_complex_double* alpha, lapack_complex_double* beta, lapack_complex_double* vsl, lapack_int ldvsl, lapack_complex_double* vsr, lapack_int ldvsr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h LAPACK Routines: Least Squares and Eigenvalue Problems 4 1121 Description The ?gges routine computes the generalized eigenvalues, the generalized real/complex Schur form (S,T), optionally, the left and/or right matrices of Schur vectors (vsl and vsr) for a pair of n-by-n real/complex nonsymmetric matrices (A,B). This gives the generalized Schur factorization (A,B) = ( vsl*S *vsrH, vsl*T*vsrH ) Optionally, it also orders the eigenvalues so that a selected cluster of eigenvalues appears in the leading diagonal blocks of the upper quasi-triangular matrix S and the upper triangular matrix T. The leading columns of vsl and vsr then form an orthonormal/unitary basis for the corresponding left and right eigenspaces (deflating subspaces). If only the generalized eigenvalues are needed, use the driver ggev instead, which is faster. A generalized eigenvalue for a pair of matrices (A,B) is a scalar w or a ratio alpha / beta = w, such that A - w*B is singular. It is usually represented as the pair (alpha, beta), as there is a reasonable interpretation for beta=0 or for both being zero. A pair of matrices (S,T) is in the generalized real Schur form if T is upper triangular with non-negative diagonal and S is block upper triangular with 1-by-1 and 2-by-2 blocks. 1-by-1 blocks correspond to real generalized eigenvalues, while 2-by-2 blocks of S are "standardized" by making the corresponding elements of T have the form: and the pair of corresponding 2-by-2 blocks in S and T will have a complex conjugate pair of generalized eigenvalues. A pair of matrices (S,T) is in generalized complex Schur form if S and T are upper triangular and, in addition, the diagonal of T are non-negative real numbers. The ?gges routine replaces the deprecated ?gegs routine. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobvsl CHARACTER*1. Must be 'N' or 'V'. If jobvsl = 'N', then the left Schur vectors are not computed. If jobvsl = 'V', then the left Schur vectors are computed. jobvsr CHARACTER*1. Must be 'N' or 'V'. If jobvsr = 'N', then the right Schur vectors are not computed. If jobvsr = 'V', then the right Schur vectors are computed. sort CHARACTER*1. Must be 'N' or 'S'. Specifies whether or not to order the eigenvalues on the diagonal of the generalized Schur form. If sort = 'N', then eigenvalues are not ordered. If sort = 'S', eigenvalues are ordered (see selctg). selctg LOGICAL FUNCTION of three REAL arguments for real flavors. LOGICAL FUNCTION of two COMPLEX arguments for complex flavors. selctg must be declared EXTERNAL in the calling subroutine. If sort = 'S', selctg is used to select eigenvalues to sort to the top left of the Schur form. If sort = 'N', selctg is not referenced. For real flavors: 4 Intel® Math Kernel Library Reference Manual 1122 An eigenvalue (alphar(j) + alphai(j))/beta(j) is selected if selctg(alphar(j), alphai(j), beta(j)) is true; that is, if either one of a complex conjugate pair of eigenvalues is selected, then both complex eigenvalues are selected. Note that in the ill-conditioned case, a selected complex eigenvalue may no longer satisfy selctg(alphar(j), alphai(j), beta(j)) = .TRUE. after ordering. In this case info is set to n+2 . For complex flavors: An eigenvalue alpha(j) / beta(j) is selected if selctg(alpha(j), beta(j)) is true. Note that a selected complex eigenvalue may no longer satisfy selctg(alpha(j), beta(j)) = .TRUE. after ordering, since ordering may change the value of complex eigenvalues (especially if the eigenvalue is ill-conditioned); in this case info is set to n+2 (see info below). n INTEGER. The order of the matrices A, B, vsl, and vsr (n = 0). a, b, work REAL for sgges DOUBLE PRECISION for dgges COMPLEX for cgges DOUBLE COMPLEX for zgges. Arrays: a(lda,*) is an array containing the n-by-n matrix A (first of the pair of matrices). The second dimension of a must be at least max(1, n). b(ldb,*) is an array containing the n-by-n matrix B (second of the pair of matrices). The second dimension of b must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). ldb INTEGER. The leading dimension of the array b. Must be at least max(1, n). ldvsl, ldvsr INTEGER. The leading dimensions of the output matrices vsl and vsr, respectively. Constraints: ldvsl = 1. If jobvsl = 'V', ldvsl = max(1, n). ldvsr = 1. If jobvsr = 'V', ldvsr = max(1, n). lwork INTEGER. The dimension of the array work. lwork = max(1, 8n+16) for real flavors; lwork = max(1, 2n) for complex flavors. For good performance, lwork must generally be larger. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. rwork REAL for cgges DOUBLE PRECISION for zgges Workspace array, DIMENSION at least max(1, 8n). This array is used in complex flavors only. bwork LOGICAL. Workspace array, DIMENSION at least max(1, n). Not referenced if sort = 'N'. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1123 Output Parameters a On exit, this array has been overwritten by its generalized Schur form S. b On exit, this array has been overwritten by its generalized Schur form T. sdim INTEGER. If sort = 'N', sdim= 0. If sort = 'S', sdim is equal to the number of eigenvalues (after sorting) for which selctg is true. Note that for real flavors complex conjugate pairs for which selctg is true for either eigenvalue count as 2. alphar, alphai REAL for sgges; DOUBLE PRECISION for dgges. Arrays, DIMENSION at least max(1, n) each. Contain values that form generalized eigenvalues in real flavors. See beta. alpha COMPLEX for cgges; DOUBLE COMPLEX for zgges. Array, DIMENSION at least max(1, n). Contain values that form generalized eigenvalues in complex flavors. See beta. beta REAL for sgges DOUBLE PRECISION for dgges COMPLEX for cgges DOUBLE COMPLEX for zgges. Array, DIMENSION at least max(1, n). For real flavors: On exit, (alphar(j) + alphai(j)*i)/beta(j), j=1,..., n, will be the generalized eigenvalues. alphar(j) + alphai(j)*i and beta(j), j=1,..., n are the diagonals of the complex Schur form (S,T) that would result if the 2-by-2 diagonal blocks of the real generalized Schur form of (A,B) were further reduced to triangular form using complex unitary transformations. If alphai(j) is zero, then the j-th eigenvalue is real; if positive, then the j-th and (j+1)-st eigenvalues are a complex conjugate pair, with alphai(j+1) negative. For complex flavors: On exit, alpha(j)/beta(j), j=1,..., n, will be the generalized eigenvalues. alpha(j), j=1,...,n, and beta(j), j=1,..., n are the diagonals of the complex Schur form (S,T) output by cgges/zgges. The beta(j) will be non-negative real. See also Application Notes below. vsl, vsr REAL for sgges DOUBLE PRECISION for dgges COMPLEX for cgges DOUBLE COMPLEX for zgges. Arrays: vsl(ldvsl,*), the second dimension of vsl must be at least max(1, n). If jobvsl = 'V', this array will contain the left Schur vectors. If jobvsl = 'N', vsl is not referenced. vsr(ldvsr,*), the second dimension of vsr must be at least max(1, n). If jobvsr = 'V', this array will contain the right Schur vectors. If jobvsr = 'N', vsr is not referenced. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. 4 Intel® Math Kernel Library Reference Manual 1124 info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. If info = i, and i = n: the QZ iteration failed. (A, B) is not in Schur form, but alphar(j), alphai(j) (for real flavors), or alpha(j) (for complex flavors), and beta(j), j=info +1,..., n should be correct. i > n: errors that usually indicate LAPACK problems: i = n+1: other than QZ iteration failed in hgeqz; i = n+2: after reordering, roundoff changed values of some complex eigenvalues so that leading eigenvalues in the generalized Schur form no longer satisfy selctg = .TRUE.. This could also be caused due to scaling; i = n+3: reordering failed in tgsen. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gges interface are the following: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, n). alphar Holds the vector of length n. Used in real flavors only. alphai Holds the vector of length n. Used in real flavors only. alpha Holds the vector of length n. Used in complex flavors only. beta Holds the vector of length n. vsl Holds the matrix VSL of size (n, n). vsr Holds the matrix VSR of size (n, n). jobvsl Restored based on the presence of the argument vsl as follows: jobvsl = 'V', if vsl is present, jobvsl = 'N', if vsl is omitted. jobvsr Restored based on the presence of the argument vsr as follows: jobvsr = 'V', if vsr is present, jobvsr = 'N', if vsr is omitted. sort Restored based on the presence of the argument select as follows: sort = 'S', if select is present, sort = 'N', if select is omitted. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1125 Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The quotients alphar(j)/beta(j) and alphai(j)/beta(j) may easily over- or underflow, and beta(j) may even be zero. Thus, you should avoid simply computing the ratio. However, alphar and alphai will be always less than and usually comparable with norm(A) in magnitude, and beta always less than and usually comparable with norm(B). ?ggesx Computes the generalized eigenvalues, Schur form, and, optionally, the left and/or right matrices of Schur vectors. Syntax Fortran 77: call sggesx (jobvsl, jobvsr, sort, selctg, sense, n, a, lda, b, ldb, sdim, alphar, alphai, beta, vsl, ldvsl, vsr, ldvsr, rconde, rcondv, work, lwork, iwork, liwork, bwork, info) call dggesx (jobvsl, jobvsr, sort, selctg, sense, n, a, lda, b, ldb, sdim, alphar, alphai, beta, vsl, ldvsl, vsr, ldvsr, rconde, rcondv, work, lwork, iwork, liwork, bwork, info) call cggesx (jobvsl, jobvsr, sort, selctg, sense, n, a, lda, b, ldb, sdim, alpha, beta, vsl, ldvsl, vsr, ldvsr, rconde, rcondv, work, lwork, rwork, iwork, liwork, bwork, info) call zggesx (jobvsl, jobvsr, sort, selctg, sense, n, a, lda, b, ldb, sdim, alpha, beta, vsl, ldvsl, vsr, ldvsr, rconde, rcondv, work, lwork, rwork, iwork, liwork, bwork, info) Fortran 95: call ggesx(a, b, alphar, alphai, beta [,vsl] [,vsr] [,select] [,sdim] [,rconde] [, rcondv] [,info]) call ggesx(a, b, alpha, beta [, vsl] [,vsr] [,select] [,sdim] [,rconde] [,rcondv] [, info]) C: lapack_int LAPACKE_sggesx( int matrix_order, char jobvsl, char jobvsr, char sort, LAPACK_S_SELECT3 select, char sense, lapack_int n, float* a, lapack_int lda, float* b, lapack_int ldb, lapack_int* sdim, float* alphar, float* alphai, float* beta, float* vsl, lapack_int ldvsl, float* vsr, lapack_int ldvsr, float* rconde, float* rcondv ); lapack_int LAPACKE_dggesx( int matrix_order, char jobvsl, char jobvsr, char sort, LAPACK_D_SELECT3 select, char sense, lapack_int n, double* a, lapack_int lda, double* b, lapack_int ldb, lapack_int* sdim, double* alphar, double* alphai, double* beta, double* vsl, lapack_int ldvsl, double* vsr, lapack_int ldvsr, double* rconde, double* rcondv ); lapack_int LAPACKE_cggesx( int matrix_order, char jobvsl, char jobvsr, char sort, LAPACK_C_SELECT2 select, char sense, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, lapack_int* sdim, lapack_complex_float* alpha, lapack_complex_float* beta, lapack_complex_float* vsl, lapack_int ldvsl, lapack_complex_float* vsr, lapack_int ldvsr, float* rconde, float* rcondv ); 4 Intel® Math Kernel Library Reference Manual 1126 lapack_int LAPACKE_zggesx( int matrix_order, char jobvsl, char jobvsr, char sort, LAPACK_Z_SELECT2 select, char sense, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, lapack_int* sdim, lapack_complex_double* alpha, lapack_complex_double* beta, lapack_complex_double* vsl, lapack_int ldvsl, lapack_complex_double* vsr, lapack_int ldvsr, double* rconde, double* rcondv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes for a pair of n-by-n real/complex nonsymmetric matrices (A,B), the generalized eigenvalues, the generalized real/complex Schur form (S,T), optionally, the left and/or right matrices of Schur vectors (vsl and vsr). This gives the generalized Schur factorization (A,B) = ( vsl*S *vsrH, vsl*T*vsrH ) Optionally, it also orders the eigenvalues so that a selected cluster of eigenvalues appears in the leading diagonal blocks of the upper quasi-triangular matrix S and the upper triangular matrix T; computes a reciprocal condition number for the average of the selected eigenvalues (rconde); and computes a reciprocal condition number for the right and left deflating subspaces corresponding to the selected eigenvalues (rcondv). The leading columns of vsl and vsr then form an orthonormal/unitary basis for the corresponding left and right eigenspaces (deflating subspaces). A generalized eigenvalue for a pair of matrices (A,B) is a scalar w or a ratio alpha / beta = w, such that A - w*B is singular. It is usually represented as the pair (alpha, beta), as there is a reasonable interpretation for beta=0 or for both being zero. A pair of matrices (S,T) is in generalized real Schur form if T is upper triangular with non-negative diagonal and S is block upper triangular with 1-by-1 and 2-by-2 blocks. 1-by-1 blocks correspond to real generalized eigenvalues, while 2-by-2 blocks of S will be "standardized" by making the corresponding elements of T have the form: and the pair of corresponding 2-by-2 blocks in S and T will have a complex conjugate pair of generalized eigenvalues. A pair of matrices (S,T) is in generalized complex Schur form if S and T are upper triangular and, in addition, the diagonal of T are non-negative real numbers. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobvsl CHARACTER*1. Must be 'N' or 'V'. If jobvsl = 'N', then the left Schur vectors are not computed. If jobvsl = 'V', then the left Schur vectors are computed. jobvsr CHARACTER*1. Must be 'N' or 'V'. If jobvsr = 'N', then the right Schur vectors are not computed. If jobvsr = 'V', then the right Schur vectors are computed. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1127 sort CHARACTER*1. Must be 'N' or 'S'. Specifies whether or not to order the eigenvalues on the diagonal of the generalized Schur form. If sort = 'N', then eigenvalues are not ordered. If sort = 'S', eigenvalues are ordered (see selctg). selctg LOGICAL FUNCTION of three REAL arguments for real flavors. LOGICAL FUNCTION of two COMPLEX arguments for complex flavors. selctg must be declared EXTERNAL in the calling subroutine. If sort = 'S', selctg is used to select eigenvalues to sort to the top left of the Schur form. If sort = 'N', selctg is not referenced. For real flavors: An eigenvalue (alphar(j) + alphai(j))/beta(j) is selected if selctg(alphar(j), alphai(j), beta(j)) is true; that is, if either one of a complex conjugate pair of eigenvalues is selected, then both complex eigenvalues are selected. Note that in the ill-conditioned case, a selected complex eigenvalue may no longer satisfy selctg(alphar(j), alphai(j), beta(j)) = .TRUE. after ordering. In this case info is set to n+2. For complex flavors: An eigenvalue alpha(j) / beta(j) is selected if selctg(alpha(j), beta(j)) is true. Note that a selected complex eigenvalue may no longer satisfy selctg(alpha(j), beta(j)) = .TRUE. after ordering, since ordering may change the value of complex eigenvalues (especially if the eigenvalue is ill-conditioned); in this case info is set to n+2 (see info below). sense CHARACTER*1. Must be 'N', 'E', 'V', or 'B'. Determines which reciprocal condition number are computed. If sense = 'N', none are computed; If sense = 'E', computed for average of selected eigenvalues only; If sense = 'V', computed for selected deflating subspaces only; If sense = 'B', computed for both. If sense is 'E', 'V', or 'B', then sort must equal 'S'. n INTEGER. The order of the matrices A, B, vsl, and vsr (n = 0). a, b, work REAL for sggesx DOUBLE PRECISION for dggesx COMPLEX for cggesx DOUBLE COMPLEX for zggesx. Arrays: a(lda,*) is an array containing the n-by-n matrix A (first of the pair of matrices). The second dimension of a must be at least max(1, n). b(ldb,*) is an array containing the n-by-n matrix B (second of the pair of matrices). The second dimension of b must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). ldb INTEGER. The leading dimension of the array b. Must be at least max(1, n). ldvsl, ldvsr INTEGER. The leading dimensions of the output matrices vsl and vsr, respectively. Constraints: ldvsl = 1. If jobvsl = 'V', ldvsl = max(1, n). 4 Intel® Math Kernel Library Reference Manual 1128 ldvsr = 1. If jobvsr = 'V', ldvsr = max(1, n). lwork INTEGER. The dimension of the array work. For real flavors: If n=0 then lwork=1. If n>0 and sense = 'N', then lwork = max(8*n, 6*n+16). If n>0 and sense = 'E', 'V', or 'B', then lwork = max(8*n, 6*n+16, 2*sdim*(n-sdim)); For complex flavors: If n=0 then lwork=1. If n>0 and sense = 'N', then lwork = max(1, 2*n); If n>0 and sense = 'E', 'V', or 'B', then lwork = max(1, 2*n, 2*sdim*(n-sdim)). Note that 2*sdim*(n-sdim) = n*n/2. An error is only returned if lwork < max(8*n, 6*n+16)for real flavors, and lwork < max(1, 2*n) for complex flavors, but if sense = 'E', 'V', or 'B', this may not be large enough. If lwork=-1, then a workspace query is assumed; the routine only calculates the bound on the optimal size of the work array and the minimum size of the iwork array, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. rwork REAL for cggesx DOUBLE PRECISION for zggesx Workspace array, DIMENSION at least max(1, 8n). This array is used in complex flavors only. iwork INTEGER. Workspace array, DIMENSION max(1, liwork). liwork INTEGER. The dimension of the array iwork. If sense = 'N', or n=0, then liwork=1, otherwise liwork = (n+6) for real flavors, and liwork = (n+2) for complex flavors. If liwork=-1, then a workspace query is assumed; the routine only calculates the bound on the optimal size of the work array and the minimum size of the iwork array, returns these values as the first entries of the work and iwork arrays, and no error message related to lwork or liwork is issued by xerbla. bwork LOGICAL. Workspace array, DIMENSION at least max(1, n). Not referenced if sort = 'N'. Output Parameters a On exit, this array has been overwritten by its generalized Schur form S. b On exit, this array has been overwritten by its generalized Schur form T. sdim INTEGER. If sort = 'N', sdim= 0. If sort = 'S', sdim is equal to the number of eigenvalues (after sorting) for which selctg is true. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1129 Note that for real flavors complex conjugate pairs for which selctg is true for either eigenvalue count as 2. alphar, alphai REAL for sggesx; DOUBLE PRECISION for dggesx. Arrays, DIMENSION at least max(1, n) each. Contain values that form generalized eigenvalues in real flavors. See beta. alpha COMPLEX for cggesx; DOUBLE COMPLEX for zggesx. Array, DIMENSION at least max(1, n). Contain values that form generalized eigenvalues in complex flavors. See beta. beta REAL for sggesx DOUBLE PRECISION for dggesx COMPLEX for cggesx DOUBLE COMPLEX for zggesx. Array, DIMENSION at least max(1, n). For real flavors: On exit, (alphar(j) + alphai(j)*i)/beta(j), j=1,..., n will be the generalized eigenvalues. alphar(j) + alphai(j)*i and beta(j), j=1,..., n are the diagonals of the complex Schur form (S,T) that would result if the 2-by-2 diagonal blocks of the real generalized Schur form of (A,B) were further reduced to triangular form using complex unitary transformations. If alphai(j) is zero, then the j-th eigenvalue is real; if positive, then the j-th and (j+1)-st eigenvalues are a complex conjugate pair, with alphai(j+1) negative. For complex flavors: On exit, alpha(j)/beta(j), j=1,..., n will be the generalized eigenvalues. alpha(j), j=1,..., n, and beta(j), j=1,...,n are the diagonals of the complex Schur form (S,T) output by cggesx/zggesx. The beta(j) will be nonnegative real. See also Application Notes below. vsl, vsr REAL for sggesx DOUBLE PRECISION for dggesx COMPLEX for cggesx DOUBLE COMPLEX for zggesx. Arrays: vsl(ldvsl,*), the second dimension of vsl must be at least max(1, n). If jobvsl = 'V', this array will contain the left Schur vectors. If jobvsl = 'N', vsl is not referenced. vsr(ldvsr,*), the second dimension of vsr must be at least max(1, n). If jobvsr = 'V', this array will contain the right Schur vectors. If jobvsr = 'N', vsr is not referenced. rconde, rcondv REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION (2) each If sense = 'E' or 'B', rconde(1) and rconde(2) contain the reciprocal condition numbers for the average of the selected eigenvalues. Not referenced if sense = 'N' or 'V'. If sense = 'V' or 'B', rcondv(1) and rcondv(2) contain the reciprocal condition numbers for the selected deflating subspaces. Not referenced if sense = 'N' or 'E'. 4 Intel® Math Kernel Library Reference Manual 1130 work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. iwork(1) On exit, if info = 0, then iwork(1) returns the required minimal size of liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. If info = i, and i = n: the QZ iteration failed. (A, B) is not in Schur form, but alphar(j), alphai(j) (for real flavors), or alpha(j) (for complex flavors), and beta(j), j=info +1,..., n should be correct. i > n: errors that usually indicate LAPACK problems: i = n+1: other than QZ iteration failed in ?hgeqz; i = n+2: after reordering, roundoff changed values of some complex eigenvalues so that leading eigenvalues in the generalized Schur form no longer satisfy selctg = .TRUE.. This could also be caused due to scaling; i = n+3: reordering failed in tgsen. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggesx interface are the following: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, n). alphar Holds the vector of length n. Used in real flavors only. alphai Holds the vector of length n. Used in real flavors only. alpha Holds the vector of length n. Used in complex flavors only. beta Holds the vector of length n. vsl Holds the matrix VSL of size (n, n). vsr Holds the matrix VSR of size (n, n). rconde Holds the vector of length (2). rcondv Holds the vector of length (2). jobvsl Restored based on the presence of the argument vsl as follows: jobvsl = 'V', if vsl is present, jobvsl = 'N', if vsl is omitted. jobvsr Restored based on the presence of the argument vsr as follows: jobvsr = 'V', if vsr is present, jobvsr = 'N', if vsr is omitted. sort Restored based on the presence of the argument select as follows: sort = 'S', if select is present, sort = 'N', if select is omitted. sense Restored based on the presence of arguments rconde and rcondv as follows: sense = 'B', if both rconde and rcondv are present, sense = 'E', if rconde is present and rcondv omitted, sense = 'V', if rconde is omitted and rcondv present, sense = 'N', if both rconde and rcondv are omitted. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1131 Note that there will be an error condition if rconde or rcondv are present and select is omitted. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork (or liwork) for the first run or set lwork = -1 (liwork = -1). If you choose the first option and set any of admissible lwork (or liwork) sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array (work, iwork) on exit. Use this value (work(1), iwork(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork). This operation is called a workspace query. Note that if you set lwork (liwork) to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The quotients alphar(j)/beta(j) and alphai(j)/beta(j) may easily over- or underflow, and beta(j) may even be zero. Thus, you should avoid simply computing the ratio. However, alphar and alphai will be always less than and usually comparable with norm(A) in magnitude, and beta always less than and usually comparable with norm(B). ?ggev Computes the generalized eigenvalues, and the left and/or right generalized eigenvectors for a pair of nonsymmetric matrices. Syntax Fortran 77: call sggev(jobvl, jobvr, n, a, lda, b, ldb, alphar, alphai, beta, vl, ldvl, vr, ldvr, work, lwork, info) call dggev(jobvl, jobvr, n, a, lda, b, ldb, alphar, alphai, beta, vl, ldvl, vr, ldvr, work, lwork, info) call cggev(jobvl, jobvr, n, a, lda, b, ldb, alpha, beta, vl, ldvl, vr, ldvr, work, lwork, rwork, info) call zggev(jobvl, jobvr, n, a, lda, b, ldb, alpha, beta, vl, ldvl, vr, ldvr, work, lwork, rwork, info) Fortran 95: call ggev(a, b, alphar, alphai, beta [,vl] [,vr] [,info]) call ggev(a, b, alpha, beta [, vl] [,vr] [,info]) C: lapack_int LAPACKE_sggev( int matrix_order, char jobvl, char jobvr, lapack_int n, float* a, lapack_int lda, float* b, lapack_int ldb, float* alphar, float* alphai, float* beta, float* vl, lapack_int ldvl, float* vr, lapack_int ldvr ); lapack_int LAPACKE_dggev( int matrix_order, char jobvl, char jobvr, lapack_int n, double* a, lapack_int lda, double* b, lapack_int ldb, double* alphar, double* alphai, double* beta, double* vl, lapack_int ldvl, double* vr, lapack_int ldvr ); 4 Intel® Math Kernel Library Reference Manual 1132 lapack_int LAPACKE_cggev( int matrix_order, char jobvl, char jobvr, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* alpha, lapack_complex_float* beta, lapack_complex_float* vl, lapack_int ldvl, lapack_complex_float* vr, lapack_int ldvr ); lapack_int LAPACKE_zggev( int matrix_order, char jobvl, char jobvr, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* alpha, lapack_complex_double* beta, lapack_complex_double* vl, lapack_int ldvl, lapack_complex_double* vr, lapack_int ldvr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The ?ggev routine computes the generalized eigenvalues, and optionally, the left and/or right generalized eigenvectors for a pair of n-by-n real/complex nonsymmetric matrices (A,B). A generalized eigenvalue for a pair of matrices (A,B) is a scalar ? or a ratio alpha / beta = ?, such that A - ?*B is singular. It is usually represented as the pair (alpha, beta), as there is a reasonable interpretation for beta =0 and even for both being zero. The right generalized eigenvector v(j) corresponding to the generalized eigenvalue ?(j) of (A,B) satisfies A*v(j) = ?(j)*B*v(j). The left generalized eigenvector u(j) corresponding to the generalized eigenvalue ?(j) of (A,B) satisfies u(j)H*A = ?(j)*u(j)H*B where u(j)H denotes the conjugate transpose of u(j). The ?ggev routine replaces the deprecated ?gegv routine. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobvl CHARACTER*1. Must be 'N' or 'V'. If jobvl = 'N', the left generalized eigenvectors are not computed; If jobvl = 'V', the left generalized eigenvectors are computed. jobvr CHARACTER*1. Must be 'N' or 'V'. If jobvr = 'N', the right generalized eigenvectors are not computed; If jobvr = 'V', the right generalized eigenvectors are computed. n INTEGER. The order of the matrices A, B, vl, and vr (n = 0). a, b, work REAL for sggev DOUBLE PRECISION for dggev COMPLEX for cggev DOUBLE COMPLEX for zggev. Arrays: a(lda,*) is an array containing the n-by-n matrix A (first of the pair of matrices). The second dimension of a must be at least max(1, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1133 b(ldb,*) is an array containing the n-by-n matrix B (second of the pair of matrices). The second dimension of b must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). ldb INTEGER. The leading dimension of the array b. Must be at least max(1, n). ldvl, ldvr INTEGER. The leading dimensions of the output matrices vl and vr, respectively. Constraints: ldvl = 1. If jobvl = 'V', ldvl = max(1, n). ldvr = 1. If jobvr = 'V', ldvr = max(1, n). lwork INTEGER. The dimension of the array work. lwork = max(1, 8n+16) for real flavors; lwork = max(1, 2n) for complex flavors. For good performance, lwork must generally be larger. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. rwork REAL for cggev DOUBLE PRECISION for zggev Workspace array, DIMENSION at least max(1, 8n). This array is used in complex flavors only. Output Parameters a, b On exit, these arrays have been overwritten. alphar, alphai REAL for sggev; DOUBLE PRECISION for dggev. Arrays, DIMENSION at least max(1, n) each. Contain values that form generalized eigenvalues in real flavors. See beta. alpha COMPLEX for cggev; DOUBLE COMPLEX for zggev. Array, DIMENSION at least max(1, n). Contain values that form generalized eigenvalues in complex flavors. See beta. beta REAL for sggev DOUBLE PRECISION for dggev COMPLEX for cggev DOUBLE COMPLEX for zggev. Array, DIMENSION at least max(1, n). For real flavors: On exit, (alphar(j)+ alphai(j)*i)/beta(j), j=1,..., n, are the generalized eigenvalues. If alphai(j) is zero, then the j-th eigenvalue is real; if positive, then the jth and (j+1)-st eigenvalues are a complex conjugate pair, with alphai(j+1) negative. For complex flavors: On exit, alpha(j)/beta(j), j=1,..., n, are the generalized eigenvalues. See also Application Notes below. 4 Intel® Math Kernel Library Reference Manual 1134 vl, vr REAL for sggev DOUBLE PRECISION for dggev COMPLEX for cggev DOUBLE COMPLEX for zggev. Arrays: vl(ldvl,*); the second dimension of vl must be at least max(1, n). If jobvl = 'V', the left generalized eigenvectors u(j) are stored one after another in the columns of vl, in the same order as their eigenvalues. Each eigenvector is scaled so the largest component has abs(Re) + abs(Im) = 1. If jobvl = 'N', vl is not referenced. For real flavors: If the j-th eigenvalue is real, then u(j) = vl(:,j), the j-th column of vl. If the j-th and (j+1)-st eigenvalues form a complex conjugate pair, then u(j) = vl(:,j) + i*vl(:,j+1) and u(j+1) = vl(:,j) - i*vl(:,j +1), where i = sqrt(-1). For complex flavors: u(j) = vl(:,j), the j-th column of vl. vr(ldvr,*); the second dimension of vr must be at least max(1, n). If jobvr = 'V', the right generalized eigenvectors v(j) are stored one after another in the columns of vr, in the same order as their eigenvalues. Each eigenvector is scaled so the largest component has abs(Re) + abs(Im) = 1. If jobvr = 'N', vr is not referenced. For real flavors: If the j-th eigenvalue is real, then v(j) = vr(:,j), the j-th column of vr. If the j-th and (j+1)-st eigenvalues form a complex conjugate pair, then v(j) = vr(:,j) + i*vr(:,j+1) and v(j+1) = vr(:,j) - i*vr(:,j +1). For complex flavors: v(j) = vr(:,j), the j-th column of vr. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n: the QZ iteration failed. No eigenvectors have been calculated, but alphar(j), alphai(j) (for real flavors), or alpha(j) (for complex flavors), and beta(j), j=info+1,..., n should be correct. i > n: errors that usually indicate LAPACK problems: i = n+1: other than QZ iteration failed in hgeqz; i = n+2: error return from tgevc. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggev interface are the following: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1135 alphar Holds the vector of length n. Used in real flavors only. alphai Holds the vector of length n. Used in real flavors only. alpha Holds the vector of length n. Used in complex flavors only. beta Holds the vector of length n. vl Holds the matrix VL of size (n, n). vr Holds the matrix VR of size (n, n). jobvl Restored based on the presence of the argument vl as follows: jobvl = 'V', if vl is present, jobvl = 'N', if vl is omitted. jobvr Restored based on the presence of the argument vr as follows: jobvr = 'V', if vr is present, jobvr = 'N', if vr is omitted. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The quotients alphar(j)/beta(j) and alphai(j)/beta(j) may easily over- or underflow, and beta(j) may even be zero. Thus, you should avoid simply computing the ratio. However, alphar and alphai (for real flavors) or alpha (for complex flavors) will be always less than and usually comparable with norm(A) in magnitude, and beta always less than and usually comparable with norm(B). ?ggevx Computes the generalized eigenvalues, and, optionally, the left and/or right generalized eigenvectors. Syntax Fortran 77: call sggevx(balanc, jobvl, jobvr, sense, n, a, lda, b, ldb, alphar, alphai, beta, vl, ldvl, vr, ldvr, ilo, ihi, lscale, rscale, abnrm, bbnrm, rconde, rcondv, work, lwork, iwork, bwork, info) call dggevx(balanc, jobvl, jobvr, sense, n, a, lda, b, ldb, alphar, alphai, beta, vl, ldvl, vr, ldvr, ilo, ihi, lscale, rscale, abnrm, bbnrm, rconde, rcondv, work, lwork, iwork, bwork, info) call cggevx(balanc, jobvl, jobvr, sense, n, a, lda, b, ldb, alpha, beta, vl, ldvl, vr, ldvr, ilo, ihi, lscale, rscale, abnrm, bbnrm, rconde, rcondv, work, lwork, rwork, iwork, bwork, info) 4 Intel® Math Kernel Library Reference Manual 1136 call zggevx(balanc, jobvl, jobvr, sense, n, a, lda, b, ldb, alpha, beta, vl, ldvl, vr, ldvr, ilo, ihi, lscale, rscale, abnrm, bbnrm, rconde, rcondv, work, lwork, rwork, iwork, bwork, info) Fortran 95: call ggevx(a, b, alphar, alphai, beta [,vl] [,vr] [,balanc] [,ilo] [,ihi] [, lscale] [,rscale] [,abnrm] [,bbnrm] [,rconde] [,rcondv] [,info]) call ggevx(a, b, alpha, beta [, vl] [,vr] [,balanc] [,ilo] [,ihi] [,lscale] [, rscale] [,abnrm] [,bbnrm] [,rconde] [,rcondv] [,info]) C: lapack_int LAPACKE_sggevx( int matrix_order, char balanc, char jobvl, char jobvr, char sense, lapack_int n, float* a, lapack_int lda, float* b, lapack_int ldb, float* alphar, float* alphai, float* beta, float* vl, lapack_int ldvl, float* vr, lapack_int ldvr, lapack_int* ilo, lapack_int* ihi, float* lscale, float* rscale, float* abnrm, float* bbnrm, float* rconde, float* rcondv ); lapack_int LAPACKE_dggevx( int matrix_order, char balanc, char jobvl, char jobvr, char sense, lapack_int n, double* a, lapack_int lda, double* b, lapack_int ldb, double* alphar, double* alphai, double* beta, double* vl, lapack_int ldvl, double* vr, lapack_int ldvr, lapack_int* ilo, lapack_int* ihi, double* lscale, double* rscale, double* abnrm, double* bbnrm, double* rconde, double* rcondv ); lapack_int LAPACKE_cggevx( int matrix_order, char balanc, char jobvl, char jobvr, char sense, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* alpha, lapack_complex_float* beta, lapack_complex_float* vl, lapack_int ldvl, lapack_complex_float* vr, lapack_int ldvr, lapack_int* ilo, lapack_int* ihi, float* lscale, float* rscale, float* abnrm, float* bbnrm, float* rconde, float* rcondv ); lapack_int LAPACKE_zggevx( int matrix_order, char balanc, char jobvl, char jobvr, char sense, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* alpha, lapack_complex_double* beta, lapack_complex_double* vl, lapack_int ldvl, lapack_complex_double* vr, lapack_int ldvr, lapack_int* ilo, lapack_int* ihi, double* lscale, double* rscale, double* abnrm, double* bbnrm, double* rconde, double* rcondv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes for a pair of n-by-n real/complex nonsymmetric matrices (A,B), the generalized eigenvalues, and optionally, the left and/or right generalized eigenvectors. Optionally also, it computes a balancing transformation to improve the conditioning of the eigenvalues and eigenvectors (ilo, ihi, lscale, rscale, abnrm, and bbnrm), reciprocal condition numbers for the eigenvalues (rconde), and reciprocal condition numbers for the right eigenvectors (rcondv). A generalized eigenvalue for a pair of matrices (A,B) is a scalar ? or a ratio alpha / beta = ?, such that A - ?*B is singular. It is usually represented as the pair (alpha, beta), as there is a reasonable interpretation for beta=0 and even for both being zero. The right generalized eigenvector v(j) corresponding to the generalized eigenvalue ?(j) of (A,B) satisfies LAPACK Routines: Least Squares and Eigenvalue Problems 4 1137 A*v(j) = ?(j)*B*v(j). The left generalized eigenvector u(j) corresponding to the generalized eigenvalue ?(j) of (A,B) satisfies u(j)H*A = ?(j)*u(j)H*B where u(j)H denotes the conjugate transpose of u(j). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. balanc CHARACTER*1. Must be 'N', 'P', 'S', or 'B'. Specifies the balance option to be performed. If balanc = 'N', do not diagonally scale or permute; If balanc = 'P', permute only; If balanc = 'S', scale only; If balanc = 'B', both permute and scale. Computed reciprocal condition numbers will be for the matrices after balancing and/or permuting. Permuting does not change condition numbers (in exact arithmetic), but balancing does. jobvl CHARACTER*1. Must be 'N' or 'V'. If jobvl = 'N', the left generalized eigenvectors are not computed; If jobvl = 'V', the left generalized eigenvectors are computed. jobvr CHARACTER*1. Must be 'N' or 'V'. If jobvr = 'N', the right generalized eigenvectors are not computed; If jobvr = 'V', the right generalized eigenvectors are computed. sense CHARACTER*1. Must be 'N', 'E', 'V', or 'B'. Determines which reciprocal condition number are computed. If sense = 'N', none are computed; If sense = 'E', computed for eigenvalues only; If sense = 'V', computed for eigenvectors only; If sense = 'B', computed for eigenvalues and eigenvectors. n INTEGER. The order of the matrices A, B, vl, and vr (n = 0). a, b, work REAL for sggevx DOUBLE PRECISION for dggevx COMPLEX for cggevx DOUBLE COMPLEX for zggevx. Arrays: a(lda,*) is an array containing the n-by-n matrix A (first of the pair of matrices). The second dimension of a must be at least max(1, n). b(ldb,*) is an array containing the n-by-n matrix B (second of the pair of matrices). The second dimension of b must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of the array a. Must be at least max(1, n). ldb INTEGER. The leading dimension of the array b. Must be at least max(1, n). ldvl, ldvr INTEGER. The leading dimensions of the output matrices vl and vr, respectively. 4 Intel® Math Kernel Library Reference Manual 1138 Constraints: ldvl = 1. If jobvl = 'V', ldvl = max(1, n). ldvr = 1. If jobvr = 'V', ldvr = max(1, n). lwork INTEGER. The dimension of the array work. lwork = max(1, 2*n); For real flavors: If balanc = 'S', or 'B', or jobvl = 'V', or jobvr = 'V', then lwork = max(1, 6*n); if sense = 'E', or 'B', then lwork = max(1, 10*n); if sense = 'V', or 'B', lwork = (2n2+ 8*n+16). For complex flavors: if sense = 'E', lwork = max(1, 4*n); if sense = 'V', or 'B', lwork =max(1, 2*n2+ 2*n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. rwork REAL for cggevx DOUBLE PRECISION for zggevx Workspace array, DIMENSION at least max(1, 6*n) if balanc = 'S', or 'B', and at least max(1, 2*n) otherwise. This array is used in complex flavors only. iwork INTEGER. Workspace array, DIMENSION at least (n+6) for real flavors and at least (n +2) for complex flavors. Not referenced if sense = 'E'. bwork LOGICAL. Workspace array, DIMENSION at least max(1, n). Not referenced if sense = 'N'. Output Parameters a, b On exit, these arrays have been overwritten. If jobvl = 'V' or jobvr = 'V' or both, then a contains the first part of the real Schur form of the "balanced" versions of the input A and B, and b contains its second part. alphar, alphai REAL for sggevx; DOUBLE PRECISION for dggevx. Arrays, DIMENSION at least max(1, n) each. Contain values that form generalized eigenvalues in real flavors. See beta. alpha COMPLEX for cggevx; DOUBLE COMPLEX for zggevx. Array, DIMENSION at least max(1, n). Contain values that form generalized eigenvalues in complex flavors. See beta. beta REAL for sggevx DOUBLE PRECISION for dggevx COMPLEX for cggevx DOUBLE COMPLEX for zggevx. Array, DIMENSION at least max(1, n). For real flavors: On exit, (alphar(j) + alphai(j)*i)/beta(j), j=1,..., n, will be the generalized eigenvalues. LAPACK Routines: Least Squares and Eigenvalue Problems 4 1139 If alphai(j) is zero, then the j-th eigenvalue is real; if positive, then the jth and (j+1)-st eigenvalues are a complex conjugate pair, with alphai(j+1) negative. For complex flavors: On exit, alpha(j)/beta(j), j=1,..., n, will be the generalized eigenvalues. See also Application Notes below. vl, vr REAL for sggevx DOUBLE PRECISION for dggevx COMPLEX for cggevx DOUBLE COMPLEX for zggevx. Arrays: vl(ldvl,*); the second dimension of vl must be at least max(1, n). If jobvl = 'V', the left generalized eigenvectors u(j) are stored one after another in the columns of vl, in the same order as their eigenvalues. Each eigenvector will be scaled so the largest component have abs(Re) + abs(Im) = 1. If jobvl = 'N', vl is not referenced. For real flavors: If the j-th eigenvalue is real, then u(j) = vl(:,j), the j-th column of vl. If the j-th and (j+1)-st eigenvalues form a complex conjugate pair, then u(j) = vl(:,j) + i*vl(:,j+1) and u(j+1) = vl(:,j) - i*vl(:,j +1), where i = sqrt(-1). For complex flavors: u(j) = vl(:,j), the j-th column of vl. vr(ldvr,*); the second dimension of vr must be at least max(1, n). If jobvr = 'V', the right generalized eigenvectors v(j) are stored one after another in the columns of vr, in the same order as their eigenvalues. Each eigenvector will be scaled so the largest component have abs(Re) + abs(Im) = 1. If jobvr = 'N', vr is not referenced. For real flavors: If the j-th eigenvalue is real, then v(j) = vr(:,j), the j-th column of vr. If the j-th and (j+1)-st eigenvalues form a complex conjugate pair, then v(j) = vr(:,j) + i*vr(:,j+1) and v(j+1) = vr(:,j) - i*vr(:,j +1). For complex flavors: v(j) = vr(:,j), the j-th column of vr. ilo, ihi INTEGER. ilo and ihi are integer values such that on exit A(i,j) = 0 and B(i,j) = 0 if i > j and j = 1,..., ilo-1 or i = ihi+1,..., n. If balanc = 'N' or 'S', ilo = 1 and ihi = n. lscale, rscale REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays, DIMENSION at least max(1, n) each. lscale contains details of the permutations and scaling factors applied to the left side of A and B. If PL(j) is the index of the row interchanged with row j, and DL(j) is the scaling factor applied to row j, then lscale(j) = PL(j), for j = 1,..., ilo-1 = DL(j), for j = ilo,...,ihi = PL(j) for j = ihi+1,..., n. The order in which the interchanges are made is n to ihi+1, then 1 to ilo-1. 4 Intel® Math Kernel Library Reference Manual 1140 rscale contains details of the permutations and scaling factors applied to the right side of A and B. If PR(j) is the index of the column interchanged with column j, and DR(j) is the scaling factor applied to column j, then rscale(j) = PR(j), for j = 1,..., ilo-1 = DR(j), for j = ilo,...,ihi = PR(j) for j = ihi+1,..., n. The order in which the interchanges are made is n to ihi+1, then 1 to ilo-1. abnrm, bbnrm REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. The one-norms of the balanced matrices A and B, respectively. rconde, rcondv REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, n) each. If sense = 'E', or 'B', rconde contains the reciprocal condition numbers of the eigenvalues, stored in consecutive elements of the array. For a complex conjugate pair of eigenvalues two consecutive elements of rconde are set to the same value. Thus rconde(j), rcondv(j), and the j-th columns of vl and vr all correspond to the same eigenpair (but not in general the jth eigenpair, unless all eigenpairs are selected). If sense = 'N', or 'V', rconde is not referenced. If sense = 'V', or 'B', rcondv contains the estimated reciprocal condition numbers of the eigenvectors, stored in consecutive elements of the array. For a complex eigenvector two consecutive elements of rcondv are set to the same value. If the eigenvalues cannot be reordered to compute rcondv(j), rcondv(j) is set to 0; this can only occur when the true value would be very small anyway. If sense = 'N', or 'E', rcondv is not referenced. work(1) On exit, if info = 0, then work(1) returns the required minimal size of lwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n: the QZ iteration failed. No eigenvectors have been calculated, but alphar(j), alphai(j) (for real flavors), or alpha(j) (for complex flavors), and beta(j), j=info+1,..., n should be correct. i > n: errors that usually indicate LAPACK problems: i = n+1: other than QZ iteration failed in hgeqz; i = n+2: error return from tgevc. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggevx interface are the following: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 1141 alphar Holds the vector of length n. Used in real flavors only. alphai Holds the vector of length n. Used in real flavors only. alpha Holds the vector of length n. Used in complex flavors only. beta Holds the vector of length n. vl Holds the matrix VL of size (n, n). vr Holds the matrix VR of size (n, n). lscale Holds the vector of length n. rscale Holds the vector of length n. rconde Holds the vector of length n. rcondv Holds the vector of length n. balanc Must be 'N', 'B', or 'P'. The default value is 'N'. jobvl Restored based on the presence of the argument vl as follows: jobvl = 'V', if vl is present, jobvl = 'N', if vl is omitted. jobvr Restored based on the presence of the argument vr as follows: jobvr = 'V', if vr is present, jobvr = 'N', if vr is omitted. sense Restored based on the presence of arguments rconde and rcondv as follows: sense = 'B', if both rconde and rcondv are present, sense = 'E', if rconde is present and rcondv omitted, sense = 'V', if rconde is omitted and rcondv present, sense = 'N', if both rconde and rcondv are omitted. Application Notes If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The quotients alphar(j)/beta(j) and alphai(j)/beta(j) may easily over- or underflow, and beta(j) may even be zero. Thus, you should avoid simply computing the ratio. However, alphar and alphai (for real flavors) or alpha (for complex flavors) will be always less than and usually comparable with norm(A) in magnitude, and beta always less than and usually comparable with norm(B). 4 Intel® Math Kernel Library Reference Manual 1142 LAPACK Auxiliary and Utility Routines 5 This chapter describes the Intel® Math Kernel Library implementation of LAPACK auxiliary and utility routines. The library includes auxiliary routines for both real and complex data. Auxiliary Routines Routine naming conventions, mathematical notation, and matrix storage schemes used for LAPACK auxiliary routines are the same as for the driver and computational routines described in previous chapters. The table below summarizes information about the available LAPACK auxiliary routines. LAPACK Auxiliary Routines Routine Name Data Types Description ?lacgv c, z Conjugates a complex vector. ?lacrm c, z Multiplies a complex matrix by a square real matrix. ?lacrt c, z Performs a linear transformation of a pair of complex vectors. ?laesy c, z Computes the eigenvalues and eigenvectors of a 2-by-2 complex symmetric matrix. ?rot c, z Applies a plane rotation with real cosine and complex sine to a pair of complex vectors. ?spmv c, z Computes a matrix-vector product for complex vectors using a complex symmetric packed matrix ?spr c, z Performs the symmetrical rank-1 update of a complex symmetric packed matrix. ?symv c, z Computes a matrix-vector product for a complex symmetric matrix. ?syr c, z Performs the symmetric rank-1 update of a complex symmetric matrix. i?max1 c, z Finds the index of the vector element whose real part has maximum absolute value. ?sum1 sc, dz Forms the 1-norm of the complex vector using the true absolute value. ?gbtf2 s, d, c, z Computes the LU factorization of a general band matrix using the unblocked version of the algorithm. ?gebd2 s, d, c, z Reduces a general matrix to bidiagonal form using an unblocked algorithm. ?gehd2 s, d, c, z Reduces a general square matrix to upper Hessenberg form using an unblocked algorithm. ?gelq2 s, d, c, z Computes the LQ factorization of a general rectangular matrix using an unblocked algorithm. 1143 Routine Name Data Types Description ?geql2 s, d, c, z Computes the QL factorization of a general rectangular matrix using an unblocked algorithm. ?geqr2 s, d, c, z Computes the QR factorization of a general rectangular matrix using an unblocked algorithm. ?geqr2p s, d, c, z Computes the QR factorization of a general rectangular matrix with non-negative diagonal elements using an unblocked algorithm. ?gerq2 s, d, c, z Computes the RQ factorization of a general rectangular matrix using an unblocked algorithm. ?gesc2 s, d, c, z Solves a system of linear equations using the LU factorization with complete pivoting computed by ?getc2. ?getc2 s, d, c, z Computes the LU factorization with complete pivoting of the general n-by-n matrix. ?getf2 s, d, c, z Computes the LU factorization of a general m-by-n matrix using partial pivoting with row interchanges (unblocked algorithm). ?gtts2 s, d, c, z Solves a system of linear equations with a tridiagonal matrix using the LU factorization computed by ?gttrf. ?isnan s, d, Tests input for NaN. ?laisnan s, d, Tests input for NaN by comparing itwo arguments for inequality. ?labrd s, d, c, z Reduces the first nb rows and columns of a general matrix to a bidiagonal form. ?lacn2 s, d, c, z Estimates the 1-norm of a square matrix, using reverse communication for evaluating matrix-vector products. ?lacon s, d, c, z Estimates the 1-norm of a square matrix, using reverse communication for evaluating matrix-vector products. ?lacpy s, d, c, z Copies all or part of one two-dimensional array to another. ?ladiv s, d, c, z Performs complex division in real arithmetic, avoiding unnecessary overflow. ?lae2 s, d Computes the eigenvalues of a 2-by-2 symmetric matrix. ?laebz s, d Computes the number of eigenvalues of a real symmetric tridiagonal matrix which are less than or equal to a given value, and performs other tasks required by the routine ?stebz. ?laed0 s, d, c, z Used by ?stedc. Computes all eigenvalues and corresponding eigenvectors of an unreduced symmetric tridiagonal matrix using the divide and conquer method. ?laed1 s, d Used by sstedc/dstedc. Computes the updated eigensystem of a diagonal matrix after modification by a rank-one symmetric matrix. Used when the original matrix is tridiagonal. ?laed2 s, d Used by sstedc/dstedc. Merges eigenvalues and deflates secular equation. Used when the original matrix is tridiagonal. 5 Intel® Math Kernel Library Reference Manual 1144 Routine Name Data Types Description ?laed3 s, d Used by sstedc/dstedc. Finds the roots of the secular equation and updates the eigenvectors. Used when the original matrix is tridiagonal. ?laed4 s, d Used by sstedc/dstedc. Finds a single root of the secular equation. ?laed5 s, d Used by sstedc/dstedc. Solves the 2-by-2 secular equation. ?laed6 s, d Used by sstedc/dstedc. Computes one Newton step in solution of the secular equation. ?laed7 s, d, c, z Used by ?stedc. Computes the updated eigensystem of a diagonal matrix after modification by a rank-one symmetric matrix. Used when the original matrix is dense. ?laed8 s, d, c, z Used by ?stedc. Merges eigenvalues and deflates secular equation. Used when the original matrix is dense. ?laed9 s, d Used by sstedc/dstedc. Finds the roots of the secular equation and updates the eigenvectors. Used when the original matrix is dense. ?laeda s, d Used by ?stedc. Computes the Z vector determining the rank-one modification of the diagonal matrix. Used when the original matrix is dense. ?laein s, d, c, z Computes a specified right or left eigenvector of an upper Hessenberg matrix by inverse iteration. ?laev2 s, d, c, z Computes the eigenvalues and eigenvectors of a 2-by-2 symmetric/Hermitian matrix. ?laexc s, d Swaps adjacent diagonal blocks of a real upper quasi-triangular matrix in Schur canonical form, by an orthogonal similarity transformation. ?lag2 s, d Computes the eigenvalues of a 2-by-2 generalized eigenvalue problem, with scaling as necessary to avoid over-/underflow. ?lags2 s, d Computes 2-by-2 orthogonal matrices U, V, and Q, and applies them to matrices A and B such that the rows of the transformed A and B are parallel. ?lagtf s, d Computes an LU factorization of a matrix T-?I, where T is a general tridiagonal matrix, and ? a scalar, using partial pivoting with row interchanges. ?lagtm s, d, c, z Performs a matrix-matrix product of the form C = aab+ßC, where A is a tridiagonal matrix, B and C are rectangular matrices, and a and ß are scalars, which may be 0, 1, or -1. ?lagts s, d Solves the system of equations (T-?I)x = y or (T-?I)Tx = y,where T is a general tridiagonal matrix and ? a scalar, using the LU factorization computed by ?lagtf. ?lagv2 s, d Computes the Generalized Schur factorization of a real 2-by-2 matrix pencil (A,B) where B is upper triangular. LAPACK Auxiliary and Utility Routines 5 1145 Routine Name Data Types Description ?lahqr s, d, c, z Computes the eigenvalues and Schur factorization of an upper Hessenberg matrix, using the double-shift/single-shift QR algorithm. ?lahrd s, d, c, z Reduces the first nb columns of a general rectangular matrix A so that elements below the k-th subdiagonal are zero, and returns auxiliary matrices which are needed to apply the transformation to the unreduced part of A. ?lahr2 s, d, c, z Reduces the specified number of first columns of a general rectangular matrix A so that elements below thespecified subdiagonal are zero, and returns auxiliary matrices which are needed to apply the transformation to the unreduced part of A. ?laic1 s, d, c, z Applies one step of incremental condition estimation. ?laln2 s, d Solves a 1-by-1 or 2-by-2 linear system of equations of the specified form. ?lals0 s, d, c, z Applies back multiplying factors in solving the least squares problem using divide and conquer SVD approach. Used by ?gelsd. ?lalsa s, d, c, z Computes the SVD of the coefficient matrix in compact form. Used by ?gelsd. ?lalsd s, d, c, z Uses the singular value decomposition of A to solve the least squares problem. ?lamrg s, d Creates a permutation list to merge the entries of two independently sorted sets into a single set sorted in ascending order. ?laneg s, d Computes the Sturm count. ?langb s, d, c, z Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element of general band matrix. ?lange s, d, c, z Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element of a general rectangular matrix. ?langt s, d, c, z Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element of a general tridiagonal matrix. ?lanhs s, d, c, z Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element of an upper Hessenberg matrix. ?lansb s, d, c, z Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a symmetric band matrix. ?lanhb c, z Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a Hermitian band matrix. ?lansp s, d, c, z Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a symmetric matrix supplied in packed form. 5 Intel® Math Kernel Library Reference Manual 1146 Routine Name Data Types Description ?lanhp c, z Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a complex Hermitian matrix supplied in packed form. ?lanst/?lanht s, d/c, z Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a real symmetric or complex Hermitian tridiagonal matrix. ?lansy s, d, c, z Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a real/ complex symmetric matrix. ?lanhe c, z Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a complex Hermitian matrix. ?lantb s, d, c, z Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a triangular band matrix. ?lantp s, d, c, z Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a triangular matrix supplied in packed form. ?lantr s, d, c, z Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a trapezoidal or triangular matrix. ?lanv2 s, d Computes the Schur factorization of a real 2-by-2 nonsymmetric matrix in standard form. ?lapll s, d, c, z Measures the linear dependence of two vectors. ?lapmr s, d, c, z Rearranges rows of a matrix as specified by a permutation vector. ?lapmt s, d, c, z Performs a forward or backward permutation of the columns of a matrix. ?lapy2 s, d Returns sqrt(x2+y2). ?lapy3 s, d Returns sqrt(x2+y2+z2). ?laqgb s, d, c, z Scales a general band matrix, using row and column scaling factors computed by ?gbequ. ?laqge s, d, c, z Scales a general rectangular matrix, using row and column scaling factors computed by ?geequ. ?laqhb c, z Scales a Hermetian band matrix, using scaling factors computed by ?pbequ. ?laqp2 s, d, c, z Computes a QR factorization with column pivoting of the matrix block. ?laqps s, d, c, z Computes a step of QR factorization with column pivoting of a real m-by-n matrix A by using BLAS level 3. ?laqr0 s, d, c, z Computes the eigenvalues of a Hessenberg matrix, and optionally the matrices from the Schur decomposition. ?laqr1 s, d, c, z Sets a scalar multiple of the first column of the product of 2-by-2 or 3-by-3 matrix H and specified shifts. LAPACK Auxiliary and Utility Routines 5 1147 Routine Name Data Types Description ?laqr2 s, d, c, z Performs the orthogonal/unitary similarity transformation of a Hessenberg matrix to detect and deflate fully converged eigenvalues from a trailing principal submatrix (aggressive early deflation). ?laqr3 s, d, c, z Performs the orthogonal/unitary similarity transformation of a Hessenberg matrix to detect and deflate fully converged eigenvalues from a trailing principal submatrix (aggressive early deflation). ?laqr4 s, d, c, z Computes the eigenvalues of a Hessenberg matrix, and optionally the matrices from the Schur decomposition. ?laqr5 s, d, c, z Performs a single small-bulge multi-shift QR sweep. ?laqsb s, d, c, z Scales a symmetric/Hermitian band matrix, using scaling factors computed by ?pbequ. ?laqsp s, d, c, z Scales a symmetric/Hermitian matrix in packed storage, using scaling factors computed by ?ppequ. ?laqsy s, d, c, z Scales a symmetric/Hermitian matrix, using scaling factors computed by ?poequ. ?laqtr s, d Solves a real quasi-triangular system of equations, or a complex quasi-triangular system of special form, in real arithmetic. ?lar1v s, d, c, z Computes the (scaled) r-th column of the inverse of the submatrix in rows b1 through bn of the tridiagonal matrix ldLT - sI. ?lar2v s, d, c, z Applies a vector of plane rotations with real cosines and real/ complex sines from both sides to a sequence of 2-by-2 symmetric/ Hermitian matrices. ?larf s, d, c, z Applies an elementary reflector to a general rectangular matrix. ?larfb s, d, c, z Applies a block reflector or its transpose/conjugate-transpose to a general rectangular matrix. ?larfg s, d, c, z Generates an elementary reflector (Householder matrix). ?larfgp s, d, c, z Generates an elementary reflector (Householder matrix) with nonnegatibe beta. ?larft s, d, c, z Forms the triangular factor T of a block reflector H = I - vtvH ?larfx s, d, c, z Applies an elementary reflector to a general rectangular matrix, with loop unrolling when the reflector has order = 10. ?largv s, d, c, z Generates a vector of plane rotations with real cosines and real/ complex sines. ?larnv s, d, c, z Returns a vector of random numbers from a uniform or normal distribution. ?larra s, d Computes the splitting points with the specified threshold. ?larrb s, d Provides limited bisection to locate eigenvalues for more accuracy. ?larrc s, d Computes the number of eigenvalues of the symmetric tridiagonal matrix. 5 Intel® Math Kernel Library Reference Manual 1148 Routine Name Data Types Description ?larrd s, d Computes the eigenvalues of a symmetric tridiagonal matrix to suitable accuracy. ?larre s, d Given the tridiagonal matrix T, sets small off-diagonal elements to zero and for each unreduced block Ti, finds base representations and eigenvalues. ?larrf s, d Finds a new relatively robust representation such that at least one of the eigenvalues is relatively isolated. ?larrj s, d Performs refinement of the initial estimates of the eigenvalues of the matrix T. ?larrk s, d Computes one eigenvalue of a symmetric tridiagonal matrix T to suitable accuracy. ?larrr s, d Performs tests to decide whether the symmetric tridiagonal matrix T warrants expensive computations which guarantee high relative accuracy in the eigenvalues. ?larrv s, d, c, z Computes the eigenvectors of the tridiagonal matrix T = L D LT given L, D and the eigenvalues of L D LT. ?lartg s, d, c, z Generates a plane rotation with real cosine and real/complex sine. ?lartgp s, d Generates a plane rotation so that the diagonal is nonnegative. ?lartgs s, d Generates a plane rotation designed to introduce a bulge in implicit QR iteration for the bidiagonal SVD problem. ?lartv s, d, c, z Applies a vector of plane rotations with real cosines and real/ complex sines to the elements of a pair of vectors. ?laruv s, d Returns a vector of n random real numbers from a uniform distribution. ?larz s, d, c, z Applies an elementary reflector (as returned by ?tzrzf) to a general matrix. ?larzb s, d, c, z Applies a block reflector or its transpose/conjugate-transpose to a general matrix. ?larzt s, d, c, z Forms the triangular factor T of a block reflector H = I - vtvH. ?las2 s, d Computes singular values of a 2-by-2 triangular matrix. ?lascl s, d, c, z Multiplies a general rectangular matrix by a real scalar defined as cto/cfrom. ?lasd0 s, d Computes the singular values of a real upper bidiagonal n-by-m matrix B with diagonal d and off-diagonal e. Used by ?bdsdc. ?lasd1 s, d Computes the SVD of an upper bidiagonal matrix B of the specified size. Used by ?bdsdc. ?lasd2 s, d Merges the two sets of singular values together into a single sorted set. Used by ?bdsdc. ?lasd3 s, d Finds all square roots of the roots of the secular equation, as defined by the values in D and Z, and then updates the singular vectors by matrix multiplication. Used by ?bdsdc. LAPACK Auxiliary and Utility Routines 5 1149 Routine Name Data Types Description ?lasd4 s, d Computes the square root of the i-th updated eigenvalue of a positive symmetric rank-one modification to a positive diagonal matrix. Used by ?bdsdc. ?lasd5 s, d Computes the square root of the i-th eigenvalue of a positive symmetric rank-one modification of a 2-by-2 diagonal matrix.Used by ?bdsdc. ?lasd6 s, d Computes the SVD of an updated upper bidiagonal matrix obtained by merging two smaller ones by appending a row. Used by ? bdsdc. ?lasd7 s, d Merges the two sets of singular values together into a single sorted set. Then it tries to deflate the size of the problem. Used by ?bdsdc. ?lasd8 s, d Finds the square roots of the roots of the secular equation, and stores, for each element in D, the distance to its two nearest poles. Used by ?bdsdc. ?lasd9 s, d Finds the square roots of the roots of the secular equation, and stores, for each element in D, the distance to its two nearest poles. Used by ?bdsdc. ?lasda s, d Computes the singular value decomposition (SVD) of a real upper bidiagonal matrix with diagonal d and off-diagonal e. Used by ? bdsdc. ?lasdq s, d Computes the SVD of a real bidiagonal matrix with diagonal d and off-diagonal e. Used by ?bdsdc. ?lasdt s, d Creates a tree of subproblems for bidiagonal divide and conquer. Used by ?bdsdc. ?laset s, d, c, z Initializes the off-diagonal elements and the diagonal elements of a matrix to given values. ?lasq1 s, d Computes the singular values of a real square bidiagonal matrix. Used by ?bdsqr. ?lasq2 s, d Computes all the eigenvalues of the symmetric positive definite tridiagonal matrix associated with the qd Array Z to high relative accuracy. Used by ?bdsqr and ?stegr. ?lasq3 s, d Checks for deflation, computes a shift and calls dqds. Used by ? bdsqr. ?lasq4 s, d Computes an approximation to the smallest eigenvalue using values of d from the previous transform. Used by ?bdsqr. ?lasq5 s, d Computes one dqds transform in ping-pong form. Used by ?bdsqr and ?stegr. ?lasq6 s, d Computes one dqd transform in ping-pong form. Used by ?bdsqr and ?stegr. ?lasr s, d, c, z Applies a sequence of plane rotations to a general rectangular matrix. ?lasrt s, d Sorts numbers in increasing or decreasing order. 5 Intel® Math Kernel Library Reference Manual 1150 Routine Name Data Types Description ?lassq s, d, c, z Updates a sum of squares represented in scaled form. ?lasv2 s, d Computes the singular value decomposition of a 2-by-2 triangular matrix. ?laswp s, d, c, z Performs a series of row interchanges on a general rectangular matrix. ?lasy2 s, d Solves the Sylvester matrix equation where the matrices are of order 1 or 2. ?lasyf s, d, c, z Computes a partial factorization of a real/complex symmetric matrix, using the diagonal pivoting method. ?lahef c, z Computes a partial factorization of a complex Hermitian indefinite matrix, using the diagonal pivoting method. ?latbs s, d, c, z Solves a triangular banded system of equations. ?latdf s, d, c, z Uses the LU factorization of the n-by-n matrix computed by ? getc2 and computes a contribution to the reciprocal Dif-estimate. ?latps s, d, c, z Solves a triangular system of equations with the matrix held in packed storage. ?latrd s, d, c, z Reduces the first nb rows and columns of a symmetric/Hermitian matrix A to real tridiagonal form by an orthogonal/unitary similarity transformation. ?latrs s, d, c, z Solves a triangular system of equations with the scale factor set to prevent overflow. ?latrz s, d, c, z Factors an upper trapezoidal matrix by means of orthogonal/ unitary transformations. ?lauu2 s, d, c, z Computes the product UUH or LHL, where U and L are upper or lower triangular matrices (unblocked algorithm). ?lauum s, d, c, z Computes the product UUH or LHL, where U and L are upper or lower triangular matrices (blocked algorithm). ?org2l/?ung2l s, d/c, z Generates all or part of the orthogonal/unitary matrix Q from a QL factorization determined by ?geqlf (unblocked algorithm). ?org2r/?ung2r s, d/c, z Generates all or part of the orthogonal/unitary matrix Q from a QR factorization determined by ?geqrf (unblocked algorithm). ?orgl2/?ungl2 s, d/c, z Generates all or part of the orthogonal/unitary matrix Q from an LQ factorization determined by ?gelqf (unblocked algorithm). ?orgr2/?ungr2 s, d/c, z Generates all or part of the orthogonal/unitary matrix Q from an RQ factorization determined by ?gerqf (unblocked algorithm). ?orm2l/?unm2l s, d/c, z Multiplies a general matrix by the orthogonal/unitary matrix from a QL factorization determined by ?geqlf (unblocked algorithm). ?orm2r/?unm2r s, d/c, z Multiplies a general matrix by the orthogonal/unitary matrix from a QR factorization determined by ?geqrf (unblocked algorithm). ?orml2/?unml2 s, d/c, z Multiplies a general matrix by the orthogonal/unitary matrix from a LQ factorization determined by ?gelqf (unblocked algorithm). LAPACK Auxiliary and Utility Routines 5 1151 Routine Name Data Types Description ?ormr2/?unmr2 s, d/c, z Multiplies a general matrix by the orthogonal/unitary matrix from a RQ factorization determined by ?gerqf (unblocked algorithm). ?ormr3/?unmr3 s, d/c, z Multiplies a general matrix by the orthogonal/unitary matrix from a RZ factorization determined by ?tzrzf (unblocked algorithm). ?pbtf2 s, d, c, z Computes the Cholesky factorization of a symmetric/ Hermitian positive definite band matrix (unblocked algorithm). ?potf2 s, d, c, z Computes the Cholesky factorization of a symmetric/Hermitian positive definite matrix (unblocked algorithm). ?ptts2 s, d, c, z Solves a tridiagonal system of the form AX=B using the L D LH factorization computed by ?pttrf. ?rscl s, d, cs, zd Multiplies a vector by the reciprocal of a real scalar. ?syswapr s, d, c, z Applies an elementary permutation on the rows and columns of a symmetric matrix. ?heswapr c, z Applies an elementary permutation on the rows and columns of a Hermitian matrix. ?sygs2/?hegs2 s, d/c, z Reduces a symmetric/Hermitian definite generalized eigenproblem to standard form, using the factorization results obtained from ? potrf (unblocked algorithm). ?sytd2/?hetd2 s, d/c, z Reduces a symmetric/Hermitian matrix to real symmetric tridiagonal form by an orthogonal/unitary similarity transformation (unblocked algorithm). ?sytf2 s, d, c, z Computes the factorization of a real/complex symmetric indefinite matrix, using the diagonal pivoting method (unblocked algorithm). ?hetf2 c, z Computes the factorization of a complex Hermitian matrix, using the diagonal pivoting method (unblocked algorithm). ?tgex2 s, d, c, z Swaps adjacent diagonal blocks in an upper (quasi) triangular matrix pair by an orthogonal/unitary equivalence transformation. ?tgsy2 s, d, c, z Solves the generalized Sylvester equation (unblocked algorithm). ?trti2 s, d, c, z Computes the inverse of a triangular matrix (unblocked algorithm). clag2z c ? z Converts a complex single precision matrix to a complex double precision matrix. dlag2s d ? s Converts a double precision matrix to a single precision matrix. slag2d s ? d Converts a single precision matrix to a double precision matrix. zlag2c z ? c Converts a complex double precision matrix to a complex single precision matrix. ?larfp s, d, c, z Generates a real or complex elementary reflector. ila?lc s, d, c, z Scans a matrix for its last non-zero column. ila?lr s, d, c, z Scans a matrix for its last non-zero row. 5 Intel® Math Kernel Library Reference Manual 1152 Routine Name Data Types Description ?gsvj0 s, d Pre-processor for the routine ?gesvj. ?gsvj1 s, d Pre-processor for the routine ?gesvj, applies Jacobi rotations targeting only particular pivots. ?sfrk s, d Performs a symmetric rank-k operation for matrix in RFP format. ?hfrk c, z Performs a Hermitian rank-k operation for matrix in RFP format. ?tfsm s, d, c, z Solves a matrix equation (one operand is a triangular matrix in RFP format). ?lansf s, d Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a symmetric matrix in RFP format. ?lanhf c, z Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a Hermitian matrix in RFP format. ?tfttp s, d, c, z Copies a triangular matrix from the rectangular full packed format (TF) to the standard packed format (TP). ?tfttr s, d, c, z Copies a triangular matrix from the rectangular full packed format (TF) to the standard full format (TR). ?tpttf s, d, c, z Copies a triangular matrix from the standard packed format (TP) to the rectangular full packed format (TF). ?tpttr s, d, c, z Copies a triangular matrix from the standard packed format (TP) to the standard full format (TR). ?trttf s, d, c, z Copies a triangular matrix from the standard full format (TR) to the rectangular full packed format (TF). ?trttp s, d, c, z Copies a triangular matrix from the standard full format (TR) to the standard packed format (TP). ?pstf2 s, d, c, z Computes the Cholesky factorization with complete pivoting of a real symmetric or complex Hermitian positive semi-definite matrix. dlat2s d ? s Converts a double-precision triangular matrix to a single-precision triangular matrix. zlat2c z ? c Converts a double complex triangular matrix to a complex triangular matrix. ?lacp2 c, z Copies all or part of a real two-dimensional array to a complex array. ?la_gbamv s, d, c, z Performs a matrix-vector operation to calculate error bounds. ?la_gbrcond s, d Estimates the Skeel condition number for a general banded matrix. ?la_gbrcond_c c, z Computes the infinity norm condition number of op(A)*inv(diag(c)) for general banded matrices. ?la_gbrcond_x c, z Computes the infinity norm condition number of op(A)*diag(x) for general banded matrices. LAPACK Auxiliary and Utility Routines 5 1153 Routine Name Data Types Description ? la_gbrfsx_extended s, d, c, z Improves the computed solution to a system of linear equations for general banded matrices by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. ?la_gbrpvgrw s, d, c, z Computes the reciprocal pivot growth factor norm(A)/norm(U) for a general banded matrix. ?la_geamv s, d, c, z Computes a matrix-vector product using a general matrix to calculate error bounds. ?la_gercond s, d Estimates the Skeel condition number for a general matrix. ?la_gercond_c c, z Computes the infinity norm condition number of op(A)*inv(diag(c)) for general matrices. ?la_gercond_x c, z Computes the infinity norm condition number of op(A)*diag(x) for general matrices. ? la_gerfsx_extended s, d Improves the computed solution to a system of linear equations for general matrices by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. ?la_heamv c, z Computes a matrix-vector product using a Hermitian indefinite matrix to calculate error bounds. ?la_hercond_c c, z Computes the infinity norm condition number of op(A)*inv(diag(c)) for Hermitian indefinite matrices. ?la_hercond_x c, z Computes the infinity norm condition number of op(A)*diag(x) for Hermitian indefinite matrices. ? la_herfsx_extended c, z Improves the computed solution to a system of linear equations for Hermitian indefinite matrices by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. ?la_lin_berr s, d, c, z Computes a component-wise relative backward error. ?la_porcond s, d Estimates the Skeel condition number for a symmetric positivedefinite matrix. ?la_porcond_c c, z Computes the infinity norm condition number of op(A)*inv(diag(c)) for Hermitian positive-definite matrices. ?la_porcond_x c, z Computes the infinity norm condition number of op(A)*diag(x) for Hermitian positive-definite matrices. ? la_porfsx_extended s, d, c, z Improves the computed solution to a system of linear equations for symmetric or Hermitian positive-definite matrices by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. ?la_porpvgrw s, d, c, z Computes the reciprocal pivot growth factor norm(A)/norm(U) for a symmetric or Hermitian positive-definite matrix. ?laqhe c, z Scales a Hermitian matrix. ?laqhp c, z Scales a Hermitian matrix stored in packed form. ?larcm c, z Copies all or part of a real two-dimensional array to a complex array. 5 Intel® Math Kernel Library Reference Manual 1154 Routine Name Data Types Description ?la_rpvgrw c, z Multiplies a square real matrix by a complex matrix. ?larscl2 s, d, c, z Performs reciprocal diagonal scaling on a vector. ?lascl2 s, d, c, z Performs diagonal scaling on a vector. ?la_syamv s, d, c, z Computes a matrix-vector product using a symmetric indefinite matrix to calculate error bounds. ?la_syrcond s, d Estimates the Skeel condition number for a symmetric indefinite matrix. ?la_syrcond_c c, z Computes the infinity norm condition number of op(A)*inv(diag(c)) for symmetric indefinite matrices. ?la_syrcond_x c, z Computes the infinity norm condition number of op(A)*diag(x) for symmetric indefinite matrices. ? la_syrfsx_extended s, d, c, z Improves the computed solution to a system of linear equations for symmetric indefinite matrices by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. ?la_syrpvgrw s, d, c, z Computes the reciprocal pivot growth factor norm(A)/norm(U) for a symmetric indefinite matrix. ?la_wwaddw s, d, c, z Adds a vector into a doubled-single vector. ?lacgv Conjugates a complex vector. Syntax call clacgv( n, x, incx ) call zlacgv( n, x, incx ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine conjugates a complex vector x of length n and increment incx (see "Vector Arguments in BLAS" in Appendix B). Input Parameters n INTEGER. The length of the vector x (n = 0). x COMPLEX for clacgv DOUBLE COMPLEX for zlacgv. Array, dimension (1+(n-1)* |incx|). Contains the vector of length n to be conjugated. incx INTEGER. The spacing between successive elements of x. LAPACK Auxiliary and Utility Routines 5 1155 Output Parameters x On exit, overwritten with conjg(x). ?lacrm Multiplies a complex matrix by a square real matrix. Syntax call clacrm( m, n, a, lda, b, ldb, c, ldc, rwork ) call zlacrm( m, n, a, lda, b, ldb, c, ldc, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine performs a simple matrix-matrix multiplication of the form C = A*B, where A is m-by-n and complex, B is n-by-n and real, C is m-by-n and complex. Input Parameters m INTEGER. The number of rows of the matrix A and of the matrix C (m = 0). n INTEGER. The number of columns and rows of the matrix B and the number of columns of the matrix C (n = 0). a COMPLEX for clacrm DOUBLE COMPLEX for zlacrm Array, DIMENSION (lda, n). Contains the m-by-n matrix A. lda INTEGER. The leading dimension of the array a, lda = max(1, m). b REAL for clacrm DOUBLE PRECISION for zlacrm Array, DIMENSION (ldb, n). Contains the n-by-n matrix B. ldb INTEGER. The leading dimension of the array b, ldb = max(1, n). ldc INTEGER. The leading dimension of the output array c, ldc = max(1, n). rwork REAL for clacrm DOUBLE PRECISION for zlacrm Workspace array, DIMENSION (2*m*n). Output Parameters c COMPLEX for clacrm DOUBLE COMPLEX for zlacrm Array, DIMENSION (ldc, n). Contains the m-by-n matrix C. ?lacrt Performs a linear transformation of a pair of complex vectors. 5 Intel® Math Kernel Library Reference Manual 1156 Syntax call clacrt( n, cx, incx, cy, incy, c, s ) call zlacrt( n, cx, incx, cy, incy, c, s ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine performs the following transformation where c, s are complex scalars and x, y are complex vectors. Input Parameters n INTEGER. The number of elements in the vectors cx and cy (n = 0). cx, cy COMPLEX for clacrt DOUBLE COMPLEX for zlacrt Arrays, dimension (n). Contain input vectors x and y, respectively. incx INTEGER. The increment between successive elements of cx. incy INTEGER. The increment between successive elements of cy. c, s COMPLEX for clacrt DOUBLE COMPLEX for zlacrt Complex scalars that define the transform matrix Output Parameters cx On exit, overwritten with c*x + s*y . cy On exit, overwritten with -s*x + c*y . ?laesy Computes the eigenvalues and eigenvectors of a 2- by-2 complex symmetric matrix, and checks that the norm of the matrix of eigenvectors is larger than a threshold value. Syntax call claesy( a, b, c, rt1, rt2, evscal, cs1, sn1 ) call zlaesy( a, b, c, rt1, rt2, evscal, cs1, sn1 ) LAPACK Auxiliary and Utility Routines 5 1157 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine performs the eigendecomposition of a 2-by-2 symmetric matrix provided the norm of the matrix of eigenvectors is larger than some threshold value. rt1 is the eigenvalue of larger absolute value, and rt2 of smaller absolute value. If the eigenvectors are computed, then on return (cs1, sn1) is the unit eigenvector for rt1, hence Input Parameters a, b, c COMPLEX for claesy DOUBLE COMPLEX for zlaesy Elements of the input matrix. Output Parameters rt1, rt2 COMPLEX for claesy DOUBLE COMPLEX for zlaesy Eigenvalues of larger and smaller modulus, respectively. evscal COMPLEX for claesy DOUBLE COMPLEX for zlaesy The complex value by which the eigenvector matrix was scaled to make it orthonormal. If evscal is zero, the eigenvectors were not computed. This means one of two things: the 2-by-2 matrix could not be diagonalized, or the norm of the matrix of eigenvectors before scaling was larger than the threshold value thresh (set to 0.1E0). cs1, sn1 COMPLEX for claesy DOUBLE COMPLEX for zlaesy If evscal is not zero, then (cs1, sn1) is the unit right eigenvector for rt1. ?rot Applies a plane rotation with real cosine and complex sine to a pair of complex vectors. Syntax call crot( n, cx, incx, cy, incy, c, s ) call zrot( n, cx, incx, cy, incy, c, s ) 5 Intel® Math Kernel Library Reference Manual 1158 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine applies a plane rotation, where the cosine (c) is real and the sine (s) is complex, and the vectors cx and cy are complex. This routine has its real equivalents in BLAS (see ?rot in Chapter 2). Input Parameters n INTEGER. The number of elements in the vectors cx and cy. cx, cy REAL for srot DOUBLE PRECISION for drot COMPLEX for crot DOUBLE COMPLEX for zrot Arrays of dimension (n), contain input vectors x and y, respectively. incx INTEGER. The increment between successive elements of cx. incy INTEGER. The increment between successive elements of cy. c REAL for crot DOUBLE PRECISION for zrot s REAL for srot DOUBLE PRECISION for drot COMPLEX for crot DOUBLE COMPLEX for zrot Values that define a rotation where c*c + s*conjg(s) = 1.0. Output Parameters cx On exit, overwritten with c*x + s*y. cy On exit, overwritten with -conjg(s)*x + c*y. ?spmv Computes a matrix-vector product for complex vectors using a complex symmetric packed matrix. Syntax call cspmv( uplo, n, alpha, ap, x, incx, beta, y, incy ) call zspmv( uplo, n, alpha, ap, x, incx, beta, y, incy ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?spmv routines perform a matrix-vector operation defined as LAPACK Auxiliary and Utility Routines 5 1159 y := alpha*a*x + beta*y, where: alpha and beta are complex scalars, x and y are n-element complex vectors a is an n-by-n complex symmetric matrix, supplied in packed form. These routines have their real equivalents in BLAS (see ?spmv in Chapter 2 ). Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix a is supplied in the packed array ap. If uplo = 'U' or 'u', the upper triangular part of the matrix a is supplied in the array ap. If uplo = 'L' or 'l', the lower triangular part of the matrix a is supplied in the array ap . n INTEGER. Specifies the order of the matrix a. The value of n must be at least zero. alpha, beta COMPLEX for cspmv DOUBLE COMPLEX for zspmv Specify complex scalars alpha and beta. When beta is supplied as zero, then y need not be set on input. ap COMPLEX for cspmv DOUBLE COMPLEX for zspmv Array, DIMENSION at least ((n*(n + 1))/2). Before entry, with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the symmetric matrix packed sequentially, column-by-column, so that ap(1) contains A(1, 1), ap(2) and ap(3) contain A(1, 2) and A(2, 2) respectively, and so on. Before entry, with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the symmetric matrix packed sequentially, column-by-column, so that ap(1) contains a(1, 1), ap(2) and ap(3) contain a(2, 1) and a(3, 1) respectively, and so on. x COMPLEX for cspmv DOUBLE COMPLEX for zspmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y COMPLEX for cspmv DOUBLE COMPLEX for zspmv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. 5 Intel® Math Kernel Library Reference Manual 1160 ?spr Performs the symmetrical rank-1 update of a complex symmetric packed matrix. Syntax call cspr( uplo, n, alpha, x, incx, ap ) call zspr( uplo, n, alpha, x, incx, ap ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?spr routines perform a matrix-vector operation defined as a:= alpha*x*xH + a, where: alpha is a complex scalar x is an n-element complex vector a is an n-by-n complex symmetric matrix, supplied in packed form. These routines have their real equivalents in BLAS (see ?spr in Chapter 2). Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix a is supplied in the packed array ap, as follows: If uplo = 'U' or 'u', the upper triangular part of the matrix a is supplied in the array ap. If uplo = 'L' or 'l', the lower triangular part of the matrix a is supplied in the array ap . n INTEGER. Specifies the order of the matrix a. The value of n must be at least zero. alpha COMPLEX for cspr DOUBLE COMPLEX for zspr Specifies the scalar alpha. x COMPLEX for cspr DOUBLE COMPLEX for zspr Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. ap COMPLEX for cspr DOUBLE COMPLEX for zspr Array, DIMENSION at least ((n*(n + 1))/2). Before entry, with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the symmetric matrix packed sequentially, column-by-column, so that ap(1) contains A(1,1), ap(2) and ap(3) contain A(1, 2) and A(2,2) respectively, and so on. LAPACK Auxiliary and Utility Routines 5 1161 Before entry, with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the symmetric matrix packed sequentially, column-bycolumn, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(2,1) and a(3,1) respectively, and so on. Note that the imaginary parts of the diagonal elements need not be set, they are assumed to be zero, and on exit they are set to zero. Output Parameters ap With uplo = 'U' or 'u', overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', overwritten by the lower triangular part of the updated matrix. ?symv Computes a matrix-vector product for a complex symmetric matrix. Syntax call csymv( uplo, n, alpha, a, lda, x, incx, beta, y, incy ) call zsymv( uplo, n, alpha, a, lda, x, incx, beta, y, incy ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine performs the matrix-vector operation defined as y := alpha*a*x + beta*y, where: alpha and beta are complex scalars x and y are n-element complex vectors a is an n-by-n symmetric complex matrix. These routines have their real equivalents in BLAS (see ?symv in Chapter 2). Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used: If uplo = 'U' or 'u', then the upper triangular part of the array a is used. If uplo = 'L' or 'l', then the lower triangular part of the array a is used. n INTEGER. Specifies the order of the matrix a. The value of n must be at least zero. alpha, beta COMPLEX for csymv DOUBLE COMPLEX for zsymv Specify the scalars alpha and beta. When beta is supplied as zero, then y need not be set on input. a COMPLEX for csymv DOUBLE COMPLEX for zsymv 5 Intel® Math Kernel Library Reference Manual 1162 Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading nby- n lower triangular part of the array a must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of a is not referenced. lda INTEGER. Specifies the leading dimension of A as declared in the calling (sub)program. The value of lda must be at least max(1,n). x COMPLEX for csymv DOUBLE COMPLEX for zsymv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y COMPLEX for csymv DOUBLE COMPLEX for zsymv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. ?syr Performs the symmetric rank-1 update of a complex symmetric matrix. Syntax call csyr( uplo, n, alpha, x, incx, a, lda ) call zsyr( uplo, n, alpha, x, incx, a, lda ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine performs the symmetric rank 1 operation defined as a := alpha*x*xH + a, where: • alpha is a complex scalar. • x is an n-element complex vector. • a is an n-by-n complex symmetric matrix. These routines have their real equivalents in BLAS (see ?syr in Chapter 2). LAPACK Auxiliary and Utility Routines 5 1163 Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used: If uplo = 'U' or 'u', then the upper triangular part of the array a is used. If uplo = 'L' or 'l', then the lower triangular part of the array a is used. n INTEGER. Specifies the order of the matrix a. The value of n must be at least zero. alpha COMPLEX for csyr DOUBLE COMPLEX for zsyr Specifies the scalar alpha. x COMPLEX for csyr DOUBLE COMPLEX for zsyr Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. a COMPLEX for csyr DOUBLE COMPLEX for zsyr Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of a is not referenced. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1,n). Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated matrix. i?max1 Finds the index of the vector element whose real part has maximum absolute value. Syntax index = icmax1( n, cx, incx ) index = izmax1( n, cx, incx ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description 5 Intel® Math Kernel Library Reference Manual 1164 Given a complex vector cx, the i?max1 functions return the index of the vector element whose real part has maximum absolute value. These functions are based on the BLAS functions icamax/izamax, but using the absolute value of the real part. They are designed for use with clacon/zlacon. Input Parameters n INTEGER. Specifies the number of elements in the vector cx. cx COMPLEX for icmax1 DOUBLE COMPLEX for izmax1 Array, DIMENSION at least (1+(n-1)*abs(incx)). Contains the input vector. incx INTEGER. Specifies the spacing between successive elements of cx. Output Parameters index INTEGER. Contains the index of the vector element whose real part has maximum absolute value. ?sum1 Forms the 1-norm of the complex vector using the true absolute value. Syntax res = scsum1( n, cx, incx ) res = dzsum1( n, cx, incx ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description Given a complex vector cx, scsum1/dzsum1 functions take the sum of the absolute values of vector elements and return a single/double precision result, respectively. These functions are based on scasum/dzasum from Level 1 BLAS, but use the true absolute value and were designed for use with clacon/zlacon. Input Parameters n INTEGER. Specifies the number of elements in the vector cx. cx COMPLEX for scsum1 DOUBLE COMPLEX for dzsum1 Array, DIMENSION at least (1+(n-1)*abs(incx)). Contains the input vector whose elements will be summed. incx INTEGER. Specifies the spacing between successive elements of cx (incx > 0). Output Parameters res REAL for scsum1 DOUBLE PRECISION for dzsum1 Contains the sum of absolute values. LAPACK Auxiliary and Utility Routines 5 1165 ?gbtf2 Computes the LU factorization of a general band matrix using the unblocked version of the algorithm. Syntax call sgbtf2( m, n, kl, ku, ab, ldab, ipiv, info ) call dgbtf2( m, n, kl, ku, ab, ldab, ipiv, info ) call cgbtf2( m, n, kl, ku, ab, ldab, ipiv, info ) call zgbtf2( m, n, kl, ku, ab, ldab, ipiv, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine forms the LU factorization of a general real/complex m-by-n band matrix A with kl sub-diagonals and ku super-diagonals. The routine uses partial pivoting with row interchanges and implements the unblocked version of the algorithm, calling Level 2 BLAS. See also ?gbtrf. Input Parameters m INTEGER. The number of rows of the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). kl INTEGER. The number of sub-diagonals within the band of A (kl = 0). ku INTEGER. The number of super-diagonals within the band of A (ku = 0). ab REAL for sgbtf2 DOUBLE PRECISION for dgbtf2 COMPLEX for cgbtf2 DOUBLE COMPLEX for zgbtf2. Array, DIMENSION (ldab,*). The array ab contains the matrix A in band storage (see Matrix Arguments). The second dimension of ab must be at least max(1, n). ldab INTEGER. The leading dimension of the array ab. (ldab = 2kl + ku +1) Output Parameters ab Overwritten by details of the factorization. The diagonal and kl + ku superdiagonals of U are stored in the first 1 + kl + ku rows of ab. The multipliers used during the factorization are stored in the next kl rows. ipiv INTEGER. Array, DIMENSION at least max(1,min(m,n)). The pivot indices: row i was interchanged with row ipiv(i). info INTEGER. If info =0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, uii is 0. The factorization has been completed, but U is exactly singular. Division by 0 will occur if you use the factor U for solving a system of linear equations. 5 Intel® Math Kernel Library Reference Manual 1166 ?gebd2 Reduces a general matrix to bidiagonal form using an unblocked algorithm. Syntax call sgebd2( m, n, a, lda, d, e, tauq, taup, work, info ) call dgebd2( m, n, a, lda, d, e, tauq, taup, work, info ) call cgebd2( m, n, a, lda, d, e, tauq, taup, work, info ) call zgebd2( m, n, a, lda, d, e, tauq, taup, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine reduces a general m-by-n matrix A to upper or lower bidiagonal form B by an orthogonal (unitary) transformation: QT*A*P = B (for real flavors) or QH*A*P = B (for complex flavors). If m = n, B is upper bidiagonal; if m < n, B is lower bidiagonal. The routine does not form the matrices Q and P explicitly, but represents them as products of elementary reflectors. if m = n, Q = H(1)*H(2)*...*H(n), and P = G(1)*G(2)*...*G(n-1) if m < n, Q = H(1)*H(2)*...*H(m-1), and P = G(1)*G(2)*...*G(m) Each H(i) and G(i) has the form H(i) = I - tauq*v*vT and G(i) = I - taup*u*uT for real flavors, or H(i) = I - tauq*v*vH and G(i) = I - taup*u*uH for complex flavors where tauq and taup are scalars (real for sgebd2/dgebd2, complex for cgebd2/zgebd2), and v and u are vectors (real for sgebd2/dgebd2, complex for cgebd2/zgebd2). Input Parameters m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgebd2 DOUBLE PRECISION for dgebd2 COMPLEX for cgebd2 DOUBLE COMPLEX for zgebd2. Arrays: a(lda,*) contains the m-by-n general matrix A to be reduced. The second dimension of a must be at least max(1, n). work(*) is a workspace array, the dimension of work must be at least max(1, m, n). lda INTEGER. The leading dimension of a; at least max(1, m). LAPACK Auxiliary and Utility Routines 5 1167 Output Parameters a if m = n, the diagonal and first super-diagonal of a are overwritten with the upper bidiagonal matrix B. Elements below the diagonal, with the array tauq, represent the orthogonal/unitary matrix Q as a product of elementary reflectors, and elements above the first superdiagonal, with the array taup, represent the orthogonal/unitary matrix p as a product of elementary reflectors. if m < n, the diagonal and first sub-diagonal of a are overwritten by the lower bidiagonal matrix B. Elements below the first subdiagonal, with the array tauq, represent the orthogonal/unitary matrix Q as a product of elementary reflectors, and elements above the diagonal, with the array taup, represent the orthogonal/unitary matrix p as a product of elementary reflectors. d REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION at least max(1, min(m, n)). Contains the diagonal elements of the bidiagonal matrix B: d(i) = a(i, i). e REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION at least max(1, min(m, n) - 1). Contains the off-diagonal elements of the bidiagonal matrix B: if m = n, e(i) = a(i, i+1) for i = 1,2,..., n-1; if m < n, e(i) = a(i+1, i) for i = 1,2,..., m-1. tauq, taup REAL for sgebd2 DOUBLE PRECISION for dgebd2 COMPLEX for cgebd2 DOUBLE COMPLEX for zgebd2. Arrays, DIMENSION at least max (1, min(m, n)). Contain scalar factors of the elementary reflectors which represent orthogonal/unitary matrices Q and p, respectively. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. ?gehd2 Reduces a general square matrix to upper Hessenberg form using an unblocked algorithm. Syntax call sgehd2( n, ilo, ihi, a, lda, tau, work, info ) call dgehd2( n, ilo, ihi, a, lda, tau, work, info ) call cgehd2( n, ilo, ihi, a, lda, tau, work, info ) call zgehd2( n, ilo, ihi, a, lda, tau, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1168 Description The routine reduces a real/complex general matrix A to upper Hessenberg form H by an orthogonal or unitary similarity transformation QT*A*Q = H (for real flavors) or QH*A*Q = H (for complex flavors). The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of elementary reflectors. Input Parameters n INTEGER The order of the matrix A (n = 0). ilo, ihi INTEGER. It is assumed that A is already upper triangular in rows and columns 1:ilo -1 and ihi+1:n. If A has been output by ?gebal, then ilo and ihi must contain the values returned by that routine. Otherwise they should be set to ilo = 1 and ihi = n. Constraint: 1 = ilo = ihi = max(1, n). a, work REAL for sgehd2 DOUBLE PRECISION for dgehd2 COMPLEX for cgehd2 DOUBLE COMPLEX for zgehd2. Arrays: a (lda,*) contains the n-by-n matrix A to be reduced. The second dimension of a must be at least max(1, n). work (n) is a workspace array. lda INTEGER. The leading dimension of a; at least max(1, n). Output Parameters a On exit, the upper triangle and the first subdiagonal of A are overwritten with the upper Hessenberg matrix H and the elements below the first subdiagonal, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors. See Application Notes below. tau REAL for sgehd2 DOUBLE PRECISION for dgehd2 COMPLEX for cgehd2 DOUBLE COMPLEX for zgehd2. Array, DIMENSION at least max (1, n-1). Contains the scalar factors of elementary reflectors. See Application Notes below. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Application Notes The matrix Q is represented as a product of (ihi - ilo) elementary reflectors Q = H(ilo)*H(ilo +1)*...*H(ihi -1) Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors LAPACK Auxiliary and Utility Routines 5 1169 where tau is a real/complex scalar, and v is a real/complex vector with v(1:i) = 0, v(i+1) = 1 and v(ihi+1:n) = 0. On exit, v(i+2:ihi) is stored in a(i+2:ihi, i) and tau in tau(i). The contents of a are illustrated by the following example, with n = 7, ilo = 2 and ihi = 6: where a denotes an element of the original matrix A, h denotes a modified element of the upper Hessenberg matrix H, and vi denotes an element of the vector defining H(i). ?gelq2 Computes the LQ factorization of a general rectangular matrix using an unblocked algorithm. Syntax call sgelq2( m, n, a, lda, tau, work, info ) call dgelq2( m, n, a, lda, tau, work, info ) call cgelq2( m, n, a, lda, tau, work, info ) call zgelq2( m, n, a, lda, tau, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes an LQ factorization of a real/complex m-by-n matrix A as A = L*Q. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors : Q = H(k) ... H(2) H(1) (or Q = H(k)H ... H(2)H H(1)H for complex flavors), where k = min(m, n) Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors, 5 Intel® Math Kernel Library Reference Manual 1170 where tau is a real/complex scalar stored in tau(i), and v is a real/complex vector with v(1:i-1) = 0 and v(i) = 1. On exit, v(i+1:n) (for real functions) and conjgv(i+1:n) (for complex functions) are stored in a(i, i +1:n). Input Parameters m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgelq2 DOUBLE PRECISION for dgelq2 COMPLEX for cgelq2 DOUBLE COMPLEX for zgelq2. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). work(m) is a workspace array. lda INTEGER. The leading dimension of a; at least max(1, m). Output Parameters a Overwritten by the factorization data as follows: on exit, the elements on and below the diagonal of the array a contain the m-by-min(n,m) lower trapezoidal matrix L (L is lower triangular if n = m); the elements above the diagonal, with the array tau, represent the orthogonal/unitary matrix Q as a product of min(n,m) elementary reflectors. tau REAL for sgelq2 DOUBLE PRECISION for dgelq2 COMPLEX for cgelq2 DOUBLE COMPLEX for zgelq2. Array, DIMENSION at least max(1, min(m, n)). Contains scalar factors of the elementary reflectors. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?geql2 Computes the QL factorization of a general rectangular matrix using an unblocked algorithm. Syntax call sgeql2( m, n, a, lda, tau, work, info ) call dgeql2( m, n, a, lda, tau, work, info ) call cgeql2( m, n, a, lda, tau, work, info ) call zgeql2( m, n, a, lda, tau, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes a QL factorization of a real/complex m-by-n matrix A as A = Q*L. LAPACK Auxiliary and Utility Routines 5 1171 The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors : Q = H(k)* ... *H(2)*H(1), where k = min(m, n). Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors where tau is a real/complex scalar stored in tau(i), and v is a real/complex vector with v(m-k+i+1:m) = 0 and v(m-k+i) = 1. On exit, v(1:m-k+i-1) is stored in a(1:m-k+i-1, n-k+i). Input Parameters m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeql2 DOUBLE PRECISION for dgeql2 COMPLEX for cgeql2 DOUBLE COMPLEX for zgeql2. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). work(m) is a workspace array. lda INTEGER. The leading dimension of a; at least max(1, m). Output Parameters a Overwritten by the factorization data as follows: on exit, if m = n, the lower triangle of the subarray a(m-n+1:m, 1:n) contains the n-by-n lower triangular matrix L; if m < n, the elements on and below the (n-m)th superdiagonal contain the m-by-n lower trapezoidal matrix L; the remaining elements, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors. tau REAL for sgeql2 DOUBLE PRECISION for dgeql2 COMPLEX for cgeql2 DOUBLE COMPLEX for zgeql2. Array, DIMENSION at least max(1, min(m, n)). Contains scalar factors of the elementary reflectors. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?geqr2 Computes the QR factorization of a general rectangular matrix using an unblocked algorithm. Syntax call sgeqr2( m, n, a, lda, tau, work, info ) call dgeqr2( m, n, a, lda, tau, work, info ) 5 Intel® Math Kernel Library Reference Manual 1172 call cgeqr2( m, n, a, lda, tau, work, info ) call zgeqr2( m, n, a, lda, tau, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes a QR factorization of a real/complex m-by-n matrix A as A = Q*R. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors : Q = H(1)*H(2)* ... *H(k), where k = min(m, n) Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors where tau is a real/complex scalar stored in tau(i), and v is a real/complex vector with v(1:i-1) = 0 and v(i) = 1. On exit, v(i+1:m) is stored in a(i+1:m, i). Input Parameters m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqr2 DOUBLE PRECISION for dgeqr2 COMPLEX for cgeqr2 DOUBLE COMPLEX for zgeqr2. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). work(n) is a workspace array. lda INTEGER. The leading dimension of a; at least max(1, m). Output Parameters a Overwritten by the factorization data as follows: on exit, the elements on and above the diagonal of the array a contain the min(n,m)-by-n upper trapezoidal matrix R (R is upper triangular if m = n); the elements below the diagonal, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors. tau REAL for sgeqr2 DOUBLE PRECISION for dgeqr2 COMPLEX for cgeqr2 DOUBLE COMPLEX for zgeqr2. Array, DIMENSION at least max(1, min(m, n)). Contains scalar factors of the elementary reflectors. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Auxiliary and Utility Routines 5 1173 ?geqr2p Computes the QR factorization of a general rectangular matrix with non-negative diagonal elements using an unblocked algorithm. Syntax call sgeqr2p( m, n, a, lda, tau, work, info ) call dgeqr2p( m, n, a, lda, tau, work, info ) call cgeqr2p( m, n, a, lda, tau, work, info ) call zgeqr2p( m, n, a, lda, tau, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes a QR factorization of a real/complex m-by-n matrix A as A = Q*R. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors : Q = H(1)*H(2)* ... *H(k), where k = min(m, n) Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors where tau is a real/complex scalar stored in tau(i), and v is a real/complex vector with v(1:i-1) = 0 and v(i) = 1. On exit, v(i+1:m) is stored in a(i+1:m, i). Input Parameters m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqr2p DOUBLE PRECISION for d COMPLEX for cgeqr2p DOUBLE COMPLEX for zgeqr2p. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). work(n) is a workspace array. lda INTEGER. The leading dimension of a; at least max(1, m). Output Parameters a Overwritten by the factorization data as follows: on exit, the elements on and above the diagonal of the array a contain the min(n,m)-by-n upper trapezoidal matrix R (R is upper triangular if m = n); the elements below the diagonal, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors. 5 Intel® Math Kernel Library Reference Manual 1174 The diagonal elements of the matrix R are non-negative. tau REAL for sgeqr2p DOUBLE PRECISION for dgeqr2p COMPLEX for cgeqr2p DOUBLE COMPLEX for zgeqr2p. Array, DIMENSION at least max(1, min(m, n)). Contains scalar factors of the elementary reflectors. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?gerq2 Computes the RQ factorization of a general rectangular matrix using an unblocked algorithm. Syntax call sgerq2( m, n, a, lda, tau, work, info ) call dgerq2( m, n, a, lda, tau, work, info ) call cgerq2( m, n, a, lda, tau, work, info ) call zgerq2( m, n, a, lda, tau, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes a RQ factorization of a real/complex m-by-n matrix A as A = R*Q. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors : Q = H(1)*H(2)* ... *H(k) for real flavors, or Q = H(1)H*H(2)H* ... *H(k)H for complex flavors where k = min(m, n). Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors where tau is a real/complex scalar stored in tau(i), and v is a real/complex vector with v(n-k+i+1:n) = 0 and v(n-k+i) = 1. On exit, v(1:n-k+i-1) is stored in a(m-k+i, 1:n-k+i-1). Input Parameters m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgerq2 DOUBLE PRECISION for dgerq2 COMPLEX for cgerq2 LAPACK Auxiliary and Utility Routines 5 1175 DOUBLE COMPLEX for zgerq2. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). work(m) is a workspace array. lda INTEGER. The leading dimension of a; at least max(1, m). Output Parameters a Overwritten by the factorization data as follows: on exit, if m = n, the upper triangle of the subarray a(1:m, n-m+1:n ) contains the m-by-m upper triangular matrix R; if m > n, the elements on and above the (m-n)-th subdiagonal contain the m-by-n upper trapezoidal matrix R; the remaining elements, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors. tau REAL for sgerq2 DOUBLE PRECISION for dgerq2 COMPLEX for cgerq2 DOUBLE COMPLEX for zgerq2. Array, DIMENSION at least max(1, min(m, n)). Contains scalar factors of the elementary reflectors. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?gesc2 Solves a system of linear equations using the LU factorization with complete pivoting computed by ? getc2. Syntax call sgesc2( n, a, lda, rhs, ipiv, jpiv, scale ) call dgesc2( n, a, lda, rhs, ipiv, jpiv, scale ) call cgesc2( n, a, lda, rhs, ipiv, jpiv, scale ) call zgesc2( n, a, lda, rhs, ipiv, jpiv, scale ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine solves a system of linear equations A*X = scale*RHS with a general n-by-n matrix A using the LU factorization with complete pivoting computed by ?getc2. Input Parameters n INTEGER. The order of the matrix A. a, rhs REAL for sgesc2 5 Intel® Math Kernel Library Reference Manual 1176 DOUBLE PRECISION for dgesc2 COMPLEX for cgesc2 DOUBLE COMPLEX for zgesc2. Arrays: a(lda,*) contains the LU part of the factorization of the n-by-n matrix A computed by ?getc2: A = P*L*U*Q. The second dimension of a must be at least max(1, n); rhs(n) contains on entry the right hand side vector for the system of equations. lda INTEGER. The leading dimension of a; at least max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1,n). The pivot indices: for 1 = i = n, row i of the matrix has been interchanged with row ipiv(i). jpiv INTEGER. Array, DIMENSION at least max(1,n). The pivot indices: for 1 = j = n, column j of the matrix has been interchanged with column jpiv(j). Output Parameters rhs On exit, overwritten with the solution vector X. scale REAL for sgesc2/cgesc2 DOUBLE PRECISION for dgesc2/zgesc2 Contains the scale factor. scale is chosen in the range 0 = scale = 1 to prevent overflow in the solution. ?getc2 Computes the LU factorization with complete pivoting of the general n-by-n matrix. Syntax call sgetc2( n, a, lda, ipiv, jpiv, info ) call dgetc2( n, a, lda, ipiv, jpiv, info ) call cgetc2( n, a, lda, ipiv, jpiv, info ) call zgetc2( n, a, lda, ipiv, jpiv, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes an LU factorization with complete pivoting of the n-by-n matrix A. The factorization has the form A = P*L*U*Q, where P and Q are permutation matrices, L is lower triangular with unit diagonal elements and U is upper triangular. The LU factorization computed by this routine is used by ?latdf to compute a contribution to the reciprocal Dif-estimate. LAPACK Auxiliary and Utility Routines 5 1177 Input Parameters n INTEGER. The order of the matrix A (n = 0). a REAL for sgetc2 DOUBLE PRECISION for dgetc2 COMPLEX for cgetc2 DOUBLE COMPLEX for zgetc2. Array a(lda,*) contains the n-by-n matrix A to be factored. The second dimension of a must be at least max(1, n); lda INTEGER. The leading dimension of a; at least max(1, n). Output Parameters a On exit, the factors L and U from the factorization A = P*L*U*Q; the unit diagonal elements of L are not stored. If U(k, k) appears to be less than smin, U(k, k) is given the value of smin, that is giving a nonsingular perturbed system. ipiv INTEGER. Array, DIMENSION at least max(1,n). The pivot indices: for 1 = i = n, row i of the matrix has been interchanged with row ipiv(i). jpiv INTEGER. Array, DIMENSION at least max(1,n). The pivot indices: for 1 = j = n, column j of the matrix has been interchanged with column jpiv(j). info INTEGER. If info = 0, the execution is successful. If info = k >0, U(k, k) is likely to produce overflow if we try to solve for x in A*x = b. So U is perturbed to avoid the overflow. ?getf2 Computes the LU factorization of a general m-by-n matrix using partial pivoting with row interchanges (unblocked algorithm). Syntax call sgetf2( m, n, a, lda, ipiv, info ) call dgetf2( m, n, a, lda, ipiv, info ) call cgetf2( m, n, a, lda, ipiv, info ) call zgetf2( m, n, a, lda, ipiv, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the LU factorization of a general m-by-n matrix A using partial pivoting with row interchanges. The factorization has the form A = P*L*U 5 Intel® Math Kernel Library Reference Manual 1178 where p is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n) and U is upper triangular (upper trapezoidal if m < n). Input Parameters m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a REAL for sgetf2 DOUBLE PRECISION for dgetf2 COMPLEX for cgetf2 DOUBLE COMPLEX for zgetf2. Array, DIMENSION (lda,*). Contains the matrix A to be factored. The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of a; at least max(1, m). Output Parameters a Overwritten by L and U. The unit diagonal elements of L are not stored. ipiv INTEGER. Array, DIMENSION at least max(1,min(m,n)). The pivot indices: for 1 = i = n, row i was interchanged with row ipiv(i). info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i >0, uii is 0. The factorization has been completed, but U is exactly singular. Division by 0 will occur if you use the factor U for solving a system of linear equations. ?gtts2 Solves a system of linear equations with a tridiagonal matrix using the LU factorization computed by ? gttrf. Syntax call sgtts2( itrans, n, nrhs, dl, d, du, du2, ipiv, b, ldb ) call dgtts2( itrans, n, nrhs, dl, d, du, du2, ipiv, b, ldb ) call cgtts2( itrans, n, nrhs, dl, d, du, du2, ipiv, b, ldb ) call zgtts2( itrans, n, nrhs, dl, d, du, du2, ipiv, b, ldb ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine solves for X one of the following systems of linear equations with multiple right hand sides: A*X = B, AT*X = B, or AH*X = B (for complex matrices only), with a tridiagonal matrix A using the LU factorization computed by ?gttrf. LAPACK Auxiliary and Utility Routines 5 1179 Input Parameters itrans INTEGER. Must be 0, 1, or 2. Indicates the form of the equations to be solved: If itrans = 0, then A*X = B (no transpose). If itrans = 1, then AT*X = B (transpose). If itrans = 2, then AH*X = B (conjugate transpose). n INTEGER. The order of the matrix A (n = 0). nrhs INTEGER. The number of right-hand sides, i.e., the number of columns in B (nrhs = 0). dl,d,du,du2,b REAL for sgtts2 DOUBLE PRECISION for dgtts2 COMPLEX for cgtts2 DOUBLE COMPLEX for zgtts2. Arrays: dl(n - 1), d(n ), du(n - 1), du2(n - 2), b(ldb,nrhs). The array dl contains the (n - 1) multipliers that define the matrix L from the LU factorization of A. The array d contains the n diagonal elements of the upper triangular matrix U from the LU factorization of A. The array du contains the (n - 1) elements of the first super-diagonal of U. The array du2 contains the (n - 2) elements of the second super-diagonal of U. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. ldb INTEGER. The leading dimension of b; must be ldb = max(1, n). ipiv INTEGER. Array, DIMENSION (n). The pivot indices array, as returned by ?gttrf. Output Parameters b Overwritten by the solution matrix X. ?isnan Tests input for NaN. Syntax val = sisnan( sin ) val = disnan( din ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description This logical routine returns .TRUE. if its argument is NaN, and .FALSE. otherwise. Input Parameters sin REAL for sisnan Input to test for NaN. 5 Intel® Math Kernel Library Reference Manual 1180 din DOUBLE PRECISION for disnan Input to test for NaN. Output Parameters val Logical. Result of the test. ?laisnan Tests input for NaN. Syntax val = slaisnan( sin1, sin2 ) val = dlaisnan( din1, din2 ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description This logical routine checks for NaNs (NaN stands for 'Not A Number') by comparing its two arguments for inequality. NaN is the only floating-point value where NaN ? NaN returns .TRUE. To check for NaNs, pass the same variable as both arguments. This routine is not for general use. It exists solely to avoid over-optimization in ?isnan. Input Parameters sin1, sin2 REAL for sisnan Two numbers to compare for inequality. din2, din2 DOUBLE PRECISION for disnan Two numbers to compare for inequality. Output Parameters val Logical. Result of the comparison. ?labrd Reduces the first nb rows and columns of a general matrix to a bidiagonal form. Syntax call slabrd( m, n, nb, a, lda, d, e, tauq, taup, x, ldx, y, ldy ) call dlabrd( m, n, nb, a, lda, d, e, tauq, taup, x, ldx, y, ldy ) call clabrd( m, n, nb, a, lda, d, e, tauq, taup, x, ldx, y, ldy ) call zlabrd( m, n, nb, a, lda, d, e, tauq, taup, x, ldx, y, ldy ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description LAPACK Auxiliary and Utility Routines 5 1181 The routine reduces the first nb rows and columns of a general m-by-n matrix A to upper or lower bidiagonal form by an orthogonal/unitary transformation Q'*A*P, and returns the matrices X and Y which are needed to apply the transformation to the unreduced part of A. if m = n, A is reduced to upper bidiagonal form; if m < n, to lower bidiagonal form. The matrices Q and P are represented as products of elementary reflectors: Q = H(1)*(2)* ...*H(nb), and P = G(1)*G(2)* ...*G(nb) Each H(i) and G(i) has the form H(i) = I - tauq*v*v' and G(i) = I - taup*u*u' where tauq and taup are scalars, and v and u are vectors. The elements of the vectors v and u together form the m-by-nb matrix V and the nb-by-n matrix U' which are needed, with X and Y, to apply the transformation to the unreduced part of the matrix, using a block update of the form: A := A - V*Y' - X*U'. This is an auxiliary routine called by ?gebrd. Input Parameters m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). nb INTEGER. The number of leading rows and columns of A to be reduced. a REAL for slabrd DOUBLE PRECISION for dlabrd COMPLEX for clabrd DOUBLE COMPLEX for zlabrd. Array a(lda,*) contains the matrix A to be reduced. The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of a; at least max(1, m). ldx INTEGER. The leading dimension of the output array x; must beat least max(1, m). ldy INTEGER. The leading dimension of the output array y; must beat least max(1, n). Output Parameters a On exit, the first nb rows and columns of the matrix are overwritten; the rest of the array is unchanged. if m = n, elements on and below the diagonal in the first nb columns, with the array tauq, represent the orthogonal/unitary matrix Q as a product of elementary reflectors; and elements above the diagonal in the first nb rows, with the array taup, represent the orthogonal/unitary matrix p as a product of elementary reflectors. if m < n, elements below the diagonal in the first nb columns, with the array tauq, represent the orthogonal/unitary matrix Q as a product of elementary reflectors, and elements on and above the diagonal in the first nb rows, with the array taup, represent the orthogonal/unitary matrix p as a product of elementary reflectors. d, e REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays, DIMENSION (nb) each. The array d contains the diagonal elements of the first nb rows and columns of the reduced matrix: 5 Intel® Math Kernel Library Reference Manual 1182 d(i) = a(i,i). The array e contains the off-diagonal elements of the first nb rows and columns of the reduced matrix. tauq, taup REAL for slabrd DOUBLE PRECISION for dlabrd COMPLEX for clabrd DOUBLE COMPLEX for zlabrd. Arrays, DIMENSION (nb) each. Contain scalar factors of the elementary reflectors which represent the orthogonal/unitary matrices Q and P, respectively. x, y REAL for slabrd DOUBLE PRECISION for dlabrd COMPLEX for clabrd DOUBLE COMPLEX for zlabrd. Arrays, dimension x(ldx, nb), y(ldy, nb). The array x contains the m-by-nb matrix X required to update the unreduced part of A. The array y contains the n-by-nb matrix Y required to update the unreduced part of A. Application Notes if m = n, then for the elementary reflectors H(i) and G(i), v(1:i-1) = 0, v(i) = 1, and v(i:m) is stored on exit in a(i:m, i); u(1:i) = 0, u(i+1) = 1, and u(i +1:n) is stored on exit in a(i, i+1:n); tauq is stored in tauq(i) and taup in taup(i). if m < n, v(1:i) = 0, v(i+1) = 1, and v(i+1:m) is stored on exit in a(i+2:m, i) ; u(1:i-1) = 0, u(i) = 1, and u(i:n) is stored on exit in a(i, i+1:n); tauq is stored in tauq(i) and taup in taup(i). The contents of a on exit are illustrated by the following examples with nb = 2: where a denotes an element of the original matrix which is unchanged, vi denotes an element of the vector defining H(i), and ui an element of the vector defining G(i). LAPACK Auxiliary and Utility Routines 5 1183 ?lacn2 Estimates the 1-norm of a square matrix, using reverse communication for evaluating matrix-vector products. Syntax call slacn2( n, v, x, isgn, est, kase, isave ) call dlacn2( n, v, x, isgn, est, kase, isave ) call clacn2( n, v, x, est, kase, isave ) call zlacn2( n, v, x, est, kase, isave ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine estimates the 1-norm of a square, real or complex matrix A. Reverse communication is used for evaluating matrix-vector products. Input Parameters n INTEGER. The order of the matrix A (n = 1). v, x REAL for slacn2 DOUBLE PRECISION for dlacn2 COMPLEX for clacn2 DOUBLE COMPLEX for zlacn2. Arrays, DIMENSION (n) each. v is a workspace array. x is used as input after an intermediate return. isgn INTEGER. Workspace array, DIMENSION (n), used with real flavors only. est REAL for slacn2/clacn2 DOUBLE PRECISION for dlacn2/zlacn2 On entry with kase set to 1 or 2, and isave(1) = 1, est must be unchanged from the previous call to the routine. kase INTEGER. On the initial call to the routine, kase must be set to 0. isave INTEGER. Array, DIMENSION (3). Contains variables from the previous call to the routine. Output Parameters est An estimate (a lower bound) for norm(A). kase On an intermediate return, kase is set to 1 or 2, indicating whether x is overwritten by A*x or AT*x for real flavors and A*x or AH*x for complex flavors. On the final return, kase is set to 0. v On the final return, v = A*w, where est = norm(v)/norm(w) (w is not returned). x On an intermediate return, x is overwritten by 5 Intel® Math Kernel Library Reference Manual 1184 A*x, if kase = 1, AT*x, if kase = 2 (for real flavors), AH*x, if kase = 2 (for complex flavors), and the routine must be re-called with all the other parameters unchanged. isave This parameter is used to save variables between calls to the routine. ?lacon Estimates the 1-norm of a square matrix, using reverse communication for evaluating matrix-vector products. Syntax call slacon( n, v, x, isgn, est, kase ) call dlacon( n, v, x, isgn, est, kase ) call clacon( n, v, x, est, kase ) call zlacon( n, v, x, est, kase ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine estimates the 1-norm of a square, real/complex matrix A. Reverse communication is used for evaluating matrix-vector products. WARNING The ?lacon routine is not thread-safe. It is deprecated and retained for the backward compatibility only. Use the thread-safe ?lacn2 routine instead. Input Parameters n INTEGER. The order of the matrix A (n = 1). v, x REAL for slacon DOUBLE PRECISION for dlacon COMPLEX for clacon DOUBLE COMPLEX for zlacon. Arrays, DIMENSION (n) each. v is a workspace array. x is used as input after an intermediate return. isgn INTEGER. Workspace array, DIMENSION (n), used with real flavors only. est REAL for slacon/clacon DOUBLE PRECISION for dlacon/zlacon An estimate that with kase=1 or 2 should be unchanged from the previous call to ?lacon. kase INTEGER. On the initial call to ?lacon, kase should be 0. LAPACK Auxiliary and Utility Routines 5 1185 Output Parameters est REAL for slacon/clacon DOUBLE PRECISION for dlacon/zlacon An estimate (a lower bound) for norm(A). kase On an intermediate return, kase will be 1 or 2, indicating whether x should be overwritten by A*x or AT*x for real flavors and A*x or AH*x for complex flavors. On the final return from ?lacon, kase will again be 0. v On the final return, v = A*w, where est = norm(v)/norm(w) (w is not returned). x On an intermediate return, x should be overwritten by A*x, if kase = 1, AT*x, if kase = 2 (for real flavors), AH*x, if kase = 2 (for complex flavors), and ?lacon must be re-called with all the other parameters unchanged. ?lacpy Copies all or part of one two-dimensional array to another. Syntax call slacpy( uplo, m, n, a, lda, b, ldb ) call dlacpy( uplo, m, n, a, lda, b, ldb ) call clacpy( uplo, m, n, a, lda, b, ldb ) call zlacpy( uplo, m, n, a, lda, b, ldb ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine copies all or part of a two-dimensional matrix A to another matrix B. Input Parameters uplo CHARACTER*1. Specifies the part of the matrix A to be copied to B. If uplo = 'U', the upper triangular part of A; if uplo = 'L', the lower triangular part of A. Otherwise, all of the matrix A is copied. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a REAL for slacpy DOUBLE PRECISION for dlacpy COMPLEX for clacpy DOUBLE COMPLEX for zlacpy. Array a(lda,*), contains the m-by-n matrix A. The second dimension of a must be at least max(1,n). 5 Intel® Math Kernel Library Reference Manual 1186 If uplo = 'U', only the upper triangle or trapezoid is accessed; if uplo = 'L', only the lower triangle or trapezoid is accessed. lda INTEGER. The leading dimension of a; lda = max(1, m). ldb INTEGER. The leading dimension of the output array b; ldb = max(1, m). Output Parameters b REAL for slacpy DOUBLE PRECISION for dlacpy COMPLEX for clacpy DOUBLE COMPLEX for zlacpy. Array b(ldb,*), contains the m-by-n matrix B. The second dimension of b must be at least max(1,n). On exit, B = A in the locations specified by uplo. ?ladiv Performs complex division in real arithmetic, avoiding unnecessary overflow. Syntax call sladiv( a, b, c, d, p, q ) call dladiv( a, b, c, d, p, q ) res = cladiv( x, y ) res = zladiv( x, y ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routines sladiv/dladiv perform complex division in real arithmetic as Complex functions cladiv/zladiv compute the result as res = x/y, where x and y are complex. The computation of x / y will not overflow on an intermediary step unless the results overflows. Input Parameters a, b, c, d REAL for sladiv DOUBLE PRECISION for dladiv The scalars a, b, c, and d in the above expression (for real flavors only). x, y COMPLEX for cladiv DOUBLE COMPLEX for zladiv The complex scalars x and y (for complex flavors only). LAPACK Auxiliary and Utility Routines 5 1187 Output Parameters p, q REAL for sladiv DOUBLE PRECISION for dladiv The scalars p and q in the above expression (for real flavors only). res COMPLEX for cladiv DOUBLE COMPLEX for zladiv Contains the result of division x / y. ?lae2 Computes the eigenvalues of a 2-by-2 symmetric matrix. Syntax call slae2( a, b, c, rt1, rt2 ) call dlae2( a, b, c, rt1, rt2 ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routines sla2/dlae2 compute the eigenvalues of a 2-by-2 symmetric matrix On return, rt1 is the eigenvalue of larger absolute value, and rt1 is the eigenvalue of smaller absolute value. Input Parameters a, b, c REAL for slae2 DOUBLE PRECISION for dlae2 The elements a, b, and c of the 2-by-2 matrix above. Output Parameters rt1, rt2 REAL for slae2 DOUBLE PRECISION for dlae2 The computed eigenvalues of larger and smaller absolute value, respectively. Application Notes rt1 is accurate to a few ulps barring over/underflow. rt2 may be inaccurate if there is massive cancellation in the determinant a*c-b*b; higher precision or correctly rounded or correctly truncated arithmetic would be needed to compute rt2 accurately in all cases. Overflow is possible only if rt1 is within a factor of 5 of overflow. Underflow is harmless if the input data is 0 or exceeds underflow_threshold / macheps. 5 Intel® Math Kernel Library Reference Manual 1188 ?laebz Computes the number of eigenvalues of a real symmetric tridiagonal matrix which are less than or equal to a given value, and performs other tasks required by the routine ?stebz. Syntax call slaebz( ijob, nitmax, n, mmax, minp, nbmin, abstol, reltol, pivmin, d, e, e2, nval, ab, c, mout, nab, work, iwork, info ) call dlaebz( ijob, nitmax, n, mmax, minp, nbmin, abstol, reltol, pivmin, d, e, e2, nval, ab, c, mout, nab, work, iwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?laebz contains the iteration loops which compute and use the function n(w), which is the count of eigenvalues of a symmetric tridiagonal matrix T less than or equal to its argument w. It performs a choice of two types of loops: ijob =1, followed by ijob =2: It takes as input a list of intervals and returns a list of sufficiently small intervals whose union contains the same eigenvalues as the union of the original intervals. The input intervals are (ab(j,1),ab(j,2)], j=1,...,minp. The output interval (ab(j,1),ab(j,2)] will contain eigenvalues nab(j, 1)+1,...,nab(j,2), where 1 = j = mout. ijob =3: It performs a binary search in each input interval (ab(j,1),ab(j,2)] for a point w(j) such that n(w(j))=nval(j), and uses c(j) as the starting point of the search. If such a w(j) is found, then on output ab(j,1)=ab(j,2)=w. If no such w(j) is found, then on output (ab(j,1),ab(j,2)] will be a small interval containing the point where n(w) jumps through nval(j), unless that point lies outside the initial interval. Note that the intervals are in all cases half-open intervals, that is, of the form (a,b], which includes b but not a . To avoid underflow, the matrix should be scaled so that its largest element is no greater than overflow1/2 * overflow1/4 in absolute value. To assure the most accurate computation of small eigenvalues, the matrix should be scaled to be not much smaller than that, either. NOTE In general, the arguments are not checked for unreasonable values. Input Parameters ijob INTEGER. Specifies what is to be done: = 1: Compute nab for the initial intervals. = 2: Perform bisection iteration to find eigenvalues of T. = 3: Perform bisection iteration to invert n(w), i.e., to find a point which has a specified number of eigenvalues of T to its left. Other values will cause ? laebz to return with info=-1. LAPACK Auxiliary and Utility Routines 5 1189 nitmax INTEGER. The maximum number of "levels" of bisection to be performed, i.e., an interval of width W will not be made smaller than 2-nitmax*W. If not all intervals have converged after nitmax iterations, then info is set to the number of non-converged intervals. n INTEGER. The dimension n of the tridiagonal matrix T. It must be at least 1. mmax INTEGER. The maximum number of intervals. If more than mmax intervals are generated, then ?laebz will quit with info=mmax+1. minp INTEGER. The initial number of intervals. It may not be greater than mmax. nbmin INTEGER. The smallest number of intervals that should be processed using a vector loop. If zero, then only the scalar loop will be used. abstol REAL for slaebz DOUBLE PRECISION for dlaebz. The minimum (absolute) width of an interval. When an interval is narrower than abstol, or than reltol times the larger (in magnitude) endpoint, then it is considered to be sufficiently small, i.e., converged. This must be at least zero. reltol REAL for slaebz DOUBLE PRECISION for dlaebz. The minimum relative width of an interval. When an interval is narrower than abstol, or than reltol times the larger (in magnitude) endpoint, then it is considered to be sufficiently small, i.e., converged. Note: this should always be at least radix*machine epsilon. pivmin REAL for slaebz DOUBLE PRECISION for dlaebz. The minimum absolute value of a "pivot" in the Sturm sequence loop. This value must be at least (max |e(j)**2|*safe_min) and at least safe_min, where safe_min is at least the smallest number that can divide one without overflow. d, e, e2 REAL for slaebz DOUBLE PRECISION for dlaebz. Arrays, dimension (n) each. The array d contains the diagonal elements of the tridiagonal matrix T. The array e contains the off-diagonal elements of the tridiagonal matrix T in positions 1 through n-1. e(n)vis arbitrary. The array e2 contains the squares of the off-diagonal elements of the tridiagonal matrix T. e2(n) is ignored. nval INTEGER. Array, dimension (minp). If ijob=1 or 2, not referenced. If ijob=3, the desired values of n(w). ab REAL for slaebz DOUBLE PRECISION for dlaebz. Array, dimension (mmax,2) The endpoints of the intervals. ab(j,1) is a(j), the left endpoint of the j-th interval, and ab(j,2) is b(j), the right endpoint of the j-th interval. c REAL for slaebz DOUBLE PRECISION for dlaebz. Array, dimension (mmax) If ijob=1, ignored. If ijob=2, workspace. 5 Intel® Math Kernel Library Reference Manual 1190 If ijob=3, then on input c(j) should be initialized to the first search point in the binary search. nab INTEGER. Array, dimension (mmax,2) If ijob=2, then on input, nab(i,j) should be set. It must satisfy the condition: n(ab(i,1)) = nab(i,1) = nab(i,2) = n(ab(i,2)), which means that in interval i only eigenvalues nab(i,1)+1,...,nab(i,2) are considered. Usually, nab(i,j)=n(ab(i,j)), from a previous call to ?laebz with ijob=1. If ijob=3, normally, nab should be set to some distinctive value(s) before ? laebz is called. work REAL for slaebz DOUBLE PRECISION for dlaebz. Workspace array, dimension (mmax). iwork INTEGER. Workspace array, dimension (mmax). Output Parameters nval The elements of nval will be reordered to correspond with the intervals in ab. Thus, nval(j) on output will not, in general be the same as nval(j) on input, but it will correspond with the interval (ab(j,1),ab(j,2)] on output. ab The input intervals will, in general, be modified, split, and reordered by the calculation. mout INTEGER. If ijob=1, the number of eigenvalues in the intervals. If ijob=2 or 3, the number of intervals output. If ijob=3, mout will equal minp. nab If ijob=1, then on output nab(i,j) will be set to N(ab(i,j)). If ijob=2, then on output, nab(i,j) will contain max(na(k, min(nb(k), N(ab(i,j)))), where k is the index of the input interval that the output interval (ab(j,1),ab(j,2)] came from, and na(k) and nb(k) are the input values of nab(k,1) and nab(k,2). If ijob=3, then on output, nab(i,j) contains N(ab(i,j)), unless N(w) > nval(i) for all search points w, in which case nab(i,1) will not be modified, i.e., the output value will be the same as the input value (modulo reorderings, see nval and ab), or unless N(w) < nval(i) for all search points w, in which case nab(i,2) will not be modified. info INTEGER. If info = 0 - all intervals converged If info = 1--mmax - the last info interval did not converge. If info = mmax+1 - more than mmax intervals were generated Application Notes This routine is intended to be called only by other LAPACK routines, thus the interface is less user-friendly. It is intended for two purposes: (a) finding eigenvalues. In this case, ?laebz should have one or more initial intervals set up in ab, and ? laebz should be called with ijob=1. This sets up nab, and also counts the eigenvalues. Intervals with no eigenvalues would usually be thrown out at this point. Also, if not all the eigenvalues in an interval i are LAPACK Auxiliary and Utility Routines 5 1191 desired, nab(i,1) can be increased or nab(i,2) decreased. For example, set nab(i,1)=nab(i,2)-1 to get the largest eigenvalue. ?laebz is then called with ijob=2 and mmax no smaller than the value of mout returned by the call with ijob=1. After this (ijob=2) call, eigenvalues nab(i,1)+1 through nab(i,2) are approximately ab(i,1) (or ab(i,2)) to the tolerance specified by abstol and reltol. (b) finding an interval (a',b'] containing eigenvalues w(f),...,w(l). In this case, start with a Gershgorin interval (a,b). Set up ab to contain 2 search intervals, both initially (a,b). One nval element should contain f-1 and the other should contain l, while c should contain a and b, respectively. nab(i,1) should be -1 and nab(i,2) should be n+1, to flag an error if the desired interval does not lie in (a,b). ?laebz is then called with ijob=3. On exit, if w(f-1) < w(f), then one of the intervals -- j -- will have ab(j,1)=ab(j,2) and nab(j,1)=nab(j,2)=f-1, while if, to the specified tolerance, w(f-k)=...=w(f+r), k > 0 and r = 0, then the interval will have n(ab(j,1))=nab(j,1)=f-k and n(ab(j,2))=nab(j,2)=f+r. The cases w(l) < w(l +1) and w(l-r)=...=w(l+k) are handled similarly. ?laed0 Used by ?stedc. Computes all eigenvalues and corresponding eigenvectors of an unreduced symmetric tridiagonal matrix using the divide and conquer method. Syntax call slaed0( icompq, qsiz, n, d, e, q, ldq, qstore, ldqs, work, iwork, info ) call dlaed0( icompq, qsiz, n, d, e, q, ldq, qstore, ldqs, work, iwork, info ) call claed0( qsiz, n, d, e, q, ldq, qstore, ldqs, rwork, iwork, info ) call zlaed0( qsiz, n, d, e, q, ldq, qstore, ldqs, rwork, iwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description Real flavors of this routine compute all eigenvalues and (optionally) corresponding eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method. Complex flavors claed0/zlaed0 compute all eigenvalues of a symmetric tridiagonal matrix which is one diagonal block of those from reducing a dense or band Hermitian matrix and corresponding eigenvectors of the dense or band matrix. Input Parameters icompq INTEGER. Used with real flavors only. If icompq = 0, compute eigenvalues only. If icompq = 1, compute eigenvectors of original dense symmetric matrix also. On entry, the array q must contain the orthogonal matrix used to reduce the original matrix to tridiagonal form. If icompq = 2, compute eigenvalues and eigenvectors of the tridiagonal matrix. qsiz INTEGER. The dimension of the orthogonal/unitary matrix used to reduce the full matrix to tridiagonal form; qsiz = n (for real flavors, qsiz = n if icompq = 1). 5 Intel® Math Kernel Library Reference Manual 1192 n INTEGER. The dimension of the symmetric tridiagonal matrix (n = 0). d, e, rwork REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays: d(*) contains the main diagonal of the tridiagonal matrix. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of the tridiagonal matrix. The dimension of e must be at least max(1, n-1). rwork(*) is a workspace array used in complex flavors only. The dimension of rwork must be at least (1 +3n+2nlg(n)+3n2), where lg(n) = smallest integer k such that 2k = n. q, qstore REAL for slaed0 DOUBLE PRECISION for dlaed0 COMPLEX for claed0 DOUBLE COMPLEX for zlaed0. Arrays: q(ldq, *), qstore(ldqs, *). The second dimension of these arrays must be at least max(1, n). For real flavors: If icompq = 0, array q is not referenced. If icompq = 1, on entry, q is a subset of the columns of the orthogonal matrix used to reduce the full matrix to tridiagonal form corresponding to the subset of the full matrix which is being decomposed at this time. If icompq = 2, on entry, q will be the identity matrix. The array qstore is a workspace array referenced only when icompq = 1. Used to store parts of the eigenvector matrix when the updating matrix multiplies take place. For complex flavors: On entry, q must contain an qsiz-by-n matrix whose columns are unitarily orthonormal. It is a part of the unitary matrix that reduces the full dense Hermitian matrix to a (reducible) symmetric tridiagonal matrix. The array qstore is a workspace array used to store parts of the eigenvector matrix when the updating matrix multiplies take place. ldq INTEGER. The leading dimension of the array q; ldq = max(1, n). ldqs INTEGER. The leading dimension of the array qstore; ldqs = max(1, n). work REAL for slaed0 DOUBLE PRECISION for dlaed0. Workspace array, used in real flavors only. If icompq = 0 or 1, the dimension of work must be at least (1 +3n +2nlg(n)+3n2), where lg(n) = smallest integer k such that 2k = n. If icompq = 2, the dimension of work must be at least (4n+n2). iwork INTEGER. Workspace array. For real flavors, if icompq = 0 or 1, and for complex flavors, the dimension of iwork must be at least (6+6n+5nlg(n)). For real flavors, if icompq = 2, the dimension of iwork must be at least (3+5n). Output Parameters d On exit, contains eigenvalues in ascending order. e On exit, the array is destroyed. q If icompq = 2, on exit, q contains the eigenvectors of the tridiagonal matrix. info INTEGER. LAPACK Auxiliary and Utility Routines 5 1193 If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i >0, the algorithm failed to compute an eigenvalue while working on the submatrix lying in rows and columns i/(n+1) through mod(i, n+1). ?laed1 Used by sstedc/dstedc. Computes the updated eigensystem of a diagonal matrix after modification by a rank-one symmetric matrix. Used when the original matrix is tridiagonal. Syntax call slaed1( n, d, q, ldq, indxq, rho, cutpnt, work, iwork, info ) call dlaed1( n, d, q, ldq, indxq, rho, cutpnt, work, iwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?laed1 computes the updated eigensystem of a diagonal matrix after modification by a rank-one symmetric matrix. This routine is used only for the eigenproblem which requires all eigenvalues and eigenvectors of a tridiagonal matrix. ?laed7 handles the case in which eigenvalues only or eigenvalues and eigenvectors of a full symmetric matrix (which was reduced to tridiagonal form) are desired. T = Q(in)*(D(in)+ rho*Z*ZT)*QT(in) = Q(out)*D(out)*QT(out) where Z = QTu, u is a vector of length n with ones in the cutpnt and (cutpnt+1) -th elements and zeros elsewhere. The eigenvectors of the original matrix are stored in Q, and the eigenvalues are in D. The algorithm consists of three stages: The first stage consists of deflating the size of the problem when there are multiple eigenvalues or if there is a zero in the z vector. For each such occurrence the dimension of the secular equation problem is reduced by one. This stage is performed by the routine ?laed2. The second stage consists of calculating the updated eigenvalues. This is done by finding the roots of the secular equation via the routine ?laed4 (as called by ?laed3). This routine also calculates the eigenvectors of the current problem. The final stage consists of computing the updated eigenvectors directly using the updated eigenvalues. The eigenvectors for the current problem are multiplied with the eigenvectors from the overall problem. Input Parameters n INTEGER. The dimension of the symmetric tridiagonal matrix (n = 0). d, q, work REAL for slaed1 DOUBLE PRECISION for dlaed1. Arrays: d(*) contains the eigenvalues of the rank-1-perturbed matrix. The dimension of d must be at least max(1, n). q(ldq, *) contains the eigenvectors of the rank-1-perturbed matrix. The second dimension of q must be at least max(1, n). work(*) is a workspace array, dimension at least (4n+n2). 5 Intel® Math Kernel Library Reference Manual 1194 ldq INTEGER. The leading dimension of the array q; ldq = max(1, n). indxq INTEGER. Array, dimension (n). On entry, the permutation which separately sorts the two subproblems in d into ascending order. rho REAL for slaed1 DOUBLE PRECISION for dlaed1. The subdiagonal entry used to create the rank-1 modification. This parameter can be modified by ?laed2, where it is input/output. cutpnt INTEGER. The location of the last eigenvalue in the leading sub-matrix. min(1,n) = cutpnt = n/2. iwork INTEGER. Workspace array, dimension (4n). Output Parameters d On exit, contains the eigenvalues of the repaired matrix. q On exit, q contains the eigenvectors of the repaired tridiagonal matrix. indxq On exit, contains the permutation which will reintegrate the subproblems back into sorted order, that is, d( indxq(i = 1, n )) will be in ascending order. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, an eigenvalue did not converge. ?laed2 Used by sstedc/dstedc. Merges eigenvalues and deflates secular equation. Used when the original matrix is tridiagonal. Syntax call slaed2( k, n, n1, d, q, ldq, indxq, rho, z, dlamda, w, q2, indx, indxc, indxp, coltyp, info ) call dlaed2( k, n, n1, d, q, ldq, indxq, rho, z, dlamda, w, q2, indx, indxc, indxp, coltyp, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?laed2 merges the two sets of eigenvalues together into a single sorted set. Then it tries to deflate the size of the problem. There are two ways in which deflation can occur: when two or more eigenvalues are close together or if there is a tiny entry in the z vector. For each such occurrence the order of the related secular equation problem is reduced by one. Input Parameters k INTEGER. The number of non-deflated eigenvalues, and the order of the related secular equation (0 = k = n). LAPACK Auxiliary and Utility Routines 5 1195 n INTEGER. The dimension of the symmetric tridiagonal matrix (n = 0). n1 INTEGER. The location of the last eigenvalue in the leading sub-matrix; min(1,n) = n1 = n/2. d, q, z REAL for slaed2 DOUBLE PRECISION for dlaed2. Arrays: d(*) contains the eigenvalues of the two submatrices to be combined. The dimension of d must be at least max(1, n). q(ldq, *) contains the eigenvectors of the two submatrices in the two square blocks with corners at (1,1), (n1,n1) and (n1+1,n1+1), (n,n). The second dimension of q must be at least max(1, n). z(*) contains the updating vector (the last row of the first sub-eigenvector matrix and the first row of the second sub-eigenvector matrix). ldq INTEGER. The leading dimension of the array q; ldq = max(1, n). indxq INTEGER. Array, dimension (n). On entry, the permutation which separately sorts the two subproblems in d into ascending order. Note that elements in the second half of this permutation must first have n1 added to their values. rho REAL for slaed2 DOUBLE PRECISION for dlaed2. On entry, the off-diagonal element associated with the rank-1 cut which originally split the two submatrices which are now being recombined. indx, indxp INTEGER. Workspace arrays, dimension (n) each. Array indx contains the permutation used to sort the contents of dlamda into ascending order. Array indxp contains the permutation used to place deflated values of d at the end of the array. indxp(1:k) points to the nondeflated d-values and indxp(k+1:n) points to the deflated eigenvalues. coltyp INTEGER. Workspace array, dimension (n). During execution, a label which will indicate which of the following types a column in the q2 matrix is: 1 : non-zero in the upper half only; 2 : dense; 3 : non-zero in the lower half only; 4 : deflated. Output Parameters d On exit, d contains the trailing (n-k) updated eigenvalues (those which were deflated) sorted into increasing order. q On exit, q contains the trailing (n-k) updated eigenvectors (those which were deflated) in its last n-k columns. z On exit, z content is destroyed by the updating process. indxq Destroyed on exit. rho On exit, rho has been modified to the value required by ?laed3. dlamda, w, q2 REAL for slaed2 DOUBLE PRECISION for dlaed2. Arrays: dlamda(n), w(n), q2(n12+(n-n1)2). 5 Intel® Math Kernel Library Reference Manual 1196 The array dlamda contains a copy of the first k eigenvalues which is used by ?laed3 to form the secular equation. The array w contains the first k values of the final deflation-altered z-vector which is passed to ?laed3. The array q2 contains a copy of the first k eigenvectors which is used by ? laed3 in a matrix multiply (sgemm/dgemm) to solve for the new eigenvectors. indxc INTEGER. Array, dimension (n). The permutation used to arrange the columns of the deflated q matrix into three groups: the first group contains non-zero elements only at and above n1, the second contains non-zero elements only below n1, and the third is dense. coltyp On exit, coltyp(i) is the number of columns of type i, for i=1 to 4 only (see the definition of types in the description of coltyp in Input Parameters). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?laed3 Used by sstedc/dstedc. Finds the roots of the secular equation and updates the eigenvectors. Used when the original matrix is tridiagonal. Syntax call slaed3( k, n, n1, d, q, ldq, rho, dlamda, q2, indx, ctot, w, s, info ) call dlaed3( k, n, n1, d, q, ldq, rho, dlamda, q2, indx, ctot, w, s, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?laed3 finds the roots of the secular equation, as defined by the values in d, w, and rho, between 1 and k. It makes the appropriate calls to ?laed4 and then updates the eigenvectors by multiplying the matrix of eigenvectors of the pair of eigensystems being combined by the matrix of eigenvectors of the k-by-k system which is solved here. This code makes very mild assumptions about floating point arithmetic. It will work on machines with a guard digit in add/subtract, or on those binary machines without guard digits which subtract like the Cray XMP, Cray Y-MP, Cray C-90, or Cray-2. It could conceivably fail on hexadecimal or decimal machines without guard digits, but none are known. Input Parameters k INTEGER. The number of terms in the rational function to be solved by ? laed4 (k = 0). n INTEGER. The number of rows and columns in the q matrix. n = k (deflation may result in n >k). n1 INTEGER. The location of the last eigenvalue in the leading sub-matrix; min(1,n) = n1 = n/2. LAPACK Auxiliary and Utility Routines 5 1197 q REAL for slaed3 DOUBLE PRECISION for dlaed3. Array q(ldq, *). The second dimension of q must be at least max(1, n). Initially, the first k columns of this array are used as workspace. ldq INTEGER. The leading dimension of the array q; ldq = max(1, n). rho REAL for slaed3 DOUBLE PRECISION for dlaed3. The value of the parameter in the rank one update equation. rho = 0 required. dlamda, q2, w REAL for slaed3 DOUBLE PRECISION for dlaed3. Arrays: dlamda(k), q2(ldq2, *), w(k). The first k elements of the array dlamda contain the old roots of the deflated updating problem. These are the poles of the secular equation. The first k columns of the array q2 contain the non-deflated eigenvectors for the split problem. The second dimension of q2 must be at least max(1, n). The first k elements of the array w contain the components of the deflationadjusted updating vector. indx INTEGER. Array, dimension (n). The permutation used to arrange the columns of the deflated q matrix into three groups (see ?laed2). The rows of the eigenvectors found by ?laed4 must be likewise permuted before the matrix multiply can take place. ctot INTEGER. Array, dimension (4). A count of the total number of the various types of columns in q, as described in indx. The fourth column type is any column which has been deflated. s REAL for slaed3 DOUBLE PRECISION for dlaed3. Workspace array, dimension (n1+1)*k . Will contain the eigenvectors of the repaired matrix which will be multiplied by the previously accumulated eigenvectors to update the system. Output Parameters d REAL for slaed3 DOUBLE PRECISION for dlaed3. Array, dimension at least max(1, n). d(i) contains the updated eigenvalues for 1 = i = k. q On exit, the columns 1 to k of q contain the updated eigenvectors. dlamda May be changed on output by having lowest order bit set to zero on Cray XMP, Cray Y-MP, Cray-2, or Cray C-90, as described above. w Destroyed on exit. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, an eigenvalue did not converge. 5 Intel® Math Kernel Library Reference Manual 1198 ?laed4 Used by sstedc/dstedc. Finds a single root of the secular equation. Syntax call slaed4( n, i, d, z, delta, rho, dlam, info ) call dlaed4( n, i, d, z, delta, rho, dlam, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description This routine computes the i-th updated eigenvalue of a symmetric rank-one modification to a diagonal matrix whose elements are given in the array d, and that D(i) < D(j) for i < j and that rho > 0. This is arranged by the calling routine, and is no loss in generality. The rank-one modified system is thus diag(D) + rho*Z * transpose(Z). where we assume the Euclidean norm of Z is 1. The method consists of approximating the rational functions in the secular equation by simpler interpolating rational functions. Input Parameters n INTEGER. The length of all arrays. i INTEGER. The index of the eigenvalue to be computed; 1 = i = n. d, z REAL for slaed4 DOUBLE PRECISION for dlaed4 Arrays, dimension (n) each. The array d contains the original eigenvalues. It is assumed that they are in order, d(i) < d(j) for i < j. The array z contains the components of the updating vector Z. rho REAL for slaed4 DOUBLE PRECISION for dlaed4 The scalar in the symmetric updating formula. Output Parameters delta REAL for slaed4 DOUBLE PRECISION for dlaed4 Array, dimension (n). If n ? 1, delta contains (d(j) - lambda_i) in its j-th component. If n = 1, then delta(1) = 1. The vector delta contains the information necessary to construct the eigenvectors. dlam REAL for slaed4 DOUBLE PRECISION for dlaed4 The computed lambda_i, the i-th updated eigenvalue. info INTEGER. If info = 0, the execution is successful. LAPACK Auxiliary and Utility Routines 5 1199 If info = 1, the updating process failed. ?laed5 Used by sstedc/dstedc. Solves the 2-by-2 secular equation. Syntax call slaed5( i, d, z, delta, rho, dlam ) call dlaed5( i, d, z, delta, rho, dlam ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the i-th eigenvalue of a symmetric rank-one modification of a 2-by-2 diagonal matrix diag(D) + rho*Z * transpose(Z). The diagonal elements in the array D are assumed to satisfy D(i) < D(j) for i < j . We also assume rho > 0 and that the Euclidean norm of the vector Z is one. Input Parameters i INTEGER. The index of the eigenvalue to be computed; 1 = i = 2. d, z REAL for slaed5 DOUBLE PRECISION for dlaed5 Arrays, dimension (2) each. The array d contains the original eigenvalues. It is assumed that d(1) < d(2). The array z contains the components of the updating vector. rho REAL for slaed5 DOUBLE PRECISION for dlaed5 The scalar in the symmetric updating formula. Output Parameters delta REAL for slaed5 DOUBLE PRECISION for dlaed5 Array, dimension (2). The vector delta contains the information necessary to construct the eigenvectors. dlam REAL for slaed5 DOUBLE PRECISION for dlaed5 The computed lambda_i, the i-th updated eigenvalue. ?laed6 Used by sstedc/dstedc. Computes one Newton step in solution of the secular equation. 5 Intel® Math Kernel Library Reference Manual 1200 Syntax call slaed6( kniter, orgati, rho, d, z, finit, tau, info ) call dlaed6( kniter, orgati, rho, d, z, finit, tau, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the positive or negative root (closest to the origin) of It is assumed that if orgati = .TRUE. the root is between d(2) and d(3);otherwise it is between d(1) and d(2) This routine is called by ?laed4 when necessary. In most cases, the root sought is the smallest in magnitude, though it might not be in some extremely rare situations. Input Parameters kniter INTEGER. Refer to ?laed4 for its significance. orgati LOGICAL. If orgati = .TRUE., the needed root is between d(2) and d(3); otherwise it is between d(1) and d(2). See ?laed4 for further details. rho REAL for slaed6 DOUBLE PRECISION for dlaed6 Refer to the equation for f(x) above. d, z REAL for slaed6 DOUBLE PRECISION for dlaed6 Arrays, dimension (3) each. The array d satisfies d(1) < d(2) < d(3). Each of the elements in the array z must be positive. finit REAL for slaed6 DOUBLE PRECISION for dlaed6 The value of f(x) at 0. It is more accurate than the one evaluated inside this routine (if someone wants to do so). Output Parameters tau REAL for slaed6 DOUBLE PRECISION for dlaed6 The root of the equation for f(x). info INTEGER. If info = 0, the execution is successful. If info = 1, failure to converge. LAPACK Auxiliary and Utility Routines 5 1201 ?laed7 Used by ?stedc. Computes the updated eigensystem of a diagonal matrix after modification by a rank-one symmetric matrix. Used when the original matrix is dense. Syntax call slaed7( icompq, n, qsiz, tlvls, curlvl, curpbm, d, q, ldq, indxq, rho, cutpnt, qstore, qptr, prmptr, perm, givptr, givcol, givnum, work, iwork, info ) call dlaed7( icompq, n, qsiz, tlvls, curlvl, curpbm, d, q, ldq, indxq, rho, cutpnt, qstore, qptr, prmptr, perm, givptr, givcol, givnum, work, iwork, info ) call claed7( n, cutpnt, qsiz, tlvls, curlvl, curpbm, d, q, ldq, rho, indxq, qstore, qptr, prmptr, perm, givptr, givcol, givnum, work, rwork, iwork, info ) call zlaed7( n, cutpnt, qsiz, tlvls, curlvl, curpbm, d, q, ldq, rho, indxq, qstore, qptr, prmptr, perm, givptr, givcol, givnum, work, rwork, iwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?laed7 computes the updated eigensystem of a diagonal matrix after modification by a rank-one symmetric matrix. This routine is used only for the eigenproblem which requires all eigenvalues and optionally eigenvectors of a dense symmetric/Hermitian matrix that has been reduced to tridiagonal form. For real flavors, slaed1/dlaed1 handles the case in which all eigenvalues and eigenvectors of a symmetric tridiagonal matrix are desired. T = Q(in)*(D(in)+rho*Z*ZT)*QT(in) = Q(out)*D(out)*QT(out) for real flavors, or T = Q(in)*(D(in)+rho*Z*ZH)*QH(in) = Q(out)*D(out)*QH(out) for complex flavors where Z = QT*u for real flavors and Z = QH*u for complex flavors, u is a vector of length n with ones in the cutpnt and (cutpnt + 1) -th elements and zeros elsewhere. The eigenvectors of the original matrix are stored in Q, and the eigenvalues are in D. The algorithm consists of three stages: The first stage consists of deflating the size of the problem when there are multiple eigenvalues or if there is a zero in the z vector. For each such occurrence the dimension of the secular equation problem is reduced by one. This stage is performed by the routine slaed8/dlaed8 (for real flavors) or by the routine slaed2/ dlaed2 (for complex flavors). The second stage consists of calculating the updated eigenvalues. This is done by finding the roots of the secular equation via the routine ?laed4 (as called by ?laed9 or ?laed3). This routine also calculates the eigenvectors of the current problem. The final stage consists of computing the updated eigenvectors directly using the updated eigenvalues. The eigenvectors for the current problem are multiplied with the eigenvectors from the overall problem. Input Parameters icompq INTEGER. Used with real flavors only. If icompq = 0, compute eigenvalues only. If icompq = 1, compute eigenvectors of original dense symmetric matrix also. On entry, the array q must contain the orthogonal matrix used to reduce the original matrix to tridiagonal form. 5 Intel® Math Kernel Library Reference Manual 1202 n INTEGER. The dimension of the symmetric tridiagonal matrix (n = 0). cutpnt INTEGER. The location of the last eigenvalue in the leading sub-matrix. min(1,n) = cutpnt = n . qsiz INTEGER. The dimension of the orthogonal/unitary matrix used to reduce the full matrix to tridiagonal form; qsiz = n (for real flavors, qsiz = n if icompq = 1). tlvls INTEGER. The total number of merging levels in the overall divide and conquer tree. curlvl INTEGER. The current level in the overall merge routine, 0 = curlvl = tlvls . curpbm INTEGER. The current problem in the current level in the overall merge routine (counting from upper left to lower right). d REAL for slaed7/claed7 DOUBLE PRECISION for dlaed7/zlaed7. Array, dimension at least max(1, n). Array d(*) contains the eigenvalues of the rank-1-perturbed matrix. q, work REAL for slaed7 DOUBLE PRECISION for dlaed7 COMPLEX for claed7 DOUBLE COMPLEX for zlaed7. Arrays: q(ldq, *) contains the eigenvectors of the rank-1-perturbed matrix. The second dimension of q must be at least max(1, n). work(*) is a workspace array, dimension at least (3n+2*qsiz*n) for real flavors and at least (qsiz*n) for complex flavors. ldq INTEGER. The leading dimension of the array q; ldq = max(1, n). indxq INTEGER. Array, dimension (n). Contains the permutation that separately sorts the two sub-problems in d into ascending order. rho REAL for slaed7 /claed7 DOUBLE PRECISION for dlaed7/zlaed7. The subdiagonal element used to create the rank-1 modification. qstore REAL for slaed7/claed7 DOUBLE PRECISION for dlaed7/zlaed7. Array, dimension (n2+1). Serves also as output parameter. Stores eigenvectors of submatrices encountered during divide and conquer, packed together. qptr points to beginning of the submatrices. qptr INTEGER. Array, dimension (n+2). Serves also as output parameter. List of indices pointing to beginning of submatrices stored in qstore. The submatrices are numbered starting at the bottom left of the divide and conquer tree, from left to right and bottom to top. prmptr, perm, givptr INTEGER. Arrays, dimension (n lgn) each. The array prmptr(*) contains a list of pointers which indicate where in perm a level's permutation is stored. prmptr(i+1) - prmptr(i) indicates the size of the permutation and also the size of the full, non-deflated problem. The array perm(*) contains the permutations (from deflation and sorting) to be applied to each eigenblock. This parameter can be modified by ?laed8, where it is output. LAPACK Auxiliary and Utility Routines 5 1203 The array givptr(*) contains a list of pointers which indicate where in givcol a level's Givens rotations are stored. givptr(i+1) - givptr(i) indicates the number of Givens rotations. givcol INTEGER. Array, dimension (2, n lgn). Each pair of numbers indicates a pair of columns to take place in a Givens rotation. givnum REAL for slaed7/claed7 DOUBLE PRECISION for dlaed7/zlaed7. Array, dimension (2, n lgn). Each number indicates the S value to be used in the corresponding Givens rotation. iwork INTEGER. Workspace array, dimension (4n ). rwork REAL for claed7 DOUBLE PRECISION for zlaed7. Workspace array, dimension (3n+2qsiz*n). Used in complex flavors only. Output Parameters d On exit, contains the eigenvalues of the repaired matrix. q On exit, q contains the eigenvectors of the repaired tridiagonal matrix. indxq INTEGER. Array, dimension (n). Contains the permutation that reintegrates the subproblems back into a sorted order, that is, d(indxq(i = 1, n)) will be in the ascending order. rho This parameter can be modified by ?laed8, where it is input/output. prmptr, perm, givptr INTEGER. Arrays, dimension (n lgn) each. The array prmptr contains an updated list of pointers. The array perm contains an updated permutation. The array givptr contains an updated list of pointers. givcol This parameter can be modified by ?laed8, where it is output. givnum This parameter can be modified by ?laed8, where it is output. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, an eigenvalue did not converge. ?laed8 Used by ?stedc. Merges eigenvalues and deflates secular equation. Used when the original matrix is dense. Syntax call slaed8( icompq, k, n, qsiz, d, q, ldq, indxq, rho, cutpnt, z, dlamda, q2, ldq2, w, perm, givptr, givcol, givnum, indxp, indx, info ) call dlaed8( icompq, k, n, qsiz, d, q, ldq, indxq, rho, cutpnt, z, dlamda, q2, ldq2, w, perm, givptr, givcol, givnum, indxp, indx, info ) call claed8( k, n, qsiz, q, ldq, d, rho, cutpnt, z, dlamda, q2, ldq2, w, indxp, indx, indxq, perm, givptr, givcol, givnum, info ) 5 Intel® Math Kernel Library Reference Manual 1204 call zlaed8( k, n, qsiz, q, ldq, d, rho, cutpnt, z, dlamda, q2, ldq2, w, indxp, indx, indxq, perm, givptr, givcol, givnum, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine merges the two sets of eigenvalues together into a single sorted set. Then it tries to deflate the size of the problem. There are two ways in which deflation can occur: when two or more eigenvalues are close together or if there is a tiny element in the z vector. For each such occurrence the order of the related secular equation problem is reduced by one. Input Parameters icompq INTEGER. Used with real flavors only. If icompq = 0, compute eigenvalues only. If icompq = 1, compute eigenvectors of original dense symmetric matrix also. On entry, the array q must contain the orthogonal matrix used to reduce the original matrix to tridiagonal form. n INTEGER. The dimension of the symmetric tridiagonal matrix (n = 0). cutpnt INTEGER. The location of the last eigenvalue in the leading sub-matrix. min(1,n) = cutpnt = n . qsiz INTEGER. The dimension of the orthogonal/unitary matrix used to reduce the full matrix to tridiagonal form; qsiz = n (for real flavors, qsiz = n if icompq = 1). d, z REAL for slaed8/claed8 DOUBLE PRECISION for dlaed8/zlaed8. Arrays, dimension at least max(1, n) each. The array d(*) contains the eigenvalues of the two submatrices to be combined. On entry, z(*) contains the updating vector (the last row of the first subeigenvector matrix and the first row of the second sub-eigenvector matrix). The contents of z are destroyed by the updating process. q REAL for slaed8 DOUBLE PRECISION for dlaed8 COMPLEX for claed8 DOUBLE COMPLEX for zlaed8. Array q(ldq, *). The second dimension of q must be at least max(1, n). On entry, q contains the eigenvectors of the partially solved system which has been previously updated in matrix multiplies with other partially solved eigensystems. For real flavors, If icompq = 0, q is not referenced. ldq INTEGER. The leading dimension of the array q; ldq = max(1, n). ldq2 INTEGER. The leading dimension of the output array q2; ldq2 = max(1, n). indxq INTEGER. Array, dimension (n). The permutation that separately sorts the two sub-problems in d into ascending order. Note that elements in the second half of this permutation must first have cutpnt added to their values in order to be accurate. LAPACK Auxiliary and Utility Routines 5 1205 rho REAL for slaed8/claed8 DOUBLE PRECISION for dlaed8/zlaed8. On entry, the off-diagonal element associated with the rank-1 cut which originally split the two submatrices which are now being recombined. Output Parameters k INTEGER. The number of non-deflated eigenvalues, and the order of the related secular equation. d On exit, contains the trailing (n-k) updated eigenvalues (those which were deflated) sorted into increasing order. z On exit, the updating process destroys the contents of z. q On exit, q contains the trailing (n-k) updated eigenvectors (those which were deflated) in its last (n-k) columns. indxq INTEGER. Array, dimension (n). The permutation of merged eigenvalues set. rho On exit, rho has been modified to the value required by ?laed3. dlamda, w REAL for slaed8/claed8 DOUBLE PRECISION for dlaed8/zlaed8. Arrays, dimension (n) each. The array dlamda(*) contains a copy of the first k eigenvalues which will be used by ?laed3 to form the secular equation. The array w(*) will hold the first k values of the final deflation-altered zvector and will be passed to ?laed3. q2 REAL for slaed8 DOUBLE PRECISION for dlaed8 COMPLEX for claed8 DOUBLE COMPLEX for zlaed8. Array q2(ldq2, *). The second dimension of q2 must be at least max(1, n). Contains a copy of the first k eigenvectors which will be used by slaed7/ dlaed7 in a matrix multiply (sgemm/dgemm) to update the new eigenvectors. For real flavors, If icompq = 0, q2 is not referenced. indxp, indx INTEGER. Workspace arrays, dimension (n) each. The array indxp(*) will contain the permutation used to place deflated values of d at the end of the array. On output, indxp(1:k) points to the nondeflated d-values and indxp(k+1:n) points to the deflated eigenvalues. The array indx(*) will contain the permutation used to sort the contents of d into ascending order. perm INTEGER. Array, dimension (n ). Contains the permutations (from deflation and sorting) to be applied to each eigenblock. givptr INTEGER. Contains the number of Givens rotations which took place in this subproblem. givcol INTEGER. Array, dimension (2, n ). Each pair of numbers indicates a pair of columns to take place in a Givens rotation. givnum REAL for slaed8/claed8 DOUBLE PRECISION for dlaed8/zlaed8. Array, dimension (2, n). Each number indicates the S value to be used in the corresponding Givens rotation. 5 Intel® Math Kernel Library Reference Manual 1206 info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?laed9 Used by sstedc/dstedc. Finds the roots of the secular equation and updates the eigenvectors. Used when the original matrix is dense. Syntax call slaed9( k, kstart, kstop, n, d, q, ldq, rho, dlamda, w, s, lds, info ) call dlaed9( k, kstart, kstop, n, d, q, ldq, rho, dlamda, w, s, lds, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine finds the roots of the secular equation, as defined by the values in d, z, and rho, between kstart and kstop. It makes the appropriate calls to slaed4/dlaed4 and then stores the new matrix of eigenvectors for use in calculating the next level of z vectors. Input Parameters k INTEGER. The number of terms in the rational function to be solved by slaed4/dlaed4 (k = 0). kstart, kstop INTEGER. The updated eigenvalues lambda(i), kstart = i = kstop are to be computed. 1 = kstart = kstop = k. n INTEGER. The number of rows and columns in the Q matrix. n = k (deflation may result in n > k). q REAL for slaed9 DOUBLE PRECISION for dlaed9. Workspace array, dimension (ldq, *). The second dimension of q must be at least max(1, n). ldq INTEGER. The leading dimension of the array q; ldq = max(1, n). rho REAL for slaed9 DOUBLE PRECISION for dlaed9 The value of the parameter in the rank one update equation. rho = 0 required. dlamda, w REAL for slaed9 DOUBLE PRECISION for dlaed9 Arrays, dimension (k) each. The first k elements of the array dlamda(*) contain the old roots of the deflated updating problem. These are the poles of the secular equation. The first k elements of the array w(*) contain the components of the deflation-adjusted updating vector. lds INTEGER. The leading dimension of the output array s; lds = max(1, k). LAPACK Auxiliary and Utility Routines 5 1207 Output Parameters d REAL for slaed9 DOUBLE PRECISION for dlaed9 Array, dimension (n). Elements in d(i) are not referenced for 1 = i < kstart or kstop < i = n. s REAL for slaed9 DOUBLE PRECISION for dlaed9. Array, dimension (lds, *) . The second dimension of s must be at least max(1, k). Will contain the eigenvectors of the repaired matrix which will be stored for subsequent z vector calculation and multiplied by the previously accumulated eigenvectors to update the system. dlamda On exit, the value is modified to make sure all dlamda(i) - dlamda(j) can be computed with high relative accuracy, barring overflow and underflow. w Destroyed on exit. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, the eigenvalue did not converge. ?laeda Used by ?stedc. Computes the Z vector determining the rank-one modification of the diagonal matrix. Used when the original matrix is dense. Syntax call slaeda( n, tlvls, curlvl, curpbm, prmptr, perm, givptr, givcol, givnum, q, qptr, z, ztemp, info ) call dlaeda( n, tlvls, curlvl, curpbm, prmptr, perm, givptr, givcol, givnum, q, qptr, z, ztemp, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?laeda computes the Z vector corresponding to the merge step in the curlvl-th step of the merge process with tlvls steps for the curpbm-th problem. Input Parameters n INTEGER. The dimension of the symmetric tridiagonal matrix (n = 0). tlvls INTEGER. The total number of merging levels in the overall divide and conquer tree. curlvl INTEGER. The current level in the overall merge routine, 0 = curlvl = tlvls . curpbm INTEGER. The current problem in the current level in the overall merge routine (counting from upper left to lower right). 5 Intel® Math Kernel Library Reference Manual 1208 prmptr, perm, givptr INTEGER. Arrays, dimension (n lgn ) each. The array prmptr(*) contains a list of pointers which indicate where in perm a level's permutation is stored. prmptr(i+1) - prmptr(i) indicates the size of the permutation and also the size of the full, non-deflated problem. The array perm(*) contains the permutations (from deflation and sorting) to be applied to each eigenblock. The array givptr(*) contains a list of pointers which indicate where in givcol a level's Givens rotations are stored. givptr(i+1) - givptr(i) indicates the number of Givens rotations. givcol INTEGER. Array, dimension (2, n lgn ). Each pair of numbers indicates a pair of columns to take place in a Givens rotation. givnum REAL for slaeda DOUBLE PRECISION for dlaeda. Array, dimension (2, n lgn). Each number indicates the S value to be used in the corresponding Givens rotation. q REAL for slaeda DOUBLE PRECISION for dlaeda. Array, dimension ( n2). Contains the square eigenblocks from previous levels, the starting positions for blocks are given by qptr. qptr INTEGER. Array, dimension (n+2). Contains a list of pointers which indicate where in q an eigenblock is stored. sqrt( qptr(i+1) - qptr(i)) indicates the size of the block. ztemp REAL for slaeda DOUBLE PRECISION for dlaeda. Workspace array, dimension (n). Output Parameters z REAL for slaeda DOUBLE PRECISION for dlaeda. Array, dimension (n). Contains the updating vector (the last row of the first sub-eigenvector matrix and the first row of the second sub-eigenvector matrix). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?laein Computes a specified right or left eigenvector of an upper Hessenberg matrix by inverse iteration. Syntax call slaein( rightv, noinit, n, h, ldh, wr, wi, vr, vi, b, ldb, work, eps3, smlnum, bignum, info ) call dlaein( rightv, noinit, n, h, ldh, wr, wi, vr, vi, b, ldb, work, eps3, smlnum, bignum, info ) call claein( rightv, noinit, n, h, ldh, w, v, b, ldb, rwork, eps3, smlnum, info ) LAPACK Auxiliary and Utility Routines 5 1209 call zlaein( rightv, noinit, n, h, ldh, w, v, b, ldb, rwork, eps3, smlnum, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?laein uses inverse iteration to find a right or left eigenvector corresponding to the eigenvalue (wr,wi) of a real upper Hessenberg matrix H (for real flavors slaein/dlaein) or to the eigenvalue w of a complex upper Hessenberg matrix H (for complex flavors claein/zlaein). Input Parameters rightv LOGICAL. If rightv = .TRUE., compute right eigenvector; if rightv = .FALSE., compute left eigenvector. noinit LOGICAL. If noinit = .TRUE., no initial vector is supplied in (vr,vi) or in v (for complex flavors); if noinit = .FALSE., initial vector is supplied in (vr,vi) or in v (for complex flavors). n INTEGER. The order of the matrix H (n = 0). h REAL for slaein DOUBLE PRECISION for dlaein COMPLEX for claein DOUBLE COMPLEX for zlaein. Array h(ldh, *). The second dimension of h must be at least max(1, n). Contains the upper Hessenberg matrix H. ldh INTEGER. The leading dimension of the array h; ldh = max(1, n). wr, wi REAL for slaein DOUBLE PRECISION for dlaein. The real and imaginary parts of the eigenvalue of H whose corresponding right or left eigenvector is to be computed (for real flavors of the routine). w COMPLEX for claein DOUBLE COMPLEX for zlaein. The eigenvalue of H whose corresponding right or left eigenvector is to be computed (for complex flavors of the routine). vr, vi REAL for slaein DOUBLE PRECISION for dlaein. Arrays, dimension (n) each. Used for real flavors only. On entry, if noinit = .FALSE. and wi = 0.0, vr must contain a real starting vector for inverse iteration using the real eigenvalue wr; if noinit = .FALSE. and wi ? 0.0, vr and vi must contain the real and imaginary parts of a complex starting vector for inverse iteration using the complex eigenvalue (wr,wi);otherwise vr and vi need not be set. v COMPLEX for claein DOUBLE COMPLEX for zlaein. Array, dimension (n). Used for complex flavors only. On entry, if noinit = .FALSE., v must contain a starting vector for inverse iteration; otherwise v need not be set. 5 Intel® Math Kernel Library Reference Manual 1210 b REAL for slaein DOUBLE PRECISION for dlaein COMPLEX for claein DOUBLE COMPLEX for zlaein. Workspace array b(ldb, *). The second dimension of b must be at least max(1, n). ldb INTEGER. The leading dimension of the array b; ldb = n+1 for real flavors; ldb = max(1, n) for complex flavors. work REAL for slaein DOUBLE PRECISION for dlaein. Workspace array, dimension (n). Used for real flavors only. rwork REAL for claein DOUBLE PRECISION for zlaein. Workspace array, dimension (n). Used for complex flavors only. eps3, smlnum REAL for slaein/claein DOUBLE PRECISION for dlaein/zlaein. eps3 is a small machine-dependent value which is used to perturb close eigenvalues, and to replace zero pivots. smlnum is a machine-dependent value close to underflow threshold. A suggested value for smlnum is slamch('s') * (n/slamch('p') for slaein/claein or dlamch('s') * (n/dlamch('p') for dlaein/zlaein. See lamch. bignum REAL for slaein DOUBLE PRECISION for dlaein. bignum is a machine-dependent value close to overflow threshold. Used for real flavors only. A suggested value for bignum is 1 / slamch('s') for slaein/claein or 1 / dlamch('s') for dlaein/zlaein. Output Parameters vr, vi On exit, if wi = 0.0 (real eigenvalue), vr contains the computed real eigenvector; if wi ? 0.0 (complex eigenvalue), vr and vi contain the real and imaginary parts of the computed complex eigenvector. The eigenvector is normalized so that the component of largest magnitude has magnitude 1; here the magnitude of a complex number (x,y) is taken to be |x| + |y|. vi is not referenced if wi = 0.0. v On exit, v contains the computed eigenvector, normalized so that the component of largest magnitude has magnitude 1; here the magnitude of a complex number (x,y) is taken to be |x| + |y|. info INTEGER. If info = 0, the execution is successful. If info = 1, inverse iteration did not converge. For real flavors, vr is set to the last iterate, and so is vi, if wi ? 0.0. For complex flavors, v is set to the last iterate. LAPACK Auxiliary and Utility Routines 5 1211 ?laev2 Computes the eigenvalues and eigenvectors of a 2- by-2 symmetric/Hermitian matrix. Syntax call slaev2( a, b, c, rt1, rt2, cs1, sn1 ) call dlaev2( a, b, c, rt1, rt2, cs1, sn1 ) call claev2( a, b, c, rt1, rt2, cs1, sn1 ) call zlaev2( a, b, c, rt1, rt2, cs1, sn1 ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine performs the eigendecomposition of a 2-by-2 symmetric matrix (for claev2/zlaev2). On return, rt1 is the eigenvalue of larger absolute value, rt2 of smaller absolute value, and (cs1, sn1) is the unit right eigenvector for rt1, giving the decomposition (for slaev2/dlaev2), or (for claev2/zlaev2). Input Parameters a, b, c REAL for slaev2 DOUBLE PRECISION for dlaev2 COMPLEX for claev2 DOUBLE COMPLEX for zlaev2. Elements of the input matrix. 5 Intel® Math Kernel Library Reference Manual 1212 Output Parameters rt1, rt2 REAL for slaev2/claev2 DOUBLE PRECISION for dlaev2/zlaev2. Eigenvalues of larger and smaller absolute value, respectively. cs1 REAL for slaev2/claev2 DOUBLE PRECISION for dlaev2/zlaev2. sn1 REAL for slaev2 DOUBLE PRECISION for dlaev2 COMPLEX for claev2 DOUBLE COMPLEX for zlaev2. The vector (cs1, sn1) is the unit right eigenvector for rt1. Application Notes rt1 is accurate to a few ulps barring over/underflow. rt2 may be inaccurate if there is massive cancellation in the determinant a*c-b*b; higher precision or correctly rounded or correctly truncated arithmetic would be needed to compute rt2 accurately in all cases. cs1 and sn1 are accurate to a few ulps barring over/ underflow. Overflow is possible only if rt1 is within a factor of 5 of overflow. Underflow is harmless if the input data is 0 or exceeds underflow_threshold / macheps. ?laexc Swaps adjacent diagonal blocks of a real upper quasitriangular matrix in Schur canonical form, by an orthogonal similarity transformation. Syntax call slaexc( wantq, n, t, ldt, q, ldq, j1, n1, n2, work, info ) call dlaexc( wantq, n, t, ldt, q, ldq, j1, n1, n2, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine swaps adjacent diagonal blocks T11 and T22 of order 1 or 2 in an upper quasi-triangular matrix T by an orthogonal similarity transformation. T must be in Schur canonical form, that is, block upper triangular with 1-by-1 and 2-by-2 diagonal blocks; each 2-by-2 diagonal block has its diagonal elements equal and its off-diagonal elements of opposite sign. Input Parameters wantq LOGICAL. If wantq = .TRUE., accumulate the transformation in the matrix Q; If wantq = .FALSE., do not accumulate the transformation. n INTEGER. The order of the matrix T (n = 0). t, q REAL for slaexc DOUBLE PRECISION for dlaexc Arrays: t(ldt,*) contains on entry the upper quasi-triangular matrix T, in Schur canonical form. LAPACK Auxiliary and Utility Routines 5 1213 The second dimension of t must be at least max(1, n). q(ldq,*) contains on entry, if wantq = .TRUE., the orthogonal matrix Q. If wantq = .FALSE., q is not referenced. The second dimension of q must be at least max(1, n). ldt INTEGER. The leading dimension of t; at least max(1, n). ldq INTEGER. The leading dimension of q; If wantq = .FALSE., then ldq = 1. If wantq = .TRUE., then ldq = max(1,n). j1 INTEGER. The index of the first row of the first block T11. n1 INTEGER. The order of the first block T11 (n1 = 0, 1, or 2). n2 INTEGER. The order of the second block T22 (n2 = 0, 1, or 2). work REAL for slaexc; DOUBLE PRECISION for dlaexc. Workspace array, DIMENSION (n). Output Parameters t On exit, the updated matrix T, again in Schur canonical form. q On exit, if wantq = .TRUE., the updated matrix Q. info INTEGER. If info = 0, the execution is successful. If info = 1, the transformed matrix T would be too far from Schur form; the blocks are not swapped and T and Q are unchanged. ?lag2 Computes the eigenvalues of a 2-by-2 generalized eigenvalue problem, with scaling as necessary to avoid over-/underflow. Syntax call slag2( a, lda, b, ldb, safmin, scale1, scale2, wr1, wr2, wi ) call dlag2( a, lda, b, ldb, safmin, scale1, scale2, wr1, wr2, wi ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the eigenvalues of a 2 x 2 generalized eigenvalue problem A - w *B, with scaling as necessary to avoid over-/underflow. The scaling factor, s, results in a modified eigenvalue equation s*A - w*B, where s is a non-negative scaling factor chosen so that w, w*B, and s*A do not overflow and, if possible, do not underflow, either. Input Parameters a, b REAL for slag2 5 Intel® Math Kernel Library Reference Manual 1214 DOUBLE PRECISION for dlag2 Arrays: a(lda,2) contains, on entry, the 2 x 2 matrix A. It is assumed that its 1- norm is less than 1/safmin. Entries less than sqrt(safmin)*norm(A) are subject to being treated as zero. b(ldb,2) contains, on entry, the 2 x 2 upper triangular matrix B. It is assumed that the one-norm of B is less than 1/safmin. The diagonals should be at least sqrt(safmin) times the largest element of B (in absolute value); if a diagonal is smaller than that, then +/- sqrt(safmin) will be used instead of that diagonal. lda INTEGER. The leading dimension of a; lda = 2. ldb INTEGER. The leading dimension of b; ldb = 2. safmin REAL for slag2; DOUBLE PRECISION for dlag2. The smallest positive number such that 1/safmin does not overflow. (This should always be ?lamch('S') - it is an argument in order to avoid having to call ?lamch frequently.) Output Parameters scale1 REAL for slag2; DOUBLE PRECISION for dlag2. A scaling factor used to avoid over-/underflow in the eigenvalue equation which defines the first eigenvalue. If the eigenvalues are complex, then the eigenvalues are (wr1 +/- wii)/scale1 (which may lie outside the exponent range of the machine), scale1=scale2, and scale1 will always be positive. If the eigenvalues are real, then the first (real) eigenvalue is wr1/scale1, but this may overflow or underflow, and in fact, scale1 may be zero or less than the underflow threshhold if the exact eigenvalue is sufficiently large. scale2 REAL for slag2; DOUBLE PRECISION for dlag2. A scaling factor used to avoid over-/underflow in the eigenvalue equation which defines the second eigenvalue. If the eigenvalues are complex, then scale2=scale1. If the eigenvalues are real, then the second (real) eigenvalue is wr2/scale2, but this may overflow or underflow, and in fact, scale2 may be zero or less than the underflow threshold if the exact eigenvalue is sufficiently large. wr1 REAL for slag2; DOUBLE PRECISION for dlag2. If the eigenvalue is real, then wr1 is scale1 times the eigenvalue closest to the (2,2) element of A*inv(B). If the eigenvalue is complex, then wr1=wr2 is scale1 times the real part of the eigenvalues. wr2 REAL for slag2; DOUBLE PRECISION for dlag2. If the eigenvalue is real, then wr2 is scale2 times the other eigenvalue. If the eigenvalue is complex, then wr1=wr2 is scale1 times the real part of the eigenvalues. wi REAL for slag2; DOUBLE PRECISION for dlag2. LAPACK Auxiliary and Utility Routines 5 1215 If the eigenvalue is real, then wi is zero. If the eigenvalue is complex, then wi is scale1 times the imaginary part of the eigenvalues. wi will always be non-negative. ?lags2 Computes 2-by-2 orthogonal matrices U, V, and Q, and applies them to matrices A and B such that the rows of the transformed A and B are parallel. Syntax call slags2( upper, a1, a2, a3, b1, b2, b3, csu, snu, csv, snv, csq, snq) call dlags2( upper, a1, a2, a3, b1, b2, b3, csu, snu, csv, snv, csq, snq) call clags2( upper, a1, a2, a3, b1, b2, b3, csu, snu, csv, snv, csq, snq) call zlags2( upper, a1, a2, a3, b1, b2, b3, csu, snu, csv, snv, csq, snq) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description For real flavors, the routine computes 2-by-2 orthogonal matrices U, V and Q, such that if upper = .TRUE., then and or if upper = .FALSE., then and 5 Intel® Math Kernel Library Reference Manual 1216 The rows of the transformed A and B are parallel, where Here ZT denotes the transpose of Z. For complex flavors, the routine computes 2-by-2 unitary matrices U, V and Q, such that if upper = .TRUE., then and or if upper = .FALSE., then and The rows of the transformed A and B are parallel, where Input Parameters upper LOGICAL. If upper = .TRUE., the input matrices A and B are upper triangular; If upper = .FALSE., the input matrices A and B are lower triangular. LAPACK Auxiliary and Utility Routines 5 1217 a1, a3 REAL for slags2 and clags2 DOUBLE PRECISION for dlags2 and zlags2 a2 REAL for slags2 DOUBLE PRECISION for dlags2 COMPLEX for clags2 COMPLEX*16 for zlags2 On entry, a1, a2 and a3 are elements of the input 2-by-2 upper (lower) triangular matrix A. b1, b3 REAL for slags2 and clags2 DOUBLE PRECISION for dlags2 and zlags2 b2 REAL for slags2 DOUBLE PRECISION for dlags2 COMPLEX for clags2 COMPLEX*16 for zlags2 On entry, b1, b2 and b3 are elements of the input 2-by-2 upper (lower) triangular matrix B. Output Parameters csu REAL for slags2 and clags2 DOUBLE PRECISION for dlags2 and zlags2 Element of the desired orthogonal matrix U. snu REAL for slags2 DOUBLE PRECISION for dlags2 Element of the desired orthogonal matrix U. COMPLEX for clags2 COMPLEX*16 for zlags2 csv REAL for slags2 and clags2 DOUBLE PRECISION for dlags2 and zlags2 Element of the desired orthogonal matrix V. snv REAL for slags2 DOUBLE PRECISION for dlags2 COMPLEX for clags2 COMPLEX*16 for zlags2 Element of the desired orthogonal matrix V. csq REAL for slags2 and clags2 DOUBLE PRECISION for dlags2 and zlags2 Element of the desired orthogonal matrix Q. snq REAL for slags2 DOUBLE PRECISION for dlags2 Element of the desired orthogonal matrix Q. COMPLEX for clags2 COMPLEX*16 for zlags2 ?lagtf Computes an LU factorization of a matrix T-?*I, where T is a general tridiagonal matrix, and ? is a scalar, using partial pivoting with row interchanges. Syntax call slagtf( n, a, lambda, b, c, tol, d, in, info ) 5 Intel® Math Kernel Library Reference Manual 1218 call dlagtf( n, a, lambda, b, c, tol, d, in, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine factorizes the matrix (T - lambda*I), where T is an n-by-n tridiagonal matrix and lambda is a scalar, as T - lambda*I = P*L*U, where P is a permutation matrix, L is a unit lower tridiagonal matrix with at most one non-zero sub-diagonal elements per column and U is an upper triangular matrix with at most two non-zero super-diagonal elements per column. The factorization is obtained by Gaussian elimination with partial pivoting and implicit row scaling. The parameter lambda is included in the routine so that ?lagtf may be used, in conjunction with ? lagts, to obtain eigenvectors of T by inverse iteration. Input Parameters n INTEGER. The order of the matrix T (n = 0). a, b, c REAL for slagtf DOUBLE PRECISION for dlagtf Arrays, dimension a(n), b(n-1), c(n-1): On entry, a(*) must contain the diagonal elements of the matrix T. On entry, b(*) must contain the (n-1) super-diagonal elements of T. On entry, c(*) must contain the (n-1) sub-diagonal elements of T. tol REAL for slagtf DOUBLE PRECISION for dlagtf On entry, a relative tolerance used to indicate whether or not the matrix (T - lambda*I) is nearly singular. tol should normally be chose as approximately the largest relative error in the elements of T. For example, if the elements of T are correct to about 4 significant figures, then tol should be set to about 5*10-4. If tol is supplied as less than eps, where eps is the relative machine precision, then the value eps is used in place of tol. Output Parameters a On exit, a is overwritten by the n diagonal elements of the upper triangular matrix U of the factorization of T. b On exit, b is overwritten by the n-1 super-diagonal elements of the matrix U of the factorization of T. c On exit, c is overwritten by the n-1 sub-diagonal elements of the matrix L of the factorization of T. d REAL for slagtf DOUBLE PRECISION for dlagtf Array, dimension (n-2). On exit, d is overwritten by the n-2 second super-diagonal elements of the matrix U of the factorization of T. in INTEGER. Array, dimension (n). LAPACK Auxiliary and Utility Routines 5 1219 On exit, in contains details of the permutation matrix p. If an interchange occurred at the k-th step of the elimination, then in(k) = 1, otherwise in(k) = 0. The element in(n) returns the smallest positive integer j such that abs(u(j,j)) = norm((T - lambda*I)(j))*tol, where norm( A(j)) denotes the sum of the absolute values of the j-th row of the matrix A. If no such j exists then in(n) is returned as zero. If in(n) is returned as positive, then a diagonal element of U is small, indicating that (T - lambda*I) is singular or nearly singular. info INTEGER. If info = 0, the execution is successful. If info = -k, the k-th parameter had an illegal value. ?lagtm Performs a matrix-matrix product of the form C = alpha*A*B+beta*C, where A is a tridiagonal matrix, B and C are rectangular matrices, and alpha and beta are scalars, which may be 0, 1, or -1. Syntax call slagtm( trans, n, nrhs, alpha, dl, d, du, x, ldx, beta, b, ldb ) call dlagtm( trans, n, nrhs, alpha, dl, d, du, x, ldx, beta, b, ldb ) call clagtm( trans, n, nrhs, alpha, dl, d, du, x, ldx, beta, b, ldb ) call zlagtm( trans, n, nrhs, alpha, dl, d, du, x, ldx, beta, b, ldb ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine performs a matrix-vector product of the form: B := alpha*A*X + beta*B where A is a tridiagonal matrix of order n, B and X are n-by-nrhs matrices, and alpha and beta are real scalars, each of which may be 0., 1., or -1. Input Parameters trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', then B := alpha*A*X + beta*B (no transpose); If trans = 'T', then B := alpha*AT*X + beta*B (transpose); If trans = 'C', then B := alpha*AH*X + beta*B (conjugate transpose) n INTEGER. The order of the matrix A (n = 0). nrhs INTEGER. The number of right-hand sides, i.e., the number of columns in X and B (nrhs = 0). alpha, beta REAL for slagtm/clagtm DOUBLE PRECISION for dlagtm/zlagtm 5 Intel® Math Kernel Library Reference Manual 1220 Specify the scalars alpha and beta respectively. alpha must be 0., 1., or -1.; otherwise, it is assumed to be 0. beta must be 0., 1., or -1.; otherwise, it is assumed to be 1. dl, d, du REAL for slagtm DOUBLE PRECISION for dlagtm COMPLEX for clagtm DOUBLE COMPLEX for zlagtm. Arrays: dl(n - 1), d(n), du(n - 1). The array dl contains the (n - 1) sub-diagonal elements of T. The array d contains the n diagonal elements of T. The array du contains the (n - 1) super-diagonal elements of T. x, b REAL for slagtm DOUBLE PRECISION for dlagtm COMPLEX for clagtm DOUBLE COMPLEX for zlagtm. Arrays: x(ldx,*) contains the n-by-nrhs matrix X. The second dimension of x must be at least max(1, nrhs). b(ldb,*) contains the n-by-nrhs matrix B. The second dimension of b must be at least max(1, nrhs). ldx INTEGER. The leading dimension of the array x; ldx = max(1, n). ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). Output Parameters b Overwritten by the matrix expression B := alpha*A*X + beta*B ?lagts Solves the system of equations (T - lambda*I)*x = y or (T - lambda*I)T*x = y,where T is a general tridiagonal matrix and lambda is a scalar, using the LU factorization computed by ?lagtf. Syntax call slagts( job, n, a, b, c, d, in, y, tol, info ) call dlagts( job, n, a, b, c, d, in, y, tol, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine may be used to solve for x one of the systems of equations: (T - lambda*I)*x = y or (T - lambda*I)T*x = y, where T is an n-by-n tridiagonal matrix, following the factorization of (T - lambda*I) as T - lambda*I = P*L*U, computed by the routine ?lagtf. LAPACK Auxiliary and Utility Routines 5 1221 The choice of equation to be solved is controlled by the argument job, and in each case there is an option to perturb zero or very small diagonal elements of U, this option being intended for use in applications such as inverse iteration. Input Parameters job INTEGER. Specifies the job to be performed by ?lagts as follows: = 1: The equations (T - lambda*I)x = y are to be solved, but diagonal elements of U are not to be perturbed. = -1: The equations (T - lambda*I)x = y are to be solved and, if overflow would otherwise occur, the diagonal elements of U are to be perturbed. See argument tol below. = 2: The equations (T - lambda*I)Tx = y are to be solved, but diagonal elements of U are not to be perturbed. = -2: The equations (T - lambda*I)Tx = y are to be solved and, if overflow would otherwise occur, the diagonal elements of U are to be perturbed. See argument tol below. n INTEGER. The order of the matrix T (n = 0). a, b, c, d REAL for slagts DOUBLE PRECISION for dlagts Arrays, dimension a(n), b(n-1), c(n-1), d(n-2): On entry, a(*) must contain the diagonal elements of U as returned from ? lagtf. On entry, b(*) must contain the first super-diagonal elements of U as returned from ?lagtf. On entry, c(*) must contain the sub-diagonal elements of L as returned from ?lagtf. On entry, d(*) must contain the second super-diagonal elements of U as returned from ?lagtf. in INTEGER. Array, dimension (n). On entry, in(*) must contain details of the matrix p as returned from ? lagtf. y REAL for slagts DOUBLE PRECISION for dlagts Array, dimension (n). On entry, the right hand side vector y. tol REAL for slagtf DOUBLE PRECISION for dlagtf. On entry, with job < 0, tol should be the minimum perturbation to be made to very small diagonal elements of U. tol should normally be chosen as about eps*norm(U), where eps is the relative machine precision, but if tol is supplied as non-positive, then it is reset to eps*max( abs( u(i,j)) ). If job > 0 then tol is not referenced. Output Parameters y On exit, y is overwritten by the solution vector x. tol On exit, tol is changed as described in Input Parameters section above, only if tol is non-positive on entry. Otherwise tol is unchanged. info INTEGER. If info = 0, the execution is successful. 5 Intel® Math Kernel Library Reference Manual 1222 If info = -i, the i-th parameter had an illegal value. If info = i >0, overflow would occur when computing the ith element of the solution vector x. This can only occur when job is supplied as positive and either means that a diagonal element of U is very small, or that the elements of the right-hand side vector y are very large. ?lagv2 Computes the Generalized Schur factorization of a real 2-by-2 matrix pencil (A,B) where B is upper triangular. Syntax call slagv2( a, lda, b, ldb, alphar, alphai, beta, csl, snl, csr, snr ) call dlagv2( a, lda, b, ldb, alphar, alphai, beta, csl, snl, csr, snr ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the Generalized Schur factorization of a real 2-by-2 matrix pencil (A,B) where B is upper triangular. The routine computes orthogonal (rotation) matrices given by csl, snl and csr, snr such that: 1) if the pencil (A,B) has two real eigenvalues (include 0/0 or 1/0 types), then 2) if the pencil (A,B) has a pair of complex conjugate eigenvalues, then where b11 = b22>0. LAPACK Auxiliary and Utility Routines 5 1223 Input Parameters a, b REAL for slagv2 DOUBLE PRECISION for dlagv2 Arrays: a(lda,2) contains the 2-by-2 matrix A; b(ldb,2) contains the upper triangular 2-by-2 matrix B. lda INTEGER. The leading dimension of the array a; lda = 2. ldb INTEGER. The leading dimension of the array b; ldb = 2. Output Parameters a On exit, a is overwritten by the "A-part" of the generalized Schur form. b On exit, b is overwritten by the "B-part" of the generalized Schur form. alphar, alphai, beta REAL for slagv2 DOUBLE PRECISION for dlagv2. Arrays, dimension (2) each. (alphar(k) + i*alphai(k))/beta(k) are the eigenvalues of the pencil (A,B), k=1,2 and i = sqrt(-1). Note that beta(k) may be zero. csl, snl REAL for slagv2 DOUBLE PRECISION for dlagv2 The cosine and sine of the left rotation matrix, respectively. csr, snr REAL for slagv2 DOUBLE PRECISION for dlagv2 The cosine and sine of the right rotation matrix, respectively. ?lahqr Computes the eigenvalues and Schur factorization of an upper Hessenberg matrix, using the double-shift/ single-shift QR algorithm. Syntax call slahqr( wantt, wantz, n, ilo, ihi, h, ldh, wr, wi, iloz, ihiz, z, ldz, info ) call dlahqr( wantt, wantz, n, ilo, ihi, h, ldh, wr, wi, iloz, ihiz, z, ldz, info ) call clahqr( wantt, wantz, n, ilo, ihi, h, ldh, w, iloz, ihiz, z, ldz, info ) call zlahqr( wantt, wantz, n, ilo, ihi, h, ldh, w, iloz, ihiz, z, ldz, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine is an auxiliary routine called by ?hseqr to update the eigenvalues and Schur decomposition already computed by ?hseqr, by dealing with the Hessenberg submatrix in rows and columns ilo to ihi. 5 Intel® Math Kernel Library Reference Manual 1224 Input Parameters wantt LOGICAL. If wantt = .TRUE., the full Schur form T is required; If wantt = .FALSE., eigenvalues only are required. wantz LOGICAL. If wantz = .TRUE., the matrix of Schur vectors Z is required; If wantz = .FALSE., Schur vectors are not required. n INTEGER. The order of the matrix H (n = 0). ilo, ihi INTEGER. It is assumed that h is already upper quasi-triangular in rows and columns ihi+1:n, and that h(ilo,ilo-1) = 0 (unless ilo = 1). The routine ? lahqr works primarily with the Hessenberg submatrix in rows and columns ilo to ihi, but applies transformations to all of h if wantt = .TRUE.. Constraints: 1 = ilo = max(1,ihi); ihi = n. h, z REAL for slahqr DOUBLE PRECISION for dlahqr COMPLEX for clahqr DOUBLE COMPLEX for zlahqr. Arrays: h(ldh,*) contains the upper Hessenberg matrix h. The second dimension of h must be at least max(1, n). z(ldz,*) If wantz = .TRUE., then, on entry, z must contain the current matrix z of transformations accumulated by ?hseqr. If wantz = .FALSE., then z is not referenced. The second dimension of z must be at least max(1, n). ldh INTEGER. The leading dimension of h; at least max(1, n). ldz INTEGER. The leading dimension of z; at least max(1, n). iloz, ihiz INTEGER. Specify the rows of z to which transformations must be applied if wantz = .TRUE.. 1 = iloz = ilo; ihi = ihiz = n. Output Parameters h On exit, if info= 0 and wantt = .TRUE., then, • for slahqr/dlahqr, h is upper quasi-triangular in rows and columns ilo:ihi with any 2-by-2 diagonal blocks in standard form. • for clahqr/zlahqr, h is upper triangular in rows and columns ilo:ihi. If info= 0 and wantt = .FALSE., the contents of h are unspecified on exit. If info is positive, see description of info for the output state of h. wr, wi REAL for slahqr DOUBLE PRECISION for dlahqr Arrays, DIMENSION at least max(1, n) each. Used with real flavors only. The real and imaginary parts, respectively, of the computed eigenvalues ilo to ihi are stored in the corresponding elements of wr and wi. If two eigenvalues are computed as a complex conjugate pair, they are stored in consecutive elements of wr and wi, say the i-th and (i+1)-th, with wi(i)> 0 and wi(i+1) < 0. LAPACK Auxiliary and Utility Routines 5 1225 If wantt = .TRUE., the eigenvalues are stored in the same order as on the diagonal of the Schur form returned in h, with wr(i) = h(i,i), and, if h(i:i+1, i:i+1) is a 2-by-2 diagonal block, wi(i) = sqrt(h(i +1,i)*h(i,i+1)) and wi(i+1) = -wi(i). w COMPLEX for clahqr DOUBLE COMPLEX for zlahqr. Array, DIMENSION at least max(1, n). Used with complex flavors only. The computed eigenvalues ilo to ihi are stored in the corresponding elements of w. If wantt = .TRUE., the eigenvalues are stored in the same order as on the diagonal of the Schur form returned in h, with w(i) = h(i,i). z If wantz = .TRUE., then, on exit z has been updated; transformations are applied only to the submatrix z(iloz:ihiz, ilo:ihi). info INTEGER. If info = 0, the execution is successful. With info > 0, • if info = i, ?lahqr failed to compute all the eigenvalues ilo to ihi in a total of 30 iterations per eigenvalue; elements i+1:ihi of wr and wi (for slahqr/dlahqr) or w (for clahqr/zlahqr) contain those eigenvalues which have been successfully computed. • if wantt is .FALSE., then on exit the remaining unconverged eigenvalues are the eigenvalues of the upper Hessenberg matrix rows and columns ilo through info of the final output value of h. • if wantt is .TRUE., then on exit (initial value of h)*u = u*(final value of h), (*) where u is an orthognal matrix. The final value of h is upper Hessenberg and triangular in rows and columns info+1 through ihi. • if wantz is .TRUE., then on exit (final value of z) = (initial value of z)* u, where u is an orthognal matrix in (*) regardless of the value of wantt. ?lahrd Reduces the first nb columns of a general rectangular matrix A so that elements below the k-th subdiagonal are zero, and returns auxiliary matrices which are needed to apply the transformation to the unreduced part of A. Syntax call slahrd( n, k, nb, a, lda, tau, t, ldt, y, ldy ) call dlahrd( n, k, nb, a, lda, tau, t, ldt, y, ldy ) call clahrd( n, k, nb, a, lda, tau, t, ldt, y, ldy ) call zlahrd( n, k, nb, a, lda, tau, t, ldt, y, ldy ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description 5 Intel® Math Kernel Library Reference Manual 1226 The routine reduces the first nb columns of a real/complex general n-by-(n-k+1) matrix A so that elements below the k-th subdiagonal are zero. The reduction is performed by an orthogonal/unitary similarity transformation QT*A*Q for real flavors, or QH*A*Q for complex flavors. The routine returns the matrices V and T which determine Q as a block reflector I - V*T*VT (for real flavors) or I - V*T*VH (for complex flavors), and also the matrix Y = A*V*T. The matrix Q is represented as products of nb elementary reflectors: Q = H(1)*H(2)*... *H(nb) Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors, or where tau is a real/complex scalar, and v is a real/complex vector. This is an obsolete auxiliary routine. Please use the new routine ?lahr2 instead. Input Parameters n INTEGER. The order of the matrix A (n = 0). k INTEGER. The offset for the reduction. Elements below the k-th subdiagonal in the first nb columns are reduced to zero. nb INTEGER. The number of columns to be reduced. a REAL for slahrd DOUBLE PRECISION for dlahrd COMPLEX for clahrd DOUBLE COMPLEX for zlahrd. Array a(lda, n-k+1) contains the n-by-(n-k+1) general matrix A to be reduced. lda INTEGER. The leading dimension of a; at least max(1, n). ldt INTEGER. The leading dimension of the output array t; must be at least max(1, nb). ldy INTEGER. The leading dimension of the output array y; must be at least max(1, n). Output Parameters a On exit, the elements on and above the k-th subdiagonal in the first nb columns are overwritten with the corresponding elements of the reduced matrix; the elements below the k-th subdiagonal, with the array tau, represent the matrix Q as a product of elementary reflectors. The other columns of a are unchanged. See Application Notes below. tau REAL for slahrd DOUBLE PRECISION for dlahrd COMPLEX for clahrd DOUBLE COMPLEX for zlahrd. Array, DIMENSION (nb). Contains scalar factors of the elementary reflectors. t, y REAL for slahrd DOUBLE PRECISION for dlahrd COMPLEX for clahrd DOUBLE COMPLEX for zlahrd. Arrays, dimension t(ldt, nb), y(ldy, nb). LAPACK Auxiliary and Utility Routines 5 1227 The array t contains upper triangular matrix T. The array y contains the n-by-nb matrix Y . Application Notes For the elementary reflector H(i), v(1:i+k-1) = 0, v(i+k) = 1; v(i+k+1:n) is stored on exit in a(i+k+1:n, i) and tau is stored in tau(i). The elements of the vectors v together form the (n-k+1)-by-nb matrix V which is needed, with T and Y, to apply the transformation to the unreduced part of the matrix, using an update of the form: A := (I - V*T*VT) * (A - Y*VT) for real flavors, or A := (I - V*T*VH) * (A - Y*VH) for complex flavors. The contents of A on exit are illustrated by the following example with n = 7, k = 3 and nb = 2: where a denotes an element of the original matrix A, h denotes a modified element of the upper Hessenberg matrix H, and vi denotes an element of the vector defining H(i). See Also ?lahr2 ?lahr2 Reduces the specified number of first columns of a general rectangular matrix A so that elements below the specified subdiagonal are zero, and returns auxiliary matrices which are needed to apply the transformation to the unreduced part of A. Syntax call slahr2( n, k, nb, a, lda, tau, t, ldt, y, ldy ) call dlahr2( n, k, nb, a, lda, tau, t, ldt, y, ldy ) call clahr2( n, k, nb, a, lda, tau, t, ldt, y, ldy ) call zlahr2( n, k, nb, a, lda, tau, t, ldt, y, ldy ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1228 Description The routine reduces the first nb columns of a real/complex general n-by-(n-k+1) matrix A so that elements below the k-th subdiagonal are zero. The reduction is performed by an orthogonal/unitary similarity transformation QT*A*Q for real flavors, or QH*A*Q for complex flavors. The routine returns the matrices V and T which determine Q as a block reflector I - V*T*VT (for real flavors) or I - V*T*VH (for real flavors), and also the matrix Y = A*V*T. The matrix Q is represented as products of nb elementary reflectors: Q = H(1)*H(2)*... *H(nb) Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors where tau is a real/complex scalar, and v is a real/complex vector. This is an auxiliary routine called by ?gehrd. Input Parameters n INTEGER. The order of the matrix A (n = 0). k INTEGER. The offset for the reduction. Elements below the k-th subdiagonal in the first nb columns are reduced to zero (k < n). nb INTEGER. The number of columns to be reduced. a REAL for slahr2 DOUBLE PRECISION for dlahr2 COMPLEX for clahr2 DOUBLE COMPLEX for zlahr2. Array, DIMENSION (lda, n-k+1) contains the n-by-(n-k+1) general matrix A to be reduced. lda INTEGER. The leading dimension of the array a; lda = max(1, n). ldt INTEGER. The leading dimension of the output array t; ldt = nb. ldy INTEGER. The leading dimension of the output array y; ldy = n. Output Parameters a On exit, the elements on and above the k-th subdiagonal in the first nb columns are overwritten with the corresponding elements of the reduced matrix; the elements below the k-th subdiagonal, with the array tau, represent the matrix Q as a product of elementary reflectors. The other columns of a are unchanged. See Application Notes below. tau REAL for slahr2 DOUBLE PRECISION for dlahr2 COMPLEX for clahr2 DOUBLE COMPLEX for zlahr2. Array, DIMENSION (nb). Contains scalar factors of the elementary reflectors. t, y REAL for slahr2 DOUBLE PRECISION for dlahr2 COMPLEX for clahr2 DOUBLE COMPLEX for zlahr2. Arrays, dimension t(ldt, nb), y(ldy, nb). LAPACK Auxiliary and Utility Routines 5 1229 The array t contains upper triangular matrix T. The array y contains the n-by-nb matrix Y . Application Notes For the elementary reflector H(i), v(1:i+k-1) = 0, v(i+k) = 1; v(i+k+1:n) is stored on exit in a(i+k+1:n, i) and tau is stored in tau(i). The elements of the vectors v together form the (n-k+1)-by-nb matrix V which is needed, with T and Y, to apply the transformation to the unreduced part of the matrix, using an update of the form: A := (I - V*T*VT) * (A - Y*VT) for real flavors, or A := (I - V*T*VH) * (A - Y*VH) for complex flavors. The contents of A on exit are illustrated by the following example with n = 7, k = 3 and nb = 2: where a denotes an element of the original matrix A, h denotes a modified element of the upper Hessenberg matrix H, and vi denotes an element of the vector defining H(i). ?laic1 Applies one step of incremental condition estimation. Syntax call slaic1( job, j, x, sest, w, gamma, sestpr, s, c ) call dlaic1( job, j, x, sest, w, gamma, sestpr, s, c ) call claic1( job, j, x, sest, w, gamma, sestpr, s, c ) call zlaic1( job, j, x, sest, w, gamma, sestpr, s, c ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?laic1 applies one step of incremental condition estimation in its simplest version. Let x, ||x||2 = 1 (where ||a||2 denotes the 2-norm of a), be an approximate singular vector of an j-by-j lower triangular matrix L, such that ||L*x||2 = sest Then ?laic1 computes sestpr, s, c such that the vector 5 Intel® Math Kernel Library Reference Manual 1230 is an approximate singular vector of (for complex flavors), or (for real flavors), in the sense that ||Lhat*xhat||2 = sestpr. Depending on job, an estimate for the largest or smallest singular value is computed. For real flavors, [s c]T and sestpr2 is an eigenpair of the system where alpha = xT*w . For complex flavors, [s c]H and sestpr2 is an eigenpair of the system where alpha = xH*w. Input Parameters job INTEGER. If job =1, an estimate for the largest singular value is computed; If job =2, an estimate for the smallest singular value is computed; j INTEGER. Length of x and w. x, w REAL for slaic1 DOUBLE PRECISION for dlaic1 COMPLEX for claic1 DOUBLE COMPLEX for zlaic1. Arrays, dimension (j) each. Contain vectors x and w, respectively. sest REAL for slaic1/claic1; DOUBLE PRECISION for dlaic1/zlaic1. Estimated singular value of j-by-j matrix L. gamma REAL for slaic1 DOUBLE PRECISION for dlaic1 COMPLEX for claic1 DOUBLE COMPLEX for zlaic1. The diagonal element gamma. LAPACK Auxiliary and Utility Routines 5 1231 Output Parameters sestpr REAL for slaic1/claic1; DOUBLE PRECISION for dlaic1/zlaic1. Estimated singular value of (j+1)-by-(j+1) matrix Lhat. s, c REAL for slaic1 DOUBLE PRECISION for dlaic1 COMPLEX for claic1 DOUBLE COMPLEX for zlaic1. Sine and cosine needed in forming xhat. ?laln2 Solves a 1-by-1 or 2-by-2 linear system of equations of the specified form. Syntax call slaln2( ltrans, na, nw, smin, ca, a, lda, d1, d2, b, ldb, wr, wi, x, ldx, scale, xnorm, info ) call dlaln2( ltrans, na, nw, smin, ca, a, lda, d1, d2, b, ldb, wr, wi, x, ldx, scale, xnorm, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine solves a system of the form (ca*A - w*D)*X = s*B, or (ca*AT - w*D)*X = s*B with possible scaling (s) and perturbation of A. A is an na-by-na real matrix, ca is a real scalar, D is an na-by-na real diagonal matrix, w is a real or complex value, and X and B are na-by-1 matrices: real if w is real, complex if w is complex. The parameter na may be 1 or 2. If w is complex, X and B are represented as na-by-2 matrices, the first column of each being the real part and the second being the imaginary part. The routine computes the scaling factor s ( = 1 ) so chosen that X can be computed without overflow. X is further scaled if necessary to assure that norm(ca*A - w*D)*norm(X) is less than overflow. If both singular values of (ca*A - w*D) are less than smin, smin*I (where I stands for identity) will be used instead of (ca*A - w*D). If only one singular value is less than smin, one element of (ca*A - w*D) will be perturbed enough to make the smallest singular value roughly smin. If both singular values are at least smin, (ca*A - w*D) will not be perturbed. In any case, the perturbation will be at most some small multiple of max(smin, ulp*norm(ca*A - w*D)). The singular values are computed by infinity-norm approximations, and thus will only be correct to a factor of 2 or so. NOTE All input quantities are assumed to be smaller than overflow by a reasonable factor (see bignum). 5 Intel® Math Kernel Library Reference Manual 1232 Input Parameters trans LOGICAL. If trans = .TRUE., A- transpose will be used. If trans = .FALSE., A will be used (not transposed.) na INTEGER. The size of the matrix A, possible values 1 or 2. nw INTEGER. This parameter must be 1 if w is real, and 2 if w is complex. Possible values 1 or 2. smin REAL for slaln2 DOUBLE PRECISION for dlaln2. The desired lower bound on the singular values of A. This should be a safe distance away from underflow or overflow, for example, between (underflow/machine_precision) and (machine_precision * overflow). (See bignum and ulp). ca REAL for slaln2 DOUBLE PRECISION for dlaln2. The coefficient by which A is multiplied. a REAL for slaln2 DOUBLE PRECISION for dlaln2. Array, DIMENSION (lda,na). The na-by-na matrix A. lda INTEGER. The leading dimension of a. Must be at least na. d1, d2 REAL for slaln2 DOUBLE PRECISION for dlaln2. The (1,1) and (2,2) elements in the diagonal matrix D, respectively. d2 is not used if nw = 1. b REAL for slaln2 DOUBLE PRECISION for dlaln2. Array, DIMENSION (ldb,nw). The na-by-nw matrix B (right-hand side). If nw =2 (w is complex), column 1 contains the real part of B and column 2 contains the imaginary part. ldb INTEGER. The leading dimension of b. Must be at least na. wr, wi REAL for slaln2 DOUBLE PRECISION for dlaln2. The real and imaginary part of the scalar w, respectively. wi is not used if nw = 1. ldx INTEGER. The leading dimension of the output array x. Must be at least na. Output Parameters x REAL for slaln2 DOUBLE PRECISION for dlaln2. Array, DIMENSION (ldx,nw). The na-by-nw matrix X (unknowns), as computed by the routine. If nw = 2 (w is complex), on exit, column 1 will contain the real part of X and column 2 will contain the imaginary part. scale REAL for slaln2 DOUBLE PRECISION for dlaln2. The scale factor that B must be multiplied by to insure that overflow does not occur when computing X. Thus (ca*A - w*D) X will be scale*B, not B (ignoring perturbations of A.) It will be at most 1. LAPACK Auxiliary and Utility Routines 5 1233 xnorm REAL for slaln2 DOUBLE PRECISION for dlaln2. The infinity-norm of X, when X is regarded as an na-by-nw real matrix. info INTEGER. An error flag. It will be zero if no error occurs, a negative number if an argument is in error, or a positive number if (ca*A - w*D) had to be perturbed. The possible values are: If info = 0: no error occurred, and (ca*A - w*D) did not have to be perturbed. If info = 1: (ca*A - w*D) had to be perturbed to make its smallest (or only) singular value greater than smin. NOTE For higher speed, this routine does not check the inputs for errors. ?lals0 Applies back multiplying factors in solving the least squares problem using divide and conquer SVD approach. Used by ?gelsd. Syntax call slals0( icompq, nl, nr, sqre, nrhs, b, ldb, bx, ldbx, perm, givptr, givcol, ldgcol, givnum, ldgnum, poles, difl, difr, z, k, c, s, work, info ) call dlals0( icompq, nl, nr, sqre, nrhs, b, ldb, bx, ldbx, perm, givptr, givcol, ldgcol, givnum, ldgnum, poles, difl, difr, z, k, c, s, work, info ) call clals0( icompq, nl, nr, sqre, nrhs, b, ldb, bx, ldbx, perm, givptr, givcol, ldgcol, givnum, ldgnum, poles, difl, difr, z, k, c, s, rwork, info ) call zlals0( icompq, nl, nr, sqre, nrhs, b, ldb, bx, ldbx, perm, givptr, givcol, ldgcol, givnum, ldgnum, poles, difl, difr, z, k, c, s, rwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine applies back the multiplying factors of either the left or right singular vector matrix of a diagonal matrix appended by a row to the right hand side matrix B in solving the least squares problem using the divide-and-conquer SVD approach. For the left singular vector matrix, three types of orthogonal matrices are involved: (1L) Givens rotations: the number of such rotations is givptr;the pairs of columns/rows they were applied to are stored in givcol;and the c- and s-values of these rotations are stored in givnum. (2L) Permutation. The (nl+1)-st row of B is to be moved to the first row, and for j=2:n, perm(j)-th row of B is to be moved to the j-th row. (3L) The left singular vector matrix of the remaining matrix. For the right singular vector matrix, four types of orthogonal matrices are involved: (1R) The right singular vector matrix of the remaining matrix. (2R) If sqre = 1, one extra Givens rotation to generate the right null space. (3R) The inverse transformation of (2L). 5 Intel® Math Kernel Library Reference Manual 1234 (4R) The inverse transformation of (1L). Input Parameters icompq INTEGER. Specifies whether singular vectors are to be computed in factored form: If icompq = 0: Left singular vector matrix. If icompq = 1: Right singular vector matrix. nl INTEGER. The row dimension of the upper block. nl = 1. nr INTEGER. The row dimension of the lower block. nr = 1. sqre INTEGER. If sqre = 0: the lower block is an nr-by-nr square matrix. If sqre = 1: the lower block is an nr-by-(nr+1) rectangular matrix. The bidiagonal matrix has row dimension n = nl + nr + 1, and column dimension m = n + sqre. nrhs INTEGER. The number of columns of B and bx. Must be at least 1. b REAL for slals0 DOUBLE PRECISION for dlals0 COMPLEX for clals0 DOUBLE COMPLEX for zlals0. Array, DIMENSION ( ldb, nrhs ). Contains the right hand sides of the least squares problem in rows 1 through m. ldb INTEGER. The leading dimension of b. Must be at least max(1,max( m, n )). bx REAL for slals0 DOUBLE PRECISION for dlals0 COMPLEX for clals0 DOUBLE COMPLEX for zlals0. Workspace array, DIMENSION ( ldbx, nrhs ). ldbx INTEGER. The leading dimension of bx. perm INTEGER. Array, DIMENSION (n). The permutations (from deflation and sorting) applied to the two blocks. givptr INTEGER. The number of Givens rotations which took place in this subproblem. givcol INTEGER. Array, DIMENSION ( ldgcol, 2 ). Each pair of numbers indicates a pair of rows/columns involved in a Givens rotation. ldgcol INTEGER. The leading dimension of givcol, must be at least n. givnum REAL for slals0/clals0 DOUBLE PRECISION for dlals0/zlals0 Array, DIMENSION ( ldgnum, 2 ). Each number indicates the c or s value used in the corresponding Givens rotation. ldgnum INTEGER. The leading dimension of arrays difr, poles and givnum, must be at least k. poles REAL for slals0/clals0 DOUBLE PRECISION for dlals0/zlals0 LAPACK Auxiliary and Utility Routines 5 1235 Array, DIMENSION ( ldgnum, 2 ). On entry, poles(1:k, 1) contains the new singular values obtained from solving the secular equation, and poles(1:k, 2) is an array containing the poles in the secular equation. difl REAL for slals0/clals0 DOUBLE PRECISION for dlals0/zlals0 Array, DIMENSION ( k ). On entry, difl(i) is the distance between i-th updated (undeflated) singular value and the i-th (undeflated) old singular value. difr REAL for slals0/clals0 DOUBLE PRECISION for dlals0/zlals0 Array, DIMENSION ( ldgnum, 2 ). On entry, difr(i, 1) contains the distances between i-th updated (undeflated) singular value and the i+1-th (undeflated) old singular value. And difr(i, 2) is the normalizing factor for the i-th right singular vector. z REAL for slals0/clals0 DOUBLE PRECISION for dlals0/zlals0 Array, DIMENSION ( k ). Contains the components of the deflation-adjusted updating row vector. K INTEGER. Contains the dimension of the non-deflated matrix. This is the order of the related secular equation. 1 = k = n. c REAL for slals0/clals0 DOUBLE PRECISION for dlals0/zlals0 Contains garbage if sqre =0 and the c value of a Givens rotation related to the right null space if sqre = 1. s REAL for slals0/clals0 DOUBLE PRECISION for dlals0/zlals0 Contains garbage if sqre =0 and the s value of a Givens rotation related to the right null space if sqre = 1. work REAL for slals0 DOUBLE PRECISION for dlals0 Workspace array, DIMENSION ( k ). Used with real flavors only. rwork REAL for clals0 DOUBLE PRECISION for zlals0 Workspace array, DIMENSION (k*(1+nrhs) + 2*nrhs). Used with complex flavors only. Output Parameters b On exit, contains the solution X in rows 1 through n. info INTEGER. If info = 0: successful exit. If info = -i < 0, the i-th argument had an illegal value. ?lalsa Computes the SVD of the coefficient matrix in compact form. Used by ?gelsd. Syntax call slalsa( icompq, smlsiz, n, nrhs, b, ldb, bx, ldbx, u, ldu, vt, k, difl, difr, z, poles, givptr, givcol, ldgcol, perm, givnum, c, s, work, iwork, info ) 5 Intel® Math Kernel Library Reference Manual 1236 call dlalsa( icompq, smlsiz, n, nrhs, b, ldb, bx, ldbx, u, ldu, vt, k, difl, difr, z, poles, givptr, givcol, ldgcol, perm, givnum, c, s, work, iwork, info ) call clalsa( icompq, smlsiz, n, nrhs, b, ldb, bx, ldbx, u, ldu, vt, k, difl, difr, z, poles, givptr, givcol, ldgcol, perm, givnum, c, s, rwork, iwork, info ) call zlalsa( icompq, smlsiz, n, nrhs, b, ldb, bx, ldbx, u, ldu, vt, k, difl, difr, z, poles, givptr, givcol, ldgcol, perm, givnum, c, s, rwork, iwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine is an intermediate step in solving the least squares problem by computing the SVD of the coefficient matrix in compact form. The singular vectors are computed as products of simple orthogonal matrices. If icompq = 0, ?lalsa applies the inverse of the left singular vector matrix of an upper bidiagonal matrix to the right hand side; and if icompq = 1, the routine applies the right singular vector matrix to the right hand side. The singular vector matrices were generated in the compact form by ?lalsa. Input Parameters icompq INTEGER. Specifies whether the left or the right singular vector matrix is involved. If icompq = 0: left singular vector matrix is used If icompq = 1: right singular vector matrix is used. smlsiz INTEGER. The maximum size of the subproblems at the bottom of the computation tree. n INTEGER. The row and column dimensions of the upper bidiagonal matrix. nrhs INTEGER. The number of columns of b and bx. Must be at least 1. b REAL for slalsa DOUBLE PRECISION for dlalsa COMPLEX for clalsa DOUBLE COMPLEX for zlalsa Array, DIMENSION (ldb, nrhs). Contains the right hand sides of the least squares problem in rows 1 through m. ldb INTEGER. The leading dimension of b in the calling subprogram. Must be at least max(1,max( m, n )). ldbx INTEGER. The leading dimension of the output array bx. u REAL for slalsa/clalsa DOUBLE PRECISION for dlalsa/zlalsa Array, DIMENSION (ldu, smlsiz). On entry, u contains the left singular vector matrices of all subproblems at the bottom level. ldu INTEGER, ldu = n. The leading dimension of arrays u, vt, difl, difr, poles, givnum, and z. vt REAL for slalsa/clalsa DOUBLE PRECISION for dlalsa/zlalsa Array, DIMENSION (ldu, smlsiz +1). On entry, vt T (for real flavors) or vt H (for complex flavors) contains the right singular vector matrices of all subproblems at the bottom level. k INTEGER array, DIMENSION ( n ). difl REAL for slalsa/clalsa LAPACK Auxiliary and Utility Routines 5 1237 DOUBLE PRECISION for dlalsa/zlalsa Array, DIMENSION (ldu, nlvl), where nlvl = int(log2(n /(smlsiz +1))) + 1. difr REAL for slalsa/clalsa DOUBLE PRECISION for dlalsa/zlalsa Array, DIMENSION (ldu, 2*nlvl). On entry, difl(*, i) and difr(*, 2i -1) record distances between singular values on the i-th level and singular values on the (i -1)-th level, and difr(*, 2i) record the normalizing factors of the right singular vectors matrices of subproblems on i-th level. z REAL for slalsa/clalsa DOUBLE PRECISION for dlalsa/zlalsa Array, DIMENSION (ldu, nlvl . On entry, z(1, i) contains the components of the deflation- adjusted updating the row vector for subproblems on the i-th level. poles REAL for slalsa/clalsa DOUBLE PRECISION for dlalsa/zlalsa Array, DIMENSION (ldu, 2*nlvl). On entry, poles(*, 2i-1: 2i) contains the new and old singular values involved in the secular equations on the i-th level. givptr INTEGER. Array, DIMENSION ( n ). On entry, givptr( i ) records the number of Givens rotations performed on the i-th problem on the computation tree. givcol INTEGER. Array, DIMENSION ( ldgcol, 2*nlvl ). On entry, for each i, givcol(*, 2i-1: 2i) records the locations of Givens rotations performed on the i-th level on the computation tree. ldgcol INTEGER, ldgcol = n. The leading dimension of arrays givcol and perm. perm INTEGER. Array, DIMENSION ( ldgcol, nlvl ). On entry, perm(*, i) records permutations done on the i-th level of the computation tree. givnum REAL for slalsa/clalsa DOUBLE PRECISION for dlalsa/zlalsa Array, DIMENSION (ldu, 2*nlvl). On entry, givnum(*, 2i-1 : 2i) records the c and s values of Givens rotations performed on the i-th level on the computation tree. c REAL for slalsa/clalsa DOUBLE PRECISION for dlalsa/zlalsa Array, DIMENSION ( n ). On entry, if the i-th subproblem is not square, c( i ) contains the c value of a Givens rotation related to the right null space of the i-th subproblem. s REAL for slalsa/clalsa DOUBLE PRECISION for dlalsa/zlalsa Array, DIMENSION ( n ). On entry, if the i-th subproblem is not square, s( i ) contains the s-value of a Givens rotation related to the right null space of the i-th subproblem. work REAL for slalsa DOUBLE PRECISION for dlalsa Workspace array, DIMENSION at least (n). Used with real flavors only. rwork REAL for clalsa DOUBLE PRECISION for zlalsa Workspace array, DIMENSION at least max(n, (smlsz+1)*nrhs*3). Used with complex flavors only. 5 Intel® Math Kernel Library Reference Manual 1238 iwork INTEGER. Workspace array, DIMENSION at least (3n). Output Parameters b On exit, contains the solution X in rows 1 through n. bx REAL for slalsa DOUBLE PRECISION for dlalsa COMPLEX for clalsa DOUBLE COMPLEX for zlalsa Array, DIMENSION (ldbx, nrhs). On exit, the result of applying the left or right singular vector matrix to b. info INTEGER. If info = 0: successful exit If info = -i < 0, the i-th argument had an illegal value. ?lalsd Uses the singular value decomposition of A to solve the least squares problem. Syntax call slalsd( uplo, smlsiz, n, nrhs, d, e, b, ldb, rcond, rank, work, iwork, info ) call dlalsd( uplo, smlsiz, n, nrhs, d, e, b, ldb, rcond, rank, work, iwork, info ) call clalsd( uplo, smlsiz, n, nrhs, d, e, b, ldb, rcond, rank, work, rwork, iwork, info ) call zlalsd( uplo, smlsiz, n, nrhs, d, e, b, ldb, rcond, rank, work, rwork, iwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine uses the singular value decomposition of A to solve the least squares problem of finding X to minimize the Euclidean norm of each column of A*X-B, where A is n-by-n upper bidiagonal, and X and B are n-by-nrhs. The solution X overwrites B. The singular values of A smaller than rcond times the largest singular value are treated as zero in solving the least squares problem; in this case a minimum norm solution is returned. The actual singular values are returned in d in ascending order. This code makes very mild assumptions about floating point arithmetic. It will work on machines with a guard digit in add/subtract, or on those binary machines without guard digits which subtract like the Cray XMP, Cray YMP, Cray C 90, or Cray 2. It could conceivably fail on hexadecimal or decimal machines without guard digits, but we know of none. Input Parameters uplo CHARACTER*1. If uplo = 'U', d and e define an upper bidiagonal matrix. If uplo = 'L', d and e define a lower bidiagonal matrix. LAPACK Auxiliary and Utility Routines 5 1239 smlsiz INTEGER. The maximum size of the subproblems at the bottom of the computation tree. n INTEGER. The dimension of the bidiagonal matrix. n = 0. nrhs INTEGER. The number of columns of B. Must be at least 1. d REAL for slalsd/clalsd DOUBLE PRECISION for dlalsd/zlalsd Array, DIMENSION (n). On entry, d contains the main diagonal of the bidiagonal matrix. e REAL for slalsd/clalsd DOUBLE PRECISION for dlalsd/zlalsd Array, DIMENSION (n-1). Contains the super-diagonal entries of the bidiagonal matrix. On exit, e is destroyed. b REAL for slalsd DOUBLE PRECISION for dlalsd COMPLEX for clalsd DOUBLE COMPLEX for zlalsd Array, DIMENSION (ldb,nrhs). On input, b contains the right hand sides of the least squares problem. On output, b contains the solution X. ldb INTEGER. The leading dimension of b in the calling subprogram. Must be at least max(1,n). rcond REAL for slalsd/clalsd DOUBLE PRECISION for dlalsd/zlalsd The singular values of A less than or equal to rcond times the largest singular value are treated as zero in solving the least squares problem. If rcond is negative, machine precision is used instead. For example, for the least squares problem diag(S)*X=B, where diag(S) is a diagonal matrix of singular values, the solution is X(i)=B(i)/S(i) if S(i) is greater than rcond *max(S), and X(i)=0 if S(i) is less than or equal to rcond *max(S). rank INTEGER. The number of singular values of A greater than rcond times the largest singular value. work REAL for slalsd DOUBLE PRECISION for dlalsd COMPLEX for clalsd DOUBLE COMPLEX for zlalsd Workspace array. DIMENSION for real flavors at least (9n+2n*smlsiz+8n*nlvl+n*nrhs+(smlsiz+1)2), where nlvl = max(0, int(log2(n/(smlsiz+1))) + 1). DIMENSION for complex flavors is (n*nrhs). rwork REAL for clalsd DOUBLE PRECISION for zlalsd Workspace array, used with complex flavors only. DIMENSION at least (9n + 2n*smlsiz + 8n*nlvl + 3*mlsiz*nrhs + (smlsiz+1)2), where nlvl = max(0, int(log2(min(m,n)/(smlsiz+1))) + 1). 5 Intel® Math Kernel Library Reference Manual 1240 iwork INTEGER. Workspace array of DIMENSION (3n*nlvl + 11n). Output Parameters d On exit, if info = 0, d contains singular values of the bidiagonal matrix. e On exit, destroyed. b On exit, b contains the solution X. info INTEGER. If info = 0: successful exit. If info = -i < 0, the i-th argument had an illegal value. If info > 0: The algorithm failed to compute a singular value while working on the submatrix lying in rows and columns info/(n+1) through mod(info,n+1). ?lamrg Creates a permutation list to merge the entries of two independently sorted sets into a single set sorted in acsending order. Syntax call slamrg( n1, n2, a, strd1, strd2, index ) call dlamrg( n1, n2, a, strd1, strd2, index ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine creates a permutation list which will merge the elements of a (which is composed of two independently sorted sets) into a single set which is sorted in ascending order. Input Parameters n1, n2 INTEGER. These arguments contain the respective lengths of the two sorted lists to be merged. a REAL for slamrg DOUBLE PRECISION for dlamrg. Array, DIMENSION (n1+n2). The first n1 elements of a contain a list of numbers which are sorted in either ascending or descending order. Likewise for the final n2 elements. strd1, strd2 INTEGER. These are the strides to be taken through the array a. Allowable strides are 1 and -1. They indicate whether a subset of a is sorted in ascending (strdx = 1) or descending (strdx = -1) order. Output Parameters index INTEGER. Array, DIMENSION (n1+n2). On exit, this array will contain a permutation such that if b(i) = a(index(i)) for i=1, n1+n2, then b will be sorted in ascending order. LAPACK Auxiliary and Utility Routines 5 1241 ?laneg Computes the Sturm count, the number of negative pivots encountered while factoring tridiagonal Tsigma* I = L*D*LT. Syntax value = slaneg( n, d, lld, sigma, pivmin, r ) value = dlaneg( n, d, lld, sigma, pivmin, r ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the Sturm count, the number of negative pivots encountered while factoring tridiagonal T-sigma*I = L*D*LT. This implementation works directly on the factors without forming the tridiagonal matrix T. The Sturm count is also the number of eigenvalues of T less than sigma. This routine is called from ?larb. The current routine does not use the pivmin parameter but rather requires IEEE-754 propagation of infinities and NaNs (NaN stands for 'Not A Number'). This routine also has no input range restrictions but does require default exception handling such that x/0 produces Inf when x is non-zero, and Inf/Inf produces NaN. (For more information see [Marques06]). Input Parameters n INTEGER. The order of the matrix. d REAL for slaneg DOUBLE PRECISION for dlaneg Array, DIMENSION (n). Contains n diagonal elements of the matrix D. lld REAL for slaneg DOUBLE PRECISION for dlaneg Array, DIMENSION (n-1). Contains (n-1) elements L(i)*L(i)*D(i). sigma REAL for slaneg DOUBLE PRECISION for dlaneg Shift amount in T-sigma*I = L*D*L**T. pivmin REAL for slaneg DOUBLE PRECISION for dlaneg The minimum pivot in the Sturm sequence. May be used when zero pivots are encountered on non-IEEE-754 architectures. r INTEGER. The twist index for the twisted factorization that is used for the negcount. Output Parameters value INTEGER. The number of negative pivots encountered while factoring. 5 Intel® Math Kernel Library Reference Manual 1242 ?langb Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element of general band matrix. Syntax val = slangb( norm, n, kl, ku, ab, ldab, work ) val = dlangb( norm, n, kl, ku, ab, ldab, work ) val = clangb( norm, n, kl, ku, ab, ldab, work ) val = zlangb( norm, n, kl, ku, ab, ldab, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of an n-by-n band matrix A, with kl sub-diagonals and ku super-diagonals. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). n INTEGER. The order of the matrix A. n = 0. When n = 0, ?langb is set to zero. kl INTEGER. The number of sub-diagonals of the matrix A. kl = 0. ku INTEGER. The number of super-diagonals of the matrix A. ku = 0. ab REAL for slangb DOUBLE PRECISION for dlangb COMPLEX for clangb DOUBLE COMPLEX for zlangb Array, DIMENSION (ldab,n). The band matrix A, stored in rows 1 to kl+ku+1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = a(i,j) for max(1,j-ku) = i = min(n,j+kl). ldab INTEGER. The leading dimension of the array ab. ldab = kl+ku+1. work REAL for slangb/clangb DOUBLE PRECISION for dlangb/zlangb Workspace array, DIMENSION (max(1,lwork)), where lwork = n when norm = 'I'; otherwise, work is not referenced. LAPACK Auxiliary and Utility Routines 5 1243 Output Parameters val REAL for slangb/clangb DOUBLE PRECISION for dlangb/zlangb Value returned by the function. ?lange Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element of a general rectangular matrix. Syntax val = slange( norm, m, n, a, lda, work ) val = dlange( norm, m, n, a, lda, work ) val = clange( norm, m, n, a, lda, work ) val = zlange( norm, m, n, a, lda, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lange returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a real/complex matrix A. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). m INTEGER. The number of rows of the matrix A. m = 0. When m = 0, ?lange is set to zero. n INTEGER. The number of columns of the matrix A. n = 0. When n = 0, ?lange is set to zero. a REAL for slange DOUBLE PRECISION for dlange COMPLEX for clange DOUBLE COMPLEX for zlange Array, DIMENSION (lda,n). The m-by-n matrix A. lda INTEGER. The leading dimension of the array a. lda = max(m,1). work REAL for slange and clange. DOUBLE PRECISION for dlange and zlange. 5 Intel® Math Kernel Library Reference Manual 1244 Workspace array, DIMENSION max(1,lwork), where lwork = m when norm = 'I'; otherwise, work is not referenced. Output Parameters val REAL for slange/clange DOUBLE PRECISION for dlange/zlange Value returned by the function. ?langt Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element of a general tridiagonal matrix. Syntax val = slangt( norm, n, dl, d, du ) val = dlangt( norm, n, dl, d, du ) val = clangt( norm, n, dl, d, du ) val = zlangt( norm, n, dl, d, du ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a real/complex tridiagonal matrix A. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). n INTEGER. The order of the matrix A. n = 0. When n = 0, ?langt is set to zero. dl, d, du REAL for slangt DOUBLE PRECISION for dlangt COMPLEX for clangt DOUBLE COMPLEX for zlangt Arrays: dl (n-1), d (n), du (n-1). The array dl contains the (n-1) sub-diagonal elements of A. The array d contains the diagonal elements of A. The array du contains the (n-1) super-diagonal elements of A. Output Parameters val REAL for slangt/clangt LAPACK Auxiliary and Utility Routines 5 1245 DOUBLE PRECISION for dlangt/zlangt Value returned by the function. ?lanhs Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element of an upper Hessenberg matrix. Syntax val = slanhs( norm, n, a, lda, work ) val = dlanhs( norm, n, a, lda, work ) val = clanhs( norm, n, a, lda, work ) val = zlanhs( norm, n, a, lda, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lanhs returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a Hessenberg matrix A. The value val returned by the function is: val = max(abs(Aij)), if norm = 'M' or 'm' = norm1(A), if norm = '1' or 'O' or 'o' = normI(A), if norm = 'I' or 'i' = normF(A), if norm = 'F', 'f', 'E' or 'e' where norm1 denotes the 1-norm of a matrix (maximum column sum), normI denotes the infinity norm of a matrix (maximum row sum) and normF denotes the Frobenius norm of a matrix (square root of sum of squares). Note that max(abs(Aij)) is not a consistent matrix norm. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine as described above. n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lanhs is set to zero. a REAL for slanhs DOUBLE PRECISION for dlanhs COMPLEX for clanhs DOUBLE COMPLEX for zlanhs Array, DIMENSION (lda,n). The n-by-n upper Hessenberg matrix A; the part of A below the first sub-diagonal is not referenced. lda INTEGER. The leading dimension of the array a. lda = max(n,1). work REAL for slanhs and clanhs. DOUBLE PRECISION for dlange and zlange. 5 Intel® Math Kernel Library Reference Manual 1246 Workspace array, DIMENSION (max(1,lwork)), where lwork = n when norm = 'I'; otherwise, work is not referenced. Output Parameters val REAL for slanhs/clanhs DOUBLE PRECISION for dlanhs/zlanhs Value returned by the function. ?lansb Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a symmetric band matrix. Syntax val = slansb( norm, uplo, n, k, ab, ldab, work ) val = dlansb( norm, uplo, n, k, ab, ldab, work ) val = clansb( norm, uplo, n, k, ab, ldab, work ) val = zlansb( norm, uplo, n, k, ab, ldab, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lansb returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of an n-by-n real/complex symmetric band matrix A, with k superdiagonals. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the band matrix A is supplied. If uplo = 'U': upper triangular part is supplied; If uplo = 'L': lower triangular part is supplied. n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lansb is set to zero. k INTEGER. The number of super-diagonals or sub-diagonals of the band matrix A. k = 0. ab REAL for slansb DOUBLE PRECISION for dlansb COMPLEX for clansb LAPACK Auxiliary and Utility Routines 5 1247 DOUBLE COMPLEX for zlansb Array, DIMENSION (ldab,n). The upper or lower triangle of the symmetric band matrix A, stored in the first k+1 rows of ab. The j-th column of A is stored in the j-th column of the array ab as follows: if uplo = 'U', ab(k+1+i-j,j) = a(i,j) for max(1,j-k) = i= j; if uplo = 'L', ab(1+i-j,j) = a(i,j) for j=i=min(n,j+k). ldab INTEGER. The leading dimension of the array ab. ldab = k+1. work REAL for slansb and clansb. DOUBLE PRECISION for dlansb and zlansb. Workspace array, DIMENSION (max(1,lwork)), where lwork = n when norm = 'I' or '1' or 'O'; otherwise, work is not referenced. Output Parameters val REAL for slansb/clansb DOUBLE PRECISION for dlansb/zlansb Value returned by the function. ?lanhb Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a Hermitian band matrix. Syntax val = clanhb( norm, uplo, n, k, ab, ldab, work ) val = zlanhb( norm, uplo, n, k, ab, ldab, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of an n-by-n Hermitian band matrix A, with k super-diagonals. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the band matrix A is supplied. 5 Intel® Math Kernel Library Reference Manual 1248 If uplo = 'U': upper triangular part is supplied; If uplo = 'L': lower triangular part is supplied. n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lanhb is set to zero. k INTEGER. The number of super-diagonals or sub-diagonals of the band matrix A. k = 0. ab COMPLEX for clanhb. DOUBLE COMPLEX for zlanhb. Array, DIMENSION (ldaB,n). The upper or lower triangle of the Hermitian band matrix A, stored in the first k+1 rows of ab. The j-th column of A is stored in the j-th column of the array ab as follows: if uplo = 'U', ab(k+1+i-j,j) = a(i,j) for max(1,j-k) = i = j; if uplo = 'L', ab(1+i-j,j) = a(i,j) for j = i = min(n,j+k). Note that the imaginary parts of the diagonal elements need not be set and are assumed to be zero. ldab INTEGER. The leading dimension of the array ab. ldab = k+1. work REAL for clanhb. DOUBLE PRECISION for zlanhb. Workspace array, DIMENSION max(1, lwork), where lwork = n when norm = 'I' or '1' or 'O'; otherwise, work is not referenced. Output Parameters val REAL for slanhb/clanhb DOUBLE PRECISION for dlanhb/zlanhb Value returned by the function. ?lansp Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a symmetric matrix supplied in packed form. Syntax val = slansp( norm, uplo, n, ap, work ) val = dlansp( norm, uplo, n, ap, work ) val = clansp( norm, uplo, n, ap, work ) val = zlansp( norm, uplo, n, ap, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lansp returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a real/complex symmetric matrix A, supplied in packed form. LAPACK Auxiliary and Utility Routines 5 1249 Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric matrix A is supplied. If uplo = 'U': Upper triangular part of A is supplied If uplo = 'L': Lower triangular part of A is supplied. n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lansp is set to zero. ap REAL for slansp DOUBLE PRECISION for dlansp COMPLEX for clansp DOUBLE COMPLEX for zlansp Array, DIMENSION (n(n+1)/2). The upper or lower triangle of the symmetric matrix A, packed columnwise in a linear array. The j-th column of A is stored in the array ap as follows: if uplo = 'U', ap(i + (j-1)j/2) = A(i,j) for 1 = i = j; if uplo = 'L', ap(i + (j-1)(2n-j)/2) = A(i,j) for j = i = n. work REAL for slansp and clansp. DOUBLE PRECISION for dlansp and zlansp. Workspace array, DIMENSION (max(1,lwork)), where lwork = n when norm = 'I' or '1' or 'O'; otherwise, work is not referenced. Output Parameters val REAL for slansp/clansp DOUBLE PRECISION for dlansp/zlansp Value returned by the function. ?lanhp Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a complex Hermitian matrix supplied in packed form. Syntax val = clanhp( norm, uplo, n, ap, work ) val = zlanhp( norm, uplo, n, ap, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1250 Description The function ?lanhp returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a complex Hermitian matrix A, supplied in packed form. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian matrix A is supplied. If uplo = 'U': Upper triangular part of A is supplied If uplo = 'L': Lower triangular part of A is supplied. n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lanhp is set to zero. ap COMPLEX for clanhp. DOUBLE COMPLEX for zlanhp. Array, DIMENSION (n(n+1)/2). The upper or lower triangle of the Hermitian matrix A, packed columnwise in a linear array. The j-th column of A is stored in the array ap as follows: if uplo = 'U', ap(i + (j-1)j/2) = A(i,j) for 1 = i = j; if uplo = 'L', ap(i + (j-1)(2n-j)/2) = A(i,j) for j = i = n. work REAL for clanhp. DOUBLE PRECISION for zlanhp. Workspace array, DIMENSION (max(1,lwork)), where lwork = n when norm = 'I' or '1' or 'O'; otherwise, work is not referenced. Output Parameters val REAL for clanhp. DOUBLE PRECISION for zlanhp. Value returned by the function. ?lanst/?lanht Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a real symmetric or complex Hermitian tridiagonal matrix. Syntax val = slanst( norm, n, d, e ) val = dlanst( norm, n, d, e ) val = clanht( norm, n, d, e ) LAPACK Auxiliary and Utility Routines 5 1251 val = zlanht( norm, n, d, e ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The functions ?lanst/?lanht return the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a real symmetric or a complex Hermitian tridiagonal matrix A. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lanst/?lanht is set to zero. d REAL for slanst/clanht DOUBLE PRECISION for dlanst/zlanht Array, DIMENSION (n). The diagonal elements of A. e REAL for slanst DOUBLE PRECISION for dlanst COMPLEX for clanht DOUBLE COMPLEX for zlanht Array, DIMENSION (n-1). The (n-1) sub-diagonal or super-diagonal elements of A. Output Parameters val REAL for slanst/clanht DOUBLE PRECISION for dlanst/zlanht Value returned by the function. ?lansy Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a real/complex symmetric matrix. Syntax val = slansy( norm, uplo, n, a, lda, work ) val = dlansy( norm, uplo, n, a, lda, work ) val = clansy( norm, uplo, n, a, lda, work ) val = zlansy( norm, uplo, n, a, lda, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1252 Description The function ?lansy returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a real/complex symmetric matrix A. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric matrix A is to be referenced. = 'U': Upper triangular part of A is referenced. = 'L': Lower triangular part of A is referenced n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lansy is set to zero. a REAL for slansy DOUBLE PRECISION for dlansy COMPLEX for clansy DOUBLE COMPLEX for zlansy Array, DIMENSION (lda,n). The symmetric matrix A. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A, and the strictly upper triangular part of a is not referenced. lda INTEGER. The leading dimension of the array a. lda = max(n,1). work REAL for slansy and clansy. DOUBLE PRECISION for dlansy and zlansy. Workspace array, DIMENSION (max(1,lwork)), where lwork = n when norm = 'I' or '1' or 'O'; otherwise, work is not referenced. Output Parameters val REAL for slansy/clansy DOUBLE PRECISION for dlansy/zlansy Value returned by the function. ?lanhe Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a complex Hermitian matrix. LAPACK Auxiliary and Utility Routines 5 1253 Syntax val = clanhe( norm, uplo, n, a, lda, work ) val = zlanhe( norm, uplo, n, a, lda, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lanhe returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a complex Hermitian matrix A. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian matrix A is to be referenced. = 'U': Upper triangular part of A is referenced. = 'L': Lower triangular part of A is referenced n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lanhe is set to zero. a COMPLEX for clanhe. DOUBLE COMPLEX for zlanhe. Array, DIMENSION (lda,n). The Hermitian matrix A. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A, and the strictly upper triangular part of a is not referenced. lda INTEGER. The leading dimension of the array a. lda = max(n,1). work REAL for clanhe. DOUBLE PRECISION for zlanhe. Workspace array, DIMENSION (max(1,lwork)), where lwork = n when norm = 'I' or '1' or 'O'; otherwise, work is not referenced. Output Parameters val REAL for clanhe. DOUBLE PRECISION for zlanhe. Value returned by the function. 5 Intel® Math Kernel Library Reference Manual 1254 ?lantb Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a triangular band matrix. Syntax val = slantb( norm, uplo, diag, n, k, ab, ldab, work ) val = dlantb( norm, uplo, diag, n, k, ab, ldab, work ) val = clantb( norm, uplo, diag, n, k, ab, ldab, work ) val = zlantb( norm, uplo, diag, n, k, ab, ldab, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lantb returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of an n-by-n triangular band matrix A, with ( k + 1 ) diagonals. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular. = 'U': Upper triangular = 'L': Lower triangular. diag CHARACTER*1. Specifies whether or not the matrix A is unit triangular. = 'N': Non-unit triangular = 'U': Unit triangular. n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lantb is set to zero. k INTEGER. The number of super-diagonals of the matrix A if uplo = 'U', or the number of sub-diagonals of the matrix A if uplo = 'L'. k = 0. ab REAL for slantb DOUBLE PRECISION for dlantb COMPLEX for clantb DOUBLE COMPLEX for zlantb Array, DIMENSION (ldab,n). The upper or lower triangular band matrix A, stored in the first k+1 rows of ab. The j-th column of A is stored in the j-th column of the array ab as follows: if uplo = 'U', ab(k+1+i-j,j) = a(i,j) for max(1,j-k) = i = j; LAPACK Auxiliary and Utility Routines 5 1255 if uplo = 'L', ab(1+i-j,j) = a(i,j) for j= i= min(n,j+k). Note that when diag = 'U', the elements of the array ab corresponding to the diagonal elements of the matrix A are not referenced, but are assumed to be one. ldab INTEGER. The leading dimension of the array ab. ldab = k+1. work REAL for slantb and clantb. DOUBLE PRECISION for dlantb and zlantb. Workspace array, DIMENSION (max(1,lwork)), where lwork = n when norm = 'I' ; otherwise, work is not referenced. Output Parameters val REAL for slantb/clantb. DOUBLE PRECISION for dlantb/zlantb. Value returned by the function. ?lantp Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a triangular matrix supplied in packed form. Syntax val = slantp( norm, uplo, diag, n, ap, work ) val = dlantp( norm, uplo, diag, n, ap, work ) val = clantp( norm, uplo, diag, n, ap, work ) val = zlantp( norm, uplo, diag, n, ap, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lantp returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a triangular matrix A, supplied in packed form. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular. = 'U': Upper triangular 5 Intel® Math Kernel Library Reference Manual 1256 = 'L': Lower triangular. diag CHARACTER*1. Specifies whether or not the matrix A is unit triangular. = 'N': Non-unit triangular = 'U': Unit triangular. n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lantp is set to zero. ap REAL for slantp DOUBLE PRECISION for dlantp COMPLEX for clantp DOUBLE COMPLEX for zlantp Array, DIMENSION (n(n+1)/2). The upper or lower triangular matrix A, packed columnwise in a linear array. The j-th column of A is stored in the array ap as follows: if uplo = 'U', AP(i + (j-1)j/2) = a(i,j) for 1= i= j; if uplo = 'L', ap(i + (j-1)(2n-j)/2) = a(i,j) for j= i= n. Note that when diag = 'U', the elements of the array ap corresponding to the diagonal elements of the matrix A are not referenced, but are assumed to be one. work REAL for slantp and clantp. DOUBLE PRECISION for dlantp and zlantp. Workspace array, DIMENSION (max(1,lwork)), where lwork = n when norm = 'I' ; otherwise, work is not referenced. Output Parameters val REAL for slantp/clantp. DOUBLE PRECISION for dlantp/zlantp. Value returned by the function. ?lantr Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a trapezoidal or triangular matrix. Syntax val = slantr( norm, uplo, diag, m, n, a, lda, work ) val = dlantr( norm, uplo, diag, m, n, a, lda, work ) val = clantr( norm, uplo, diag, m, n, a, lda, work ) val = zlantr( norm, uplo, diag, m, n, a, lda, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lantr returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a trapezoidal or triangular matrix A. LAPACK Auxiliary and Utility Routines 5 1257 Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). uplo CHARACTER*1. Specifies whether the matrix A is upper or lower trapezoidal. = 'U': Upper trapezoidal = 'L': Lower trapezoidal. Note that A is triangular instead of trapezoidal if m = n. diag CHARACTER*1. Specifies whether or not the matrix A has unit diagonal. = 'N': Non-unit diagonal = 'U': Unit diagonal. m INTEGER. The number of rows of the matrix A. m = 0, and if uplo = 'U', m = n. When m = 0, ?lantr is set to zero. n INTEGER. The number of columns of the matrix A. n = 0, and if uplo = 'L', n = m. When n = 0, ?lantr is set to zero. a REAL for slantr DOUBLE PRECISION for dlantr COMPLEX for clantr DOUBLE COMPLEX for zlantr Array, DIMENSION (lda,n). The trapezoidal matrix A (A is triangular if m = n). If uplo = 'U', the leading m-by-n upper trapezoidal part of the array a contains the upper trapezoidal matrix, and the strictly lower triangular part of A is not referenced. If uplo = 'L', the leading m-by-n lower trapezoidal part of the array a contains the lower trapezoidal matrix, and the strictly upper triangular part of A is not referenced. Note that when diag = 'U', the diagonal elements of A are not referenced and are assumed to be one. lda INTEGER. The leading dimension of the array a. lda = max(m,1). work REAL for slantr/clantrp. DOUBLE PRECISION for dlantr/zlantr. Workspace array, DIMENSION (max(1,lwork)), where lwork = m when norm = 'I' ; otherwise, work is not referenced. Output Parameters val REAL for slantr/clantrp. DOUBLE PRECISION for dlantr/zlantr. Value returned by the function. 5 Intel® Math Kernel Library Reference Manual 1258 ?lanv2 Computes the Schur factorization of a real 2-by-2 nonsymmetric matrix in standard form. Syntax call slanv2( a, b, c, d, rt1r, rt1i, rt2r, rt2i, cs, sn ) call dlanv2( a, b, c, d, rt1r, rt1i, rt2r, rt2i, cs, sn ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the Schur factorization of a real 2-by-2 nonsymmetric matrix in standard form: where either 1. cc = 0 so that aa and dd are real eigenvalues of the matrix, or 2. aa = dd and bb*cc < 0, so that aa ± sqrt(bb*cc) are complex conjugate eigenvalues. The routine was adjusted to reduce the risk of cancellation errors, when computing real eigenvalues, and to ensure, if possible, that abs(rt1r) = abs(rt2r). Input Parameters a, b, c, d REAL for slanv2 DOUBLE PRECISION for dlanv2. On entry, elements of the input matrix. Output Parameters a, b, c, d On exit, overwritten by the elements of the standardized Schur form. rt1r, rt1i, rt2r, rt2i REAL for slanv2 DOUBLE PRECISION for dlanv2. The real and imaginary parts of the eigenvalues. If the eigenvalues are a complex conjugate pair, rt1i > 0. cs, sn REAL for slanv2 DOUBLE PRECISION for dlanv2. Parameters of the rotation matrix. ?lapll Measures the linear dependence of two vectors. Syntax call slapll( n, x, incx, Y, incy, ssmin ) call dlapll( n, x, incx, Y, incy, ssmin ) LAPACK Auxiliary and Utility Routines 5 1259 call clapll( n, x, incx, Y, incy, ssmin ) call zlapll( n, x, incx, Y, incy, ssmin ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description Given two column vectors x and y of length n, let A = (x y) be the n-by-2 matrix. The routine ?lapll first computes the QR factorization of A as A = Q*R and then computes the SVD of the 2- by-2 upper triangular matrix R. The smaller singular value of R is returned in ssmin, which is used as the measurement of the linear dependency of the vectors x and y. Input Parameters n INTEGER. The length of the vectors x and y. x REAL for slapll DOUBLE PRECISION for dlapll COMPLEX for clapll DOUBLE COMPLEX for zlapll Array, DIMENSION (1+(n-1)incx). On entry, x contains the n-vector x. y REAL for slapll DOUBLE PRECISION for dlapll COMPLEX for clapll DOUBLE COMPLEX for zlapll Array, DIMENSION (1+(n-1)incy). On entry, y contains the n-vector y. incx INTEGER. The increment between successive elements of x; incx > 0. incy INTEGER. The increment between successive elements of y; incy > 0. Output Parameters x On exit, x is overwritten. y On exit, y is overwritten. ssmin REAL for slapll/clapll DOUBLE PRECISION for dlapll/zlapll The smallest singular value of the n-by-2 matrix A = (x y). ?lapmr Rearranges rows of a matrix as specified by a permutation vector. Syntax Fortran 77: call slapmr( forwrd, m, n, x, ldx, k ) call dlapmr( forwrd, m, n, x, ldx, k ) call clapmr( forwrd, m, n, x, ldx, k ) 5 Intel® Math Kernel Library Reference Manual 1260 call zlapmr( forwrd, m, n, x, ldx, k ) Fortran 95: call lapmr( x,k[,forwrd] ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?lapmr routine rearranges the rows of the m-by-n matrix X as specified by the permutation k(1),k(2),...,k(m) of the integers 1,...,m. If forwrd = .TRUE., forward permutation: X(k(i,*)) is moved to X(i,*) for i= 1,2,...,m. If forwrd = .FALSE., backward permutation: X(i,*) is moved to X(k(i,*)) for i = 1,2,...,m. Input Parameters forwrd LOGICAL. If forwrd = .TRUE., forward permutation If forwrd = .FALSE., backward permutation m INTEGER. The number of rows of the matrix X. m = 0. n INTEGER. The number of columns of the matrix X. n = 0. x REAL for slapmr DOUBLE PRECISION for dlapmr COMPLEX for clapmr DOUBLE COMPLEX for zlapmr Array, DIMENSION (ldx,n). On entry, the m-by-n matrix X. ldx INTEGER. The leading dimension of the array X, ldx = max(1,m). k INTEGER. Array, DIMENSION (m). On entry, k contains the permutation vector and is used as internal workspace. Output Parameters x On exit, x contains the permuted matrix X. k On exit, k is reset to its original value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ?lapmr interface are as follows: x Holds the matrix X of size (n, n). k Holds the vector of length m. forwrd Specifies the permutation. Must be '.TRUE.' or '.FALSE.'. See Also ?lapmt LAPACK Auxiliary and Utility Routines 5 1261 ?lapmt Performs a forward or backward permutation of the columns of a matrix. Syntax call slapmt( forwrd, m, n, x, ldx, k ) call dlapmt( forwrd, m, n, x, ldx, k ) call clapmt( forwrd, m, n, x, ldx, k ) call zlapmt( forwrd, m, n, x, ldx, k ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lapmt rearranges the columns of the m-by-n matrix X as specified by the permutation k(1),k(2),...,k(n) of the integers 1,...,n. If forwrd = .TRUE., forward permutation: X(*,k(j)) is moved to X(*,j) for j=1,2,...,n. If forwrd = .FALSE., backward permutation: X(*,j) is moved to X(*,k(j)) for j = 1,2,...,n. Input Parameters forwrd LOGICAL. If forwrd = .TRUE., forward permutation If forwrd = .FALSE., backward permutation m INTEGER. The number of rows of the matrix X. m = 0. n INTEGER. The number of columns of the matrix X. n = 0. x REAL for slapmt DOUBLE PRECISION for dlapmt COMPLEX for clapmt DOUBLE COMPLEX for zlapmt Array, DIMENSION (ldx,n). On entry, the m-by-n matrix X. ldx INTEGER. The leading dimension of the array X, ldx = max(1,m). k INTEGER. Array, DIMENSION (n). On entry, k contains the permutation vector and is used as internal workspace. Output Parameters x On exit, x contains the permuted matrix X. k On exit, k is reset to its original value. See Also ?lapmr ?lapy2 Returns sqrt(x2+y2). 5 Intel® Math Kernel Library Reference Manual 1262 Syntax val = slapy2( x, y ) val = dlapy2( x, y ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lapy2 returns sqrt(x2+y2), avoiding unnecessary overflow or harmful underflow. Input Parameters x, y REAL for slapy2 DOUBLE PRECISION for dlapy2 Specify the input values x and y. Output Parameters val REAL for slapy2 DOUBLE PRECISION for dlapy2. Value returned by the function. ?lapy3 Returns sqrt(x2+y2+z2). Syntax val = slapy3( x, y, z ) val = dlapy3( x, y, z ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lapy3 returns sqrt(x2+y2+z2), avoiding unnecessary overflow or harmful underflow. Input Parameters x, y, z REAL for slapy3 DOUBLE PRECISION for dlapy3 Specify the input values x, y and z. Output Parameters val REAL for slapy3 DOUBLE PRECISION for dlapy3. Value returned by the function. LAPACK Auxiliary and Utility Routines 5 1263 ?laqgb Scales a general band matrix, using row and column scaling factors computed by ?gbequ. Syntax call slaqgb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, equed ) call dlaqgb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, equed ) call claqgb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, equed ) call zlaqgb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, equed ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine equilibrates a general m-by-n band matrix A with kl subdiagonals and ku superdiagonals using the row and column scaling factors in the vectors r and c. Input Parameters m INTEGER. The number of rows of the matrix A. m = 0. n INTEGER. The number of columns of the matrix A. n = 0. kl INTEGER. The number of subdiagonals within the band of A. kl = 0. ku INTEGER. The number of superdiagonals within the band of A. ku = 0. ab REAL for slaqgb DOUBLE PRECISION for dlaqgb COMPLEX for claqgb DOUBLE COMPLEX for zlaqgb Array, DIMENSION (ldab,n). On entry, the matrix A in band storage, in rows 1 to kl+ku+1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = A(i,j) for max(1,j-ku) = i = min(m,j+kl). ldab INTEGER. The leading dimension of the array ab. lda = kl+ku+1. amax REAL for slaqgb/claqgb DOUBLE PRECISION for dlaqgb/zlaqgb Absolute value of largest matrix entry. r, c REAL for slaqgb/claqgb DOUBLE PRECISION for dlaqgb/zlaqgb Arrays r (m), c (n). Contain the row and column scale factors for A, respectively. rowcnd REAL for slaqgb/claqgb DOUBLE PRECISION for dlaqgb/zlaqgb Ratio of the smallest r(i) to the largest r(i). colcnd REAL for slaqgb/claqgb DOUBLE PRECISION for dlaqgb/zlaqgb Ratio of the smallest c(i) to the largest c(i). 5 Intel® Math Kernel Library Reference Manual 1264 Output Parameters ab On exit, the equilibrated matrix, in the same storage format as A. See equed for the form of the equilibrated matrix. equed CHARACTER*1. Specifies the form of equilibration that was done. If equed = 'N': No equilibration If equed = 'R': Row equilibration, that is, A has been premultiplied by diag(r). If equed = 'C': Column equilibration, that is, A has been postmultiplied by diag(c). If equed = 'B': Both row and column equilibration, that is, A has been replaced by diag(r)*A*diag(c). Application Notes The routine uses internal parameters thresh, large, and small, which have the following meaning. thresh is a threshold value used to decide if row or column scaling should be done based on the ratio of the row or column scaling factors. If rowcnd < thresh, row scaling is done, and if colcnd < thresh, column scaling is done. large and small are threshold values used to decide if row scaling should be done based on the absolute size of the largest matrix element. If amax > large or amax < small, row scaling is done. ?laqge Scales a general rectangular matrix, using row and column scaling factors computed by ?geequ. Syntax call slaqge( m, n, a, lda, r, c, rowcnd, colcnd, amax, equed ) call dlaqge( m, n, a, lda, r, c, rowcnd, colcnd, amax, equed ) call claqge( m, n, a, lda, r, c, rowcnd, colcnd, amax, equed ) call zlaqge( m, n, a, lda, r, c, rowcnd, colcnd, amax, equed ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine equilibrates a general m-by-n matrix A using the row and column scaling factors in the vectors r and c. Input Parameters m INTEGER. The number of rows of the matrix A. m = 0. n INTEGER. The number of columns of the matrix A. n = 0. a REAL for slaqge DOUBLE PRECISION for dlaqge COMPLEX for claqge DOUBLE COMPLEX for zlaqge Array, DIMENSION (lda,n). On entry, the m-by-n matrix A. LAPACK Auxiliary and Utility Routines 5 1265 lda INTEGER. The leading dimension of the array a. lda = max(m,1). r REAL for slanqge/claqge DOUBLE PRECISION for dlaqge/zlaqge Array, DIMENSION (m). The row scale factors for A. c REAL for slanqge/claqge DOUBLE PRECISION for dlaqge/zlaqge Array, DIMENSION (n). The column scale factors for A. rowcnd REAL for slanqge/claqge DOUBLE PRECISION for dlaqge/zlaqge Ratio of the smallest r(i) to the largest r(i). colcnd REAL for slanqge/claqge DOUBLE PRECISION for dlaqge/zlaqge Ratio of the smallest c(i) to the largest c(i). amax REAL for slanqge/claqge DOUBLE PRECISION for dlaqge/zlaqge Absolute value of largest matrix entry. Output Parameters a On exit, the equilibrated matrix. See equed for the form of the equilibrated matrix. equed CHARACTER*1. Specifies the form of equilibration that was done. If equed = 'N': No equilibration If equed = 'R': Row equilibration, that is, A has been premultiplied by diag(r). If equed = 'C': Column equilibration, that is, A has been postmultiplied by diag(c). If equed = 'B': Both row and column equilibration, that is, A has been replaced by diag(r)*A*diag(c). Application Notes The routine uses internal parameters thresh, large, and small, which have the following meaning. thresh is a threshold value used to decide if row or column scaling should be done based on the ratio of the row or column scaling factors. If rowcnd < thresh, row scaling is done, and if colcnd < thresh, column scaling is done. large and small are threshold values used to decide if row scaling should be done based on the absolute size of the largest matrix element. If amax > large or amax < small, row scaling is done. ?laqhb Scales a Hermetian band matrix, using scaling factors computed by ?pbequ. Syntax call claqhb( uplo, n, kd, ab, ldab, s, scond, amax, equed ) call zlaqhb( uplo, n, kd, ab, ldab, s, scond, amax, equed ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1266 Description The routine equilibrates a Hermetian band matrix A using the scaling factors in the vector s. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the band matrix A is stored. If uplo = 'U': upper triangular. If uplo = 'L': lower triangular. n INTEGER. The order of the matrix A. n = 0. kd INTEGER. The number of super-diagonals of the matrix A if uplo = 'U', or the number of sub-diagonals if uplo = 'L'. kd = 0. ab COMPLEX for claqhb DOUBLE COMPLEX for zlaqhb Array, DIMENSION (ldab,n). On entry, the upper or lower triangle of the band matrix A, stored in the first kd+1 rows of the array. The j-th column of A is stored in the j-th column of the array ab as follows: if uplo = 'U', ab(kd+1+i-j,j) = A(i,j) for max(1,j-kd) = i = j; if uplo = 'L', ab(1+i-j,j) = A(i,j) for j = i = min(n,j+kd). ldab INTEGER. The leading dimension of the array ab. ldab = kd+1. scond REAL for claqsb DOUBLE PRECISION for zlaqsb Ratio of the smallest s(i) to the largest s(i). amax REAL for claqsb DOUBLE PRECISION for zlaqsb Absolute value of largest matrix entry. Output Parameters ab On exit, if info = 0, the triangular factor U or L from the Cholesky factorization A = UH*U or A = L*LH of the band matrix A, in the same storage format as A. s REAL for claqsb DOUBLE PRECISION for zlaqsb Array, DIMENSION (n). The scale factors for A. equed CHARACTER*1. Specifies whether or not equilibration was done. If equed = 'N': No equilibration. If equed = 'Y': Equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). Application Notes The routine uses internal parameters thresh, large, and small, which have the following meaning. thresh is a threshold value used to decide if scaling should be based on the ratio of the scaling factors. If scond < thresh, scaling is done. LAPACK Auxiliary and Utility Routines 5 1267 The values large and small are threshold values used to decide if scaling should be done based on the absolute size of the largest matrix element. If amax > large or amax < small, scaling is done. ?laqp2 Computes a QR factorization with column pivoting of the matrix block. Syntax call slaqp2( m, n, offset, a, lda, jpvt, tau, vn1, vn2, work ) call dlaqp2( m, n, offset, a, lda, jpvt, tau, vn1, vn2, work ) call claqp2( m, n, offset, a, lda, jpvt, tau, vn1, vn2, work ) call zlaqp2( m, n, offset, a, lda, jpvt, tau, vn1, vn2, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes a QR factorization with column pivoting of the block A(offset+1:m,1:n). The block A(1:offset,1:n) is accordingly pivoted, but not factorized. Input Parameters m INTEGER. The number of rows of the matrix A. m = 0. n INTEGER. The number of columns of the matrix A. n = 0. offset INTEGER. The number of rows of the matrix A that must be pivoted but no factorized. offset = 0. a REAL for slaqp2 DOUBLE PRECISION for dlaqp2 COMPLEX for claqp2 DOUBLE COMPLEX for zlaqp2 Array, DIMENSION (lda,n). On entry, the m-by-n matrix A. lda INTEGER. The leading dimension of the array a. lda = max(1,m). jpvt INTEGER. Array, DIMENSION (n). On entry, if jpvt(i) ? 0, the i-th column of A is permuted to the front of A*P (a leading column); if jpvt(i) = 0, the i-th column of A is a free column. vn1, vn2 REAL for slaqp2/claqp2 DOUBLE PRECISION for dlaqp2/zlaqp2 Arrays, DIMENSION (n) each. Contain the vectors with the partial and exact column norms, respectively. work REAL for slaqp2 DOUBLE PRECISION for dlaqp2 COMPLEX for claqp2 DOUBLE COMPLEX for zlaqp2 Workspace array, DIMENSION (n). 5 Intel® Math Kernel Library Reference Manual 1268 Output Parameters a On exit, the upper triangle of block A(offset+1:m,1:n) is the triangular factor obtained; the elements in block A(offset+1:m,1:n) below the diagonal, together with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors. Block A(1:offset,1:n) has been accordingly pivoted, but not factorized. jpvt On exit, if jpvt(i) = k, then the i-th column of A*P was the k-th column of A. tau REAL for slaqp2 DOUBLE PRECISION for dlaqp2 COMPLEX for claqp2 DOUBLE COMPLEX for zlaqp2 Array, DIMENSION (min(m,n)). The scalar factors of the elementary reflectors. vn1, vn2 Contain the vectors with the partial and exact column norms, respectively. ?laqps Computes a step of QR factorization with column pivoting of a real m-by-n matrix A by using BLAS level 3. Syntax call slaqps( m, n, offset, nb, kb, a, lda, jpvt, tau, vn1, vn2, auxv, f, ldf ) call dlaqps( m, n, offset, nb, kb, a, lda, jpvt, tau, vn1, vn2, auxv, f, ldf ) call claqps( m, n, offset, nb, kb, a, lda, jpvt, tau, vn1, vn2, auxv, f, ldf ) call zlaqps( m, n, offset, nb, kb, a, lda, jpvt, tau, vn1, vn2, auxv, f, ldf ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes a step of QR factorization with column pivoting of a real m-by-n matrix A by using BLAS level 3. The routine tries to factorize NB columns from A starting from the row offset+1, and updates all of the matrix with BLAS level 3 routine ?gemm. In some cases, due to catastrophic cancellations, ?laqps cannot factorize NB columns. Hence, the actual number of factorized columns is returned in kb. Block A(1:offset,1:n) is accordingly pivoted, but not factorized. Input Parameters m INTEGER. The number of rows of the matrix A. m = 0. n INTEGER. The number of columns of the matrix A. n = 0. offset INTEGER. The number of rows of A that have been factorized in previous steps. nb INTEGER. The number of columns to factorize. a REAL for slaqps LAPACK Auxiliary and Utility Routines 5 1269 DOUBLE PRECISION for dlaqps COMPLEX for claqps DOUBLE COMPLEX for zlaqps Array, DIMENSION (lda,n). On entry, the m-by-n matrix A. lda INTEGER. The leading dimension of the array a. lda = max(1,m). jpvt INTEGER. Array, DIMENSION (n). If jpvt(I) = k then column k of the full matrix A has been permuted into position i in AP. vn1, vn2 REAL for slaqps/claqps DOUBLE PRECISION for dlaqps/zlaqps Arrays, DIMENSION (n) each. Contain the vectors with the partial and exact column norms, respectively. auxv REAL for slaqps DOUBLE PRECISION for dlaqps COMPLEX for claqps DOUBLE COMPLEX for zlaqps Array, DIMENSION (nb). Auxiliary vector. f REAL for slaqps DOUBLE PRECISION for dlaqps COMPLEX for claqps DOUBLE COMPLEX for zlaqps Array, DIMENSION (ldf,nb). For real flavors, matrix FT = L*YT*A. For complex flavors, matrix FH = L*YH*A. ldf INTEGER. The leading dimension of the array f. ldf = max(1,n). Output Parameters kb INTEGER. The number of columns actually factorized. a On exit, block A(offset+1:m,1:kb) is the triangular factor obtained and block A(1:offset,1:n) has been accordingly pivoted, but no factorized. The rest of the matrix, block A(offset+1:m,kb+1:n) has been updated. jpvt INTEGER array, DIMENSION (n). If jpvt(I) = k then column k of the full matrix A has been permuted into position i in AP. tau REAL for slaqps DOUBLE PRECISION for dlaqps COMPLEX for claqps DOUBLE COMPLEX for zlaqps Array, DIMENSION (kb). The scalar factors of the elementary reflectors. vn1, vn2 The vectors with the partial and exact column norms, respectively. auxv Auxiliary vector. f Matrix F' = L*Y'*A. ?laqr0 Computes the eigenvalues of a Hessenberg matrix, and optionally the marixes from the Schur decomposition. 5 Intel® Math Kernel Library Reference Manual 1270 Syntax call slaqr0( wantt, wantz, n, ilo, ihi, h, ldh, wr, wi, iloz, ihiz, z, ldz, work, lwork, info ) call dlaqr0( wantt, wantz, n, ilo, ihi, h, ldh, wr, wi, iloz, ihiz, z, ldz, work, lwork, info ) call claqr0( wantt, wantz, n, ilo, ihi, h, ldh, w, iloz, ihiz, z, ldz, work, lwork, info ) call zlaqr0( wantt, wantz, n, ilo, ihi, h, ldh, w, iloz, ihiz, z, ldz, work, lwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the eigenvalues of a Hessenberg matrix H, and, optionally, the matrices T and Z from the Schur decomposition H=Z*T*ZH, where T is an upper quasi-triangular/triangular matrix (the Schur form), and Z is the orthogonal/unitary matrix of Schur vectors. Optionally Z may be postmultiplied into an input orthogonal/unitary matrix Q so that this routine can give the Schur factorization of a matrix A which has been reduced to the Hessenberg form H by the orthogonal/unitary matrix Q: A = Q*H*QH = (QZ)*H*(QZ)H. Input Parameters wantt LOGICAL. If wantt = .TRUE., the full Schur form T is required; If wantt = .FALSE., only eigenvalues are required. wantz LOGICAL. If wantz = .TRUE., the matrix of Schur vectors Z is required; If wantz = .FALSE., Schur vectors are not required. n INTEGER. The order of the Hessenberg matrix H. (n = 0). ilo, ihi INTEGER. It is assumed that H is already upper triangular in rows and columns 1:ilo-1 and ihi+1:n, and if ilo > 1 then H(ilo, ilo-1) = 0. ilo and ihi are normally set by a previous call to cgebal, and then passed to cgehrd when the matrix output by cgebal is reduced to Hessenberg form. Otherwise, ilo and ihi should be set to 1 and n, respectively. If n > 0, then 1 = ilo = ihi = n. If n=0, then ilo=1 and ihi=0 h REAL for slaqr0 DOUBLE PRECISION for dlaqr0 COMPLEX for claqr0 DOUBLE COMPLEX for zlaqr0. Array, DIMENSION (ldh, n), contains the upper Hessenberg matrix H. ldh INTEGER. The leading dimension of the array h. ldh = max(1, n). iloz, ihiz INTEGER. Specify the rows of Z to which transformations must be applied if wantz is .TRUE., 1 = iloz = ilo; ihi = ihiz = n. z REAL for slaqr0 LAPACK Auxiliary and Utility Routines 5 1271 DOUBLE PRECISION for dlaqr0 COMPLEX for claqr0 DOUBLE COMPLEX for zlaqr0. Array, DIMENSION (ldz, ihi), contains the matrix Z if wantz is .TRUE.. If wantz is .FALSE., z is not referenced. ldz INTEGER. The leading dimension of the array z. If wantz is .TRUE., then ldz = max(1, ihiz). Otherwise, ldz = 1. work REAL for slaqr0 DOUBLE PRECISION for dlaqr0 COMPLEX for claqr0 DOUBLE COMPLEX for zlaqr0. Workspace array with dimension lwork. lwork INTEGER. The dimension of the array work. lwork = max(1,n) is sufficient, but for the optimal performance a greater workspace may be required, typically as large as 6*n. It is recommended to use the workspace query to determine the optimal workspace size. If lwork=-1,then the routine performs a workspace query: it estimates the optimal workspace size for the given values of the input parameters n, ilo, and ihi. The estimate is returned in work(1). No error messages related to the lwork is issued by xerbla. Neither H nor Z are accessed. Output Parameters h If info=0, and wantt is .TRUE., then h contains the upper quasitriangular/ triangular matrix T from the Schur decomposition (the Schur form). If info=0, and wantt is .FALSE., then the contents of h are unspecified on exit. (The output values of h when info > 0 are given under the description of the info parameter below.) The routine may explicitly set h(i,j) for i>j and j=1,2,...ilo-1 or j=ihi+1, ihi+2,...n. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. w COMPLEX for claqr0 DOUBLE COMPLEX for zlaqr0. Arrays, DIMENSION(n). The computed eigenvalues of h(ilo:ihi, ilo:ihi) are stored in w(ilo:ihi). If wantt is .TRUE., then the eigenvalues are stored in the same order as on the diagonal of the Schur form returned in h, with w(i) = h(i,i). wr, wi REAL for slaqr0 DOUBLE PRECISION for dlaqr0 Arrays, DIMENSION(ihi) each. The real and imaginary parts, respectively, of the computed eigenvalues of h(ilo:ihi, ilo:ihi) are stored in wr(ilo:ihi) and wi(ilo:ihi). If two eigenvalues are computed as a complex conjugate pair, they are stored in consecutive elements of wr and wi, say the i-th and (i+1)-th, with wi(i)> 0 and wi(i+1) < 0. If wantt is .TRUE., then the eigenvalues are stored in the same order as on the diagonal of the Schur form returned in h, with wr(i) = h(i,i), and if h(i:i+1,i:i+1)is a 2-by-2 diagonal block, then wi(i)=sqrt(-h(i +1,i)*h(i,i+1)). 5 Intel® Math Kernel Library Reference Manual 1272 z If wantz is .TRUE., then z(ilo:ihi, iloz:ihiz) is replaced by z(ilo:ihi, iloz:ihiz)*U, where U is the orthogonal/unitary Schur factor of h(ilo:ihi, ilo:ihi). If wantz is .FALSE., z is not referenced. (The output values of z when info > 0 are given under the description of the info parameter below.) info INTEGER. = 0: the execution is successful. > 0: if info = i, then the routine failed to compute all the eigenvalues. Elements 1:ilo-1 and i+1:n of wr and wi contain those eigenvalues which have been successfully computed. > 0: if wantt is .FALSE., then the remaining unconverged eigenvalues are the eigenvalues of the upper Hessenberg matrix rows and columns ilo through info of the final output value of h. > 0: if wantt is .TRUE., then (initial value of h)*U = U*(final value of h, where U is an orthogonal/unitary matrix. The final value of h is upper Hessenberg and quasi-triangular/triangular in rows and columns info+1 through ihi. > 0: if wantz is .TRUE., then (final value of z(ilo:ihi, iloz:ihiz))=(initial value of z(ilo:ihi, iloz:ihiz)*U, where U is the orthogonal/unitary matrix in the previous expression (regardless of the value of wantt). > 0: if wantz is .FALSE., then z is not accessed. ?laqr1 Sets a scalar multiple of the first column of the product of 2-by-2 or 3-by-3 matrix H and specified shifts. Syntax call slaqr1( n, h, ldh, sr1, si1, sr2, si2, v ) call dlaqr1( n, h, ldh, sr1, si1, sr2, si2, v ) call claqr1( n, h, ldh, s1, s2, v ) call zlaqr1( n, h, ldh, s1, s2, v ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description Given a 2-by-2 or 3-by-3 matrix H, this routine sets v to a scalar multiple of the first column of the product K = (H - s1*I)*(H - s2*I), or K = (H - (sr1 + i*si1)*I)*(H - (sr2 + i*si2)*I) scaling to avoid overflows and most underflows. It is assumed that either 1) sr1 = sr2 and si1 = -si2, or 2) si1 = si2 = 0. This is useful for starting double implicit shift bulges in the QR algorithm. Input Parameters n INTEGER. LAPACK Auxiliary and Utility Routines 5 1273 The order of the matrix H. n must be equal to 2 or 3. sr1, si2, sr2, si2 REAL for slaqr1 DOUBLE PRECISION for dlaqr1 Shift values that define K in the formula above. s1, s2 COMPLEX for claqr1 DOUBLE COMPLEX for zlaqr1. Shift values that define K in the formula above. h REAL for slaqr1 DOUBLE PRECISION for dlaqr1 COMPLEX for claqr1 DOUBLE COMPLEX for zlaqr1. Array, DIMENSION (ldh, n), contains 2-by-2 or 3-by-3 matrix H in the formula above. ldh INTEGER. The leading dimension of the array h just as declared in the calling routine. ldh = n. Output Parameters v REAL for slaqr1 DOUBLE PRECISION for dlaqr1 COMPLEX for claqr1 DOUBLE COMPLEX for zlaqr1. Array with dimension (n). A scalar multiple of the first column of the matrix K in the formula above. ?laqr2 Performs the orthogonal/unitary similarity transformation of a Hessenberg matrix to detect and deflate fully converged eigenvalues from a trailing principal submatrix (aggressive early deflation). Syntax call slaqr2( wantt, wantz, n, ktop, kbot, nw, h, ldh, iloz, ihiz, z, ldz, ns, nd, sr, si, v, ldv, nh, t, ldt, nv, wv, ldwv, work, lwork ) call dlaqr2( wantt, wantz, n, ktop, kbot, nw, h, ldh, iloz, ihiz, z, ldz, ns, nd, sr, si, v, ldv, nh, t, ldt, nv, wv, ldwv, work, lwork ) call claqr2( wantt, wantz, n, ktop, kbot, nw, h, ldh, iloz, ihiz, z, ldz, ns, nd, sh, v, ldv, nh, t, ldt, nv, wv, ldwv, work, lwork ) call zlaqr2( wantt, wantz, n, ktop, kbot, nw, h, ldh, iloz, ihiz, z, ldz, ns, nd, sh, v, ldv, nh, t, ldt, nv, wv, ldwv, work, lwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description 5 Intel® Math Kernel Library Reference Manual 1274 The routine accepts as input an upper Hessenberg matrix H and performs an orthogonal/unitary similarity transformation designed to detect and deflate fully converged eigenvalues from a trailing principal submatrix. On output H has been overwritten by a new Hessenberg matrix that is a perturbation of an orthogonal/ unitary similarity transformation of H. It is to be hoped that the final version of H has many zero subdiagonal entries. This subroutine is identical to ?laqr3 except that it avoids recursion by calling ?lahqr instead of ?laqr4. Input Parameters wantt LOGICAL. If wantt = .TRUE., then the Hessenberg matrix H is fully updated so that the quasi-triangular/triangular Schur factor may be computed (in cooperation with the calling subroutine). If wantt = .FALSE., then only enough of H is updated to preserve the eigenvalues. wantz LOGICAL. If wantz = .TRUE., then the orthogonal/unitary matrix Z is updated so that the orthogonal/unitary Schur factor may be computed (in cooperation with the calling subroutine). If wantz = .FALSE., then Z is not referenced. n INTEGER. The order of the Hessenberg matrix H and (if wantz = .TRUE.) the order of the orthogonal/unitary matrix Z. ktop INTEGER. It is assumed that either ktop=1 or h(ktop,ktop-1)=0. ktop and kbot together determine an isolated block along the diagonal of the Hessenberg matrix. kbot INTEGER. It is assumed without a check that either kbot=n or h(kbot+1,kbot)=0. ktop and kbot together determine an isolated block along the diagonal of the Hessenberg matrix. nw INTEGER. Size of the deflation window. 1 = nw = (kbot-ktop+1). h REAL for slaqr2 DOUBLE PRECISION for dlaqr2 COMPLEX for claqr2 DOUBLE COMPLEX for zlaqr2. Array, DIMENSION (ldh, n), on input the initial n-by-n section of h stores the Hessenberg matrix H undergoing aggressive early deflation. ldh INTEGER. The leading dimension of the array h just as declared in the calling subroutine. ldh=n. iloz, ihiz INTEGER. Specify the rows of Z to which transformations must be applied if wantz is .TRUE.. 1 = iloz = ihiz = n. z REAL for slaqr2 DOUBLE PRECISION for dlaqr2 COMPLEX for claqr2 DOUBLE COMPLEX for zlaqr2. Array, DIMENSION (ldz, n), contains the matrix Z if wantz is .TRUE.. If wantz is .FALSE., then z is not referenced. ldz INTEGER. The leading dimension of the array z just as declared in the calling subroutine. ldz = 1. v REAL for slaqr2 LAPACK Auxiliary and Utility Routines 5 1275 DOUBLE PRECISION for dlaqr2 COMPLEX for claqr2 DOUBLE COMPLEX for zlaqr2. Workspace array with dimension (ldv, nw). An nw-by-nw work array. ldv INTEGER. The leading dimension of the array v just as declared in the calling subroutine. ldv = nw. nh INTEGER. The number of column of t. nh = nw. t REAL for slaqr2 DOUBLE PRECISION for dlaqr2 COMPLEX for claqr2 DOUBLE COMPLEX for zlaqr2. Workspace array with dimension (ldt, nw). ldt INTEGER. The leading dimension of the array t just as declared in the calling subroutine. ldt=nw. nv INTEGER. The number of rows of work array wv available for workspace. nv=nw. wv REAL for slaqr2 DOUBLE PRECISION for dlaqr2 COMPLEX for claqr2 DOUBLE COMPLEX for zlaqr2. Workspace array with dimension (ldwv, nw). ldwv INTEGER. The leading dimension of the array wv just as declared in the calling subroutine. ldwv=nw. work REAL for slaqr2 DOUBLE PRECISION for dlaqr2 COMPLEX for claqr2 DOUBLE COMPLEX for zlaqr2. Workspace array with dimension lwork. lwork INTEGER. The dimension of the array work. lwork=2*nw) is sufficient, but for the optimal performance a greater workspace may be required. If lwork=-1,then the routine performs a workspace query: it estimates the optimal workspace size for the given values of the input parameters n, nw, ktop, and kbot. The estimate is returned in work(1). No error messages related to the lwork is issued by xerbla. Neither H nor Z are accessed. Output Parameters h On output h has been transformed by an orthogonal/unitary similarity transformation, perturbed, and the returned to Hessenberg form that (it is to be hoped) has some zero subdiagonal entries. work(1) On exit work(1) is set to an estimate of the optimal value of lwork for the given values of the input parameters n, nw, ktop, and kbot. z If wantz is .TRUE., then the orthogonal/unitary similarity transformation is accumulated into z(iloz:ihiz, ilo:ihi) from the right. If wantz is .FALSE., then z is unreferenced. nd INTEGER. The number of converged eigenvalues uncovered by the routine. ns INTEGER. The number of unconverged, that is approximate eigenvalues returned in sr, si or in sh that may be used as shifts by the calling subroutine. 5 Intel® Math Kernel Library Reference Manual 1276 sh COMPLEX for claqr2 DOUBLE COMPLEX for zlaqr2. Arrays, DIMENSION (kbot). The approximate eigenvalues that may be used for shifts are stored in the sh(kbot-nd-ns+1)through the sh(kbot-nd). The converged eigenvalues are stored in the sh(kbot-nd+1)through the sh(kbot). sr, si REAL for slaqr2 DOUBLE PRECISION for dlaqr2 Arrays, DIMENSION (kbot) each. The real and imaginary parts of the approximate eigenvalues that may be used for shifts are stored in the sr(kbot-nd-ns+1)through the sr(kbotnd), and si(kbot-nd-ns+1) through the si(kbot-nd), respectively. The real and imaginary parts of converged eigenvalues are stored in the sr(kbot-nd+1)through the sr(kbot), and si(kbot-nd+1) through the si(kbot), respectively. ?laqr3 Performs the orthogonal/unitary similarity transformation of a Hessenberg matrix to detect and deflate fully converged eigenvalues from a trailing principal submatrix (aggressive early deflation). Syntax call slaqr3( wantt, wantz, n, ktop, kbot, nw, h, ldh, iloz, ihiz, z, ldz, ns, nd, sr, si, v, ldv, nh, t, ldt, nv, wv, ldwv, work, lwork ) call dlaqr3( wantt, wantz, n, ktop, kbot, nw, h, ldh, iloz, ihiz, z, ldz, ns, nd, sr, si, v, ldv, nh, t, ldt, nv, wv, ldwv, work, lwork ) call claqr3( wantt, wantz, n, ktop, kbot, nw, h, ldh, iloz, ihiz, z, ldz, ns, nd, sh, v, ldv, nh, t, ldt, nv, wv, ldwv, work, lwork ) call zlaqr3( wantt, wantz, n, ktop, kbot, nw, h, ldh, iloz, ihiz, z, ldz, ns, nd, sh, v, ldv, nh, t, ldt, nv, wv, ldwv, work, lwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine accepts as input an upper Hessenberg matrix H and performs an orthogonal/unitary similarity transformation designed to detect and deflate fully converged eigenvalues from a trailing principal submatrix. On output H has been overwritten by a new Hessenberg matrix that is a perturbation of an orthogonal/ unitary similarity transformation of H. It is to be hoped that the final version of H has many zero subdiagonal entries. Input Parameters wantt LOGICAL. If wantt = .TRUE., then the Hessenberg matrix H is fully updated so that the quasi-triangular/triangular Schur factor may be computed (in cooperation with the calling subroutine). LAPACK Auxiliary and Utility Routines 5 1277 If wantt = .FALSE., then only enough of H is updated to preserve the eigenvalues. wantz LOGICAL. If wantz = .TRUE., then the orthogonal/unitary matrix Z is updated so that the orthogonal/unitary Schur factor may be computed (in cooperation with the calling subroutine). If wantz = .FALSE., then Z is not referenced. n INTEGER. The order of the Hessenberg matrix H and (if wantz = .TRUE.) the order of the orthogonal/unitary matrix Z. ktop INTEGER. It is assumed that either ktop=1 or h(ktop,ktop-1)=0. ktop and kbot together determine an isolated block along the diagonal of the Hessenberg matrix. kbot INTEGER. It is assumed without a check that either kbot=n or h(kbot+1,kbot)=0. ktop and kbot together determine an isolated block along the diagonal of the Hessenberg matrix. nw INTEGER. Size of the deflation window. 1=nw=(kbot-ktop+1). h REAL for slaqr3 DOUBLE PRECISION for dlaqr3 COMPLEX for claqr3 DOUBLE COMPLEX for zlaqr3. Array, DIMENSION (ldh, n), on input the initial n-by-n section of h stores the Hessenberg matrix H undergoing aggressive early deflation. ldh INTEGER. The leading dimension of the array h just as declared in the calling subroutine. ldh=n. iloz, ihiz INTEGER. Specify the rows of Z to which transformations must be applied if wantz is .TRUE.. 1=iloz=ihiz=n. z REAL for slaqr3 DOUBLE PRECISION for dlaqr3 COMPLEX for claqr3 DOUBLE COMPLEX for zlaqr3. Array, DIMENSION (ldz, n), contains the matrix Z if wantz is .TRUE.. If wantz is .FALSE., then z is not referenced. ldz INTEGER. The leading dimension of the array z just as declared in the calling subroutine. ldz=1. v REAL for slaqr3 DOUBLE PRECISION for dlaqr3 COMPLEX for claqr3 DOUBLE COMPLEX for zlaqr3. Workspace array with dimension (ldv, nw). An nw-by-nw work array. ldv INTEGER. The leading dimension of the array v just as declared in the calling subroutine. ldv=nw. nh INTEGER. The number of column of t. nh=nw. t REAL for slaqr3 DOUBLE PRECISION for dlaqr3 COMPLEX for claqr3 DOUBLE COMPLEX for zlaqr3. 5 Intel® Math Kernel Library Reference Manual 1278 Workspace array with dimension (ldt, nw). ldt INTEGER. The leading dimension of the array t just as declared in the calling subroutine. ldt=nw. nv INTEGER. The number of rows of work array wv available for workspace. nv=nw. wv REAL for slaqr3 DOUBLE PRECISION for dlaqr3 COMPLEX for claqr3 DOUBLE COMPLEX for zlaqr3. Workspace array with dimension (ldwv, nw). ldwv INTEGER. The leading dimension of the array wv just as declared in the calling subroutine. ldwv=nw. work REAL for slaqr3 DOUBLE PRECISION for dlaqr3 COMPLEX for claqr3 DOUBLE COMPLEX for zlaqr3. Workspace array with dimension lwork. lwork INTEGER. The dimension of the array work. lwork=2*nw) is sufficient, but for the optimal performance a greater workspace may be required. If lwork=-1,then the routine performs a workspace query: it estimates the optimal workspace size for the given values of the input parameters n, nw, ktop, and kbot. The estimate is returned in work(1). No error messages related to the lwork is issued by xerbla. Neither H nor Z are accessed. Output Parameters h On output h has been transformed by an orthogonal/unitary similarity transformation, perturbed, and the returned to Hessenberg form that (it is to be hoped) has some zero subdiagonal entries. work(1) On exit work(1) is set to an estimate of the optimal value of lwork for the given values of the input parameters n, nw, ktop, and kbot. z If wantz is .TRUE., then the orthogonal/unitary similarity transformation is accumulated into z(iloz:ihiz, ilo:ihi) from the right. If wantz is .FALSE., then z is unreferenced. nd INTEGER. The number of converged eigenvalues uncovered by the routine. ns INTEGER. The number of unconverged, that is approximate eigenvalues returned in sr, si or in sh that may be used as shifts by the calling subroutine. sh COMPLEX for claqr3 DOUBLE COMPLEX for zlaqr3. Arrays, DIMENSION (kbot). The approximate eigenvalues that may be used for shifts are stored in the sh(kbot-nd-ns+1)through the sh(kbot-nd). The converged eigenvalues are stored in the sh(kbot-nd+1)through the sh(kbot). sr, si REAL for slaqr3 DOUBLE PRECISION for dlaqr3 Arrays, DIMENSION (kbot) each. LAPACK Auxiliary and Utility Routines 5 1279 The real and imaginary parts of the approximate eigenvalues that may be used for shifts are stored in the sr(kbot-nd-ns+1)through the sr(kbotnd), and si(kbot-nd-ns+1) through the si(kbot-nd), respectively. The real and imaginary parts of converged eigenvalues are stored in the sr(kbot-nd+1)through the sr(kbot), and si(kbot-nd+1) through the si(kbot), respectively. ?laqr4 Computes the eigenvalues of a Hessenberg matrix, and optionally the matrices from the Schur decomposition. Syntax call slaqr4( wantt, wantz, n, ilo, ihi, h, ldh, wr, wi, iloz, ihiz, z, ldz, work, lwork, info ) call dlaqr4( wantt, wantz, n, ilo, ihi, h, ldh, wr, wi, iloz, ihiz, z, ldz, work, lwork, info ) call claqr4( wantt, wantz, n, ilo, ihi, h, ldh, w, iloz, ihiz, z, ldz, work, lwork, info ) call zlaqr4( wantt, wantz, n, ilo, ihi, h, ldh, w, iloz, ihiz, z, ldz, work, lwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the eigenvalues of a Hessenberg matrix H, and, optionally, the matrices T and Z from the Schur decomposition H=Z*T*ZH, where T is an upper quasi-triangular/triangular matrix (the Schur form), and Z is the orthogonal/unitary matrix of Schur vectors. Optionally Z may be postmultiplied into an input orthogonal/unitary matrix Q so that this routine can give the Schur factorization of a matrix A which has been reduced to the Hessenberg form H by the orthogonal/unitary matrix Q: A = Q*H*QH = (QZ)*H*(QZ)H. This routine implements one level of recursion for ?laqr0. It is a complete implementation of the small bulge multi-shift QR algorithm. It may be called by ?laqr0 and, for large enough deflation window size, it may be called by ?laqr3. This routine is identical to ?laqr0 except that it calls ?laqr2 instead of ?laqr3. Input Parameters wantt LOGICAL. If wantt = .TRUE., the full Schur form T is required; If wantt = .FALSE., only eigenvalues are required. wantz LOGICAL. If wantz = .TRUE., the matrix of Schur vectors Z is required; If wantz = .FALSE., Schur vectors are not required. n INTEGER. The order of the Hessenberg matrix H. (n = 0). ilo, ihi INTEGER. It is assumed that H is already upper triangular in rows and columns 1:ilo-1 and ihi+1:n, and if ilo > 1 then h(ilo, ilo-1) = 0. 5 Intel® Math Kernel Library Reference Manual 1280 ilo and ihi are normally set by a previous call to cgebal, and then passed to cgehrd when the matrix output by cgebal is reduced to Hessenberg form. Otherwise, ilo and ihi should be set to 1 and n, respectively. If n > 0, then 1 = ilo = ihi = n. If n=0, then ilo=1 and ihi=0 h REAL for slaqr4 DOUBLE PRECISION for dlaqr4 COMPLEX for claqr4 DOUBLE COMPLEX for zlaqr4. Array, DIMENSION (ldh, n), contains the upper Hessenberg matrix H. ldh INTEGER. The leading dimension of the array h. ldh = max(1, n). iloz, ihiz INTEGER. Specify the rows of Z to which transformations must be applied if wantz is .TRUE., 1 = iloz = ilo; ihi = ihiz = n. z REAL for slaqr4 DOUBLE PRECISION for dlaqr4 COMPLEX for claqr4 DOUBLE COMPLEX for zlaqr4. Array, DIMENSION (ldz, ihi), contains the matrix Z if wantz is .TRUE.. If wantz is .FALSE., z is not referenced. ldz INTEGER. The leading dimension of the array z. If wantz is .TRUE., then ldz = max(1, ihiz). Otherwise, ldz = 1. work REAL for slaqr4 DOUBLE PRECISION for dlaqr4 COMPLEX for claqr4 DOUBLE COMPLEX for zlaqr4. Workspace array with dimension lwork. lwork INTEGER. The dimension of the array work. lwork = max(1,n) is sufficient, but for the optimal performance a greater workspace may be required, typically as large as 6*n. It is recommended to use the workspace query to determine the optimal workspace size. If lwork=-1,then the routine performs a workspace query: it estimates the optimal workspace size for the given values of the input parameters n, ilo, and ihi. The estimate is returned in work(1). No error messages related to the lwork is issued by xerbla. Neither H nor Z are accessed. Output Parameters h If info=0, and wantt is .TRUE., then h contains the upper quasitriangular/ triangular matrix T from the Schur decomposition (the Schur form). If info=0, and wantt is .FALSE., then the contents of h are unspecified on exit. (The output values of h when info > 0 are given under the description of the info parameter below.) The routines may explicitly set h(i,j) for i>j and j=1,2,...ilo-1 or j=ihi+1, ihi+2,...n. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. w COMPLEX for claqr4 LAPACK Auxiliary and Utility Routines 5 1281 DOUBLE COMPLEX for zlaqr4. Arrays, DIMENSION(n). The computed eigenvalues of h(ilo:ihi, ilo:ihi) are stored in w(ilo:ihi). If wantt is .TRUE., then the eigenvalues are stored in the same order as on the diagonal of the Schur form returned in h, with w(i) = h(i,i). wr, wi REAL for slaqr4 DOUBLE PRECISION for dlaqr4 Arrays, DIMENSION(ihi) each. The real and imaginary parts, respectively, of the computed eigenvalues of h(ilo:ihi, ilo:ihi) are stored in the wr(ilo:ihi) and wi(ilo:ihi). If two eigenvalues are computed as a complex conjugate pair, they are stored in consecutive elements of wr and wi, say the i-th and (i+1)-th, with wi(i)> 0 and wi(i+1) < 0. If wantt is .TRUE., then the eigenvalues are stored in the same order as on the diagonal of the Schur form returned in h, with wr(i) = h(i,i), and if h(i:i+1,i:i+1)is a 2-by-2 diagonal block, then wi(i)=sqrt(-h(i +1,i)*h(i,i+1)). z If wantz is .TRUE., then z(ilo:ihi, iloz:ihiz) is replaced by z(ilo:ihi, iloz:ihiz)*U, where U is the orthogonal/unitary Schur factor of h(ilo:ihi, ilo:ihi). If wantz is .FALSE., z is not referenced. (The output values of z when info > 0 are given under the description of the info parameter below.) info INTEGER. = 0: the execution is successful. > 0: if info = i, then the routine failed to compute all the eigenvalues. Elements 1:ilo-1 and i+1:n of wr and wi contain those eigenvalues which have been successfully computed. > 0: if wantt is .FALSE., then the remaining unconverged eigenvalues are the eigenvalues of the upper Hessenberg matrix rows and columns ilo through info of the final output value of h. > 0: if wantt is .TRUE., then (initial value of h)*U = U*(final value of h, where U is an orthogonal/unitary matrix. The final value of h is upper Hessenberg and quasi-triangular/triangular in rows and columns info+1 through ihi. > 0: if wantz is .TRUE., then (final value of z(ilo:ihi, iloz:ihiz))=(initial value of z(ilo:ihi, iloz:ihiz)*U, where U is the orthogonal/unitary matrix in the previous expression (regardless of the value of wantt). > 0: if wantz is .FALSE., then z is not accessed. ?laqr5 Performs a single small-bulge multi-shift QR sweep. Syntax call slaqr5( wantt, wantz, kacc22, n, ktop, kbot, nshfts, sr, si, h, ldh, iloz, ihiz, z, ldz, v, ldv, u, ldu, nv, wv, ldwv, nh, wh, ldwh ) call dlaqr5( wantt, wantz, kacc22, n, ktop, kbot, nshfts, sr, si, h, ldh, iloz, ihiz, z, ldz, v, ldv, u, ldu, nv, wv, ldwv, nh, wh, ldwh ) call claqr5( wantt, wantz, kacc22, n, ktop, kbot, nshfts, s, h, ldh, iloz, ihiz, z, ldz, v, ldv, u, ldu, nv, wv, ldwv, nh, wh, ldwh ) 5 Intel® Math Kernel Library Reference Manual 1282 call zlaqr5( wantt, wantz, kacc22, n, ktop, kbot, nshfts, s, h, ldh, iloz, ihiz, z, ldz, v, ldv, u, ldu, nv, wv, ldwv, nh, wh, ldwh ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description This auxiliary routine called by ?laqr0 performs a single small-bulge multi-shift QR sweep. Input Parameters wantt LOGICAL. wantt = .TRUE. if the quasi-triangular/triangular Schur factor is computed. wantt is set to .FALSE. otherwise. wantz LOGICAL. wantz = .TRUE. if the orthogonal/unitary Schur factor is computed. wantz is set to .FALSE. otherwise. kacc22 INTEGER. Possible values are 0, 1, or 2. Specifies the computation mode of far-from-diagonal orthogonal updates. = 0: the routine does not accumulate reflections and does not use matrixmatrix multiply to update far-from-diagonal matrix entries. = 1: the routine accumulates reflections and uses matrix-matrix multiply to update the far-from-diagonal matrix entries. = 2: the routine accumulates reflections, uses matrix-matrix multiply to update the far-from-diagonal matrix entries, and takes advantage of 2-by-2 block structure during matrix multiplies. n INTEGER. The order of the Hessenberg matrix H upon which the routine operates. ktop, kbot INTEGER. It is assumed without a check that either ktop=1 or h(ktop,ktop-1)=0, and either kbot=n or h(kbot+1,kbot)=0. nshfts INTEGER. Number of simultaneous shifts, must be positive and even. sr, si REAL for slaqr5 DOUBLE PRECISION for dlaqr5 Arrays, DIMENSION (nshfts) each. sr contains the real parts and si contains the imaginary parts of the nshfts shifts of origin that define the multi-shift QR sweep. s COMPLEX for claqr5 DOUBLE COMPLEX for zlaqr5. Arrays, DIMENSION (nshfts). s contains the shifts of origin that define the multi-shift QR sweep. h REAL for slaqr5 DOUBLE PRECISION for dlaqr5 COMPLEX for claqr5 DOUBLE COMPLEX for zlaqr5. Array, DIMENSION (ldh, n), on input contains the Hessenberg matrix. ldh INTEGER. The leading dimension of the array h just as declared in the calling routine. ldh = max(1, n). LAPACK Auxiliary and Utility Routines 5 1283 iloz, ihiz INTEGER. Specify the rows of Z to which transformations must be applied if wantz is .TRUE.. 1 = iloz = ihiz = n. z REAL for slaqr5 DOUBLE PRECISION for dlaqr5 COMPLEX for claqr5 DOUBLE COMPLEX for zlaqr5. Array, DIMENSION (ldz, ihi), contains the matrix Z if wantz is .TRUE.. If wantz is .FALSE., then z is not referenced. ldz INTEGER. The leading dimension of the array z just as declared in the calling routine. ldz = n. v REAL for slaqr5 DOUBLE PRECISION for dlaqr5 COMPLEX for claqr5 DOUBLE COMPLEX for zlaqr5. Workspace array with dimension (ldv, nshfts/2). ldv INTEGER. The leading dimension of the array v just as declared in the calling routine. ldv = 3. u REAL for slaqr5 DOUBLE PRECISION for dlaqr5 COMPLEX for claqr5 DOUBLE COMPLEX for zlaqr5. Workspace array with dimension (ldu, 3*nshfts-3). ldu INTEGER. The leading dimension of the array u just as declared in the calling routine. ldu = 3*nshfts-3. nh INTEGER. The number of column in the array wh available for workspace. nh = 1. wh REAL for slaqr5 DOUBLE PRECISION for dlaqr5 COMPLEX for claqr5 DOUBLE COMPLEX for zlaqr5. Workspace array with dimension (ldwh, nh) ldwh INTEGER. The leading dimension of the array wh just as declared in the calling routine. ldwh = 3*nshfts-3 nv INTEGER. The number of rows of the array wv available for workspace. nv = 1. wv REAL for slaqr5 DOUBLE PRECISION for dlaqr5 COMPLEX for claqr5 DOUBLE COMPLEX for zlaqr5. Workspace array with dimension (ldwv, 3*nshfts-3). ldwv INTEGER. The leading dimension of the array wv just as declared in the calling routine. ldwv = nv. Output Parameters sr, si On output, may be reordered. h On output a multi-shift QR Sweep with shifts sr(j)+i*si(j) or s(j) is applied to the isolated diagonal block in rows and columns ktop through kbot . 5 Intel® Math Kernel Library Reference Manual 1284 z If wantz is .TRUE., then the QR Sweep orthogonal/unitary similarity transformation is accumulated into z(iloz:ihiz, ilo:ihi) from the right. If wantz is .FALSE., then z is unreferenced. ?laqsb Scales a symmetric band matrix, using scaling factors computed by ?pbequ. Syntax call slaqsb( uplo, n, kd, ab, ldab, s, scond, amax, equed ) call dlaqsb( uplo, n, kd, ab, ldab, s, scond, amax, equed ) call claqsb( uplo, n, kd, ab, ldab, s, scond, amax, equed ) call zlaqsb( uplo, n, kd, ab, ldab, s, scond, amax, equed ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine equilibrates a symmetric band matrix A using the scaling factors in the vector s. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric matrix A is stored. If uplo = 'U': upper triangular. If uplo = 'L': lower triangular. n INTEGER. The order of the matrix A. n = 0. kd INTEGER. The number of super-diagonals of the matrix A if uplo = 'U', or the number of sub-diagonals if uplo = 'L'. kd = 0. ab REAL for slaqsb DOUBLE PRECISION for dlaqsb COMPLEX for claqsb DOUBLE COMPLEX for zlaqsb Array, DIMENSION (ldab,n). On entry, the upper or lower triangle of the symmetric band matrix A, stored in the first kd+1 rows of the array. The jth column of A is stored in the j-th column of the array ab as follows: if uplo = 'U', ab(kd+1+i-j,j) = A(i,j) for max(1,j-kd) = i = j; if uplo = 'L', ab(1+i-j,j) = A(i,j) for j = i = min(n,j+kd). ldab INTEGER. The leading dimension of the array ab. ldab = kd+1. s REAL for slaqsb/claqsb DOUBLE PRECISION for dlaqsb/zlaqsb Array, DIMENSION (n). The scale factors for A. scond REAL for slaqsb/claqsb LAPACK Auxiliary and Utility Routines 5 1285 DOUBLE PRECISION for dlaqsb/zlaqsb Ratio of the smallest s(i) to the largest s(i). amax REAL for slaqsb/claqsb DOUBLE PRECISION for dlaqsb/zlaqsb Absolute value of largest matrix entry. Output Parameters ab On exit, if info = 0, the triangular factor U or L from the Cholesky factorization of the band matrix A that can be A = UT*U or A = L*LT for real flavors and A = UH*U or A = L*LH for complex flavors, in the same storage format as A. equed CHARACTER*1. Specifies whether or not equilibration was done. If equed = 'N': No equilibration. If equed = 'Y': Equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). Application Notes The routine uses internal parameters thresh, large, and small, which have the following meaning. thresh is a threshold value used to decide if scaling should be based on the ratio of the scaling factors. If scond < thresh, scaling is done. large and small are threshold values used to decide if scaling should be done based on the absolute size of the largest matrix element. If amax > large or amax < small, scaling is done. ?laqsp Scales a symmetric/Hermitian matrix in packed storage, using scaling factors computed by ?ppequ. Syntax call slaqsp( uplo, n, ap, s, scond, amax, equed ) call dlaqsp( uplo, n, ap, s, scond, amax, equed ) call claqsp( uplo, n, ap, s, scond, amax, equed ) call zlaqsp( uplo, n, ap, s, scond, amax, equed ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?laqsp equilibrates a symmetric matrix A using the scaling factors in the vector s. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric matrix A is stored. If uplo = 'U': upper triangular. If uplo = 'L': lower triangular. n INTEGER. The order of the matrix A. n = 0. 5 Intel® Math Kernel Library Reference Manual 1286 ap REAL for slaqsp DOUBLE PRECISION for dlaqsp COMPLEX for claqsp DOUBLE COMPLEX for zlaqsp Array, DIMENSION (n(n+1)/2). On entry, the upper or lower triangle of the symmetric matrix A, packed columnwise in a linear array. The j-th column of A is stored in the array ap as follows: if uplo = 'U', ap(i + (j-1)j/2) = A(i,j) for 1 = i = j; if uplo = 'L', ap(i + (j-1)(2n-j)/2) = A(i,j) for j=i=n. s REAL for slaqsp/claqsp DOUBLE PRECISION for dlaqsp/zlaqsp Array, DIMENSION (n). The scale factors for A. scond REAL for slaqsp/claqsp DOUBLE PRECISION for dlaqsp/zlaqsp Ratio of the smallest s(i) to the largest s(i). amax REAL for slaqsp/claqsp DOUBLE PRECISION for dlaqsp/zlaqsp Absolute value of largest matrix entry. Output Parameters ap On exit, the equilibrated matrix: diag(s)*A*diag(s), in the same storage format as A. equed CHARACTER*1. Specifies whether or not equilibration was done. If equed = 'N': No equilibration. If equed = 'Y': Equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). Application Notes The routine uses internal parameters thresh, large, and small, which have the following meaning. thresh is a threshold value used to decide if scaling should be based on the ratio of the scaling factors. If scond < thresh, scaling is done. large and small are threshold values used to decide if scaling should be done based on the absolute size of the largest matrix element. If amax > large or amax < small, scaling is done. ?laqsy Scales a symmetric/Hermitian matrix, using scaling factors computed by ?poequ. Syntax call slaqsy( uplo, n, a, lda, s, scond, amax, equed ) call dlaqsy( uplo, n, a, lda, s, scond, amax, equed ) call claqsy( uplo, n, a, lda, s, scond, amax, equed ) call zlaqsy( uplo, n, a, lda, s, scond, amax, equed ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h LAPACK Auxiliary and Utility Routines 5 1287 Description The routine equilibrates a symmetric matrix A using the scaling factors in the vector s. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric matrix A is stored. If uplo = 'U': upper triangular. If uplo = 'L': lower triangular. n INTEGER. The order of the matrix A. n = 0. a REAL for slaqsy DOUBLE PRECISION for dlaqsy COMPLEX for claqsy DOUBLE COMPLEX for zlaqsy Array, DIMENSION (lda,n). On entry, the symmetric matrix A. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A, and the strictly upper triangular part of a is not referenced. lda INTEGER. The leading dimension of the array a. lda = max(n,1). s REAL for slaqsy/claqsy DOUBLE PRECISION for dlaqsy/zlaqsy Array, DIMENSION (n). The scale factors for A. scond REAL for slaqsy/claqsy DOUBLE PRECISION for dlaqsy/zlaqsy Ratio of the smallest s(i) to the largest s(i). amax REAL for slaqsy/claqsy DOUBLE PRECISION for dlaqsy/zlaqsy Absolute value of largest matrix entry. Output Parameters a On exit, if equed = 'Y', the equilibrated matrix: diag(s)*A*diag(s). equed CHARACTER*1. Specifies whether or not equilibration was done. If equed = 'N': No equilibration. If equed = 'Y': Equilibration was done, i.e., A has been replaced by diag(s)*A*diag(s). Application Notes The routine uses internal parameters thresh, large, and small, which have the following meaning. thresh is a threshold value used to decide if scaling should be based on the ratio of the scaling factors. If scond < thresh, scaling is done. large and small are threshold values used to decide if scaling should be done based on the absolute size of the largest matrix element. If amax > large or amax < small, scaling is done. 5 Intel® Math Kernel Library Reference Manual 1288 ?laqtr Solves a real quasi-triangular system of equations, or a complex quasi-triangular system of special form, in real arithmetic. Syntax call slaqtr( ltran, lreal, n, t, ldt, b, w, scale, x, work, info ) call dlaqtr( ltran, lreal, n, t, ldt, b, w, scale, x, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?laqtr solves the real quasi-triangular system op(T) * p = scale*c, if lreal = .TRUE. or the complex quasi-triangular systems op(T + iB)*(p+iq) = scale*(c+id), if lreal = .FALSE. in real arithmetic, where T is upper quasi-triangular. If lreal = .FALSE., then the first diagonal block of T must be 1-by-1, B is the specially structured matrix op(A) = A or AT, AT denotes the transpose of matrix A. On input, This routine is designed for the condition number estimation in routine ?trsna. Input Parameters ltran LOGICAL. On entry, ltran specifies the option of conjugate transpose: = .FALSE., op(T + iB) = T + iB, = .TRUE., op(T + iB) = (T + iB)T. LAPACK Auxiliary and Utility Routines 5 1289 lreal LOGICAL. On entry, lreal specifies the input matrix structure: = .FALSE., the input is complex = .TRUE., the input is real. n INTEGER. On entry, n specifies the order of T + iB. n = 0. t REAL for slaqtr DOUBLE PRECISION for dlaqtr Array, dimension (ldt,n). On entry, t contains a matrix in Schur canonical form. If lreal = .FALSE., then the first diagonal block of t must be 1- by-1. ldt INTEGER. The leading dimension of the matrix T. ldt = max(1,n). b REAL for slaqtr DOUBLE PRECISION for dlaqtr Array, dimension (n). On entry, b contains the elements to form the matrix B as described above. If lreal = .TRUE., b is not referenced. w REAL for slaqtr DOUBLE PRECISION for dlaqtr On entry, w is the diagonal element of the matrix B. If lreal = .TRUE., w is not referenced. x REAL for slaqtr DOUBLE PRECISION for dlaqtr Array, dimension (2n). On entry, x contains the right hand side of the system. work REAL for slaqtr DOUBLE PRECISION for dlaqtr Workspace array, dimension (n). Output Parameters scale REAL for slaqtr DOUBLE PRECISION for dlaqtr On exit, scale is the scale factor. x On exit, X is overwritten by the solution. info INTEGER. If info = 0: successful exit. If info = 1: the some diagonal 1-by-1 block has been perturbed by a small number smin to keep nonsingularity. If info = 2: the some diagonal 2-by-2 block has been perturbed by a small number in ?laln2 to keep nonsingularity. NOTE For higher speed, this routine does not check the inputs for errors. ?lar1v Computes the (scaled) r-th column of the inverse of the submatrix in rows b1 through bn of tridiagonal matrix. 5 Intel® Math Kernel Library Reference Manual 1290 Syntax call slar1v( n, b1, bn, lambda, d, l, ld, lld, pivmin, gaptol, z, wantnc, negcnt, ztz, mingma, r, isuppz, nrminv, resid, rqcorr, work ) call dlar1v( n, b1, bn, lambda, d, l, ld, lld, pivmin, gaptol, z, wantnc, negcnt, ztz, mingma, r, isuppz, nrminv, resid, rqcorr, work ) call clar1v( n, b1, bn, lambda, d, l, ld, lld, pivmin, gaptol, z, wantnc, negcnt, ztz, mingma, r, isuppz, nrminv, resid, rqcorr, work ) call zlar1v( n, b1, bn, lambda, d, l, ld, lld, pivmin, gaptol, z, wantnc, negcnt, ztz, mingma, r, isuppz, nrminv, resid, rqcorr, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lar1v computes the (scaled) r-th column of the inverse of the submatrix in rows b1 through bn of the tridiagonal matrix L*D*LT - ?*I. When ? is close to an eigenvalue, the computed vector is an accurate eigenvector. Usually, r corresponds to the index where the eigenvector is largest in magnitude. The following steps accomplish this computation : • Stationary qd transform, L*D*LT - ?*I = L(+)*D(+)*L(+)T • Progressive qd transform, L*D*LT - ?*I = U(-)*D(-)*U(-)T, • Computation of the diagonal elements of the inverse of L*D*LT - ?*I by combining the above transforms, and choosing r as the index where the diagonal of the inverse is (one of the) largest in magnitude. • Computation of the (scaled) r-th column of the inverse using the twisted factorization obtained by combining the top part of the stationary and the bottom part of the progressive transform. Input Parameters n INTEGER. The order of the matrix L*D*LT. b1 INTEGER. First index of the submatrix of L*D*LT. bn INTEGER. Last index of the submatrix of L*D*LT. lambda REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v The shift. To compute an accurate eigenvector, lambda should be a good approximation to an eigenvalue of L*D*LT. l REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v Array, DIMENSION (n-1). The (n-1) subdiagonal elements of the unit bidiagonal matrix L, in elements 1 to n-1. d REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v Array, DIMENSION (n). The n diagonal elements of the diagonal matrix D. ld REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v Array, DIMENSION (n-1). The n-1 elements Li*Di. LAPACK Auxiliary and Utility Routines 5 1291 lld REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v Array, DIMENSION (n-1). The n-1 elements Li*Li*Di. pivmin REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v The minimum pivot in the Sturm sequence. gaptol REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v Tolerance that indicates when eigenvector entries are negligible with respect to their contribution to the residual. z REAL for slar1v DOUBLE PRECISION for dlar1v COMPLEX for clar1v DOUBLE COMPLEX for zlar1v Array, DIMENSION (n). All entries of z must be set to 0. wantnc LOGICAL. Specifies whether negcnt has to be computed. r INTEGER. The twist index for the twisted factorization used to compute z. On input, 0 = r = n. If r is input as 0, r is set to the index where (L*D*LT - lambda*I)-1 is largest in magnitude. If 1 = r = n, r is unchanged. work REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v Workspace array, DIMENSION (4*n). Output Parameters z REAL for slar1v DOUBLE PRECISION for dlar1v COMPLEX for clar1v DOUBLE COMPLEX for zlar1v Array, DIMENSION (n). The (scaled) r-th column of the inverse. z(r) is returned to be 1. negcnt INTEGER. If wantnc is .TRUE. then negcnt = the number of pivots < pivmin in the matrix factorization L*D*LT, and negcnt = -1 otherwise. ztz REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v The square of the 2-norm of z. mingma REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v The reciprocal of the largest (in magnitude) diagonal element of the inverse of L*D*LT - lambda*I. r On output, r is the twist index used to compute z. Ideally, r designates the position of the maximum entry in the eigenvector. isuppz INTEGER. Array, DIMENSION (2). The support of the vector in Z, that is, the vector z is nonzero only in elements isuppz(1) through isuppz(2). nrminv REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v Equals 1/sqrt( ztz ). 5 Intel® Math Kernel Library Reference Manual 1292 resid REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v The residual of the FP vector. resid = ABS( mingma )/sqrt( ztz ). rqcorr REAL for slar1v/clar1v DOUBLE PRECISION for dlar1v/zlar1v The Rayleigh Quotient correction to lambda. rqcorr = mingma/ztz. ?lar2v Applies a vector of plane rotations with real cosines and real/complex sines from both sides to a sequence of 2-by-2 symmetric/Hermitian matrices. Syntax call slar2v( n, x, y, z, incx, c, s, incc ) call dlar2v( n, x, y, z, incx, c, s, incc ) call clar2v( n, x, y, z, incx, c, s, incc ) call zlar2v( n, x, y, z, incx, c, s, incc ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lar2v applies a vector of real/complex plane rotations with real cosines from both sides to a sequence of 2-by-2 real symmetric or complex Hermitian matrices, defined by the elements of the vectors x, y and z. For i = 1,2,...,n Input Parameters n INTEGER. The number of plane rotations to be applied. x, y, z REAL for slar2v DOUBLE PRECISION for dlar2v COMPLEX for clar2v DOUBLE COMPLEX for zlar2v Arrays, DIMENSION (1+(n-1)*incx) each. Contain the vectors x, y and z, respectively. For all flavors of ?lar2v, elements of x and y are assumed to be real. incx INTEGER. The increment between elements of x, y, and z. incx > 0. c REAL for slar2v/clar2v DOUBLE PRECISION for dlar2v/zlar2v Array, DIMENSION (1+(n-1)*incc). The cosines of the plane rotations. s REAL for slar2v LAPACK Auxiliary and Utility Routines 5 1293 DOUBLE PRECISION for dlar2v COMPLEX for clar2v DOUBLE COMPLEX for zlar2v Array, DIMENSION (1+(n-1)*incc). The sines of the plane rotations. incc INTEGER. The increment between elements of c and s. incc > 0. Output Parameters x, y, z Vectors x, y and z, containing the results of transform. ?larf Applies an elementary reflector to a general rectangular matrix. Syntax call slarf( side, m, n, v, incv, tau, c, ldc, work ) call dlarf( side, m, n, v, incv, tau, c, ldc, work ) call clarf( side, m, n, v, incv, tau, c, ldc, work ) call zlarf( side, m, n, v, incv, tau, c, ldc, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine applies a real/complex elementary reflector H to a real/complex m-by-n matrix C, from either the left or the right. H is represented in one of the following forms: • H = I - tau*v*vT where tau is a real scalar and v is a real vector. If tau = 0, then H is taken to be the unit matrix. • H = I - tau*v*vH where tau is a complex scalar and v is a complex vector. If tau = 0, then H is taken to be the unit matrix. For clarf/zlarf, to apply HH (the conjugate transpose of H), supply conjg(tau) instead of tau. Input Parameters side CHARACTER*1. If side = 'L': form H*C If side = 'R': form C*H. m INTEGER. The number of rows of the matrix C. n INTEGER. The number of columns of the matrix C. v REAL for slarf DOUBLE PRECISION for dlarf COMPLEX for clarf DOUBLE COMPLEX for zlarf Array, DIMENSION (1 + (m-1)*abs(incv)) if side = 'L' or 5 Intel® Math Kernel Library Reference Manual 1294 (1 + (n-1)*abs(incv)) if side = 'R'. The vector v in the representation of H. v is not used if tau = 0. incv INTEGER. The increment between elements of v. incv ? 0. tau REAL for slarf DOUBLE PRECISION for dlarf COMPLEX for clarf DOUBLE COMPLEX for zlarf The value tau in the representation of H. c REAL for slarf DOUBLE PRECISION for dlarf COMPLEX for clarf DOUBLE COMPLEX for zlarf Array, DIMENSION (ldc,n). On entry, the m-by-n matrix C. ldc INTEGER. The leading dimension of the array c. ldc = max(1,m). work REAL for slarf DOUBLE PRECISION for dlarf COMPLEX for clarf DOUBLE COMPLEX for zlarf Workspace array, DIMENSION (n) if side = 'L' or (m) if side = 'R'. Output Parameters c On exit, C is overwritten by the matrix H*C if side = 'L', or C*H if side = 'R'. ?larfb Applies a block reflector or its transpose/conjugatetranspose to a general rectangular matrix. Syntax call slarfb( side, trans, direct, storev, m, n, k, v, ldv, t, ldt, c, ldc, work, ldwork ) call dlarfb( side, trans, direct, storev, m, n, k, v, ldv, t, ldt, c, ldc, work, ldwork ) call clarfb( side, trans, direct, storev, m, n, k, v, ldv, t, ldt, c, ldc, work, ldwork ) call zlarfb( side, trans, direct, storev, m, n, k, v, ldv, t, ldt, c, ldc, work, ldwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The real flavors of the routine ?larfb apply a real block reflector H or its transpose HT to a real m-by-n matrix C from either left or right. LAPACK Auxiliary and Utility Routines 5 1295 The complex flavors of the routine ?larfb apply a complex block reflector H or its conjugate transpose HH to a complex m-by-n matrix C from either left or right. Input Parameters side CHARACTER*1. If side = 'L': apply H or HT for real flavors and H or HH for complex flavors from the left. If side = 'R': apply H or HT for real flavors and H or HH for complex flavors from the right. trans CHARACTER*1. If trans = 'N': apply H (No transpose). If trans = 'C': apply HH (Conjugate transpose). If trans = 'T': apply HT (Transpose). direct CHARACTER*1. Indicates how H is formed from a product of elementary reflectors If direct = 'F': H = H(1)*H(2)*. . . *H(k) (forward) If direct = 'B': H = H(k)* . . . H(2)*H(1) (backward) storev CHARACTER*1. Indicates how the vectors which define the elementary reflectors are stored: If storev = 'C': Column-wise If storev = 'R': Row-wise m INTEGER. The number of rows of the matrix C. n INTEGER. The number of columns of the matrix C. k INTEGER. The order of the matrix T (equal to the number of elementary reflectors whose product defines the block reflector). v REAL for slarfb DOUBLE PRECISION for dlarfb COMPLEX for clarfb DOUBLE COMPLEX for zlarfb Array, DIMENSION (ldv, k) if storev = 'C' (ldv, m) if storev = 'R' and side = 'L' (ldv, n) if storev = 'R' and side = 'R' The matrix v. See Application Notes below. ldv INTEGER. The leading dimension of the array v. If storev = 'C' and side = 'L', ldv = max(1,m); if storev = 'C' and side = 'R', ldv = max(1,n); if storev = 'R', ldv = k. t REAL for slarfb DOUBLE PRECISION for dlarfb COMPLEX for clarfb DOUBLE COMPLEX for zlarfb Array, DIMENSION (ldt,k). Contains the triangular k-by-k matrix T in the representation of the block reflector. ldt INTEGER. The leading dimension of the array t. ldt = k. c REAL for slarfb DOUBLE PRECISION for dlarfb 5 Intel® Math Kernel Library Reference Manual 1296 COMPLEX for clarfb DOUBLE COMPLEX for zlarfb Array, DIMENSION (ldc,n). On entry, the m-by-n matrix C. ldc INTEGER. The leading dimension of the array c. ldc = max(1,m). work REAL for slarfb DOUBLE PRECISION for dlarfb COMPLEX for clarfb DOUBLE COMPLEX for zlarfb Workspace array, DIMENSION (ldwork, k). ldwork INTEGER. The leading dimension of the array work. If side = 'L', ldwork = max(1, n); if side = 'R', ldwork = max(1, m). Output Parameters c On exit, c is overwritten by the product of the following: • H*C, or HT*C, or C*H, or C*HT for real flavors • H*C, or HH*C, or C*H, or C*HH for complex flavors Application Notes The shape of the matrix V and the storage of the vectors which define the H(i) is best illustrated by the following example with n = 5 and k = 3. The elements equal to 1 are not stored; the corresponding array elements are modified but restored on exit. The rest of the array is not used. LAPACK Auxiliary and Utility Routines 5 1297 ?larfg Generates an elementary reflector (Householder matrix). Syntax call slarfg( n, alpha, x, incx, tau ) call dlarfg( n, alpha, x, incx, tau ) call clarfg( n, alpha, x, incx, tau ) call zlarfg( n, alpha, x, incx, tau ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?larfg generates a real/complex elementary reflector H of order n, such that for real flavors and for complex flavors, where alpha and beta are scalars (with beta real for all flavors), and x is an (n-1)-element real/complex vector. H is represented in the form for real flavors and for complex flavors, where tau is a real/complex scalar and v is a real/complex (n-1)-element vector, respectively. Note that for clarfg/zlarfg, H is not Hermitian. If the elements of x are all zero (and, for complex flavors, alpha is real), then tau = 0 and H is taken to be the unit matrix. Otherwise, 1 = tau = 2 (for real flavors), or 1 = Re(tau) = 2 and abs(tau-1) = 1 (for complex flavors). Input Parameters n INTEGER. The order of the elementary reflector. alpha REAL for slarfg DOUBLE PRECISION for dlarfg COMPLEX for clarfg DOUBLE COMPLEX for zlarfg On entry, the value alpha. 5 Intel® Math Kernel Library Reference Manual 1298 x REAL for slarfg DOUBLE PRECISION for dlarfg COMPLEX for clarfg DOUBLE COMPLEX for zlarfg Array, DIMENSION (1+(n-2)*abs(incx)). On entry, the vector x. incx INTEGER. The increment between elements of x. incx > 0. Output Parameters alpha On exit, it is overwritten with the value beta. x On exit, it is overwritten with the vector v. tau REAL for slarfg DOUBLE PRECISION for dlarfg COMPLEX for clarfg DOUBLE COMPLEX for zlarfg The value tau. ?larfgp Generates an elementary reflector (Householder matrix) with non-negative beta . Syntax call slarfgp( n, alpha, x, incx, tau ) call dlarfgp( n, alpha, x, incx, tau ) call clarfgp( n, alpha, x, incx, tau ) call zlarfgp( n, alpha, x, incx, tau ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?larfgp generates a real/complex elementary reflector H of order n, such that for real flavors and for complex flavors, where alpha and beta are scalars (with beta real and non-negative for all flavors), and x is an (n-1)- element real/complex vector. H is represented in the form for real flavors and for complex flavors, LAPACK Auxiliary and Utility Routines 5 1299 where tau is a real/complex scalar and v is a real/complex (n-1)-element vector. Note that for c/zlarfgp, H is not Hermitian. If the elements of x are all zero (and, for complex flavors, alpha is real), then tau = 0 and H is taken to be the unit matrix. Otherwise, 1 = tau = 2 (for real flavors), or 1 = Re(tau) = 2 and abs(tau-1) = 1 (for complex flavors). Input Parameters n INTEGER. The order of the elementary reflector. alpha REAL for slarfgp DOUBLE PRECISION for dlarfgp COMPLEX for clarfgp DOUBLE COMPLEX for zlarfgp On entry, the value alpha. x REAL for s DOUBLE PRECISION for dlarfgp COMPLEX for clarfgp DOUBLE COMPLEX for zlarfgp Array, DIMENSION (1+(n-2)*abs(incx)). On entry, the vector x. incx INTEGER. The increment between elements of x. incx > 0. Output Parameters alpha On exit, it is overwritten with the value beta. x On exit, it is overwritten with the vector v. tau REAL for slarfgp DOUBLE PRECISION for dlarfgp COMPLEX for clarfgp DOUBLE COMPLEX for zlarfgp The value tau. ?larft Forms the triangular factor T of a block reflector H = I - V*T*V**H. Syntax call slarft( direct, storev, n, k, v, ldv, tau, t, ldt ) call dlarft( direct, storev, n, k, v, ldv, tau, t, ldt ) call clarft( direct, storev, n, k, v, ldv, tau, t, ldt ) call zlarft( direct, storev, n, k, v, ldv, tau, t, ldt ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1300 Description The routine ?larft forms the triangular factor T of a real/complex block reflector H of order n, which is defined as a product of k elementary reflectors. If direct = 'F', H = H(1)*H(2)* . . .*H(k) and T is upper triangular; If direct = 'B', H = H(k)*. . .*H(2)*H(1) and T is lower triangular. If storev = 'C', the vector which defines the elementary reflector H(i) is stored in the i-th column of the array v, and H = I - V*T*VT (for real flavors) or H = I - V*T*VH (for complex flavors) . If storev = 'R', the vector which defines the elementary reflector H(i) is stored in the i-th row of the array v, and H = I - VT*T*V (for real flavors) or H = I - VH*T*V (for complex flavors). Input Parameters direct CHARACTER*1. Specifies the order in which the elementary reflectors are multiplied to form the block reflector: = 'F': H = H(1)*H(2)*. . . *H(k) (forward) = 'B': H = H(k)*. . .*H(2)*H(1) (backward) storev CHARACTER*1. Specifies how the vectors which define the elementary reflectors are stored (see also Application Notes below): = 'C': column-wise = 'R': row-wise. n INTEGER. The order of the block reflector H. n = 0. k INTEGER. The order of the triangular factor T (equal to the number of elementary reflectors). k = 1. v REAL for slarft DOUBLE PRECISION for dlarft COMPLEX for clarft DOUBLE COMPLEX for zlarft Array, DIMENSION (ldv, k) if storev = 'C' or (ldv, n) if storev = 'R'. The matrix V. ldv INTEGER. The leading dimension of the array v. If storev = 'C', ldv = max(1,n); if storev = 'R', ldv = k. tau REAL for slarft DOUBLE PRECISION for dlarft COMPLEX for clarft DOUBLE COMPLEX for zlarft Array, DIMENSION (k). tau(i) must contain the scalar factor of the elementary reflector H(i). ldt INTEGER. The leading dimension of the output array t. ldt = k. Output Parameters t REAL for slarft DOUBLE PRECISION for dlarft COMPLEX for clarft DOUBLE COMPLEX for zlarft LAPACK Auxiliary and Utility Routines 5 1301 Array, DIMENSION (ldt,k). The k-by-k triangular factor T of the block reflector. If direct = 'F', T is upper triangular; if direct = 'B', T is lower triangular. The rest of the array is not used. v The matrix V. Application Notes The shape of the matrix V and the storage of the vectors which define the H(i) is best illustrated by the following example with n = 5 and k = 3. The elements equal to 1 are not stored; the corresponding array elements are modified but restored on exit. The rest of the array is not used. ?larfx Applies an elementary reflector to a general rectangular matrix, with loop unrolling when the reflector has order less than or equal to 10. Syntax call slarfx( side, m, n, v, tau, c, ldc, work ) call dlarfx( side, m, n, v, tau, c, ldc, work ) call clarfx( side, m, n, v, tau, c, ldc, work ) call zlarfx( side, m, n, v, tau, c, ldc, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1302 Description The routine ?larfx applies a real/complex elementary reflector H to a real/complex m-by-n matrix C, from either the left or the right. H is represented in the following forms: • H = I - tau*v*vT, where tau is a real scalar and v is a real vector. • H = I - tau*v*vH, where tau is a complex scalar and v is a complex vector. If tau = 0, then H is taken to be the unit matrix. Input Parameters side CHARACTER*1. If side = 'L': form H*C If side = 'R': form C*H. m INTEGER. The number of rows of the matrix C. n INTEGER. The number of columns of the matrix C. v REAL for slarfx DOUBLE PRECISION for dlarfx COMPLEX for clarfx DOUBLE COMPLEX for zlarfx Array, DIMENSION (m) if side = 'L' or (n) if side = 'R'. The vector v in the representation of H. tau REAL for slarfx DOUBLE PRECISION for dlarfx COMPLEX for clarfx DOUBLE COMPLEX for zlarfx The value tau in the representation of H. c REAL for slarfx DOUBLE PRECISION for dlarfx COMPLEX for clarfx DOUBLE COMPLEX for zlarfx Array, DIMENSION (ldc,n). On entry, the m-by-n matrix C. ldc INTEGER. The leading dimension of the array c. lda = (1,m). work REAL for slarfx DOUBLE PRECISION for dlarfx COMPLEX for clarfx DOUBLE COMPLEX for zlarfx Workspace array, DIMENSION (n) if side = 'L' or (m) if side = 'R'. work is not referenced if H has order < 11. Output Parameters c On exit, C is overwritten by the matrix H*C if side = 'L', or C*H if side = 'R'. LAPACK Auxiliary and Utility Routines 5 1303 ?largv Generates a vector of plane rotations with real cosines and real/complex sines. Syntax call slargv( n, x, incx, y, incy, c, incc ) call dlargv( n, x, incx, y, incy, c, incc ) call clargv( n, x, incx, y, incy, c, incc ) call zlargv( n, x, incx, y, incy, c, incc ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine generates a vector of real/complex plane rotations with real cosines, determined by elements of the real/complex vectors x and y. For slargv/dlargv: For clargv/zlargv: where c(i)2 + abs(s(i))2 = 1 and the following conventions are used (these are the same as in clartg/ zlartg but differ from the BLAS Level 1 routine crotg/zrotg): If yi = 0, then c(i) = 1 and s(i) = 0; If xi = 0, then c(i) = 0 and s(i) is chosen so that ri is real. Input Parameters n INTEGER. The number of plane rotations to be generated. x, y REAL for slargv DOUBLE PRECISION for dlargv COMPLEX for clargv DOUBLE COMPLEX for zlargv Arrays, DIMENSION (1+(n-1)*incx) and (1+(n-1)*incy), respectively. On entry, the vectors x and y. incx INTEGER. The increment between elements of x. incx > 0. incy INTEGER. The increment between elements of y. 5 Intel® Math Kernel Library Reference Manual 1304 incy > 0. incc INTEGER. The increment between elements of the output array c. incc > 0. Output Parameters x On exit, x(i) is overwritten by ai (for real flavors), or by ri (for complex flavors), for i = 1,...,n. y On exit, the sines s(i) of the plane rotations. c REAL for slargv/clargv DOUBLE PRECISION for dlargv/zlargv Array, DIMENSION (1+(n-1)*incc). The cosines of the plane rotations. ?larnv Returns a vector of random numbers from a uniform or normal distribution. Syntax call slarnv( idist, iseed, n, x ) call dlarnv( idist, iseed, n, x ) call clarnv( idist, iseed, n, x ) call zlarnv( idist, iseed, n, x ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?larnv returns a vector of n random real/complex numbers from a uniform or normal distribution. This routine calls the auxiliary routine ?laruv to generate random real numbers from a uniform (0,1) distribution, in batches of up to 128 using vectorisable code. The Box-Muller method is used to transform numbers from a uniform to a normal distribution. Input Parameters idist INTEGER. Specifies the distribution of the random numbers: for slarnv and dlanrv: = 1: uniform (0,1) = 2: uniform (-1,1) = 3: normal (0,1). for clarnv and zlanrv: = 1: real and imaginary parts each uniform (0,1) = 2: real and imaginary parts each uniform (-1,1) = 3: real and imaginary parts each normal (0,1) = 4: uniformly distributed on the disc abs(z) < 1 = 5: uniformly distributed on the circle abs(z) = 1 iseed INTEGER. Array, DIMENSION (4). On entry, the seed of the random number generator; the array elements must be between 0 and 4095, and iseed(4) must be odd. LAPACK Auxiliary and Utility Routines 5 1305 n INTEGER. The number of random numbers to be generated. Output Parameters x REAL for slarnv DOUBLE PRECISION for dlarnv COMPLEX for clarnv DOUBLE COMPLEX for zlarnv Array, DIMENSION (n). The generated random numbers. iseed On exit, the seed is updated. ?larra Computes the splitting points with the specified threshold. Syntax call slarra( n, d, e, e2, spltol, tnrm, nsplit, isplit, info ) call dlarra( n, d, e, e2, spltol, tnrm, nsplit, isplit, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the splitting points with the specified threshold and sets any "small" off-diagonal elements to zero. Input Parameters n INTEGER. The order of the matrix (n > 1). d REAL for slarra DOUBLE PRECISION for dlarra Array, DIMENSION (n). Contains n diagonal elements of the tridiagonal matrix T. e REAL for slarra DOUBLE PRECISION for dlarra Array, DIMENSION (n). First (n-1) entries contain the subdiagonal elements of the tridiagonal matrix T; e(n) need not be set. e2 REAL for slarra DOUBLE PRECISION for dlarra Array, DIMENSION (n). First (n-1) entries contain the squares of the subdiagonal elements of the tridiagonal matrix T; e2(n) need not be set. spltol REAL for slarra DOUBLE PRECISION for dlarra The threshold for splitting. Two criteria can be used: spltol<0 : criterion based on absolute off-diagonal value; spltol>0 : criterion that preserves relative accuracy. tnrm REAL for slarra DOUBLE PRECISION for dlarra 5 Intel® Math Kernel Library Reference Manual 1306 The norm of the matrix. Output Parameters e On exit, the entries e(isplit(i)), 1 = i = nsplit, are set to zero, the other entries of e are untouched. e2 On exit, the entries e2(isplit(i)), 1 = i = nsplit, are set to zero. nsplit INTEGER. The number of blocks the matrix T splits into. 1 = nsplit = n isplit INTEGER. Array, DIMENSION (n). The splitting points, at which T breaks up into blocks. The first block consists of rows/columns 1 to isplit(1), the second of rows/columns isplit(1)+1 through isplit(2), and so on, and the nsplit-th consists of rows/columns isplit(nsplit-1)+1 through isplit(nsplit)=n. info INTEGER. = 0: successful exit. ?larrb Provides limited bisection to locate eigenvalues for more accuracy. Syntax call slarrb( n, d, lld, ifirst, ilast, rtol1, rtol2, offset, w, wgap, werr, work, iwork, pivmin, spdiam, twist, info ) call dlarrb( n, d, lld, ifirst, ilast, rtol1, rtol2, offset, w, wgap, werr, work, iwork, pivmin, spdiam, twist, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description Given the relatively robust representation (RRR) L*D*LT, the routine does "limited" bisection to refine the eigenvalues of L*D*LT, w( ifirst-offset ) through w( ilast-offset ), to more accuracy. Initial guesses for these eigenvalues are input in w. The corresponding estimate of the error in these guesses and their gaps are input in werr and wgap, respectively. During bisection, intervals [left, right] are maintained by storing their mid-points and semi-widths in the arrays w and werr respectively. Input Parameters n INTEGER. The order of the matrix. d REAL for slarrb DOUBLE PRECISION for dlarrb Array, DIMENSION (n). The n diagonal elements of the diagonal matrix D. lld REAL for slarrb DOUBLE PRECISION for dlarrb Array, DIMENSION (n-1). The n-1 elements Li*Li*Di. LAPACK Auxiliary and Utility Routines 5 1307 ifirst INTEGER. The index of the first eigenvalue to be computed. ilast INTEGER. The index of the last eigenvalue to be computed. rtol1, rtol2 REAL for slarrb DOUBLE PRECISION for dlarrb Tolerance for the convergence of the bisection intervals. An interval [left, right] has converged if RIGHT-LEFT.LT.MAX( rtol1*gap, rtol2*max(| left|,|right|) ), where gap is the (estimated) distance to the nearest eigenvalue. offset INTEGER. Offset for the arrays w, wgap and werr, that is, the ifirstoffset through ilast-offset elements of these arrays are to be used. w REAL for slarrb DOUBLE PRECISION for dlarrb Array, DIMENSION (n). On input, w( ifirst-offset ) through w( ilastoffset ) are estimates of the eigenvalues of L*D*LT indexed ifirst through ilast. wgap REAL for slarrb DOUBLE PRECISION for dlarrb Array, DIMENSION (n-1). The estimated gaps between consecutive eigenvalues of L*D*LT, that is, wgap(i-offset) is the gap between eigenvalues i and i+1. Note that if IFIRST.EQ.ILAST then wgap(ifirstoffset) must be set to 0. werr REAL for slarrb DOUBLE PRECISION for dlarrb Array, DIMENSION (n). On input, werr(ifirst-offset) through werr(ilast-offset) are the errors in the estimates of the corresponding elements in w. work REAL for slarrb DOUBLE PRECISION for dlarrb Workspace array, DIMENSION (2*n). pivmin REAL for slarrb DOUBLE PRECISION for dlarrb The minimum pivot in the Sturm sequence. spdiam REAL for slarrb DOUBLE PRECISION for dlarrb The spectral diameter of the matrix. twist INTEGER. The twist index for the twisted factorization that is used for the negcount. twist = n: Compute negcount from L*D*LT - lambda*i = L+* D+ *L +T twist = n: Compute negcount from L*D*LT - lambda*i = U-*D-*U-T twist = n: Compute negcount from L*D*LT - lambda*i = Nr*D r*Nr iwork INTEGER. Workspace array, DIMENSION (2*n). Output Parameters w On output, the estimates of the eigenvalues are "refined". wgap On output, the gaps are refined. werr On output, "refined" errors in the estimates of w. 5 Intel® Math Kernel Library Reference Manual 1308 info INTEGER. Error flag. ?larrc Computes the number of eigenvalues of the symmetric tridiagonal matrix. Syntax call slarrc( jobt, n, vl, vu, d, e, pivmin, eigcnt, lcnt, rcnt, info ) call dlarrc( jobt, n, vl, vu, d, e, pivmin, eigcnt, lcnt, rcnt, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine finds the number of eigenvalues of the symmetric tridiagonal matrix T or of its factorization L*D*LT in the specified interval. Input Parameters jobt CHARACTER*1. = 'T': computes Sturm count for matrix T. = 'L': computes Sturm count for matrix L*D*LT. n INTEGER. The order of the matrix. (n > 1). vl,vu REAL for slarrc DOUBLE PRECISION for dlarrc The lower and upper bounds for the eigenvalues. d REAL for slarrc DOUBLE PRECISION for dlarrc Array, DIMENSION (n). If jobt= 'T': contains the n diagonal elements of the tridiagonal matrix T. If jobt= 'L': contains the n diagonal elements of the diagonal matrix D. e REAL for slarrc DOUBLE PRECISION for dlarrc Array, DIMENSION (n). If jobt= 'T': contains the (n-1)offdiagonal elements of the matrix T. If jobt= 'L': contains the (n-1)offdiagonal elements of the matrix L. pivmin REAL for slarrc DOUBLE PRECISION for dlarrc The minimum pivot in the Sturm sequence for the matrix T. Output Parameters eigcnt INTEGER. The number of eigenvalues of the symmetric tridiagonal matrix T that are in the half-open interval (vl,vu]. lcnt,rcnt INTEGER. The left and right negcounts of the interval. LAPACK Auxiliary and Utility Routines 5 1309 info INTEGER. Now it is not used and always is set to 0. ?larrd Computes the eigenvalues of a symmetric tridiagonal matrix to suitable accuracy. Syntax call slarrd( range, order, n, vl, vu, il, iu, gers, reltol, d, e, e2, pivmin, nsplit, isplit, m, w, werr, wl, wu, iblock, indexw, work, iwork, info ) call dlarrd( range, order, n, vl, vu, il, iu, gers, reltol, d, e, e2, pivmin, nsplit, isplit, m, w, werr, wl, wu, iblock, indexw, work, iwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the eigenvalues of a symmetric tridiagonal matrix T to suitable accuracy. This is an auxiliary code to be called from ?stemr. The user may ask for all eigenvalues, all eigenvalues in the halfopen interval (vl, vu], or the il-th through iu-th eigenvalues. To avoid overflow, the matrix must be scaled so that its largest element is no greater than (overflow1/2*underflow1/4) in absolute value, and for greatest accuracy, it should not be much smaller than that. (For more details see [Kahan66]. Input Parameters range CHARACTER. = 'A': ("All") all eigenvalues will be found. = 'V': ("Value") all eigenvalues in the half-open interval (vl, vu] will be found. = 'I': ("Index") the il-th through iu-th eigenvalues will be found. order CHARACTER. = 'B': ("By block") the eigenvalues will be grouped by split-off block (see iblock, isplit below) and ordered from smallest to largest within the block. = 'E': ("Entire matrix") the eigenvalues for the entire matrix will be ordered from smallest to largest. n INTEGER. The order of the tridiagonal matrix T (n = 1). vl,vu REAL for slarrd DOUBLE PRECISION for dlarrd If range = 'V': the lower and upper bounds of the interval to be searched for eigenvalues. Eigenvalues less than or equal to vl, or greater than vu, will not be returned. vl < vu. If range = 'A' or 'I': not referenced. il,iu INTEGER. If range = 'I': the indices (in ascending order) of the smallest and largest eigenvalues to be returned. 1 = il = iu = n, if n > 0; il=1 and iu=0 if n=0. If range = 'A' or 'V': not referenced. 5 Intel® Math Kernel Library Reference Manual 1310 gers REAL for slarrd DOUBLE PRECISION for dlarrd Array, DIMENSION (2*n). The n Gerschgorin intervals (the i-th Gerschgorin interval is (gers(2*i-1), gers(2*i)). reltol REAL for slarrd DOUBLE PRECISION for dlarrd The minimum relative width of an interval. When an interval is narrower than reltol times the larger (in magnitude) endpoint, then it is considered to be sufficiently small, that is converged. Note: this should always be at least radix*machine epsilon. d REAL for slarrd DOUBLE PRECISION for dlarrd Array, DIMENSION (n). Contains n diagonal elements of the tridiagonal matrix T. e REAL for slarrd DOUBLE PRECISION for dlarrd Array, DIMENSION (n-1). Contains (n-1) off-diagonal elements of the tridiagonal matrix T. e2 REAL for slarrd DOUBLE PRECISION for dlarrd Array, DIMENSION (n-1). Contains (n-1) squared off-diagonal elements of the tridiagonal matrix T. pivmin REAL for slarrd DOUBLE PRECISION for dlarrd The minimum pivot in the Sturm sequence for the matrix T. nsplit INTEGER. The number of diagonal blocks the matrix T . 1 = nsplit = n isplit INTEGER. Arrays, DIMENSION (n). The splitting points, at which T breaks up into submatrices. The first submatrix consists of rows/columns 1 to isplit(1), the second of rows/ columns isplit(1)+1 through isplit(2), and so on, and the nsplit-th consists of rows/columns isplit(nsplit-1)+1 through isplit(nsplit)=n. (Only the first nsplit elements actually is used, but since the user cannot know a priori value of nsplit, n words must be reserved for isplit.) work REAL for slarrd DOUBLE PRECISION for dlarrd Workspace array, DIMENSION (4*n). iwork INTEGER. Workspace array, DIMENSION (4*n). Output Parameters m INTEGER. The actual number of eigenvalues found. 0 = m = n. (See also the description of info=2,3.) w REAL for slarrd DOUBLE PRECISION for dlarrd Array, DIMENSION (n). LAPACK Auxiliary and Utility Routines 5 1311 The first m elements of w contain the eigenvalue approximations. ?laprd computes an interval Ij = (aj, bj] that includes eigenvalue j. The eigenvalue approximation is given as the interval midpoint w(j)= (aj+bj)/ 2. The corresponding error is bounded by werr(j) = abs(aj-bj)/2. werr REAL for slarrd DOUBLE PRECISION for dlarrd Array, DIMENSION (n). The error bound on the corresponding eigenvalue approximation in w. wl, wu REAL for slarrd DOUBLE PRECISION for dlarrd The interval (wl, wu] contains all the wanted eigenvalues. If range = 'V': then wl=vl and wu=vu. If range = 'A': then wl and wu are the global Gerschgorin bounds on the spectrum. If range = 'I': then wl and wu are computed by ?laebz from the index range specified. iblock INTEGER. Array, DIMENSION (n). At each row/column j where e(j) is zero or small, the matrix T is considered to split into a block diagonal matrix. If info = 0, then iblock(i) specifies to which block (from 1 to the number of blocks) the eigenvalue w(i) belongs. (The routine may use the remaining n-m elements as workspace.) indexw INTEGER. Array, DIMENSION (n). The indices of the eigenvalues within each block (submatrix); for example, indexw(i)= j and iblock(i)=k imply that the i-th eigenvalue w(i) is the j-th eigenvalue in block k. info INTEGER. = 0: successful exit. < 0: if info = -i, the i-th argument has an illegal value > 0: some or all of the eigenvalues fail to converge or are not computed: =1 or 3: bisection fail to converge for some eigenvalues; these eigenvalues are flagged by a negative block number. The effect is that the eigenvalues may not be as accurate as the absolute and relative tolerances. =2 or 3: range='I' only: not all of the eigenvalues il:iu are found. =4: range='I', and the Gershgorin interval initially used is too small. No eigenvalues are computed. ?larre Given the tridiagonal matrix T, sets small off-diagonal elements to zero and for each unreduced block Ti, finds base representations and eigenvalues. Syntax call slarre( range, n, vl, vu, il, iu, d, e, e2, rtol1, rtol2, spltol, nsplit, isplit, m, w, werr, wgap, iblock, indexw, gers, pivmin, work, iwork, info ) call dlarre( range, n, vl, vu, il, iu, d, e, e2, rtol1, rtol2, spltol, nsplit, isplit, m, w, werr, wgap, iblock, indexw, gers, pivmin, work, iwork, info ) 5 Intel® Math Kernel Library Reference Manual 1312 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description To find the desired eigenvalues of a given real symmetric tridiagonal matrix T, the routine sets any "small" off-diagonal elements to zero, and for each unreduced block Ti, it finds • a suitable shift at one end of the block spectrum • the base representation, Ti - si *I = Li*Di*Li T, and • eigenvalues of each Li*Di*Li T. The representations and eigenvalues found are then used by ?stemr to compute the eigenvectors of a symmetric tridiagonal matrix. The accuracy varies depending on whether bisection is used to find a few eigenvalues or the dqds algorithm (subroutine ?lasq2) to compute all and discard any unwanted one. As an added benefit, ?larre also outputs the n Gerschgorin intervals for the matrices Li*Di*Li T. Input Parameters range CHARACTER. = 'A': ("All") all eigenvalues will be found. = 'V': ("Value") all eigenvalues in the half-open interval (vl, vu] will be found. = 'I': ("Index") the il-th through iu-th eigenvalues of the entire matrix will be found. n INTEGER. The order of the matrix. n > 0. vl, vu REAL for slarre DOUBLE PRECISION for dlarre If range='V', the lower and upper bounds for the eigenvalues. Eigenvalues less than or equal to vl, or greater than vu, are not returned. vl < vu. il, iu INTEGER. If range='I', the indices (in ascending order) of the smallest and largest eigenvalues to be returned. 1 = il = iu = n. d REAL for slarre DOUBLE PRECISION for dlarre Array, DIMENSION (n). The n diagonal elements of the diagonal matrices T. e REAL for slarre DOUBLE PRECISION for dlarre Array, DIMENSION (n). The first (n-1) entries contain the subdiagonal elements of the tridiagonal matrix T; e(n) need not be set. e2 REAL for slarre DOUBLE PRECISION for dlarre Array, DIMENSION (n). The first (n-1) entries contain the squares of the subdiagonal elements of the tridiagonal matrix T; e2(n) need not be set. rtol1, rtol2 REAL for slarre DOUBLE PRECISION for dlarre Parameters for bisection. An interval [LEFT,RIGHT] has converged if RIGHT-LEFT.LT.MAX( rtol1*gap, rtol2*max(|LEFT|,|RIGHT|) ). spltol REAL for slarre DOUBLE PRECISION for dlarre LAPACK Auxiliary and Utility Routines 5 1313 The threshold for splitting. work REAL for slarre DOUBLE PRECISION for dlarre Workspace array, DIMENSION (6*n). iwork INTEGER. Workspace array, DIMENSION (5*n). Output Parameters vl, vu On exit, if range='I' or ='A', contain the bounds on the desired part of the spectrum. d On exit, the n diagonal elements of the diagonal matrices Di . e On exit, the subdiagonal elements of the unit bidiagonal matrices Li . The entries e( isplit( i) ), 1 = i = nsplit, contain the base points sigmai on output. e2 On exit, the entries e2( isplit( i) ), 1 = i = nsplit, have been set to zero. nsplit INTEGER. The number of blocks T splits into. 1 = nsplit = n. isplit INTEGER. Array, DIMENSION (n). The splitting points, at which T breaks up into blocks. The first block consists of rows/columns 1 to isplit(1), the second of rows/columns isplit(1)+1 through isplit(2), etc., and the nsplit-th consists of rows/columns isplit(nsplit-1)+1 through isplit(nsplit)=n. m INTEGER. The total number of eigenvalues (of all the Li*Di*Li T) found. w REAL for slarre DOUBLE PRECISION for dlarre Array, DIMENSION (n). The first m elements contain the eigenvalues. The eigenvalues of each of the blocks, Li*Di*Li T, are sorted in ascending order. The routine may use the remaining n-m elements as workspace. werr REAL for slarre DOUBLE PRECISION for dlarre Array, DIMENSION (n). The error bound on the corresponding eigenvalue in w. wgap REAL for slarre DOUBLE PRECISION for dlarre Array, DIMENSION (n). The separation from the right neighbor eigenvalue in w. The gap is only with respect to the eigenvalues of the same block as each block has its own representation tree. Exception: at the right end of a block the left gap is stored. iblock INTEGER. Array, DIMENSION (n). The indices of the blocks (submatrices) associated with the corresponding eigenvalues in w; iblock(i)=1 if eigenvalue w(i) belongs to the first block from the top, =2 if w(i) belongs to the second block, etc. indexw INTEGER. Array, DIMENSION (n). The indices of the eigenvalues within each block (submatrix); for example, indexw(i)= 10 and iblock(i)=2 imply that the i-th eigenvalue w(i) is the 10-th eigenvalue in the second block. gers REAL for slarre DOUBLE PRECISION for dlarre 5 Intel® Math Kernel Library Reference Manual 1314 Array, DIMENSION (2*n). The n Gerschgorin intervals (the i-th Gerschgorin interval is (gers(2*i-1), gers(2*i)). pivmin REAL for slarre DOUBLE PRECISION for dlarre The minimum pivot in the Sturm sequence for T . info INTEGER. If info = 0: successful exit If info > 0: A problem occured in ?larre. If info = 5, the Rayleigh Quotient Iteration failed to converge to full accuracy. If info < 0: One of the called subroutines signaled an internal problem. Inspection of the corresponding parameter info for further information is required. • If info = -1, there is a problem in ?larrd • If info = -2, no base representation could be found in maxtry iterations. Increasing maxtry and recompilation might be a remedy. • If info = -3, there is a problem in ?larrb when computing the refined root representation for ?lasq2. • If info = -4, there is a problem in ?larrb when preforming bisection on the desired part of the spectrum. • If info = -5, there is a problem in ?lasq2. • If info = -6, there is a problem in ?lasq2. See Also ?stemr ?lasq2 ?larrb ?larrd ?larrf Finds a new relatively robust representation such that at least one of the eigenvalues is relatively isolated. Syntax call slarrf( n, d, l, ld, clstrt, clend, w, wgap, werr, spdiam, clgapl, clgapr, pivmin, sigma, dplus, lplus, work, info ) call dlarrf( n, d, l, ld, clstrt, clend, w, wgap, werr, spdiam, clgapl, clgapr, pivmin, sigma, dplus, lplus, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description Given the initial representation L*D*LT and its cluster of close eigenvalues (in a relative measure), w(clstrt), w(clstrt+1), ... w(clend), the routine ?larrf finds a new relatively robust representation L*D*LT - si*I = L(+)*D(+)*L(+)T such that at least one of the eigenvalues of L(+)*D*(+)*L(+)T is relatively isolated. LAPACK Auxiliary and Utility Routines 5 1315 Input Parameters n INTEGER. The order of the matrix (subblock, if the matrix is splitted). d REAL for slarrf DOUBLE PRECISION for dlarrf Array, DIMENSION (n). The n diagonal elements of the diagonal matrix D. l REAL for slarrf DOUBLE PRECISION for dlarrf Array, DIMENSION (n-1). The (n-1) subdiagonal elements of the unit bidiagonal matrix L. ld REAL for slarrf DOUBLE PRECISION for dlarrf Array, DIMENSION (n-1). The n-1 elements Li*Di. clstrt INTEGER. The index of the first eigenvalue in the cluster. clend INTEGER. The index of the last eigenvalue in the cluster. w REAL for slarrf DOUBLE PRECISION for dlarrf Array, DIMENSION = (clend -clstrt+1). The eigenvalue approximations of L*D*LT in ascending order. w(clstrt) through w(clend) form the cluster of relatively close eigenvalues. wgap REAL for slarrf DOUBLE PRECISION for dlarrf Array, DIMENSION = (clend -clstrt+1). The separation from the right neighbor eigenvalue in w. werr REAL for slarrf DOUBLE PRECISION for dlarrf Array, DIMENSION = (clend -clstrt+1). On input, werr contains the semiwidth of the uncertainty interval of the corresponding eigenvalue approximation in w. spdiam REAL for slarrf DOUBLE PRECISION for dlarrf Estimate of the spectral diameter obtained from the Gerschgorin intervals. clgapl, clgapr REAL for slarrf DOUBLE PRECISION for dlarrf Absolute gap on each end of the cluster. Set by the calling routine to protect against shifts too close to eigenvalues outside the cluster. pivmin REAL for slarrf DOUBLE PRECISION for dlarrf The minimum pivot allowed in the Sturm sequence. work REAL for slarrf DOUBLE PRECISION for dlarrf Workspace array, DIMENSION (2*n). Output Parameters wgap On output, the gaps are refined. sigma REAL for slarrf DOUBLE PRECISION for dlarrf The shift used to form L(+)*D*(+)*L(+)T. 5 Intel® Math Kernel Library Reference Manual 1316 dplus REAL for slarrf DOUBLE PRECISION for dlarrf Array, DIMENSION (n). The n diagonal elements of the diagonal matrix D(+). lplus REAL for slarrf DOUBLE PRECISION for dlarrf Array, DIMENSION (n). The first (n-1) elements of lplus contain the subdiagonal elements of the unit bidiagonal matrix L(+). ?larrj Performs refinement of the initial estimates of the eigenvalues of the matrix T. Syntax call slarrj( n, d, e2, ifirst, ilast, rtol, offset, w, werr, work, iwork, pivmin, spdiam, info ) call dlarrj( n, d, e2, ifirst, ilast, rtol, offset, w, werr, work, iwork, pivmin, spdiam, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description Given the initial eigenvalue approximations of T, this routine does bisection to refine the eigenvalues of T, w(ifirst-offset) through w(ilast-offset), to more accuracy. Initial guesses for these eigenvalues are input in w, the corresponding estimate of the error in these guesses in werr. During bisection, intervals [a,b] are maintained by storing their mid-points and semi-widths in the arrays w and werr respectively. Input Parameters n INTEGER. The order of the matrix T. d REAL for slarrj DOUBLE PRECISION for dlarrj Array, DIMENSION (n). Contains n diagonal elements of the matrix T. e2 REAL for slarrj DOUBLE PRECISION for dlarrj Array, DIMENSION (n-1). Contains (n-1) squared sub-diagonal elements of the T. ifirst INTEGER. The index of the first eigenvalue to be computed. ilast INTEGER. The index of the last eigenvalue to be computed. rtol REAL for slarrj DOUBLE PRECISION for dlarrj Tolerance for the convergence of the bisection intervals. An interval [a,b] is considered to be converged if (b-a) = rtol*max(|a|,|b|). offset INTEGER. LAPACK Auxiliary and Utility Routines 5 1317 Offset for the arrays w and werr, that is the ifirst-offset through ilast-offset elements of these arrays are to be used. w REAL for slarrj DOUBLE PRECISION for dlarrj Array, DIMENSION (n). On input, w(ifirst-offset) through w(ilast-offset) are estimates of the eigenvalues of L*D*LT indexed ifirst through ilast. werr REAL for slarrj DOUBLE PRECISION for dlarrj Array, DIMENSION (n). On input, werr(ifirst-offset) through werr(ilast-offset) are the errors in the estimates of the corresponding elements in w. work REAL for slarrj DOUBLE PRECISION for dlarrj Workspace array, DIMENSION (2*n). iwork INTEGER. Workspace array, DIMENSION (2*n). pivmin REAL for slarrj DOUBLE PRECISION for dlarrj The minimum pivot in the Sturm sequence for the matrix T. spdiam REAL for slarrj DOUBLE PRECISION for dlarrj The spectral diameter of the matrix T. Output Parameters w On exit, contains the refined estimates of the eigenvalues. werr On exit, contains the refined errors in the estimates of the corresponding elements in w. info INTEGER. Now it is not used and always is set to 0. ?larrk Computes one eigenvalue of a symmetric tridiagonal matrix T to suitable accuracy. Syntax call slarrk( n, iw, gl, gu, d, e2, pivmin, reltol, w, werr, info ) call dlarrk( n, iw, gl, gu, d, e2, pivmin, reltol, w, werr, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes one eigenvalue of a symmetric tridiagonal matrix T to suitable accuracy. This is an auxiliary code to be called from ?stemr. To avoid overflow, the matrix must be scaled so that its largest element is no greater than (overflow1/2*underflow1/4) in absolute value, and for greatest accuracy, it should not be much smaller than that. For more details see [Kahan66]. 5 Intel® Math Kernel Library Reference Manual 1318 Input Parameters n INTEGER. The order of the matrix T. (n = 1). iw INTEGER. The index of the eigenvalue to be returned. gl, gu REAL for slarrk DOUBLE PRECISION for dlarrk An upper and a lower bound on the eigenvalue. d REAL for slarrk DOUBLE PRECISION for dlarrk Array, DIMENSION (n). Contains n diagonal elements of the matrix T. e2 REAL for slarrk DOUBLE PRECISION for dlarrk Array, DIMENSION (n-1). Contains (n-1) squared off-diagonal elements of the T. pivmin REAL for slarrk DOUBLE PRECISION for dlarrk The minimum pivot in the Sturm sequence for the matrix T. reltol REAL for slarrk DOUBLE PRECISION for dlarrk The minimum relative width of an interval. When an interval is narrower than reltol times the larger (in magnitude) endpoint, then it is considered to be sufficiently small, that is converged. Note: this should always be at least radix*machine epsilon. Output Parameters w REAL for slarrk DOUBLE PRECISION for dlarrk Contains the eigenvalue approximation. werr REAL for slarrk DOUBLE PRECISION for dlarrk Contains the error bound on the corresponding eigenvalue approximation in w. info INTEGER. = 0: Eigenvalue converges = -1: Eigenvalue does not converge ?larrr Performs tests to decide whether the symmetric tridiagonal matrix T warrants expensive computations which guarantee high relative accuracy in the eigenvalues. Syntax call slarrr( n, d, e, info ) call dlarrr( n, d, e, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h LAPACK Auxiliary and Utility Routines 5 1319 Description The routine performs tests to decide whether the symmetric tridiagonal matrix T warrants expensive computations which guarantee high relative accuracy in the eigenvalues. Input Parameters n INTEGER. The order of the matrix T. (n > 0). d REAL for slarrr DOUBLE PRECISION for dlarrr Array, DIMENSION (n). Contains n diagonal elements of the matrix T. e REAL for slarrr DOUBLE PRECISION for dlarrr Array, DIMENSION (n). The first (n-1) entries contain sub-diagonal elements of the tridiagonal matrix T; e(n) is set to 0. Output Parameters info INTEGER. = 0: the matrix warrants computations preserving relative accuracy (default value). = -1: the matrix warrants computations guaranteeing only absolute accuracy. ?larrv Computes the eigenvectors of the tridiagonal matrix T = L*D* LT given L, D and the eigenvalues of L*D* LT. Syntax call slarrv( n, vl, vu, d, l, pivmin, isplit, m, dol, dou, minrgp, rtol1, rtol2, w, werr, wgap, iblock, indexw, gers, z, ldz, isuppz, work, iwork, info ) call dlarrv( n, vl, vu, d, l, pivmin, isplit, m, dol, dou, minrgp, rtol1, rtol2, w, werr, wgap, iblock, indexw, gers, z, ldz, isuppz, work, iwork, info ) call clarrv( n, vl, vu, d, l, pivmin, isplit, m, dol, dou, minrgp, rtol1, rtol2, w, werr, wgap, iblock, indexw, gers, z, ldz, isuppz, work, iwork, info ) call zlarrv( n, vl, vu, d, l, pivmin, isplit, m, dol, dou, minrgp, rtol1, rtol2, w, werr, wgap, iblock, indexw, gers, z, ldz, isuppz, work, iwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?larrv computes the eigenvectors of the tridiagonal matrix T = L*D* LT given L, D and approximations to the eigenvalues of L*D* LT. The input eigenvalues should have been computed by slarre for real flavors (slarrv/clarrv) and by dlarre for double precision flavors (dlarre/zlarre). 5 Intel® Math Kernel Library Reference Manual 1320 Input Parameters n INTEGER. The order of the matrix. n = 0. vl, vu REAL for slarrv/clarrv DOUBLE PRECISION for dlarrv/zlarrv Lower and upper bounds respectively of the interval that contains the desired eigenvalues. vl < vu. Needed to compute gaps on the left or right end of the extremal eigenvalues in the desired range. d REAL for slarrv/clarrv DOUBLE PRECISION for dlarrv/zlarrv Array, DIMENSION (n). On entry, the n diagonal elements of the diagonal matrix D. l REAL for slarrv/clarrv DOUBLE PRECISION for dlarrv/zlarrv Array, DIMENSION (n). On entry, the (n-1) subdiagonal elements of the unit bidiagonal matrix L are contained in elements 1 to n-1 of L if the matrix is not splitted. At the end of each block the corresponding shift is stored as given by slarre for real flavors and by dlarre for double precision flavors. pivmin REAL for slarrv/clarrv DOUBLE PRECISION for dlarrv/zlarrv The minimum pivot allowed in the Sturm sequence. isplit INTEGER. Array, DIMENSION (n). The splitting points, at which T breaks up into blocks. The first block consists of rows/columns 1 to isplit(1), the second of rows/columns isplit(1)+1 through isplit(2), etc. m INTEGER. The total number of eigenvalues found. 0 = m = n. If range = 'A', m = n, and if range = 'I', m = iu - il +1. dol, dou INTEGER. If you want to compute only selected eigenvectors from all the eigenvalues supplied, specify an index range dol:dou. Or else apply the setting dol=1, dou=m. Note that dol and dou refer to the order in which the eigenvalues are stored in w. If you want to compute only selected eigenpairs, then the columns dol-1 to dou+1 of the eigenvector space Z contain the computed eigenvectors. All other columns of Z are set to zero. minrgp, rtol1, rtol2 REAL for slarrv/clarrv DOUBLE PRECISION for dlarrv/zlarrv Parameters for bisection. An interval [LEFT,RIGHT] has converged if RIGHT-LEFT.LT.MAX( rtol1*gap, rtol2*max(|LEFT|,|RIGHT|) ). w REAL for slarrv/clarrv DOUBLE PRECISION for dlarrv/zlarrv Array, DIMENSION (n). The first m elements of w contain the approximate eigenvalues for which eigenvectors are to be computed. The eigenvalues should be grouped by split-off block and ordered from smallest to largest within the block (the output array w from ?larre is expected here). These eigenvalues are set with respect to the shift of the corresponding root representation for their block. werr REAL for slarrv/clarrv DOUBLE PRECISION for dlarrv/zlarrv LAPACK Auxiliary and Utility Routines 5 1321 Array, DIMENSION (n). The first m elements contain the semiwidth of the uncertainty interval of the corresponding eigenvalue in w. wgap REAL for slarrv/clarrv DOUBLE PRECISION for dlarrv/zlarrv Array, DIMENSION (n). The separation from the right neighbor eigenvalue in w. iblock INTEGER. Array, DIMENSION (n). The indices of the blocks (submatrices) associated with the corresponding eigenvalues in w; iblock(i)=1 if eigenvalue w(i) belongs to the first block from the top, =2 if w(i) belongs to the second block, etc. indexw INTEGER. Array, DIMENSION (n). The indices of the eigenvalues within each block (submatrix); for example, indexw(i)= 10 and iblock(i)=2 imply that the i-th eigenvalue w(i) is the 10-th eigenvalue in the second block. gers REAL for slarrv/clarrv DOUBLE PRECISION for dlarrv/zlarrv Array, DIMENSION (2*n). The n Gerschgorin intervals (the i-th Gerschgorin interval is (gers(2*i-1), gers(2*i)). The Gerschgorin intervals should be computed from the original unshifted matrix. ldz INTEGER. The leading dimension of the output array Z. ldz = 1, and if jobz = 'V', ldz = max(1,n). work REAL for slarrv/clarrv DOUBLE PRECISION for dlarrv/zlarrv Workspace array, DIMENSION (12*n). iwork INTEGER. Workspace array, DIMENSION (7*n). Output Parameters d On exit, d may be overwritten. l On exit, l is overwritten. w On exit, w holds the eigenvalues of the unshifted matrix. werr On exit, werr contains refined values of its input approximations. wgap On exit, wgap contains refined values of its input approximations. Very small gaps are changed. z REAL for slarrv DOUBLE PRECISION for dlarrv COMPLEX for clarrv DOUBLE COMPLEX for zlarrv Array, DIMENSION (ldz, max(1,m) ). If info = 0, the first m columns of z contain the orthonormal eigenvectors of the matrix T corresponding to the input eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). NOTE The user must ensure that at least max(1,m) columns are supplied in the array z. isuppz INTEGER . 5 Intel® Math Kernel Library Reference Manual 1322 Array, DIMENSION (2*max(1,m)). The support of the eigenvectors in z, that is, the indices indicating the nonzero elements in z. The i-th eigenvector is nonzero only in elements isuppz(2i-1) through isuppz(2i). info INTEGER. If info = 0: successful exit If info > 0: A problem occured in ?larrv. If info = 5, the Rayleigh Quotient Iteration failed to converge to full accuracy. If info < 0: One of the called subroutines signaled an internal problem. Inspection of the corresponding parameter info for further information is required. • If info = -1, there is a problem in ?larrb when refining a child eigenvalue; • If info = -2, there is a problem in ?larrf when computing the relatively robust representation (RRR) of a child. When a child is inside a tight cluster, it can be difficult to find an RRR. A partial remedy from the user's point of view is to make the parameter minrgp smaller and recompile. However, as the orthogonality of the computed vectors is proportional to 1/minrgp, you should be aware that you might be trading in precision when you decrease minrgp. • If info = -3, there is a problem in ?larrb when refining a single eigenvalue after the Rayleigh correction was rejected. See Also ?larrb ?larre ?larrf ?lartg Generates a plane rotation with real cosine and real/ complex sine. Syntax call slartg( f, g, cs, sn, r ) call dlartg( f, g, cs, sn, r ) call clartg( f, g, cs, sn, r ) call zlartg( f, g, cs, sn, r ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine generates a plane rotation so that where cs2 + |sn|2 = 1 LAPACK Auxiliary and Utility Routines 5 1323 This is a slower, more accurate version of the BLAS Level 1 routine ?rotg, except for the following differences. For slartg/dlartg: f and g are unchanged on return; If g=0, then cs=1 and sn=0; If f=0 and g ? 0, then cs=0 and sn=1 without doing any floating point operations (saves work in ?bdsqr when there are zeros on the diagonal); If f exceeds g in magnitude, cs will be positive. For clartg/zlartg: f and g are unchanged on return; If g=0, then cs=1 and sn=0; If f=0, then cs=0 and sn is chosen so that r is real. Input Parameters f, g REAL for slartg DOUBLE PRECISION for dlartg COMPLEX for clartg DOUBLE COMPLEX for zlartg The first and second component of vector to be rotated. Output Parameters cs REAL for slartg/clartg DOUBLE PRECISION for dlartg/zlartg The cosine of the rotation. sn REAL for slartg DOUBLE PRECISION for dlartg COMPLEX for clartg DOUBLE COMPLEX for zlartg The sine of the rotation. r REAL for slartg DOUBLE PRECISION for dlartg COMPLEX for clartg DOUBLE COMPLEX for zlartg The nonzero component of the rotated vector. ?lartgp Generates a plane rotation so that the diagonal is nonnegative. Syntax Fortran 77: call slartgp( f, g, cs, sn, r ) call dlartgp( f, g, cs, sn, r ) Fortran 95: call lartgp( f,g,cs,sn,r ) 5 Intel® Math Kernel Library Reference Manual 1324 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine generates a plane rotation so that where cs2 + sn2 = 1 This is a slower, more accurate version of the BLAS Level 1 routine ?rotg, except for the following differences: • f and g are unchanged on return. • If g=0, then cs=(+/-)1 and sn=0. • If f=0 and g ? 0, then cs=0 and sn=(+/-)1. The sign is chosen so that r = 0. Input Parameters f, g REAL for slartgp DOUBLE PRECISION for dlartgp The first and second component of the vector to be rotated. Output Parameters cs REAL for slartgp DOUBLE PRECISION for dlartgp The cosine of the rotation. sn REAL for slartgp DOUBLE PRECISION for dlartgp The sine of the rotation. r REAL for slartgp DOUBLE PRECISION for dlartgp The nonzero component of the rotated vector. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ?lartgp interface are as follows: f Holds the first component of the vector to be rotated. g Holds the second component of the vector to be rotated. cs Holds the cosine of the rotation. sn Holds the sine of the rotation. r Holds the nonzero component of the rotated vector. See Also ?rotg LAPACK Auxiliary and Utility Routines 5 1325 ?lartg ?lartgs ?lartgs Generates a plane rotation designed to introduce a bulge in implicit QR iteration for the bidiagonal SVD problem. Syntax Fortran 77: call slartgs( x, y, sigma, cs, sn ) call dlartgs( x, y, sigma, cs, sn ) Fortran 95: call lartgs( x,y,sigma,cs,sn ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine generates a plane rotation designed to introduce a bulge in Golub-Reinsch-style implicit QR iteration for the bidiagonal SVD problem. x and y are the top-row entries, and sigma is the shift. The computed cs and sn define a plane rotation that satisfies the following: with r nonnegative. If x2 - sigma and x * y are 0, the rotation is by p/2 Input Parameters x, y REAL for slartgs DOUBLE PRECISION for dlartgs The (1,1) and (1,2) entries of an upper bidiagonal matrix, respectively. sigma REAL for slartgs DOUBLE PRECISION for dlartgs Shift Output Parameters cs REAL for slartgs DOUBLE PRECISION for dlartgs The cosine of the rotation. sn REAL for slartgs DOUBLE PRECISION for dlartgs The sine of the rotation. 5 Intel® Math Kernel Library Reference Manual 1326 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ?lartgs interface are as follows: x Holds the (1,1) entry of an upper diagonal matrix. y Holds the (1,2) entry of an upper diagonal matrix. sigma Holds the shift. cs Holds the cosine of the rotation. sn Holds the sine of the rotation. See Also ?lartg ?lartgp ?lartv Applies a vector of plane rotations with real cosines and real/complex sines to the elements of a pair of vectors. Syntax call slartv( n, x, incx, y, incy, c, s, incc ) call dlartv( n, x, incx, y, incy, c, s, incc ) call clartv( n, x, incx, y, incy, c, s, incc ) call zlartv( n, x, incx, y, incy, c, s, incc ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine applies a vector of real/complex plane rotations with real cosines to elements of the real/complex vectors x and y. For i = 1,2,...,n Input Parameters n INTEGER. The number of plane rotations to be applied. x, y REAL for slartv DOUBLE PRECISION for dlartv COMPLEX for clartv DOUBLE COMPLEX for zlartv Arrays, DIMENSION (1+(n-1)*incx) and (1+(n-1)*incy), respectively. The input vectors x and y. LAPACK Auxiliary and Utility Routines 5 1327 incx INTEGER. The increment between elements of x. incx > 0. incy INTEGER. The increment between elements of y. incy > 0. c REAL for slartv/clartv DOUBLE PRECISION for dlartv/zlartv Array, DIMENSION (1+(n-1)*incc). The cosines of the plane rotations. s REAL for slartv DOUBLE PRECISION for dlartv COMPLEX for clartv DOUBLE COMPLEX for zlartv Array, DIMENSION (1+(n-1)*incc). The sines of the plane rotations. incc INTEGER. The increment between elements of c and s. incc > 0. Output Parameters x, y The rotated vectors x and y. ?laruv Returns a vector of n random real numbers from a uniform distribution. Syntax call slaruv( iseed, n, x ) call dlaruv( iseed, n, x ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?laruv returns a vector of n random real numbers from a uniform (0,1) distribution (n = 128). This is an auxiliary routine called by ?larnv. Input Parameters iseed INTEGER. Array, DIMENSION (4). On entry, the seed of the random number generator; the array elements must be between 0 and 4095, and iseed(4) must be odd. n INTEGER. The number of random numbers to be generated. n = 128. Output Parameters x REAL for slaruv DOUBLE PRECISION for dlaruv Array, DIMENSION (n). The generated random numbers. seed On exit, the seed is updated. 5 Intel® Math Kernel Library Reference Manual 1328 ?larz Applies an elementary reflector (as returned by ? tzrzf) to a general matrix. Syntax call slarz( side, m, n, l, v, incv, tau, c, ldc, work ) call dlarz( side, m, n, l, v, incv, tau, c, ldc, work ) call clarz( side, m, n, l, v, incv, tau, c, ldc, work ) call zlarz( side, m, n, l, v, incv, tau, c, ldc, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?larz applies a real/complex elementary reflector H to a real/complex m-by-n matrix C, from either the left or the right. H is represented in the forms H = I-tau*v*vT for real flavors and H = I-tau*v*vH for complex flavors, where tau is a real/complex scalar and v is a real/complex vector, respectively. If tau = 0, then H is taken to be the unit matrix. For complex flavors, to apply HH (the conjugate transpose of H), supply conjg(tau) instead of tau. H is a product of k elementary reflectors as returned by ?tzrzf. Input Parameters side CHARACTER*1. If side = 'L': form H*C If side = 'R': form C*H m INTEGER. The number of rows of the matrix C. n INTEGER. The number of columns of the matrix C. l INTEGER. The number of entries of the vector v containing the meaningful part of the Householder vectors. If side = 'L', m = L = 0, if side = 'R', n = L = 0. v REAL for slarz DOUBLE PRECISION for dlarz COMPLEX for clarz DOUBLE COMPLEX for zlarz Array, DIMENSION (1+(l-1)*abs(incv)). The vector v in the representation of H as returned by ?tzrzf. v is not used if tau = 0. incv INTEGER. The increment between elements of v. incv ? 0. tau REAL for slarz DOUBLE PRECISION for dlarz COMPLEX for clarz DOUBLE COMPLEX for zlarz LAPACK Auxiliary and Utility Routines 5 1329 The value tau in the representation of H. c REAL for slarz DOUBLE PRECISION for dlarz COMPLEX for clarz DOUBLE COMPLEX for zlarz Array, DIMENSION (ldc,n). On entry, the m-by-n matrix C. ldc INTEGER. The leading dimension of the array c. ldc = max(1,m). work REAL for slarz DOUBLE PRECISION for dlarz COMPLEX for clarz DOUBLE COMPLEX for zlarz Workspace array, DIMENSION (n) if side = 'L' or (m) if side = 'R'. Output Parameters c On exit, C is overwritten by the matrix H*C if side = 'L', or C*H if side = 'R'. ?larzb Applies a block reflector or its transpose/conjugatetranspose to a general matrix. Syntax call slarzb( side, trans, direct, storev, m, n, k, l, v, ldv, t, ldt, c, ldc, work, ldwork ) call dlarzb( side, trans, direct, storev, m, n, k, l, v, ldv, t, ldt, c, ldc, work, ldwork ) call clarzb( side, trans, direct, storev, m, n, k, l, v, ldv, t, ldt, c, ldc, work, ldwork ) call zlarzb( side, trans, direct, storev, m, n, k, l, v, ldv, t, ldt, c, ldc, work, ldwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine applies a real/complex block reflector H or its transpose HT (or the conjugate transpose HH for complex flavors) to a real/complex distributed m-by-n matrix C from the left or the right. Currently, only storev = 'R' and direct = 'B' are supported. Input Parameters side CHARACTER*1. If side = 'L': apply H or HT/HH from the left If side = 'R': apply H or HT/HH from the right 5 Intel® Math Kernel Library Reference Manual 1330 trans CHARACTER*1. If trans = 'N': apply H (No transpose) If trans='C': apply HH (conjugate transpose) If trans='T': apply HT (transpose transpose) direct CHARACTER*1. Indicates how H is formed from a product of elementary reflectors = 'F': H = H(1)*H(2)*...*H(k) (forward, not supported) = 'B': H = H(k)*...*H(2)*H(1) (backward) storev CHARACTER*1. Indicates how the vectors which define the elementary reflectors are stored: = 'C': Column-wise (not supported) = 'R': Row-wise. m INTEGER. The number of rows of the matrix C. n INTEGER. The number of columns of the matrix C. k INTEGER. The order of the matrix T (equal to the number of elementary reflectors whose product defines the block reflector). l INTEGER. The number of columns of the matrix V containing the meaningful part of the Householder reflectors. If side = 'L', m = l = 0, if side = 'R', n = l = 0. v REAL for slarzb DOUBLE PRECISION for dlarzb COMPLEX for clarzb DOUBLE COMPLEX for zlarzb Array, DIMENSION (ldv, nv). If storev = 'C', nv = k; if storev = 'R', nv = l. ldv INTEGER. The leading dimension of the array v. If storev = 'C', ldv = l; if storev = 'R', ldv = k. t REAL for slarzb DOUBLE PRECISION for dlarzb COMPLEX for clarzb DOUBLE COMPLEX for zlarzb Array, DIMENSION (ldt,k). The triangular k-by-k matrix T in the representation of the block reflector. ldt INTEGER. The leading dimension of the array t. ldt = k. c REAL for slarzb DOUBLE PRECISION for dlarzb COMPLEX for clarzb DOUBLE COMPLEX for zlarzb Array, DIMENSION (ldc,n). On entry, the m-by-n matrix C. ldc INTEGER. The leading dimension of the array c. ldc = max(1,m). work REAL for slarzb DOUBLE PRECISION for dlarzb COMPLEX for clarzb DOUBLE COMPLEX for zlarzb Workspace array, DIMENSION (ldwork, k). LAPACK Auxiliary and Utility Routines 5 1331 ldwork INTEGER. The leading dimension of the array work. If side = 'L', ldwork = max(1, n); if side = 'R', ldwork = max(1, m). Output Parameters c On exit, C is overwritten by H*C, or HT/HH*C, or C*H, or C*HT/HH. ?larzt Forms the triangular factor T of a block reflector H = I - V*T*VH. Syntax call slarzt( direct, storev, n, k, v, ldv, tau, t, ldt ) call dlarzt( direct, storev, n, k, v, ldv, tau, t, ldt ) call clarzt( direct, storev, n, k, v, ldv, tau, t, ldt ) call zlarzt( direct, storev, n, k, v, ldv, tau, t, ldt ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine forms the triangular factor T of a real/complex block reflector H of order > n, which is defined as a product of k elementary reflectors. If direct = 'F', H = H(1)*H(2)*...*H(k), and T is upper triangular. If direct = 'B', H = H(k)*...*H(2)*H(1), and T is lower triangular. If storev = 'C', the vector which defines the elementary reflector H(i) is stored in the i-th column of the array v, and H = I-V*T*VT (for real flavors) or H = I-V*T*VH (for complex flavors). If storev = 'R', the vector which defines the elementary reflector H(i) is stored in the i-th row of the array v, and H = I-VT*T*V (for real flavors) or H = I-VH*T*V (for complex flavors). Currently, only storev = 'R' and direct = 'B' are supported. Input Parameters direct CHARACTER*1. Specifies the order in which the elementary reflectors are multiplied to form the block reflector: If direct = 'F': H = H(1)*H(2)*...*H(k) (forward, not supported) If direct = 'B': H = H(k)*...*H(2)*H(1) (backward) storev CHARACTER*1. Specifies how the vectors which define the elementary reflectors are stored (see also Application Notes below): If storev = 'C': column-wise (not supported) If storev = 'R': row-wise n INTEGER. The order of the block reflector H. n = 0. k INTEGER. The order of the triangular factor T (equal to the number of elementary reflectors). k = 1. 5 Intel® Math Kernel Library Reference Manual 1332 v REAL for slarzt DOUBLE PRECISION for dlarzt COMPLEX for clarzt DOUBLE COMPLEX for zlarzt Array, DIMENSION (ldv, k) if storev = 'C' (ldv, n) if storev = 'R' The matrix V. ldv INTEGER. The leading dimension of the array v. If storev = 'C', ldv = max(1,n); if storev = 'R', ldv = k. tau REAL for slarzt DOUBLE PRECISION for dlarzt COMPLEX for clarzt DOUBLE COMPLEX for zlarzt Array, DIMENSION (k). tau(i) must contain the scalar factor of the elementary reflector H(i). ldt INTEGER. The leading dimension of the output array t. ldt = k. Output Parameters t REAL for slarzt DOUBLE PRECISION for dlarzt COMPLEX for clarzt DOUBLE COMPLEX for zlarzt Array, DIMENSION (ldt,k). The k-by-k triangular factor T of the block reflector. If direct = 'F', T is upper triangular; if direct = 'B', T is lower triangular. The rest of the array is not used. v The matrix V. See Application Notes below. Application Notes The shape of the matrix V and the storage of the vectors which define the H(i) is best illustrated by the following example with n = 5 and k = 3. The elements equal to 1 are not stored; the corresponding array elements are modified but restored on exit. The rest of the array is not used. LAPACK Auxiliary and Utility Routines 5 1333 ?las2 Computes singular values of a 2-by-2 triangular matrix. Syntax call slas2( f, g, h, ssmin, ssmax ) 5 Intel® Math Kernel Library Reference Manual 1334 call dlas2( f, g, h, ssmin, ssmax ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?las2 computes the singular values of the 2-by-2 matrix On return, ssmin is the smaller singular value and SSMAX is the larger singular value. Input Parameters f, g, h REAL for slas2 DOUBLE PRECISION for dlas2 The (1,1), (1,2) and (2,2) elements of the 2-by-2 matrix, respectively. Output Parameters ssmin, ssmax REAL for slas2 DOUBLE PRECISION for dlas2 The smaller and the larger singular values, respectively. Application Notes Barring over/underflow, all output quantities are correct to within a few units in the last place (ulps), even in the absence of a guard digit in addition/subtraction. In ieee arithmetic, the code works correctly if one matrix element is infinite. Overflow will not occur unless the largest singular value itself overflows, or is within a few ulps of overflow. (On machines with partial overflow, like the Cray, overflow may occur if the largest singular value is within a factor of 2 of overflow.) Underflow is harmless if underflow is gradual. Otherwise, results may correspond to a matrix modified by perturbations of size near the underflow threshold. ?lascl Multiplies a general rectangular matrix by a real scalar defined as cto/cfrom. Syntax call slascl( type, kl, ku, cfrom, cto, m, n, a, lda, info ) call dlascl( type, kl, ku, cfrom, cto, m, n, a, lda, info ) call clascl( type, kl, ku, cfrom, cto, m, n, a, lda, info ) call zlascl( type, kl, ku, cfrom, cto, m, n, a, lda, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lascl multiplies the m-by-n real/complex matrix A by the real scalar cto/cfrom. The operation is performed without over/underflow as long as the final result cto*A(i,j)/cfrom does not over/underflow. LAPACK Auxiliary and Utility Routines 5 1335 type specifies that A may be full, upper triangular, lower triangular, upper Hessenberg, or banded. Input Parameters type CHARACTER*1. This parameter specifies the storage type of the input matrix. = 'G': A is a full matrix. = 'L': A is a lower triangular matrix. = 'U': A is an upper triangular matrix. = 'H': A is an upper Hessenberg matrix. = 'B': A is a symmetric band matrix with lower bandwidth kl and upper bandwidth ku and with the only the lower half stored = 'Q': A is a symmetric band matrix with lower bandwidth kl and upper bandwidth ku and with the only the upper half stored. = 'Z': A is a band matrix with lower bandwidth kl and upper bandwidth ku. See description of the ?gbtrf function for storage details. kl INTEGER. The lower bandwidth of A. Referenced only if type = 'B', 'Q' or 'Z'. ku INTEGER. The upper bandwidth of A. Referenced only if type = 'B', 'Q' or 'Z'. cfrom, cto REAL for slascl/clascl DOUBLE PRECISION for dlascl/zlascl The matrix A is multiplied by cto/cfrom. A(i,j) is computed without over/ underflow if the final result cto*A(i,j)/cfrom can be represented without over/underflow. cfrom must be nonzero. m INTEGER. The number of rows of the matrix A. m = 0. n INTEGER. The number of columns of the matrix A. n = 0. a REAL for slascl DOUBLE PRECISION for dlascl COMPLEX for clascl DOUBLE COMPLEX for zlascl Array, DIMENSION (lda, n). The matrix to be multiplied by cto/cfrom. See type for the storage type. lda INTEGER. The leading dimension of the array a. lda = max(1,m). Output Parameters a The multiplied matrix A. info INTEGER. If info = 0 - successful exit If info = -i < 0, the i-th argument had an illegal value. See Also ?gbtrf ?lasd0 Computes the singular values of a real upper bidiagonal n-by-m matrix B with diagonal d and offdiagonal e. Used by ?bdsdc. 5 Intel® Math Kernel Library Reference Manual 1336 Syntax call slasd0( n, sqre, d, e, u, ldu, vt, ldvt, smlsiz, iwork, work, info ) call dlasd0( n, sqre, d, e, u, ldu, vt, ldvt, smlsiz, iwork, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description Using a divide and conquer approach, the routine ?lasd0 computes the singular value decomposition (SVD) of a real upper bidiagonal n-by-m matrix B with diagonal d and offdiagonal e, where m = n + sqre. The algorithm computes orthogonal matrices U and VT such that B = U*S*VT. The singular values S are overwritten on d. The related subroutine ?lasda computes only the singular values, and optionally, the singular vectors in compact form. Input Parameters n INTEGER. On entry, the row dimension of the upper bidiagonal matrix. This is also the dimension of the main diagonal array d. sqre INTEGER. Specifies the column dimension of the bidiagonal matrix. If sqre = 0: the bidiagonal matrix has column dimension m = n. If sqre = 1: the bidiagonal matrix has column dimension m = n+1. d REAL for slasd0 DOUBLE PRECISION for dlasd0 Array, DIMENSION (n). On entry, d contains the main diagonal of the bidiagonal matrix. e REAL for slasd0 DOUBLE PRECISION for dlasd0 Array, DIMENSION (m-1). Contains the subdiagonal entries of the bidiagonal matrix. On exit, e is destroyed. ldu INTEGER. On entry, leading dimension of the output array u. ldvt INTEGER. On entry, leading dimension of the output array vt. smlsiz INTEGER. On entry, maximum size of the subproblems at the bottom of the computation tree. iwork INTEGER. Workspace array, dimension must be at least (8n). work REAL for slasd0 DOUBLE PRECISION for dlasd0 Workspace array, dimension must be at least (3m2+2m). Output Parameters d On exit d, If info = 0, contains singular values of the bidiagonal matrix. u REAL for slasd0 DOUBLE PRECISION for dlasd0 Array, DIMENSION at least (ldq, n). On exit, u contains the left singular vectors. vt REAL for slasd0 DOUBLE PRECISION for dlasd0 LAPACK Auxiliary and Utility Routines 5 1337 Array, DIMENSION at least (ldvt, m). On exit, vtT contains the right singular vectors. info INTEGER. If info = 0: successful exit. If info = -i < 0, the i-th argument had an illegal value. If info = 1, a singular value did not converge. ?lasd1 Computes the SVD of an upper bidiagonal matrix B of the specified size. Used by ?bdsdc. Syntax call slasd1( nl, nr, sqre, d, alpha, beta, u, ldu, vt, ldvt, idxq, iwork, work, info ) call dlasd1( nl, nr, sqre, d, alpha, beta, u, ldu, vt, ldvt, idxq, iwork, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the SVD of an upper bidiagonal n-by-m matrix B, where n = nl + nr + 1 and m = n + sqre. The routine ?lasd1 is called from ?lasd0. A related subroutine ?lasd7 handles the case in which the singular values (and the singular vectors in factored form) are desired. ?lasd1 computes the SVD as follows: = U(out)*(D(out) 0)*VT(out) where ZT = (Z1T a Z2T b) = uT*VTT, and u is a vector of dimension m with alpha and beta in the nl+1 and nl+2-th entries and zeros elsewhere; and the entry b is empty if sqre = 0. The left singular vectors of the original matrix are stored in u, and the transpose of the right singular vectors are stored in vt, and the singular values are in d. The algorithm consists of three stages: 1. The first stage consists of deflating the size of the problem when there are multiple singular values or when there are zeros in the Z vector. For each such occurrence the dimension of the secular equation problem is reduced by one. This stage is performed by the routine ?lasd2. 2. The second stage consists of calculating the updated singular values. This is done by finding the square roots of the roots of the secular equation via the routine ?lasd4 (as called by ?lasd3). This routine also calculates the singular vectors of the current problem. 3. The final stage consists of computing the updated singular vectors directly using the updated singular values. The singular vectors for the current problem are multiplied with the singular vectors from the overall problem. 5 Intel® Math Kernel Library Reference Manual 1338 Input Parameters nl INTEGER. The row dimension of the upper block. nl = 1. nr INTEGER. The row dimension of the lower block. nr = 1. sqre INTEGER. If sqre = 0: the lower block is an nr-by-nr square matrix. If sqre = 1: the lower block is an nr-by-(nr+1) rectangular matrix. The bidiagonal matrix has row dimension n = nl + nr + 1, and column dimension m = n + sqre. d REAL for slasd1 DOUBLE PRECISION for dlasd1 Array, DIMENSION (nl+nr+1). n = nl+nr+1. On entry d(1:nl,1:nl) contains the singular values of the upper block; and d(nl+2:n) contains the singular values of the lower block. alpha REAL for slasd1 DOUBLE PRECISION for dlasd1 Contains the diagonal element associated with the added row. beta REAL for slasd1 DOUBLE PRECISION for dlasd1 Contains the off-diagonal element associated with the added row. u REAL for slasd1 DOUBLE PRECISION for dlasd1 Array, DIMENSION (ldu, n). On entry u(1:nl, 1:nl) contains the left singular vectors of the upper block; u(nl+2:n, nl+2:n) contains the left singular vectors of the lower block. ldu INTEGER. The leading dimension of the array U. ldu = max(1, n). vt REAL for slasd1 DOUBLE PRECISION for dlasd1 Array, DIMENSION (ldvt, m), where m = n + sqre. On entry vt(1:nl+1, 1:nl+1)T contains the right singular vectors of the upper block; vt(nl+2:m, nl+2:m)T contains the right singular vectors of the lower block. ldvt INTEGER. The leading dimension of the array vt. ldvt = max(1, M). iwork INTEGER. Workspace array, DIMENSION (4n). work REAL for slasd1 DOUBLE PRECISION for dlasd1 Workspace array, DIMENSION (3m2 + 2m). Output Parameters d On exit d(1:n) contains the singular values of the modified matrix. alpha On exit, the diagonal element associated with the added row deflated by max( abs( alpha ), abs( beta ), abs( D(I) ) ), I = 1,n. beta On exit, the off-diagonal element associated with the added row deflated by max( abs( alpha ), abs( beta ), abs( D(I) ) ), I = 1,n. LAPACK Auxiliary and Utility Routines 5 1339 u On exit u contains the left singular vectors of the bidiagonal matrix. vt On exit vtT contains the right singular vectors of the bidiagonal matrix. idxq INTEGER Array, DIMENSION (n). Contains the permutation which will reintegrate the subproblem just solved back into sorted order, that is, d(idxq( i = 1, n )) will be in ascending order. info INTEGER. If info = 0: successful exit. If info = -i < 0, the i-th argument had an illegal value. If info = 1, a singular value did not converge. ?lasd2 Merges the two sets of singular values together into a single sorted set. Used by ?bdsdc. Syntax call slasd2( nl, nr, sqre, k, d, z, alpha, beta, u, ldu, vt, ldvt, dsigma, u2, ldu2, vt2, ldvt2, idxp, idx, idxp, idxq, coltyp, info ) call dlasd2( nl, nr, sqre, k, d, z, alpha, beta, u, ldu, vt, ldvt, dsigma, u2, ldu2, vt2, ldvt2, idxp, idx, idxp, idxq, coltyp, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasd2 merges the two sets of singular values together into a single sorted set. Then it tries to deflate the size of the problem. There are two ways in which deflation can occur: when two or more singular values are close together or if there is a tiny entry in the Z vector. For each such occurrence the order of the related secular equation problem is reduced by one. The routine ?lasd2 is called from ?lasd1. Input Parameters nl INTEGER. The row dimension of the upper block. nl = 1. nr INTEGER. The row dimension of the lower block. nr = 1. sqre INTEGER. If sqre = 0): the lower block is an nr-by-nr square matrix If sqre = 1): the lower block is an nr-by-(nr+1) rectangular matrix. The bidiagonal matrix has n = nl + nr + 1 rows and m = n + sqre = n columns. d REAL for slasd2 DOUBLE PRECISION for dlasd2 Array, DIMENSION (n). On entry d contains the singular values of the two submatrices to be combined. alpha REAL for slasd2 DOUBLE PRECISION for dlasd2 5 Intel® Math Kernel Library Reference Manual 1340 Contains the diagonal element associated with the added row. beta REAL for slasd2 DOUBLE PRECISION for dlasd2 Contains the off-diagonal element associated with the added row. u REAL for slasd2 DOUBLE PRECISION for dlasd2 Array, DIMENSION (ldu, n). On entry u contains the left singular vectors of two submatrices in the two square blocks with corners at (1,1), (nl, nl), and (nl+2, nl+2), (n,n). ldu INTEGER. The leading dimension of the array u. ldu = n. ldu2 INTEGER. The leading dimension of the output array u2. ldu2 = n. vt REAL for slasd2 DOUBLE PRECISION for dlasd2 Array, DIMENSION (ldvt, m). On entry, vtT contains the right singular vectors of two submatrices in the two square blocks with corners at (1,1), (nl+1, nl+1), and (nl+2, nl+2), (m, m). ldvt INTEGER. The leading dimension of the array vt. ldvt = m. ldvt2 INTEGER. The leading dimension of the output array vt2. ldvt2 = m. idxp INTEGER. Workspace array, DIMENSION (n). This will contain the permutation used to place deflated values of D at the end of the array. On output idxp(2:k) points to the nondeflated d-values and idxp(k+1:n) points to the deflated singular values. idx INTEGER. Workspace array, DIMENSION (n). This will contain the permutation used to sort the contents of d into ascending order. coltyp INTEGER. Workspace array, DIMENSION (n). As workspace, this array contains a label that indicates which of the following types a column in the u2 matrix or a row in the vt2 matrix is: 1 : non-zero in the upper half only 2 : non-zero in the lower half only 3 : dense 4 : deflated. idxq INTEGER. Array, DIMENSION (n). This parameter contains the permutation that separately sorts the two sub-problems in D in the ascending order. Note that entries in the first half of this permutation must first be moved one position backwards and entries in the second half must have nl+1 added to their values. Output Parameters k INTEGER. Contains the dimension of the non-deflated matrix, This is the order of the related secular equation. 1 = k = n. d On exit D contains the trailing (n-k) updated singular values (those which were deflated) sorted into increasing order. u On exit u contains the trailing (n-k) updated left singular vectors (those which were deflated) in its last n-k columns. z REAL for slasd2 LAPACK Auxiliary and Utility Routines 5 1341 DOUBLE PRECISION for dlasd2 Array, DIMENSION (n). On exit, z contains the updating row vector in the secular equation. dsigma REAL for slasd2 DOUBLE PRECISION for dlasd2 Array, DIMENSION (n). Contains a copy of the diagonal elements (k-1 singular values and one zero) in the secular equation. u2 REAL for slasd2 DOUBLE PRECISION for dlasd2 Array, DIMENSION (ldu2, n). Contains a copy of the first k-1 left singular vectors which will be used by ?lasd3 in a matrix multiply (?gemm) to solve for the new left singular vectors. u2 is arranged into four blocks. The first block contains a column with 1 at nl+1 and zero everywhere else; the second block contains non-zero entries only at and above nl; the third contains non-zero entries only below nl+1; and the fourth is dense. vt On exit, vtT contains the trailing (n-k) updated right singular vectors (those which were deflated) in its last n-k columns. In case sqre =1, the last row of vt spans the right null space. vt2 REAL for slasd2 DOUBLE PRECISION for dlasd2 Array, DIMENSION (ldvt2, n). vt2T contains a copy of the first k right singular vectors which will be used by ?lasd3 in a matrix multiply (?gemm) to solve for the new right singular vectors. vt2 is arranged into three blocks. The first block contains a row that corresponds to the special 0 diagonal element in sigma; the second block contains non-zeros only at and before nl +1; the third block contains non-zeros only at and after nl +2. idxc INTEGER. Array, DIMENSION (n). This will contain the permutation used to arrange the columns of the deflated u matrix into three groups: the first group contains non-zero entries only at and above nl, the second contains non-zero entries only below nl+2, and the third is dense. coltyp On exit, it is an array of dimension 4, with coltyp(i) being the dimension of the i-th type columns. info INTEGER. If info = 0): successful exit If info = -i < 0, the i-th argument had an illegal value. ?lasd3 Finds all square roots of the roots of the secular equation, as defined by the values in D and Z, and then updates the singular vectors by matrix multiplication. Used by ?bdsdc. Syntax call slasd3( nl, nr, sqre, k, d, q, ldq, dsigma, u, ldu, u2, ldu2, vt, ldvt, vt2, ldvt2, idxc, ctot, z, info ) call dlasd3( nl, nr, sqre, k, d, q, ldq, dsigma, u, ldu, u2, ldu2, vt, ldvt, vt2, ldvt2, idxc, ctot, z, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1342 Description The routine ?lasd3 finds all the square roots of the roots of the secular equation, as defined by the values in D and Z. It makes the appropriate calls to ?lasd4 and then updates the singular vectors by matrix multiplication. The routine ?lasd3 is called from ?lasd1. Input Parameters nl INTEGER. The row dimension of the upper block. nl = 1. nr INTEGER. The row dimension of the lower block. nr = 1. sqre INTEGER. If sqre = 0): the lower block is an nr-by-nr square matrix. If sqre = 1): the lower block is an nr-by-(nr+1) rectangular matrix. The bidiagonal matrix has n = nl + nr + 1 rows and m = n + sqre = n columns. k INTEGER.The size of the secular equation, 1 = k = n. q REAL for slasd3 DOUBLE PRECISION for dlasd3 Workspace array, DIMENSION at least (ldq, k). ldq INTEGER. The leading dimension of the array Q. ldq = k. dsigma REAL for slasd3 DOUBLE PRECISION for dlasd3 Array, DIMENSION (k). The first k elements of this array contain the old roots of the deflated updating problem. These are the poles of the secular equation. ldu INTEGER. The leading dimension of the array u. ldu = n. u2 REAL for slasd3 DOUBLE PRECISION for dlasd3 Array, DIMENSION (ldu2, n). The first k columns of this matrix contain the non-deflated left singular vectors for the split problem. ldu2 INTEGER. The leading dimension of the array u2. ldu2 = n. ldvt INTEGER. The leading dimension of the array vt. ldvt = n. vt2 REAL for slasd3 DOUBLE PRECISION for dlasd3 Array, DIMENSION (ldvt2, n). The first k columns of vt2' contain the non-deflated right singular vectors for the split problem. ldvt2 INTEGER. The leading dimension of the array vt2. ldvt2 = n. idxc INTEGER. Array, DIMENSION (n). LAPACK Auxiliary and Utility Routines 5 1343 The permutation used to arrange the columns of u (and rows of vt) into three groups: the first group contains non-zero entries only at and above (or before) nl +1; the second contains non-zero entries only at and below (or after) nl+2; and the third is dense. The first column of u and the row of vt are treated separately, however. The rows of the singular vectors found by ?lasd4 must be likewise permuted before the matrix multiplies can take place. ctot INTEGER. Array, DIMENSION (4). A count of the total number of the various types of columns in u (or rows in vt), as described in idxc. The fourth column type is any column which has been deflated. z REAL for slasd3 DOUBLE PRECISION for dlasd3 Array, DIMENSION (k). The first k elements of this array contain the components of the deflation-adjusted updating row vector. Output Parameters d REAL for slasd3 DOUBLE PRECISION for dlasd3 Array, DIMENSION (k). On exit the square roots of the roots of the secular equation, in ascending order. u REAL for slasd3 DOUBLE PRECISION for dlasd3 Array, DIMENSION (ldu, n). The last n - k columns of this matrix contain the deflated left singular vectors. vt REAL for slasd3 DOUBLE PRECISION for dlasd3 Array, DIMENSION (ldvt, m). The last m - k columns of vt' contain the deflated right singular vectors. vt2 Destroyed on exit. z Destroyed on exit. info INTEGER. If info = 0): successful exit. If info = -i < 0, the i-th argument had an illegal value. If info = 1, an singular value did not converge. Application Notes This code makes very mild assumptions about floating point arithmetic. It will work on machines with a guard digit in add/subtract, or on those binary machines without guard digits which subtract like the Cray XMP, Cray YMP, Cray C 90, or Cray 2. It could conceivably fail on hexadecimal or decimal machines without guard digits, but we know of none. ?lasd4 Computes the square root of the i-th updated eigenvalue of a positive symmetric rank-one modification to a positive diagonal matrix. Used by ? bdsdc. Syntax call slasd4( n, i, d, z, delta, rho, sigma, work, info) 5 Intel® Math Kernel Library Reference Manual 1344 call dlasd4( n, i, d, z, delta, rho, sigma, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the square root of the i-th updated eigenvalue of a positive symmetric rank-one modification to a positive diagonal matrix whose entries are given as the squares of the corresponding entries in the array d, and that 0 = d(i) < d(j) for i < j and that rho > 0. This is arranged by the calling routine, and is no loss in generality. The rank-one modified system is thus diag(d)*diag(d) + rho*Z*ZT, where the Euclidean norm of Z is equal to 1.The method consists of approximating the rational functions in the secular equation by simpler interpolating rational functions. Input Parameters n INTEGER. The length of all arrays. i INTEGER. The index of the eigenvalue to be computed. 1 = i = n. d REAL for slasd4 DOUBLE PRECISION for dlasd4 Array, DIMENSION (n). The original eigenvalues. They must be in order, 0 = d(i) < d(j) for i < j. z REAL for slasd4 DOUBLE PRECISION for dlasd4 Array, DIMENSION (n). The components of the updating vector. rho REAL for slasd4 DOUBLE PRECISION for dlasd4 The scalar in the symmetric updating formula. work REAL for slasd4 DOUBLE PRECISION for dlasd4 Workspace array, DIMENSION (n ). If n ? 1, work contains (d(j) + sigma_i) in its j-th component. If n = 1, then work( 1 ) = 1. Output Parameters delta REAL for slasd4 DOUBLE PRECISION for dlasd4 Array, DIMENSION (n). If n ? 1, delta contains (d(j) - sigma_i) in its j-th component. If n = 1, then delta (1) = 1. The vector delta contains the information necessary to construct the (singular) eigenvectors. sigma REAL for slasd4 DOUBLE PRECISION for dlasd4 The computed sigma_i, the i-th updated eigenvalue. info INTEGER. = 0: successful exit > 0: If info = 1, the updating process failed. LAPACK Auxiliary and Utility Routines 5 1345 ?lasd5 Computes the square root of the i-th eigenvalue of a positive symmetric rank-one modification of a 2-by-2 diagonal matrix.Used by ?bdsdc. Syntax call slasd5( i, d, z, delta, rho, dsigma, work ) call dlasd5( i, d, z, delta, rho, dsigma, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the square root of the i-th eigenvalue of a positive symmetric rank-one modification of a 2-by-2 diagonal matrix diag(d)*diag(d)+rho*Z*ZT The diagonal entries in the array d must satisfy 0 = d(i) < d(j) for i 0: if info = 1, an singular value did not converge ?lasd7 Merges the two sets of singular values together into a single sorted set. Then it tries to deflate the size of the problem. Used by ?bdsdc. Syntax call slasd7( icompq, nl, nr, sqre, k, d, z, zw, vf, vfw, vl, vlw, alpha, beta, dsigma, idx, idxp, idxq, perm, givptr, givcol, ldgcol, givnum, ldgnum, c, s, info ) call dlasd7( icompq, nl, nr, sqre, k, d, z, zw, vf, vfw, vl, vlw, alpha, beta, dsigma, idx, idxp, idxq, perm, givptr, givcol, ldgcol, givnum, ldgnum, c, s, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasd7 merges the two sets of singular values together into a single sorted set. Then it tries to deflate the size of the problem. There are two ways in which deflation can occur: when two or more singular values are close together or if there is a tiny entry in the Z vector. For each such occurrence the order of the related secular equation problem is reduced by one. ?lasd7 is called from ?lasd6. Input Parameters icompq INTEGER. Specifies whether singular vectors are to be computed in compact form, as follows: = 0: Compute singular values only. = 1: Compute singular vectors of upper bidiagonal matrix in compact form. nl INTEGER. The row dimension of the upper block. nl = 1. nr INTEGER. The row dimension of the lower block. nr = 1. sqre INTEGER. = 0: the lower block is an nr-by-nr square matrix. = 1: the lower block is an nr-by-(nr+1) rectangular matrix. The bidiagonal matrix has n = nl + nr + 1 rows and m = n + sqre = n columns. d REAL for slasd7 DOUBLE PRECISION for dlasd7 5 Intel® Math Kernel Library Reference Manual 1350 Array, DIMENSION (n). On entry d contains the singular values of the two submatrices to be combined. zw REAL for slasd7 DOUBLE PRECISION for dlasd7 Array, DIMENSION ( m ). Workspace for z. vf REAL for slasd7 DOUBLE PRECISION for dlasd7 Array, DIMENSION ( m ). On entry, vf(1:nl+1) contains the first components of all right singular vectors of the upper block; and vf(nl +2:m) contains the first components of all right singular vectors of the lower block. vfw REAL for slasd7 DOUBLE PRECISION for dlasd7 Array, DIMENSION ( m ). Workspace for vf. vl REAL for slasd7 DOUBLE PRECISION for dlasd7 Array, DIMENSION ( m ). On entry, vl(1:nl+1) contains the last components of all right singular vectors of the upper block; and vl(nl+2:m) contains the last components of all right singular vectors of the lower block. VLW REAL for slasd7 DOUBLE PRECISION for dlasd7 Array, DIMENSION ( m ). Workspace for VL. alpha REAL for slasd7 DOUBLE PRECISION for dlasd7. Contains the diagonal element associated with the added row. beta REAL for slasd7 DOUBLE PRECISION for dlasd7 Contains the off-diagonal element associated with the added row. idx INTEGER. Workspace array, DIMENSION (n). This will contain the permutation used to sort the contents of d into ascending order. idxp INTEGER. Workspace array, DIMENSION (n). This will contain the permutation used to place deflated values of d at the end of the array. idxq INTEGER. Array, DIMENSION (n). This contains the permutation which separately sorts the two sub-problems in d into ascending order. Note that entries in the first half of this permutation must first be moved one position backward; and entries in the second half must first have nl+1 added to their values. ldgcol INTEGER.The leading dimension of the output array givcol, must be at least n. ldgnum INTEGER. The leading dimension of the output array givnum, must be at least n. LAPACK Auxiliary and Utility Routines 5 1351 Output Parameters k INTEGER. Contains the dimension of the non-deflated matrix, this is the order of the related secular equation. 1 = k = n. d On exit, d contains the trailing (n-k) updated singular values (those which were deflated) sorted into increasing order. z REAL for slasd7 DOUBLE PRECISION for dlasd7. Array, DIMENSION (m). On exit, Z contains the updating row vector in the secular equation. vf On exit, vf contains the first components of all right singular vectors of the bidiagonal matrix. vl On exit, vl contains the last components of all right singular vectors of the bidiagonal matrix. dsigma REAL for slasd7 DOUBLE PRECISION for dlasd7. Array, DIMENSION (n). Contains a copy of the diagonal elements (k-1 singular values and one zero) in the secular equation. idxp On output, idxp(2: k) points to the nondeflated d-values and idxp( k+1:n) points to the deflated singular values. perm INTEGER. Array, DIMENSION (n). The permutations (from deflation and sorting) to be applied to each singular block. Not referenced if icompq = 0. givptr INTEGER. The number of Givens rotations which took place in this subproblem. Not referenced if icompq = 0. givcol INTEGER. Array, DIMENSION ( ldgcol, 2 ). Each pair of numbers indicates a pair of columns to take place in a Givens rotation. Not referenced if icompq = 0. givnum REAL for slasd7 DOUBLE PRECISION for dlasd7. Array, DIMENSION ( ldgnum, 2 ). Each number indicates the C or S value to be used in the corresponding Givens rotation. Not referenced if icompq = 0. c REAL for slasd7. DOUBLE PRECISION for dlasd7. If sqre =0, then c contains garbage, and if sqre = 1, then c contains Cvalue of a Givens rotation related to the right null space. S REAL for slasd7. DOUBLE PRECISION for dlasd7. If sqre =0, then s contains garbage, and if sqre = 1, then s contains Svalue of a Givens rotation related to the right null space. info INTEGER. = 0: successful exit. < 0: if info = -i, the i-th argument had an illegal value. 5 Intel® Math Kernel Library Reference Manual 1352 ?lasd8 Finds the square roots of the roots of the secular equation, and stores, for each element in D, the distance to its two nearest poles. Used by ?bdsdc. Syntax call slasd8( icompq, k, d, z, vf, vl, difl, difr, lddifr, dsigma, work, info ) call dlasd8( icompq, k, d, z, vf, vl, difl, difr, lddifr, dsigma, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasd8 finds the square roots of the roots of the secular equation, as defined by the values in dsigma and z. It makes the appropriate calls to ?lasd4, and stores, for each element in d, the distance to its two nearest poles (elements in dsigma). It also updates the arrays vf and vl, the first and last components of all the right singular vectors of the original bidiagonal matrix. ?lasd8 is called from ?lasd6. Input Parameters icompq INTEGER. Specifies whether singular vectors are to be computed in factored form in the calling routine: = 0: Compute singular values only. = 1: Compute singular vectors in factored form as well. k INTEGER. The number of terms in the rational function to be solved by ? lasd4. k = 1. z REAL for slasd8 DOUBLE PRECISION for dlasd8. Array, DIMENSION ( k ). The first k elements of this array contain the components of the deflationadjusted updating row vector. vf REAL for slasd8 DOUBLE PRECISION for dlasd8. Array, DIMENSION ( k ). On entry, vf contains information passed through dbede8. vl REAL for slasd8 DOUBLE PRECISION for dlasd8. Array, DIMENSION ( k ). On entry, vl contains information passed through dbede8. lddifr INTEGER. The leading dimension of the output array difr, must be at least k. dsigma REAL for slasd8 DOUBLE PRECISION for dlasd8. Array, DIMENSION ( k ). The first k elements of this array contain the old roots of the deflated updating problem. These are the poles of the secular equation. work REAL for slasd8 DOUBLE PRECISION for dlasd8. Workspace array, DIMENSION at least (3k). LAPACK Auxiliary and Utility Routines 5 1353 Output Parameters d REAL for slasd8 DOUBLE PRECISION for dlasd8. Array, DIMENSION ( k ). On output, D contains the updated singular values. z Updated on exit. vf On exit, vf contains the first k components of the first components of all right singular vectors of the bidiagonal matrix. vl On exit, vl contains the first k components of the last components of all right singular vectors of the bidiagonal matrix. difl REAL for slasd8 DOUBLE PRECISION for dlasd8. Array, DIMENSION ( k ). On exit, difl(i) = d(i) - dsigma(i). difr REAL for slasd8 DOUBLE PRECISION for dlasd8. Array, DIMENSION ( lddifr, 2 ) if icompq = 1 and DIMENSION ( k ) if icompq = 0. On exit, difr(i,1) = d(i) - dsigma(i+1), difr(k,1) is not defined and will not be referenced. If icompq = 1, difr(1:k,2) is an array containing the normalizing factors for the right singular vector matrix. dsigma The elements of this array may be very slightly altered in value. info INTEGER. = 0: successful exit. < 0: if info = -i, the i-th argument had an illegal value. > 0: If info = 1, an singular value did not converge. ?lasd9 Finds the square roots of the roots of the secular equation, and stores, for each element in D, the distance to its two nearest poles. Used by ?bdsdc. Syntax call slasd9( icompq, ldu, k, d, z, vf, vl, difl, difr, dsigma, work, info ) call dlasd9( icompq, ldu, k, d, z, vf, vl, difl, difr, dsigma, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasd9 finds the square roots of the roots of the secular equation, as defined by the values in dsigma and z. It makes the appropriate calls to ?lasd4, and stores, for each element in d, the distance to its two nearest poles (elements in dsigma). It also updates the arrays vf and vl, the first and last components of all the right singular vectors of the original bidiagonal matrix. ?lasd9 is called from ?lasd7. 5 Intel® Math Kernel Library Reference Manual 1354 Input Parameters icompq INTEGER. Specifies whether singular vectors are to be computed in factored form in the calling routine: If icompq = 0, compute singular values only; If icompq = 1, compute singular vector matrices in factored form also. k INTEGER. The number of terms in the rational function to be solved by slasd4. k = 1. dsigma REAL for slasd9 DOUBLE PRECISION for dlasd9. Array, DIMENSION(k). The first k elements of this array contain the old roots of the deflated updating problem. These are the poles of the secular equation. z REAL for slasd9 DOUBLE PRECISION for dlasd9. Array, DIMENSION (k). The first k elements of this array contain the components of the deflation-adjusted updating row vector. vf REAL for slasd9 DOUBLE PRECISION for dlasd9. Array, DIMENSION(k). On entry, vf contains information passed through sbede8. vl REAL for slasd9 DOUBLE PRECISION for dlasd9. Array, DIMENSION(k). On entry, vl contains information passed through sbede8. work REAL for slasd9 DOUBLE PRECISION for dlasd9. Workspace array, DIMENSION at least (3k). Output Parameters d REAL for slasd9 DOUBLE PRECISION for dlasd9. Array, DIMENSION(k). d(i) contains the updated singular values. vf On exit, vf contains the first k components of the first components of all right singular vectors of the bidiagonal matrix. vl On exit, vl contains the first k components of the last components of all right singular vectors of the bidiagonal matrix. difl REAL for slasd9 DOUBLE PRECISION for dlasd9. Array, DIMENSION (k). On exit, difl(i) = d(i) - dsigma(i). difr REAL for slasd9 DOUBLE PRECISION for dlasd9. Array, DIMENSION (ldu, 2) if icompq =1 and DIMENSION (k) if icompq = 0. On exit, difr(i, 1) = d(i) - dsigma(i+1), difr(k, 1) is not defined and will not be referenced. If icompq = 1, difr(1:k, 2) is an array containing the normalizing factors for the right singular vector matrix. LAPACK Auxiliary and Utility Routines 5 1355 info INTEGER. = 0: successful exit. < 0: if info = -i, the i-th argument had an illegal value. > 0: If info = 1, an singular value did not converge ?lasda Computes the singular value decomposition (SVD) of a real upper bidiagonal matrix with diagonal d and offdiagonal e. Used by ?bdsdc. Syntax call slasda( icompq, smlsiz, n, sqre, d, e, u, ldu, vt, k, difl, difr, z, poles, givptr, givcol, ldgcol, perm, givnum, c, s, work, iwork, info ) call dlasda( icompq, smlsiz, n, sqre, d, e, u, ldu, vt, k, difl, difr, z, poles, givptr, givcol, ldgcol, perm, givnum, c, s, work, iwork, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description Using a divide and conquer approach, ?lasda computes the singular value decomposition (SVD) of a real upper bidiagonal n-by-m matrix B with diagonal d and off-diagonal e, where m = n + sqre. The algorithm computes the singular values in the SVD B = U*S*VT. The orthogonal matrices U and VT are optionally computed in compact form. A related subroutine ?lasd0 computes the singular values and the singular vectors in explicit form. Input Parameters icompq INTEGER. Specifies whether singular vectors are to be computed in compact form, as follows: = 0: Compute singular values only. = 1: Compute singular vectors of upper bidiagonal matrix in compact form. smlsiz INTEGER. The maximum size of the subproblems at the bottom of the computation tree. n INTEGER. The row dimension of the upper bidiagonal matrix. This is also the dimension of the main diagonal array d. sqre INTEGER. Specifies the column dimension of the bidiagonal matrix. If sqre = 0: the bidiagonal matrix has column dimension m = n If sqre = 1: the bidiagonal matrix has column dimension m = n + 1. d REAL for slasda DOUBLE PRECISION for dlasda. Array, DIMENSION (n). On entry, d contains the main diagonal of the bidiagonal matrix. e REAL for slasda DOUBLE PRECISION for dlasda. Array, DIMENSION ( m - 1 ). Contains the subdiagonal entries of the bidiagonal matrix. On exit, e is destroyed. 5 Intel® Math Kernel Library Reference Manual 1356 ldu INTEGER. The leading dimension of arrays u, vt, difl, difr, poles, givnum, and z. ldu = n. ldgcol INTEGER. The leading dimension of arrays givcol and perm. ldgcol = n. work REAL for slasda DOUBLE PRECISION for dlasda. Workspace array, DIMENSION (6n+(smlsiz+1)2). iwork INTEGER. Workspace array, Dimension must be at least (7n). Output Parameters d On exit d, if info = 0, contains the singular values of the bidiagonal matrix. u REAL for slasda DOUBLE PRECISION for dlasda. Array, DIMENSION (ldu, smlsiz) if icompq =1. Not referenced if icompq = 0. If icompq = 1, on exit, u contains the left singular vector matrices of all subproblems at the bottom level. vt REAL for slasda DOUBLE PRECISION for dlasda. Array, DIMENSION ( ldu, smlsiz+1 ) if icompq = 1, and not referenced if icompq = 0. If icompq = 1, on exit, vt' contains the right singular vector matrices of all subproblems at the bottom level. k INTEGER. Array, DIMENSION (n) if icompq = 1 and DIMENSION (1) if icompq = 0. If icompq = 1, on exit, k(i) is the dimension of the i-th secular equation on the computation tree. difl REAL for slasda DOUBLE PRECISION for dlasda. Array, DIMENSION ( ldu, nlvl ), where nlvl = floor(log2(n/smlsiz)). difr REAL for slasda DOUBLE PRECISION for dlasda. Array, DIMENSION ( ldu, 2 nlvl ) if icompq = 1 and DIMENSION (n) if icompq = 0. If icompq = 1, on exit, difl(1:n, i) and difr(1:n,2i -1) record distances between singular values on the i-th level and singular values on the (i -1)- th level, and difr(1:n, 2i ) contains the normalizing factors for the right singular vector matrix. See ?lasd8 for details. z REAL for slasda DOUBLE PRECISION for dlasda. Array, DIMENSION ( ldu, nlvl ) if icompq = 1 and DIMENSION (n) if icompq = 0. The first k elements of z(1, i) contain the components of the deflation-adjusted updating row vector for subproblems on the i-th level. poles REAL for slasda DOUBLE PRECISION for dlasda LAPACK Auxiliary and Utility Routines 5 1357 Array, DIMENSION (ldu, 2*nlvl) if icompq = 1, and not referenced if icompq = 0. If icompq = 1, on exit, poles(1, 2i - 1) and poles(1, 2i) contain the new and old singular values involved in the secular equations on the i-th level. givptr INTEGER. Array, DIMENSION (n) if icompq = 1, and not referenced if icompq = 0. If icompq = 1, on exit, givptr( i ) records the number of Givens rotations performed on the i-th problem on the computation tree. givcol INTEGER . Array, DIMENSION (ldgcol, 2*nlvl) if icompq = 1, and not referenced if icompq = 0. If icompq = 1, on exit, for each i, givcol(1, 2 i - 1) and givcol(1, 2 i) record the locations of Givens rotations performed on the ith level on the computation tree. perm INTEGER . Array, DIMENSION ( ldgcol, nlvl ) if icompq = 1, and not referenced if icompq = 0. If icompq = 1, on exit, perm (1, i) records permutations done on the i-th level of the computation tree. givnum REAL for slasda DOUBLE PRECISION for dlasda. Array DIMENSION ( ldu, 2*nlvl ) if icompq = 1, and not referenced if icompq = 0. If icompq = 1, on exit, for each i, givnum(1, 2 i - 1) and givnum(1, 2 i) record the C- and S-values of Givens rotations performed on the i-th level on the computation tree. c REAL for slasda DOUBLE PRECISION for dlasda. Array, DIMENSION (n) if icompq = 1, and DIMENSION (1) if icompq = 0. If icompq = 1 and the i-th subproblem is not square, on exit, c(i) contains the C-value of a Givens rotation related to the right null space of the i-th subproblem. s REAL for slasda DOUBLE PRECISION for dlasda. Array, DIMENSION (n) icompq = 1, and DIMENSION (1) if icompq = 0. If icompq = 1 and the i-th subproblem is not square, on exit, s(i) contains the S-value of a Givens rotation related to the right null space of the i-th subproblem. info INTEGER. = 0: successful exit. < 0: if info = -i, the i-th argument had an illegal value > 0: If info = 1, an singular value did not converge ?lasdq Computes the SVD of a real bidiagonal matrix with diagonal d and off-diagonal e. Used by ?bdsdc. Syntax call slasdq( uplo, sqre, n, ncvt, nru, ncc, d, e, vt, ldvt, u, ldu, c, ldc, work, info ) 5 Intel® Math Kernel Library Reference Manual 1358 call dlasdq( uplo, sqre, n, ncvt, nru, ncc, d, e, vt, ldvt, u, ldu, c, ldc, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasdq computes the singular value decomposition (SVD) of a real (upper or lower) bidiagonal matrix with diagonal d and off-diagonal e, accumulating the transformations if desired. If B is the input bidiagonal matrix, the algorithm computes orthogonal matrices Q and P such that B = Q*S*PT. The singular values S are overwritten on d. The input matrix U is changed to U*Q if desired. The input matrix VT is changed to PT*VT if desired. The input matrix C is changed to QT*C if desired. Input Parameters uplo CHARACTER*1. On entry, uplo specifies whether the input bidiagonal matrix is upper or lower bidiagonal. If uplo = 'U' or 'u', B is upper bidiagonal; If uplo = 'L' or 'l', B is lower bidiagonal. sqre INTEGER. = 0: then the input matrix is n-by-n. = 1: then the input matrix is n-by-(n+1) if uplu = 'U' and (n+1)-by-n if uplu = 'L'. The bidiagonal matrix has n = nl + nr + 1 rows and m = n + sqre = n columns. n INTEGER. On entry, n specifies the number of rows and columns in the matrix. n must be at least 0. ncvt INTEGER. On entry, ncvt specifies the number of columns of the matrix VT. ncvt must be at least 0. nru INTEGER. On entry, nru specifies the number of rows of the matrix U. nru must be at least 0. ncc INTEGER. On entry, ncc specifies the number of columns of the matrix C. ncc must be at least 0. d REAL for slasdq DOUBLE PRECISION for dlasdq. Array, DIMENSION (n). On entry, d contains the diagonal entries of the bidiagonal matrix. e REAL for slasdq DOUBLE PRECISION for dlasdq. Array, DIMENSION is (n-1) if sqre = 0 and n if sqre = 1. On entry, the entries of e contain the off-diagonal entries of the bidiagonal matrix. vt REAL for slasdq DOUBLE PRECISION for dlasdq. Array, DIMENSION (ldvt, ncvt). On entry, contains a matrix which on exit has been premultiplied by PT, dimension n-by-ncvt if sqre = 0 and (n+1)- by-ncvt if sqre = 1 (not referenced if ncvt=0). LAPACK Auxiliary and Utility Routines 5 1359 ldvt INTEGER. On entry, ldvt specifies the leading dimension of vt as declared in the calling (sub) program. ldvt must be at least 1. If ncvt is nonzero, ldvt must also be at least n. u REAL for slasdq DOUBLE PRECISION for dlasdq. Array, DIMENSION (ldu, n). On entry, contains a matrix which on exit has been postmultiplied by Q, dimension nru-by-n if sqre = 0 and nru-by-(n +1) if sqre = 1 (not referenced if nru=0). ldu INTEGER. On entry, ldu specifies the leading dimension of u as declared in the calling (sub) program. ldu must be at least max(1, nru ) . c REAL for slasdq DOUBLE PRECISION for dlasdq. Array, DIMENSION (ldc, ncc). On entry, contains an n-by-ncc matrix which on exit has been premultiplied by Q', dimension n-by-ncc if sqre = 0 and (n+1)-by-ncc if sqre = 1 (not referenced if ncc=0). ldc INTEGER. On entry, ldc specifies the leading dimension of C as declared in the calling (sub) program. ldc must be at least 1. If ncc is non-zero, ldc must also be at least n. work REAL for slasdq DOUBLE PRECISION for dlasdq. Array, DIMENSION (4n). This is a workspace array. Only referenced if one of ncvt, nru, or ncc is nonzero, and if n is at least 2. Output Parameters d On normal exit, d contains the singular values in ascending order. e On normal exit, e will contain 0. If the algorithm does not converge, d and e will contain the diagonal and superdiagonal entries of a bidiagonal matrix orthogonally equivalent to the one given as input. vt On exit, the matrix has been premultiplied by P'. u On exit, the matrix has been postmultiplied by Q. c On exit, the matrix has been premultiplied by Q'. info INTEGER. On exit, a value of 0 indicates a successful exit. If info < 0, argument number -info is illegal. If info > 0, the algorithm did not converge, and info specifies how many superdiagonals did not converge. ?lasdt Creates a tree of subproblems for bidiagonal divide and conquer. Used by ?bdsdc. Syntax call slasdt( n, lvl, nd, inode, ndiml, ndimr, msub ) call dlasdt( n, lvl, nd, inode, ndiml, ndimr, msub ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine creates a tree of subproblems for bidiagonal divide and conquer. 5 Intel® Math Kernel Library Reference Manual 1360 Input Parameters n INTEGER. On entry, the number of diagonal elements of the bidiagonal matrix. msub INTEGER. On entry, the maximum row dimension each subproblem at the bottom of the tree can be of. Output Parameters lvl INTEGER. On exit, the number of levels on the computation tree. nd INTEGER. On exit, the number of nodes on the tree. inode INTEGER. Array, DIMENSION (n). On exit, centers of subproblems. ndiml INTEGER . Array, DIMENSION (n). On exit, row dimensions of left children. ndimr INTEGER . Array, DIMENSION (n). On exit, row dimensions of right children. ?laset Initializes the off-diagonal elements and the diagonal elements of a matrix to given values. Syntax call slaset( uplo, m, n, alpha, beta, a, lda ) call dlaset( uplo, m, n, alpha, beta, a, lda ) call claset( uplo, m, n, alpha, beta, a, lda ) call zlaset( uplo, m, n, alpha, beta, a, lda ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine initializes an m-by-n matrix A to beta on the diagonal and alpha on the off-diagonals. Input Parameters uplo CHARACTER*1. Specifies the part of the matrix A to be set. If uplo = 'U', upper triangular part is set; the strictly lower triangular part of A is not changed. If uplo = 'L': lower triangular part is set; the strictly upper triangular part of A is not changed. Otherwise: All of the matrix A is set. m INTEGER. The number of rows of the matrix A. m = 0. n INTEGER. The number of columns of the matrix A. n = 0. alpha, beta REAL for slaset DOUBLE PRECISION for dlaset COMPLEX for claset LAPACK Auxiliary and Utility Routines 5 1361 DOUBLE COMPLEX for zlaset. The constants to which the off-diagonal and diagonal elements are to be set, respectively. a REAL for slaset DOUBLE PRECISION for dlaset COMPLEX for claset DOUBLE COMPLEX for zlaset. Array, DIMENSION (lda, n). On entry, the m-by-n matrix A. lda INTEGER. The leading dimension of the array a. lda = max(1,m). Output Parameters a On exit, the leading m-by-n submatrix of A is set as follows: if uplo = 'U', A(i,j) = alpha, 1=i=j-1, 1=j=n, if uplo = 'L', A(i,j) = alpha, j+1=i=m, 1=j=n, otherwise, A(i,j) = alpha, 1=i=m, 1=j=n, i ? j, and, for all uplo, A(i,i) = beta, 1=i=min(m, n). ?lasq1 Computes the singular values of a real square bidiagonal matrix. Used by ?bdsqr. Syntax call slasq1( n, d, e, work, info ) call dlasq1( n, d, e, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasq1 computes the singular values of a real n-by-n bidiagonal matrix with diagonal d and offdiagonal e. The singular values are computed to high relative accuracy, in the absence of denormalization, underflow and overflow. Input Parameters n INTEGER.The number of rows and columns in the matrix. n = 0. d REAL for slasq1 DOUBLE PRECISION for dlasq1. Array, DIMENSION (n). On entry, d contains the diagonal elements of the bidiagonal matrix whose SVD is desired. e REAL for slasq1 DOUBLE PRECISION for dlasq1. Array, DIMENSION (n). On entry, elements e(1:n-1) contain the off-diagonal elements of the bidiagonal matrix whose SVD is desired. work REAL for slasq1 5 Intel® Math Kernel Library Reference Manual 1362 DOUBLE PRECISION for dlasq1. Workspace array, DIMENSION (4n). Output Parameters d On normal exit, d contains the singular values in decreasing order. e On exit, e is overwritten. info INTEGER. = 0: successful exit; < 0: if info = -i, the i-th argument had an illegal value; > 0: the algorithm failed: = 1, a split was marked by a positive value in e; = 2, current block of z not diagonalized after 30n iterations (in inner while loop); = 3, termination criterion of outer while loop not met (program created more than n unreduced blocks. ?lasq2 Computes all the eigenvalues of the symmetric positive definite tridiagonal matrix associated with the qd array z to high relative accuracy. Used by ?bdsqr and ?stegr. Syntax call slasq2( n, z, info ) call dlasq2( n, z, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasq2 computes all the eigenvalues of the symmetric positive definite tridiagonal matrix associated with the qd array z to high relative accuracy, in the absence of denormalization, underflow and overflow. To see the relation of z to the tridiagonal matrix, let L be a unit lower bidiagonal matrix with subdiagonals z(2,4,6,,..) and let U be an upper bidiagonal matrix with 1's above and diagonal z(1,3,5,,..). The tridiagonal is LU or, if you prefer, the symmetric tridiagonal to which it is similar. Input Parameters n INTEGER. The number of rows and columns in the matrix. n = 0. z REAL for slasq2 DOUBLE PRECISION for dlasq2. Array, DIMENSION (4 * n). On entry, z holds the qd array. LAPACK Auxiliary and Utility Routines 5 1363 Output Parameters z On exit, entries 1 to n hold the eigenvalues in decreasing order, z(2n+1) holds the trace, and z(2n+2) holds the sum of the eigenvalues. If n > 2, then z(2n+3) holds the iteration count, z(2n+4) holds ndivs/nin2, and z(2n+5) holds the percentage of shifts that failed. info INTEGER. = 0: successful exit; < 0: if the i-th argument is a scalar and had an illegal value, then info = -i, if the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+ j); > 0: the algorithm failed: = 1, a split was marked by a positive value in e; = 2, current block of z not diagonalized after 30*n iterations (in inner while loop); = 3, termination criterion of outer while loop not met (program created more than n unreduced blocks). Application Notes The routine ?lasq2 defines a logical variable, ieee, which is .TRUE. on machines which follow ieee-754 floating-point standard in their handling of infinities and NaNs, and .FALSE. otherwise. This variable is passed to ?lasq3. ?lasq3 Checks for deflation, computes a shift and calls dqds. Used by ?bdsqr. Syntax call slasq3( i0, n0, z, pp, dmin, sigma, desig, qmax, nfail, iter, ndiv, ieee, ttype, dmin1, dmin2, dn, dn1, dn2, g, tau ) call dlasq3( i0, n0, z, pp, dmin, sigma, desig, qmax, nfail, iter, ndiv, ieee, ttype, dmin1, dmin2, dn, dn1, dn2, g, tau ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasq3 checks for deflation, computes a shift tau, and calls dqds. In case of failure, it changes shifts, and tries again until output is positive. Input Parameters i0 INTEGER. First index. n0 INTEGER. Last index. z REAL for slasq3 DOUBLE PRECISION for dlasq3. Array, DIMENSION (4n). z holds the qd array. pp INTEGER. pp=0 for ping, pp=1 for pong. pp=2 indicates that flipping was applied to the Z array and that the initial tests for deflation should not be performed. 5 Intel® Math Kernel Library Reference Manual 1364 desig REAL for slasq3 DOUBLE PRECISION for dlasq3. Lower order part of sigma. qmax REAL for slasq3 DOUBLE PRECISION for dlasq3. Maximum value of q. ieee LOGICAL. Flag for ieee or non-ieee arithmetic (passed to ?lasq5). ttype INTEGER. Shift type. dmin1, dmin2, dn, dn1, dn2, g, tau REAL for slasq3 DOUBLE PRECISION for dlasq3. These scalars are passed as arguments in order to save their values between calls to ?lasq3. Output Parameters dmin REAL for slasq3 DOUBLE PRECISION for dlasq3. Minimum value of d. pp INTEGER. pp=0 for ping, pp=1 for pong. pp=2 indicates that flipping was applied to the Z array and that the initial tests for deflation should not be performed. sigma REAL for slasq3 DOUBLE PRECISION for dlasq3. Sum of shifts used in the current segment. desig Lower order part of sigma. nfail INTEGER. Number of times shift was too big. iter INTEGER. Number of iterations. ndiv INTEGER. Number of divisions. ttype INTEGER. Shift type. dmin1, dmin2, dn, dn1, dn2, g, tau REAL for slasq3 DOUBLE PRECISION for dlasq3. These scalars are passed as arguments in order to save their values between calls to ?lasq3. ?lasq4 Computes an approximation to the smallest eigenvalue using values of d from the previous transform. Used by ?bdsqr. Syntax call slasq4( i0, n0, z, pp, n0in, dmin, dmin1, dmin2, dn, dn1, dn2, tau, ttype, g ) call dlasq4( i0, n0, z, pp, n0in, dmin, dmin1, dmin2, dn, dn1, dn2, tau, ttype, g ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h LAPACK Auxiliary and Utility Routines 5 1365 Description The routine computes an approximation tau to the smallest eigenvalue using values of d from the previous transform. Input Parameters i0 INTEGER. First index. n0 INTEGER. Last index. z REAL for slasq4 DOUBLE PRECISION for dlasq4. Array, DIMENSION (4n). z holds the qd array. pp INTEGER. pp=0 for ping, pp=1 for pong. n0in INTEGER. The value of n0 at start of eigtest. dmin REAL for slasq4 DOUBLE PRECISION for dlasq4. Minimum value of d. dmin1 REAL for slasq4 DOUBLE PRECISION for dlasq4. Minimum value of d, excluding d(n0). dmin2 REAL for slasq4 DOUBLE PRECISION for dlasq4. Minimum value of d, excluding d(n0) and d(n0-1). dn REAL for slasq4 DOUBLE PRECISION for dlasq4. Contains d(n). dn1 REAL for slasq4 DOUBLE PRECISION for dlasq4. Contains d(n-1). dn2 REAL for slasq4 DOUBLE PRECISION for dlasq4. Contains d(n-2). g REAL for slasq4 DOUBLE PRECISION for dlasq4. A scalar passed as an argument in order to save its value between calls to ? lasq4. Output Parameters tau REAL for slasq4 DOUBLE PRECISION for dlasq4. Shift. ttype INTEGER. Shift type. g REAL for slasq4 DOUBLE PRECISION for dlasq4. A scalar passed as an argument in order to save its value between calls to ? lasq4. ?lasq5 Computes one dqds transform in ping-pong form. Used by ?bdsqr and ?stegr. 5 Intel® Math Kernel Library Reference Manual 1366 Syntax call slasq5( i0, n0, z, pp, tau, dmin, dmin1, dmin2, dn, dnm1, dnm2, ieee ) call dlasq5( i0, n0, z, pp, tau, dmin, dmin1, dmin2, dn, dnm1, dnm2, ieee ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes one dqds transform in ping-pong form: one version for ieee machines, another for non-ieee machines. Input Parameters i0 INTEGER. First index. n0 INTEGER. Last index. z REAL for slasq5 DOUBLE PRECISION for dlasq5. Array, DIMENSION (4n). z holds the qd array. emin is stored in z(4*n0) to avoid an extra argument. pp INTEGER. pp=0 for ping, pp=1 for pong. tau REAL for slasq5 DOUBLE PRECISION for dlasq5. This is the shift. ieee LOGICAL. Flag for IEEE or non-IEEE arithmetic. Output Parameters dmin REAL for slasq5 DOUBLE PRECISION for dlasq5. Minimum value of d. dmin1 REAL for slasq5 DOUBLE PRECISION for dlasq5. Minimum value of d, excluding d(n0). dmin2 REAL for slasq5 DOUBLE PRECISION for dlasq5. Minimum value of d, excluding d(n0) and d(n0-1). dn REAL for slasq5 DOUBLE PRECISION for dlasq5. Contains d(n0), the last value of d. dnm1 REAL for slasq5 DOUBLE PRECISION for dlasq5. Contains d(n0-1). dnm2 REAL for slasq5 DOUBLE PRECISION for dlasq5. Contains d(n0-2). ?lasq6 Computes one dqd transform in ping-pong form. Used by ?bdsqr and ?stegr. Syntax call slasq6( i0, n0, z, pp, dmin, dmin1, dmin2, dn, dnm1, dnm2 ) LAPACK Auxiliary and Utility Routines 5 1367 call dlasq6( i0, n0, z, pp, dmin, dmin1, dmin2, dn, dnm1, dnm2 ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasq6 computes one dqd (shift equal to zero) transform in ping-pong form, with protection against underflow and overflow. Input Parameters i0 INTEGER. First index. n0 INTEGER. Last index. z REAL for slasq6 DOUBLE PRECISION for dlasq6. Array, DIMENSION (4n). Z holds the qd array. emin is stored in z(4*n0) to avoid an extra argument. pp INTEGER. pp=0 for ping, pp=1 for pong. Output Parameters dmin REAL for slasq6 DOUBLE PRECISION for dlasq6. Minimum value of d. dmin1 REAL for slasq6 DOUBLE PRECISION for dlasq6. Minimum value of d, excluding d(n0). dmin2 REAL for slasq6 DOUBLE PRECISION for dlasq6. Minimum value of d, excluding d(n0) and d(n0-1). dn REAL for slasq6 DOUBLE PRECISION for dlasq6. Contains d(n0), the last value of d. dnm1 REAL for slasq6 DOUBLE PRECISION for dlasq6. Contains d(n0-1). dnm2 REAL for slasq6 DOUBLE PRECISION for dlasq6. Contains d(n0-2). ?lasr Applies a sequence of plane rotations to a general rectangular matrix. Syntax call slasr( side, pivot, direct, m, n, c, s, a, lda ) call dlasr( side, pivot, direct, m, n, c, s, a, lda ) call clasr( side, pivot, direct, m, n, c, s, a, lda ) call zlasr( side, pivot, direct, m, n, c, s, a, lda ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1368 Description The routine applies a sequence of plane rotations to a real/complex matrix A, from the left or the right. A := P*A, when side = 'L' ( Left-hand side ) A := A*P', when side = 'R' ( Right-hand side ) where P is an orthogonal matrix consisting of a sequence of plane rotations with z = m when side = 'L' and z = n when side = 'R'. When direct = 'F' (Forward sequence), then P = P(z-1)*...P(2)*P(1), and when direct = 'B' (Backward sequence), then P = P(1)*P(2)*...*P(z-1), where P( k ) is a plane rotation matrix defined by the 2-by-2 plane rotation: When pivot = 'V' ( Variable pivot ), the rotation is performed for the plane (k, k + 1), that is, P(k) has the form where R(k) appears as a rank-2 modification to the identity matrix in rows and columns k and k+1. When pivot = 'T' ( Top pivot ), the rotation is performed for the plane (1,k+1), so P(k) has the form LAPACK Auxiliary and Utility Routines 5 1369 where R(k) appears in rows and columns k and k+1. Similarly, when pivot = 'B' ( Bottom pivot ), the rotation is performed for the plane (k,z), giving P(k) the form where R(k) appears in rows and columns k and z. The rotations are performed without ever forming P(k) explicitly. Input Parameters side CHARACTER*1. Specifies whether the plane rotation matrix P is applied to A on the left or the right. = 'L': left, compute A := P*A = 'R': right, compute A:= A*P' direct CHARACTER*1. Specifies whether P is a forward or backward sequence of plane rotations. = 'F': forward, P = P(z-1)*...*P(2)*P(1) = 'B': backward, P = P(1)*P(2)*...*P(z-1) pivot CHARACTER*1. Specifies the plane for which P(k) is a plane rotation matrix. = 'V': Variable pivot, the plane (k, k+1) = 'T': Top pivot, the plane (1, k+1) 5 Intel® Math Kernel Library Reference Manual 1370 = 'B': Bottom pivot, the plane (k, z) m INTEGER. The number of rows of the matrix A. If m = 1, an immediate return is effected. n INTEGER. The number of columns of the matrix A. If n = 1, an immediate return is effected. c, s REAL for slasr/clasr DOUBLE PRECISION for dlasr/zlasr. Arrays, DIMENSION (m-1) if side = 'L', (n-1) if side = 'R' . c(k) and s(k) contain the cosine and sine of the plane rotations respectively that define the 2-by-2 plane rotation part (R(k)) of the P(k) matrix as described above in Description. a REAL for slasr DOUBLE PRECISION for dlasr COMPLEX for clasr DOUBLE COMPLEX for zlasr. Array, DIMENSION (lda, n). The m-by-n matrix A. lda INTEGER. The leading dimension of the array a. lda = max(1,m). Output Parameters a On exit, A is overwritten by P*A if side = 'R', or by A*P' if side = 'L'. ?lasrt Sorts numbers in increasing or decreasing order. Syntax call slasrt( id, n, d, info ) call dlasrt( id, n, d, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasrt sorts the numbers in d in increasing order (if id = 'I') or in decreasing order (if id = 'D'). It uses Quick Sort, reverting to Insertion Sort on arrays of size = 20. Dimension of stack limits n to about 232. Input Parameters id CHARACTER*1. = 'I': sort d in increasing order; = 'D': sort d in decreasing order. n INTEGER. The length of the array d. d REAL for slasrt DOUBLE PRECISION for dlasrt. LAPACK Auxiliary and Utility Routines 5 1371 On entry, the array to be sorted. Output Parameters d On exit, d has been sorted into increasing order (d(1) = ... = d(n)) or into decreasing order (d(1) = ... = d(n)), depending on id. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument had an illegal value. ?lassq Updates a sum of squares represented in scaled form. Syntax call slassq( n, x, incx, scale, sumsq ) call dlassq( n, x, incx, scale, sumsq ) call classq( n, x, incx, scale, sumsq ) call zlassq( n, x, incx, scale, sumsq ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The real routines slassq/dlassq return the values scl and smsq such that scl2 * smsq = x(1)2 +...+ x(n)2 + scale2 *sumsq, where x( i ) = x(1 + ( i - 1) incx). The value of sumsq is assumed to be non-negative and scl returns the value scl = max( scale, abs(x(i))). Values scale and sumsq must be supplied in scale and sumsq, and scl and smsq are overwritten on scale and sumsq, respectively. The complex routines classq/zlassq return the values scl and ssq such that scl2 * ssq = x(1)2 +...+ x(n)2 + scale2 *sumsq, where x(i) = abs(x(1 +(i - 1)*incx)). The value of sumsq is assumed to be at least unity and the value of ssq will then satisfy 1.0 = ssq = sumsq + 2n scale is assumed to be non-negative and scl returns the value scl = max( scale, abs(real(x(i))), abs(aimag(x(i)))). Values scale and sumsq must be supplied in scale and sumsq, and scl and ssq are overwritten on scale and sumsq, respectively. All routines ?lassq make only one pass through the vector x. 5 Intel® Math Kernel Library Reference Manual 1372 Input Parameters n INTEGER. The number of elements to be used from the vector x. x REAL for slassq DOUBLE PRECISION for dlassq COMPLEX for classq DOUBLE COMPLEX for zlassq. The vector for which a scaled sum of squares is computed: x(i) = x(1+ (i-1)*incx), 1 = i = n. incx INTEGER. The increment between successive values of the vector x. incx > 0. scale REAL for slassq/classq DOUBLE PRECISION for dlassq/zlassq. On entry, the value scale in the equation above. sumsq REAL for slassq/classq DOUBLE PRECISION for dlassq/zlassq. On entry, the value sumsq in the equation above. Output Parameters scale On exit, scale is overwritten with scl, the scaling factor for the sum of squares. sumsq For real flavors: On exit, sumsq is overwritten with the value smsq in the equation above. For complex flavors: On exit, sumsq is overwritten with the value ssq in the equation above. ?lasv2 Computes the singular value decomposition of a 2- by-2 triangular matrix. Syntax call slasv2( f, g, h, ssmin, ssmax, snr, csr, snl, csl ) call dlasv2( f, g, h, ssmin, ssmax, snr, csr, snl, csl ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasv2 computes the singular value decomposition of a 2-by-2 triangular matrix On return, abs(ssmax) is the larger singular value, abs(ssmin) is the smaller singular value, and (csl,snl) and (csr,snr) are the left and right singular vectors for abs(ssmax), giving the decomposition LAPACK Auxiliary and Utility Routines 5 1373 Input Parameters f, g, h REAL for slasv2 DOUBLE PRECISION for dlasv2. The (1,1), (1,2) and (2,2) elements of the 2-by-2 matrix, respectively. Output Parameters ssmin, ssmax REAL for slasv2 DOUBLE PRECISION for dlasv2. abs(ssmin) and abs(ssmax) is the smaller and the larger singular value, respectively. snl, csl REAL for slasv2 DOUBLE PRECISION for dlasv2. The vector (csl, snl) is a unit left singular vector for the singular value abs(ssmax). snr, csr REAL for slasv2 DOUBLE PRECISION for dlasv2. The vector (csr, snr) is a unit right singular vector for the singular value abs(ssmax). Application Notes Any input parameter may be aliased with any output parameter. Barring over/underflow and assuming a guard digit in subtraction, all output quantities are correct to within a few units in the last place (ulps). In ieee arithmetic, the code works correctly if one matrix element is infinite. Overflow will not occur unless the largest singular value itself overflows or is within a few ulps of overflow. (On machines with partial overflow, like the Cray, overflow may occur if the largest singular value is within a factor of 2 of overflow.) Underflow is harmless if underflow is gradual. Otherwise, results may correspond to a matrix modified by perturbations of size near the underflow threshold. ?laswp Performs a series of row interchanges on a general rectangular matrix. Syntax call slaswp( n, a, lda, k1, k2, ipiv, incx ) call dlaswp( n, a, lda, k1, k2, ipiv, incx ) call claswp( n, a, lda, k1, k2, ipiv, incx ) call zlaswp( n, a, lda, k1, k2, ipiv, incx ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1374 Description The routine performs a series of row interchanges on the matrix A. One row interchange is initiated for each of rows k1 through k2 of A. Input Parameters n INTEGER. The number of columns of the matrix A. a REAL for slaswp DOUBLE PRECISION for dlaswp COMPLEX for claswp DOUBLE COMPLEX for zlaswp. Array, DIMENSION (lda, n). On entry, the matrix of column dimension n to which the row interchanges will be applied. lda INTEGER. The leading dimension of the array a. k1 INTEGER. The first element of ipiv for which a row interchange will be done. k2 INTEGER. The last element of ipiv for which a row interchange will be done. ipiv INTEGER. Array, DIMENSION (k2*|incx|). The vector of pivot indices. Only the elements in positions k1 through k2 of ipiv are accessed. ipiv(k) = l implies rows k and l are to be interchanged. incx INTEGER. The increment between successive values of ipiv. If ipiv is negative, the pivots are applied in reverse order. Output Parameters a On exit, the permuted matrix. ?lasy2 Solves the Sylvester matrix equation where the matrices are of order 1 or 2. Syntax call slasy2( ltranl, ltranr, isgn, n1, n2, tl, ldtl, tr, ldtr, b, ldb, scale, x, ldx, xnorm, info ) call dlasy2( ltranl, ltranr, isgn, n1, n2, tl, ldtl, tr, ldtr, b, ldb, scale, x, ldx, xnorm, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine solves for the n1-by-n2 matrix X, 1 = n1, n2 = 2, in op(TL)*X + isgn*X*op(TR) = scale*B, where TL is n1-by-n1, LAPACK Auxiliary and Utility Routines 5 1375 TR is n2-by-n2, B is n1-by-n2, and isgn = 1 or -1. op(T) = T or TT, where TT denotes the transpose of T. Input Parameters ltranl LOGICAL. On entry, ltranl specifies the op(TL): = .FALSE., op(TL) = TL, = .TRUE., op(TL) = (TL)T. ltranr LOGICAL. On entry, ltranr specifies the op(TR): = .FALSE., op(TR) = TR, = .TRUE., op(TR) = (TR)T. isgn INTEGER. On entry, isgn specifies the sign of the equation as described before. isgn may only be 1 or -1. n1 INTEGER. On entry, n1 specifies the order of matrix TL. n1 may only be 0, 1 or 2. n2 INTEGER. On entry, n2 specifies the order of matrix TR. n2 may only be 0, 1 or 2. tl REAL for slasy2 DOUBLE PRECISION for dlasy2. Array, DIMENSION (ldtl,2). On entry, tl contains an n1-by-n1 matrix TL. ldtl INTEGER.The leading dimension of the matrix TL. ldtl = max(1,n1). tr REAL for slasy2 DOUBLE PRECISION for dlasy2. Array, DIMENSION (ldtr,2). On entry, tr contains an n2-by-n2 matrix TR. ldtr INTEGER. The leading dimension of the matrix TR. ldtr = max(1,n2). b REAL for slasy2 DOUBLE PRECISION for dlasy2. Array, DIMENSION (ldb,2). On entry, the n1-by-n2 matrix B contains the right-hand side of the equation. ldb INTEGER. The leading dimension of the matrix B. ldb = max(1,n1). ldx INTEGER. The leading dimension of the output matrix X. ldx = max(1,n1). Output Parameters scale REAL for slasy2 DOUBLE PRECISION for dlasy2. On exit, scale contains the scale factor. scale is chosen less than or equal to 1 to prevent the solution overflowing. x REAL for slasy2 DOUBLE PRECISION for dlasy2. Array, DIMENSION (ldx,2). On exit, x contains the n1-by-n2 solution. xnorm REAL for slasy2 5 Intel® Math Kernel Library Reference Manual 1376 DOUBLE PRECISION for dlasy2. On exit, xnorm is the infinity-norm of the solution. info INTEGER. On exit, info is set to 0: successful exit. 1: TL and TR have too close eigenvalues, so TL or TR is perturbed to get a nonsingular equation. NOTE For higher speed, this routine does not check the inputs for errors. ?lasyf Computes a partial factorization of a real/complex symmetric matrix, using the diagonal pivoting method. Syntax call slasyf( uplo, n, nb, kb, a, lda, ipiv, w, ldw, info ) call dlasyf( uplo, n, nb, kb, a, lda, ipiv, w, ldw, info ) call clasyf( uplo, n, nb, kb, a, lda, ipiv, w, ldw, info ) call zlasyf( uplo, n, nb, kb, a, lda, ipiv, w, ldw, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lasyf computes a partial factorization of a real/complex symmetric matrix A using the Bunch- Kaufman diagonal pivoting method. The partial factorization has the form: where the order of D is at most nb. The actual order is returned in the argument kb, and is either nb or nb-1, or n if n = nb. This is an auxiliary routine called by ?sytrf. It uses blocked code (calling Level 3 BLAS) to update the submatrix A11 (if uplo = 'U') or A22 (if uplo = 'L'). Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric matrix A is stored: = 'U': Upper triangular = 'L': Lower triangular LAPACK Auxiliary and Utility Routines 5 1377 n INTEGER. The order of the matrix A. n = 0. nb INTEGER. The maximum number of columns of the matrix A that should be factored. nb should be at least 2 to allow for 2-by-2 pivot blocks. a REAL for slasyf DOUBLE PRECISION for dlasyf COMPLEX for clasyf DOUBLE COMPLEX for zlasyf. Array, DIMENSION (lda, n). If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A, and the strictly upper triangular part of a is not referenced. lda INTEGER. The leading dimension of the array a. lda = max(1,n). w REAL for slasyf DOUBLE PRECISION for dlasyf COMPLEX for clasyf DOUBLE COMPLEX for zlasyf. Workspace array, DIMENSION (ldw, nb). ldw INTEGER. The leading dimension of the array w. ldw = max(1,n). Output Parameters kb INTEGER. The number of columns of A that were actually factored kb is either nb-1 or nb, or n if n = nb. a On exit, a contains details of the partial factorization. ipiv INTEGER. Array, DIMENSION (n ). Details of the interchanges and the block structure of D. If uplo = 'U', only the last kb elements of ipiv are set; if uplo = 'L', only the first kb elements are set. If ipiv(k) > 0, then rows and columns k and ipiv(k) were interchanged and D(k, k) is a 1-by-1 diagonal block. If uplo = 'U' and ipiv(k) = ipiv(k-1) < 0, then rows and columns k-1 and -ipiv(k) were interchanged and D(k-1:k, k-1:k) is a 2-by-2 diagonal block. If uplo = 'L' and ipiv(k) = ipiv(k+1) < 0, then rows and columns k +1 and -ipiv(k) were interchanged and D(k:k+1, k:k+1) is a 2-by-2 diagonal block. info INTEGER. = 0: successful exit > 0: if info = k, D(k, k) is exactly zero. The factorization has been completed, but the block diagonal matrix D is exactly singular. ?lahef Computes a partial factorization of a complex Hermitian indefinite matrix, using the diagonal pivoting method. Syntax call clahef( uplo, n, nb, kb, a, lda, ipiv, w, ldw, info ) call zlahef( uplo, n, nb, kb, a, lda, ipiv, w, ldw, info ) 5 Intel® Math Kernel Library Reference Manual 1378 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lahef computes a partial factorization of a complex Hermitian matrix A, using the Bunch- Kaufman diagonal pivoting method. The partial factorization has the form: where the order of D is at most nb. The actual order is returned in the argument kb, and is either nb or nb-1, or n if n = nb. Note that UH denotes the conjugate transpose of U. This is an auxiliary routine called by ?hetrf. It uses blocked code (calling Level 3 BLAS) to update the submatrix A11 (if uplo = 'U') or A22 (if uplo = 'L'). Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian matrix A is stored: = 'U': upper triangular = 'L': lower triangular n INTEGER. The order of the matrix A. n = 0. nb INTEGER. The maximum number of columns of the matrix A that should be factored. nb should be at least 2 to allow for 2-by-2 pivot blocks. a COMPLEX for clahef DOUBLE COMPLEX for zlahef. Array, DIMENSION (lda, n). On entry, the Hermitian matrix A. If uplo = 'U', the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A, and the strictly lower triangular part of A is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A, and the strictly upper triangular part of A is not referenced. lda INTEGER. The leading dimension of the array a. lda = max(1,n). w COMPLEX for clahef DOUBLE COMPLEX for zlahef. Workspace array, DIMENSION (ldw, nb). ldw INTEGER. The leading dimension of the array w. ldw = max(1,n). LAPACK Auxiliary and Utility Routines 5 1379 Output Parameters kb INTEGER. The number of columns of A that were actually factored kb is either nb-1 or nb, or n if n = nb. a On exit, A contains details of the partial factorization. ipiv INTEGER. Array, DIMENSION (n ). Details of the interchanges and the block structure of D. If uplo = 'U', only the last kb elements of ipiv are set; if uplo = 'L', only the first kb elements are set. If ipiv(k) > 0, then rows and columns k and ipiv(k) are interchanged and D(k, k) is a 1-by-1 diagonal block. If uplo = 'U' and ipiv(k) = ipiv(k-1) < 0, then rows and columns k-1 and -ipiv(k) are interchanged and D(k-1:k, k-1:k) is a 2-by-2 diagonal block. If uplo = 'L' and ipiv(k) = ipiv(k+1) < 0, then rows and columns k +1 and -ipiv(k) are interchanged and D( k:k+1, k:k+1) is a 2-by-2 diagonal block. info INTEGER. = 0: successful exit > 0: if info = k, D(k, k) is exactly zero. The factorization has been completed, but the block diagonal matrix D is exactly singular. ?latbs Solves a triangular banded system of equations. Syntax call slatbs( uplo, trans, diag, normin, n, kd, ab, ldab, x, scale, cnorm, info ) call dlatbs( uplo, trans, diag, normin, n, kd, ab, ldab, x, scale, cnorm, info ) call clatbs( uplo, trans, diag, normin, n, kd, ab, ldab, x, scale, cnorm, info ) call zlatbs( uplo, trans, diag, normin, n, kd, ab, ldab, x, scale, cnorm, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine solves one of the triangular systems A*x = s*b, or AT*x = s*b, or AH*x = s*b (for complex flavors) with scaling to prevent overflow, where A is an upper or lower triangular band matrix. Here AT denotes the transpose of A, AH denotes the conjugate transpose of A, x and b are n-element vectors, and s is a scaling factor, usually less than or equal to 1, chosen so that the components of x will be less than the overflow threshold. If the unscaled problem will not cause overflow, the Level 2 BLAS routine ?tbsv is called. If the matrix A is singular (A(j, j)=0 for some j), then s is set to 0 and a non-trivial solution to A*x = 0 is returned. Input Parameters uplo CHARACTER*1. 5 Intel® Math Kernel Library Reference Manual 1380 Specifies whether the matrix A is upper or lower triangular. = 'U': upper triangular = 'L': lower triangular trans CHARACTER*1. Specifies the operation applied to A. = 'N': solve A*x = s*b (no transpose) = 'T': solve AT*x = s*b (transpose) = 'C': solve AH*x = s*b (conjugate transpose) diag CHARACTER*1. Specifies whether the matrix A is unit triangular = 'N': non-unit triangular = 'U': unit triangular normin CHARACTER*1. Specifies whether cnorm is set. = 'Y': cnorm contains the column norms on entry; = 'N': cnorm is not set on entry. On exit, the norms is computed and stored in cnorm. n INTEGER. The order of the matrix A. n = 0. kd INTEGER. The number of subdiagonals or superdiagonals in the triangular matrix A. kb = 0. ab REAL for slatbs DOUBLE PRECISION for dlatbs COMPLEX for clatbs DOUBLE COMPLEX for zlatbs. Array, DIMENSION (ldab, n). The upper or lower triangular band matrix A, stored in the first kb+1 rows of the array. The j-th column of A is stored in the j-th column of the array ab as follows: if uplo = 'U', ab(kd+1+i -j,j) = A(i,j) for max(1, j-kd) = i = j; if uplo = 'L', ab(1+i -j,j) = A(i,j) for j = i = min(n, j+kd). ldab INTEGER. The leading dimension of the array ab. ldab = kb+1. x REAL for slatbs DOUBLE PRECISION for dlatbs COMPLEX for clatbs DOUBLE COMPLEX for zlatbs. Array, DIMENSION (n). On entry, the right hand side b of the triangular system. cnorm REAL for slatbs/clatbs DOUBLE PRECISION for dlatbs/zlatbs. Array, DIMENSION (n). If NORMIN = 'Y', cnorm is an input argument and cnorm(j) contains the norm of the off-diagonal part of the j-th column of A. If trans = 'N', cnorm(j) must be greater than or equal to the infinitynorm, and if trans = 'T' or 'C', cnorm(j) must be greater than or equal to the 1-norm. Output Parameters scale REAL for slatbs/clatbs DOUBLE PRECISION for dlatbs/zlatbs. LAPACK Auxiliary and Utility Routines 5 1381 The scaling factor s for the triangular system as described above. If scale = 0, the matrix A is singular or badly scaled, and the vector x is an exact or approximate solution to Ax = 0. cnorm If normin = 'N', cnorm is an output argument and cnorm(j) returns the 1- norm of the off-diagonal part of the j-th column of A. info INTEGER. = 0: successful exit < 0: if info = -k, the k-th argument had an illegal value ?latdf Uses the LU factorization of the n-by-n matrix computed by ?getc2 and computes a contribution to the reciprocal Dif-estimate. Syntax call slatdf( ijob, n, z, ldz, rhs, rdsum, rdscal, ipiv, jpiv ) call dlatdf( ijob, n, z, ldz, rhs, rdsum, rdscal, ipiv, jpiv ) call clatdf( ijob, n, z, ldz, rhs, rdsum, rdscal, ipiv, jpiv ) call zlatdf( ijob, n, z, ldz, rhs, rdsum, rdscal, ipiv, jpiv ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?latdf uses the LU factorization of the n-by-n matrix Z computed by ?getc2 and computes a contribution to the reciprocal Dif-estimate by solving Z*x = b for x, and choosing the right-hand side b such that the norm of x is as large as possible. On entry rhs = b holds the contribution from earlier solved subsystems, and on return rhs = x. The factorization of Z returned by ?getc2 has the form Z = P*L*U*Q, where P and Q are permutation matrices. L is lower triangular with unit diagonal elements and U is upper triangular. Input Parameters ijob INTEGER. ijob = 2: First compute an approximative null-vector e of Z using ?gecon, e is normalized, and solve for Z*x = ±e-f with the sign giving the greater value of 2-norm(x). This option is about 5 times as expensive as default. ijob ? 2 (default): Local look ahead strategy where all entries of the right-hand side b is chosen as either +1 or -1 . n INTEGER. The number of columns of the matrix Z. z REAL for slatdf/clatdf DOUBLE PRECISION for dlatdf/zlatdf. Array, DIMENSION (ldz, n) On entry, the LU part of the factorization of the n-by-n matrix Z computed by ?getc2: Z = P*L*U*Q. ldz INTEGER. The leading dimension of the array Z. lda = max(1, n). rhs REAL for slatdf/clatdf DOUBLE PRECISION for dlatdf/zlatdf. 5 Intel® Math Kernel Library Reference Manual 1382 Array, DIMENSION (n). On entry, rhs contains contributions from other subsystems. rdsum REAL for slatdf/clatdf DOUBLE PRECISION for dlatdf/zlatdf. On entry, the sum of squares of computed contributions to the Dif-estimate under computation by ?tgsyL, where the scaling factor rdscal has been factored out. If trans = 'T', rdsum is not touched. Note that rdsum only makes sense when ?tgsy2 is called by ?tgsyL. rdscal REAL for slatdf/clatdf DOUBLE PRECISION for dlatdf/zlatdf. On entry, scaling factor used to prevent overflow in rdsum. If trans = T', rdscal is not touched. Note that rdscal only makes sense when ?tgsy2 is called by ?tgsyL. ipiv INTEGER. Array, DIMENSION (n). The pivot indices; for 1 = i = n, row i of the matrix has been interchanged with row ipiv(i). jpiv INTEGER. Array, DIMENSION (n). The pivot indices; for 1 =j= n, column j of the matrix has been interchanged with column jpiv(j). Output Parameters rhs On exit, rhs contains the solution of the subsystem with entries according to the value of ijob. rdsum On exit, the corresponding sum of squares updated with the contributions from the current sub-system. If trans = 'T', rdsum is not touched. rdscal On exit, rdscal is updated with respect to the current contributions in rdsum. If trans = 'T', rdscal is not touched. ?latps Solves a triangular system of equations with the matrix held in packed storage. Syntax call slatps( uplo, trans, diag, normin, n, ap, x, scale, cnorm, info ) call dlatps( uplo, trans, diag, normin, n, ap, x, scale, cnorm, info ) call clatps( uplo, trans, diag, normin, n, ap, x, scale, cnorm, info ) call zlatps( uplo, trans, diag, normin, n, ap, x, scale, cnorm, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?latps solves one of the triangular systems A*x = s*b, or AT*x = s*b, or AH*x = s*b (for complex flavors) LAPACK Auxiliary and Utility Routines 5 1383 with scaling to prevent overflow, where A is an upper or lower triangular matrix stored in packed form. Here AT denotes the transpose of A, AH denotes the conjugate transpose of A, x and b are n-element vectors, and s is a scaling factor, usually less than or equal to 1, chosen so that the components of x will be less than the overflow threshold. If the unscaled problem does not cause overflow, the Level 2 BLAS routine ?tpsv is called. If the matrix A is singular (A(j, j) = 0 for some j), then s is set to 0 and a non-trivial solution to A*x = 0 is returned. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular. = 'U': upper triangular = 'L': uower triangular trans CHARACTER*1. Specifies the operation applied to A. = 'N': solve A*x = s*b (no transpose) = 'T': solve AT*x = s*b (transpose) = 'C': solve AH*x = s*b (conjugate transpose) diag CHARACTER*1. Specifies whether the matrix A is unit triangular. = 'N': non-unit triangular = 'U': unit triangular normin CHARACTER*1. Specifies whether cnorm is set. = 'Y': cnorm contains the column norms on entry; = 'N': cnorm is not set on entry. On exit, the norms will be computed and stored in cnorm. n INTEGER. The order of the matrix A. n = 0. ap REAL for slatps DOUBLE PRECISION for dlatps COMPLEX for clatps DOUBLE COMPLEX for zlatps. Array, DIMENSION (n(n+1)/2). The upper or lower triangular matrix A, packed columnwise in a linear array. The j-th column of A is stored in the array ap as follows: if uplo = 'U', ap(i + (j-1)j/2) = A(i,j) for 1= i = j; if uplo = 'L', ap(i + (j-1)(2n-j)/2) = A(i, j) for j=i=n. x REAL for slatps DOUBLE PRECISION for dlatps COMPLEX for clatps DOUBLE COMPLEX for zlatps. Array, DIMENSION (n) On entry, the right hand side b of the triangular system. cnorm REAL for slatps/clatps DOUBLE PRECISION for dlatps/zlatps. Array, DIMENSION (n). If normin = 'Y', cnorm is an input argument and cnorm(j) contains the norm of the off-diagonal part of the j-th column of A. If trans = 'N', cnorm(j) must be greater than or equal to the infinitynorm, and if trans = 'T' or 'C', cnorm(j) must be greater than or equal to the 1-norm. 5 Intel® Math Kernel Library Reference Manual 1384 Output Parameters x On exit, x is overwritten by the solution vector x. scale REAL for slatps/clatps DOUBLE PRECISION for dlatps/zlatps. The scaling factor s for the triangular system as described above. If scale = 0, the matrix A is singular or badly scaled, and the vector x is an exact or approximate solution to A*x = 0. cnorm If normin = 'N', cnorm is an output argument and cnorm(j) returns the 1- norm of the off-diagonal part of the j-th column of A. info INTEGER. = 0: successful exit < 0: if info = -k, the k-th argument had an illegal value ?latrd Reduces the first nb rows and columns of a symmetric/Hermitian matrix A to real tridiagonal form by an orthogonal/unitary similarity transformation. Syntax call slatrd( uplo, n, nb, a, lda, e, tau, w, ldw ) call dlatrd( uplo, n, nb, a, lda, e, tau, w, ldw ) call clatrd( uplo, n, nb, a, lda, e, tau, w, ldw ) call zlatrd( uplo, n, nb, a, lda, e, tau, w, ldw ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?latrd reduces nb rows and columns of a real symmetric or complex Hermitian matrix A to symmetric/Hermitian tridiagonal form by an orthogonal/unitary similarity transformation QT*A*Q for real flavors, QH*A*Q for complex flavors, and returns the matrices V and W which are needed to apply the transformation to the unreduced part of A. If uplo = 'U', ?latrd reduces the last nb rows and columns of a matrix, of which the upper triangle is supplied; if uplo = 'L', ?latrd reduces the first nb rows and columns of a matrix, of which the lower triangle is supplied. This is an auxiliary routine called by ?sytrd/?hetrd. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric/ Hermitian matrix A is stored: = 'U': upper triangular = 'L': lower triangular n INTEGER. The order of the matrix A. nb INTEGER. The number of rows and columns to be reduced. LAPACK Auxiliary and Utility Routines 5 1385 a REAL for slatrd DOUBLE PRECISION for dlatrd COMPLEX for clatrd DOUBLE COMPLEX for zlatrd. Array, DIMENSION (lda, n). On entry, the symmetric/Hermitian matrix A If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A, and the strictly upper triangular part of a is not referenced. lda INTEGER. The leading dimension of the array a. lda = (1,n). ldw INTEGER. The leading dimension of the output array w. ldw = max(1,n). Output Parameters a On exit, if uplo = 'U', the last nb columns have been reduced to tridiagonal form, with the diagonal elements overwriting the diagonal elements of a; the elements above the diagonal with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors; if uplo = 'L', the first nb columns have been reduced to tridiagonal form, with the diagonal elements overwriting the diagonal elements of a; the elements below the diagonal with the array tau, represent the orthogonal/ unitary matrix Q as a product of elementary reflectors. e REAL for slatrd/clatrd DOUBLE PRECISION for dlatrd/zlatrd. If uplo = 'U', e(n-nb:n-1) contains the superdiagonal elements of the last nb columns of the reduced matrix; if uplo = 'L', e(1:nb) contains the subdiagonal elements of the first nb columns of the reduced matrix. tau REAL for slatrd DOUBLE PRECISION for dlatrd COMPLEX for clatrd DOUBLE COMPLEX for zlatrd. Array, DIMENSION (lda, n). The scalar factors of the elementary reflectors, stored in tau(n-nb:n-1) if uplo = 'U', and in tau(1:nb) if uplo = 'L'. w REAL for slatrd DOUBLE PRECISION for dlatrd COMPLEX for clatrd DOUBLE COMPLEX for zlatrd. Array, DIMENSION (ldw, n). The n-by-nb matrix W required to update the unreduced part of A. Application Notes If uplo = 'U', the matrix Q is represented as a product of elementary reflectors Q = H(n)*H(n-1)*...*H(n-nb+1) Each H(i) has the form 5 Intel® Math Kernel Library Reference Manual 1386 H(i) = I - tau*v*v' where tau is a real/complex scalar, and v is a real/complex vector with v(i:n) = 0 and v(i-1) = 1; v(1: i-1) is stored on exit in a(1: i-1, i), and tau in tau(i-1). If uplo = 'L', the matrix Q is represented as a product of elementary reflectors Q = H(1)*H(2)*...*H(nb) Each H(i) has the form H(i) = I - tau*v*v' where tau is a real/complex scalar, and v is a real/complex vector with v(1: i) = 0 and v(i+1) = 1; v( i+1:n) is stored on exit in a(i+1:n, i), and tau in tau(i). The elements of the vectors v together form the n-by-nb matrix V which is needed, with W, to apply the transformation to the unreduced part of the matrix, using a symmetric/Hermitian rank-2k update of the form: A := A - VW' - WV'. The contents of a on exit are illustrated by the following examples with n = 5 and nb = 2: where d denotes a diagonal element of the reduced matrix, a denotes an element of the original matrix that is unchanged, and vi denotes an element of the vector defining H(i). ?latrs Solves a triangular system of equations with the scale factor set to prevent overflow. Syntax call slatrs( uplo, trans, diag, normin, n, a, lda, x, scale, cnorm, info ) call dlatrs( uplo, trans, diag, normin, n, a, lda, x, scale, cnorm, info ) call clatrs( uplo, trans, diag, normin, n, a, lda, x, scale, cnorm, info ) call zlatrs( uplo, trans, diag, normin, n, a, lda, x, scale, cnorm, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine solves one of the triangular systems A*x = s*b, or AT*x = s*b, or AH*x = s*b (for complex flavors) LAPACK Auxiliary and Utility Routines 5 1387 with scaling to prevent overflow. Here A is an upper or lower triangular matrix, AT denotes the transpose of A, AH denotes the conjugate transpose of A, x and b are n-element vectors, and s is a scaling factor, usually less than or equal to 1, chosen so that the components of x will be less than the overflow threshold. If the unscaled problem will not cause overflow, the Level 2 BLAS routine ?trsv is called. If the matrix A is singular (A(j,j) = 0 for some j), then s is set to 0 and a non-trivial solution to A*x = 0 is returned. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular. = 'U': Upper triangular = 'L': Lower triangular trans CHARACTER*1. Specifies the operation applied to A. = 'N': solve A*x = s*b (no transpose) = 'T': solve AT*x = s*b (transpose) = 'C': solve AH*x = s*b (conjugate transpose) diag CHARACTER*1. Specifies whether or not the matrix A is unit triangular. = 'N': non-unit triangular = 'N': non-unit triangular normin CHARACTER*1. Specifies whether cnorm has been set or not. = 'Y': cnorm contains the column norms on entry; = 'N': cnorm is not set on entry. O n exit, the norms will be computed and stored in cnorm. n INTEGER. The order of the matrix A. n = 0 a REAL for slatrs DOUBLE PRECISION for dlatrs COMPLEX for clatrs DOUBLE COMPLEX for zlatrs. Array, DIMENSION (lda, n). Contains the triangular matrix A. If uplo = 'U', the leading n-by-n upper triangular part of the array a contains the upper triangular matrix, and the strictly lower triangular part of A is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of the array a contains the lower triangular matrix, and the strictly upper triangular part of A is not referenced. If diag = 'U', the diagonal elements of A are also not referenced and are assumed to be 1. lda INTEGER. The leading dimension of the array a. lda = max(1, n). x REAL for slatrs DOUBLE PRECISION for dlatrs COMPLEX for clatrs DOUBLE COMPLEX for zlatrs. Array, DIMENSION (n). On entry, the right hand side b of the triangular system. cnorm REAL for slatrs/clatrs DOUBLE PRECISION for dlatrs/zlatrs. Array, DIMENSION (n). 5 Intel® Math Kernel Library Reference Manual 1388 If normin = 'Y', cnorm is an input argument and cnorm (j) contains the norm of the off-diagonal part of the j-th column of A. If trans = 'N', cnorm (j) must be greater than or equal to the infinitynorm, and if trans = 'T' or 'C', cnorm(j) must be greater than or equal to the 1-norm. Output Parameters x On exit, x is overwritten by the solution vector x. scale REAL for slatrs/clatrs DOUBLE PRECISION for dlatrs/zlatrs. Array, DIMENSION (lda, n). The scaling factor s for the triangular system as described above. If scale = 0, the matrix A is singular or badly scaled, and the vector x is an exact or approximate solution to A*x = 0. cnorm If normin = 'N', cnorm is an output argument and cnorm(j) returns the 1- norm of the off-diagonal part of the j-th column of A. info INTEGER. = 0: successful exit < 0: if info = -k, the k-th argument had an illegal value Application Notes A rough bound on x is computed; if that is less than overflow, ?trsv is called, otherwise, specific code is used which checks for possible overflow or divide-by-zero at every operation. A columnwise scheme is used for solving Ax = b. The basic algorithm if A is lower triangular is x[1:n] := b[1:n] for j = 1, ..., n x(j) := x(j) / A(j,j) x[j+1:n] := x[j+1:n] - x(j)*a[j+1:n,j] end Define bounds on the components of x after j iterations of the loop: M(j) = bound on x[1:j] G(j) = bound on x[j+1:n] Initially, let M(0) = 0 and G(0) = max{x(i), i=1,...,n}. Then for iteration j+1 we have M(j+1) = G(j) / | a(j+1,j+1)| G(j+1) = G(j) + M(j+1)*| a[j+2:n,j+1]| = G(j)(1 + cnorm(j+1)/ | a(j+1,j+1)|, where cnorm(j+1) is greater than or equal to the infinity-norm of column j+1 of a, not counting the diagonal. Hence LAPACK Auxiliary and Utility Routines 5 1389 and Since |x(j)| = M(j), we use the Level 2 BLAS routine ?trsv if the reciprocal of the largest M(j), j=1,..,n, is larger than max(underflow, 1/overflow). The bound on x(j) is also used to determine when a step in the columnwise method can be performed without fear of overflow. If the computed bound is greater than a large constant, x is scaled to prevent overflow, but if the bound overflows, x is set to 0, x(j) to 1, and scale to 0, and a non-trivial solution to Ax = 0 is found. Similarly, a row-wise scheme is used to solve ATx = b or AHx = b. The basic algorithm for A upper triangular is for j = 1, ..., n x(j) := ( b(j) - A[1:j-1,j]' x[1:j-1]) / A(j,j) end We simultaneously compute two bounds G(j) = bound on ( b(i) - A[1:i-1,i]'*x[1:i-1]), 1= i= j M(j) = bound on x(i), 1= i= j The initial values are G(0) = 0, M(0) = max{ b(i), i=1,..,n}, and we add the constraint G(j) = G(j-1) and M(j) = M(j-1) for j = 1. Then the bound on x(j) is M(j) = M(j-1) *(1 + cnorm(j)) / | A(j,j)| and we can safely call ?trsv if 1/M(n) and 1/G(n) are both greater than max(underflow, 1/overflow). ?latrz Factors an upper trapezoidal matrix by means of orthogonal/unitary transformations. Syntax call slatrz( m, n, l, a, lda, tau, work ) call dlatrz( m, n, l, a, lda, tau, work ) call clatrz( m, n, l, a, lda, tau, work ) call zlatrz( m, n, l, a, lda, tau, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1390 Description The routine ?latrz factors the m-by-(m+l) real/complex upper trapezoidal matrix [A1 A2] = [A(1:m,1:m) A(1: m, n-l+1:n)] as ( R 0 )* Z, by means of orthogonal/unitary transformations. Z is an (m+l)-by-(m+l) orthogonal/unitary matrix and R and A1 are m-by -m upper triangular matrices. Input Parameters m INTEGER. The number of rows of the matrix A. m = 0. n INTEGER. The number of columns of the matrix A. n = 0. l INTEGER. The number of columns of the matrix A containing the meaningful part of the Householder vectors. n-m = l = 0. a REAL for slatrz DOUBLE PRECISION for dlatrz COMPLEX for clatrz DOUBLE COMPLEX for zlatrz. Array, DIMENSION (lda, n). On entry, the leading m-by-n upper trapezoidal part of the array a must contain the matrix to be factorized. lda INTEGER. The leading dimension of the array a. lda = max(1,m). work REAL for slatrz DOUBLE PRECISION for dlatrz COMPLEX for clatrz DOUBLE COMPLEX for zlatrz. Workspace array, DIMENSION (m). Output Parameters a On exit, the leading m-by-m upper triangular part of a contains the upper triangular matrix R, and elements n-l+1 to n of the first m rows of a, with the array tau, represent the orthogonal/unitary matrix Z as a product of m elementary reflectors. tau REAL for slatrz DOUBLE PRECISION for dlatrz COMPLEX for clatrz DOUBLE COMPLEX for zlatrz. Array, DIMENSION (m). The scalar factors of the elementary reflectors. Application Notes The factorization is obtained by Householder's method. The k-th transformation matrix, z(k), which is used to introduce zeros into the (m - k + 1)-th row of A, is given in the form where for real flavors LAPACK Auxiliary and Utility Routines 5 1391 and for complex flavors tau is a scalar and z(k) is an l-element vector. tau and z(k) are chosen to annihilate the elements of the kth row of A2. The scalar tau is returned in the k-th element of tau and the vector u(k) in the k-th row of A2, such that the elements of z(k) are in a(k, l+1), ..., a(k, n). The elements of r are returned in the upper triangular part of A1. Z is given by Z = Z(1)*Z(2)*...*Z(m). ?lauu2 Computes the product U*UT(U*UH) or LT*L (LH*L), where U and L are upper or lower triangular matrices (unblocked algorithm). Syntax call slauu2( uplo, n, a, lda, info ) call dlauu2( uplo, n, a, lda, info ) call clauu2( uplo, n, a, lda, info ) call zlauu2( uplo, n, a, lda, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lauu2 computes the product U*UT or LT*L for real flavors, and U*UH or LH*L for complex flavors. Here the triangular factor U or L is stored in the upper or lower triangular part of the array a. If uplo = 'U' or 'u', then the upper triangle of the result is stored, overwriting the factor U in A. If uplo = 'L' or 'l', then the lower triangle of the result is stored, overwriting the factor L in A. This is the unblocked form of the algorithm, calling BLAS Level 2 Routines. 5 Intel® Math Kernel Library Reference Manual 1392 Input Parameters uplo CHARACTER*1. Specifies whether the triangular factor stored in the array a is upper or lower triangular: = 'U': Upper triangular = 'L': Lower triangular n INTEGER. The order of the triangular factor U or L. n = 0. a REAL for slauu2 DOUBLE PRECISION for dlauu2 COMPLEX for clauu2 DOUBLE COMPLEX for zlauu2. Array, DIMENSION (lda, n). On entry, the triangular factor U or L. lda INTEGER. The leading dimension of the array a. lda = max(1,n). Output Parameters a On exit, if uplo = 'U', then the upper triangle of a is overwritten with the upper triangle of the product U*UT (U*UH); if uplo = 'L', then the lower triangle of a is overwritten with the lower triangle of the product LT*L (LH*L). info INTEGER. = 0: successful exit < 0: if info = -k, the k-th argument had an illegal value ?lauum Computes the product U*UT(U*UH) or LT*L (LH*L), where U and L are upper or lower triangular matrices (blocked algorithm). Syntax call slauum( uplo, n, a, lda, info ) call dlauum( uplo, n, a, lda, info ) call clauum( uplo, n, a, lda, info ) call zlauum( uplo, n, a, lda, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lauum computes the product U*UT or LT*L for real flavors, and U*UH or LH*L for complex flavors. Here the triangular factor U or L is stored in the upper or lower triangular part of the array a. If uplo = 'U' or 'u', then the upper triangle of the result is stored, overwriting the factor U in A. If uplo = 'L' or 'l', then the lower triangle of the result is stored, overwriting the factor L in A. This is the blocked form of the algorithm, calling BLAS Level 3 Routines. LAPACK Auxiliary and Utility Routines 5 1393 Input Parameters uplo CHARACTER*1. Specifies whether the triangular factor stored in the array a is upper or lower triangular: = 'U': Upper triangular = 'L': Lower triangular n INTEGER. The order of the triangular factor U or L. n = 0. a REAL for slauum DOUBLE PRECISION for dlauum COMPLEX for clauum DOUBLE COMPLEX for zlauum . Array, DIMENSION (lda, n). On entry, the triangular factor U or L. lda INTEGER. The leading dimension of the array a. lda = max(1,n). Output Parameters a On exit, if uplo = 'U', then the upper triangle of a is overwritten with the upper triangle of the product U*UT(U*UH); if uplo = 'L', then the lower triangle of a is overwritten with the lower triangle of the product LT*L (LH*L). info INTEGER. = 0: successful exit < 0: if info = -k, the k-th argument had an illegal value ?org2l/?ung2l Generates all or part of the orthogonal/unitary matrix Q from a QL factorization determined by ?geqlf (unblocked algorithm). Syntax call sorg2l( m, n, k, a, lda, tau, work, info ) call dorg2l( m, n, k, a, lda, tau, work, info ) call cung2l( m, n, k, a, lda, tau, work, info ) call zung2l( m, n, k, a, lda, tau, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?org2l/?ung2l generates an m-by-n real/complex matrix Q with orthonormal columns, which is defined as the last n columns of a product of k elementary reflectors of order m: Q = H(k)*...*H(2)*H(1) as returned by ?geqlf. Input Parameters m INTEGER. The number of rows of the matrix Q. m = 0. 5 Intel® Math Kernel Library Reference Manual 1394 n INTEGER. The number of columns of the matrix Q. m = n = 0. k INTEGER. The number of elementary reflectors whose product defines the matrix Q. n = k = 0. a REAL for sorg2l DOUBLE PRECISION for dorg2l COMPLEX for cung2l DOUBLE COMPLEX for zung2l. Array, DIMENSION (lda,n). On entry, the (n -k+i)-th column must contain the vector which defines the elementary reflector H(i), for i = 1,2,..., k, as returned by ?geqlf in the last k columns of its array argument A. lda INTEGER. The leading dimension of the array a. lda = max(1,m). tau REAL for sorg2l DOUBLE PRECISION for dorg2l COMPLEX for cung2l DOUBLE COMPLEX for zung2l. Array, DIMENSION (k). tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by ?geqlf. work REAL for sorg2l DOUBLE PRECISION for dorg2l COMPLEX for cung2l DOUBLE COMPLEX for zung2l. Workspace array, DIMENSION (n). Output Parameters a On exit, the m-by-n matrix Q. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument has an illegal value ?org2r/?ung2r Generates all or part of the orthogonal/unitary matrix Q from a QR factorization determined by ?geqrf (unblocked algorithm). Syntax call sorg2r( m, n, k, a, lda, tau, work, info ) call dorg2r( m, n, k, a, lda, tau, work, info ) call cung2r( m, n, k, a, lda, tau, work, info ) call zung2r( m, n, k, a, lda, tau, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?org2r/?ung2r generates an m-by-n real/complex matrix Q with orthonormal columns, which is defined as the first n columns of a product of k elementary reflectors of order m LAPACK Auxiliary and Utility Routines 5 1395 Q = H(1)*H(2)*...*H(k) as returned by ?geqrf. Input Parameters m INTEGER. The number of rows of the matrix Q. m = 0. n INTEGER. The number of columns of the matrix Q. m = n = 0. k INTEGER. The number of elementary reflectors whose product defines the matrix Q. n = k = 0. a REAL for sorg2r DOUBLE PRECISION for dorg2r COMPLEX for cung2r DOUBLE COMPLEX for zung2r. Array, DIMENSION (lda, n). On entry, the i-th column must contain the vector which defines the elementary reflector H(i), for i = 1,2,..., k, as returned by ?geqrf in the first k columns of its array argument a. lda INTEGER. The first DIMENSION of the array a. lda = max(1,m). tau REAL for sorg2r DOUBLE PRECISION for dorg2r COMPLEX for cung2r DOUBLE COMPLEX for zung2r. Array, DIMENSION (k). tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by ?geqrf. work REAL for sorg2r DOUBLE PRECISION for dorg2r COMPLEX for cung2r DOUBLE COMPLEX for zung2r. Workspace array, DIMENSION (n). Output Parameters a On exit, the m-by-n matrix Q. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument has an illegal value ?orgl2/?ungl2 Generates all or part of the orthogonal/unitary matrix Q from an LQ factorization determined by ?gelqf (unblocked algorithm). Syntax call sorgl2( m, n, k, a, lda, tau, work, info ) call dorgl2( m, n, k, a, lda, tau, work, info ) call cungl2( m, n, k, a, lda, tau, work, info ) call zungl2( m, n, k, a, lda, tau, work, info ) 5 Intel® Math Kernel Library Reference Manual 1396 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?orgl2/?ungl2 generates a m-by-n real/complex matrix Q with orthonormal rows, which is defined as the first m rows of a product of k elementary reflectors of order n Q = H(k)*...*H(2)*H(1)for real flavors, or Q = (H(k))H*...*(H(2))H*(H(1))H for complex flavors as returned by ?gelqf. Input Parameters m INTEGER. The number of rows of the matrix Q. m = 0. n INTEGER. The number of columns of the matrix Q. n = m. k INTEGER. The number of elementary reflectors whose product defines the matrix Q. m = k = 0. a REAL for sorgl2 DOUBLE PRECISION for dorgl2 COMPLEX for cungl2 DOUBLE COMPLEX for zungl2. Array, DIMENSION (lda, n). On entry, the i-th row must contain the vector which defines the elementary reflector H(i), for i = 1,2,..., k, as returned by ?gelqf in the first k rows of its array argument a. lda INTEGER. The leading dimension of the array a. lda = max(1,m). tau REAL for sorgl2 DOUBLE PRECISION for dorgl2 COMPLEX for cungl2 DOUBLE COMPLEX for zungl2. Array, DIMENSION (k). tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by ?gelqf. work REAL for sorgl2 DOUBLE PRECISION for dorgl2 COMPLEX for cungl2 DOUBLE COMPLEX for zungl2. Workspace array, DIMENSION (m). Output Parameters a On exit, the m-by-n matrix Q. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument has an illegal value. ?orgr2/?ungr2 Generates all or part of the orthogonal/unitary matrix Q from an RQ factorization determined by ?gerqf (unblocked algorithm). Syntax call sorgr2( m, n, k, a, lda, tau, work, info ) LAPACK Auxiliary and Utility Routines 5 1397 call dorgr2( m, n, k, a, lda, tau, work, info ) call cungr2( m, n, k, a, lda, tau, work, info ) call zungr2( m, n, k, a, lda, tau, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?orgr2/?ungr2 generates an m-by-n real matrix Q with orthonormal rows, which is defined as the last m rows of a product of k elementary reflectors of order n Q = H(1)*H(2)*...*H(k) for real flavors, or Q = (H(1))H*(H(2))H*...*(H(k))H for complex flavors as returned by ?gerqf. Input Parameters m INTEGER. The number of rows of the matrix Q. m = 0. n INTEGER. The number of columns of the matrix Q. n = m k INTEGER. The number of elementary reflectors whose product defines the matrix Q. m = k = 0. a REAL for sorgr2 DOUBLE PRECISION for dorgr2 COMPLEX for cungr2 DOUBLE COMPLEX for zungr2. Array, DIMENSION (lda, n). On entry, the ( m- k+i)-th row must contain the vector which defines the elementary reflector H(i), for i = 1,2,..., k, as returned by ?gerqf in the last k rows of its array argument a. lda INTEGER. The leading dimension of the array a. lda = max(1,m). tau REAL for sorgr2 DOUBLE PRECISION for dorgr2 COMPLEX for cungr2 DOUBLE COMPLEX for zungr2. Array, DIMENSION (k).tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by ?gerqf. work REAL for sorgr2 DOUBLE PRECISION for dorgr2 COMPLEX for cungr2 DOUBLE COMPLEX for zungr2. Workspace array, DIMENSION (m). Output Parameters a On exit, the m-by-n matrix Q. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument has an illegal value 5 Intel® Math Kernel Library Reference Manual 1398 ?orm2l/?unm2l Multiplies a general matrix by the orthogonal/unitary matrix from a QL factorization determined by ?geqlf (unblocked algorithm). Syntax call sorm2l( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call dorm2l( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call cunm2l( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call zunm2l( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?orm2l/?unm2l overwrites the general real/complex m-by-n matrix C with Q*C if side = 'L' and trans = 'N', or QT*C / QH*C if side = 'L' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors), or C*Q if side = 'R' and trans = 'N', or C*QT / C*QH if side = 'R' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors). Here Q is a real orthogonal or complex unitary matrix defined as the product of k elementary reflectors Q = H(k)*...*H(2)*H(1) as returned by ?geqlf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side CHARACTER*1. = 'L': apply Q or QT / QH from the left = 'R': apply Q or QT / QH from the right trans CHARACTER*1. = 'N': apply Q (no transpose) = 'T': apply QT (transpose, for real flavors) = 'C': apply QH (conjugate transpose, for complex flavors) m INTEGER. The number of rows of the matrix C. m = 0. n INTEGER. The number of columns of the matrix C. n = 0. k INTEGER. The number of elementary reflectors whose product defines the matrix Q. If side = 'L', m = k = 0; if side = 'R', n = k = 0. a REAL for sorm2l DOUBLE PRECISION for dorm2l COMPLEX for cunm2l DOUBLE COMPLEX for zunm2l. Array, DIMENSION (lda,k). LAPACK Auxiliary and Utility Routines 5 1399 The i-th column must contain the vector which defines the elementary reflector H(i), for i = 1,2,..., k, as returned by ?geqlf in the last k columns of its array argument a. The array a is modified by the routine but restored on exit. lda INTEGER. The leading dimension of the array a. If side = 'L', lda = max(1, m) if side = 'R', lda = max(1, n). tau REAL for sorm2l DOUBLE PRECISION for dorm2l COMPLEX for cunm2l DOUBLE COMPLEX for zunm2l. Array, DIMENSION (k). tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by ?geqlf. c REAL for sorm2l DOUBLE PRECISION for dorm2l COMPLEX for cunm2l DOUBLE COMPLEX for zunm2l. Array, DIMENSION (ldc, n). On entry, the m-by-n matrix C. ldc INTEGER. The leading dimension of the array C. ldc = max(1,m). work REAL for sorm2l DOUBLE PRECISION for dorm2l COMPLEX for cunm2l DOUBLE COMPLEX for zunm2l. Workspace array, DIMENSION: (n) if side = 'L', (m) if side = 'R'. Output Parameters c On exit, c is overwritten by Q*C or QT*C / QH*C, or C*Q, or C*QT / C*QH. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument had an illegal value ?orm2r/?unm2r Multiplies a general matrix by the orthogonal/unitary matrix from a QR factorization determined by ?geqrf (unblocked algorithm). Syntax call sorm2r( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call dorm2r( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call cunm2r( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call zunm2r( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1400 Description The routine ?orm2r/?unm2r overwrites the general real/complex m-by-n matrix C with Q*C if side = 'L' and trans = 'N', or QT*C / QH*C if side = 'L' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors), or C*Q if side = 'R' and trans = 'N', or C*QT / C*QH if side = 'R' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors). Here Q is a real orthogonal or complex unitary matrix defined as the product of k elementary reflectors Q = H(1)*H(2)*...*H(k) as returned by ?geqrf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side CHARACTER*1. = 'L': apply Q or QT / QH from the left = 'R': apply Q or QT / QH from the right trans CHARACTER*1. = 'N': apply Q (no transpose) = 'T': apply QT (transpose, for real flavors) = 'C': apply QH (conjugate transpose, for complex flavors) m INTEGER. The number of rows of the matrix C. m = 0. n INTEGER. The number of columns of the matrix C. n = 0. k INTEGER. The number of elementary reflectors whose product defines the matrix Q. If side = 'L', m = k = 0; if side = 'R', n = k = 0. a REAL for sorm2r DOUBLE PRECISION for dorm2r COMPLEX for cunm2r DOUBLE COMPLEX for zunm2r. Array, DIMENSION (lda,k). The i-th column must contain the vector which defines the elementary reflector H(i), for i = 1,2,..., k, as returned by ?geqrf in the first k columns of its array argument a. The array a is modified by the routine but restored on exit. lda INTEGER. The leading dimension of the array a. If side = 'L', lda = max(1, m); if side = 'R', lda = max(1, n). tau REAL for sorm2r DOUBLE PRECISION for dorm2r COMPLEX for cunm2r DOUBLE COMPLEX for zunm2r. Array, DIMENSION (k). tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by ?geqrf. c REAL for sorm2r DOUBLE PRECISION for dorm2r COMPLEX for cunm2r LAPACK Auxiliary and Utility Routines 5 1401 DOUBLE COMPLEX for zunm2r. Array, DIMENSION (ldc, n). On entry, the m-by-n matrix C. ldc INTEGER. The leading dimension of the array c. ldc = max(1,m). work REAL for sorm2r DOUBLE PRECISION for dorm2r COMPLEX for cunm2r DOUBLE COMPLEX for zunm2r. Workspace array, DIMENSION (n) if side = 'L', (m) if side = 'R'. Output Parameters c On exit, c is overwritten by Q*C or QT*C / QH*C, or C*Q, or C*QT / C*QH. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument had an illegal value ?orml2/?unml2 Multiplies a general matrix by the orthogonal/unitary matrix from a LQ factorization determined by ?gelqf (unblocked algorithm). Syntax call sorml2( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call dorml2( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call cunml2( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call zunml2( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?orml2/?unml2 overwrites the general real/complex m-by-n matrix C with Q*C if side = 'L' and trans = 'N', or QT*C / QH*C if side = 'L' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors), or C*Q if side = 'R' and trans = 'N', or C*QT / C*QH if side = 'R' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors). Here Q is a real orthogonal or complex unitary matrix defined as the product of k elementary reflectors Q = H(k)*...*H(2)*H(1)for real flavors, or Q = (H(k))H*...*(H(2))H*(H(1))H for complex flavors as returned by ?gelqf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side CHARACTER*1. 5 Intel® Math Kernel Library Reference Manual 1402 = 'L': apply Q or QT / QH from the left = 'R': apply Q or QT / QH from the right trans CHARACTER*1. = 'N': apply Q (no transpose) = 'T': apply QT (transpose, for real flavors) = 'C': apply QH (conjugate transpose, for complex flavors) m INTEGER. The number of rows of the matrix C. m = 0. n INTEGER. The number of columns of the matrix C. n = 0. k INTEGER. The number of elementary reflectors whose product defines the matrix Q. If side = 'L', m = k = 0; if side = 'R', n = k = 0. a REAL for sorml2 DOUBLE PRECISION for dorml2 COMPLEX for cunml2 DOUBLE COMPLEX for zunml2. Array, DIMENSION (lda, m) if side = 'L', (lda, n) if side = 'R' The i-th row must contain the vector which defines the elementary reflector H(i), for i = 1,2,..., k, as returned by ?gelqf in the first k rows of its array argument a. The array a is modified by the routine but restored on exit. lda INTEGER. The leading dimension of the array a. lda = max(1,k). tau REAL for sorml2 DOUBLE PRECISION for dorml2 COMPLEX for cunml2 DOUBLE COMPLEX for zunml2. Array, DIMENSION (k). tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by ?gelqf. c REAL for sorml2 DOUBLE PRECISION for dorml2 COMPLEX for cunml2 DOUBLE COMPLEX for zunml2. Array, DIMENSION (ldc, n) On entry, the m-by-n matrix C. ldc INTEGER. The leading dimension of the array c. ldc = max(1,m). work REAL for sorml2 DOUBLE PRECISION for dorml2 COMPLEX for cunml2 DOUBLE COMPLEX for zunml2. Workspace array, DIMENSION (n) if side = 'L', (m) if side = 'R' Output Parameters c On exit, c is overwritten by Q*C or QT*C / QH*C, or C*Q, or C*QT / C*QH. info INTEGER. = 0: successful exit LAPACK Auxiliary and Utility Routines 5 1403 < 0: if info = -i, the i-th argument had an illegal value ?ormr2/?unmr2 Multiplies a general matrix by the orthogonal/unitary matrix from a RQ factorization determined by ?gerqf (unblocked algorithm). Syntax call sormr2( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call dormr2( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call cunmr2( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) call zunmr2( side, trans, m, n, k, a, lda, tau, c, ldc, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?ormr2/?unmr2 overwrites the general real/complex m-by-n matrix C with Q*C if side = 'L' and trans = 'N', or QT*C / QH*C if side = 'L' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors), or C*Q if side = 'R' and trans = 'N', or C*QT / C*QH if side = 'R' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors). Here Q is a real orthogonal or complex unitary matrix defined as the product of k elementary reflectors Q = H(1)*H(2)*...*H(k) for real flavors, or Q = (H(1))H*(H(2))H*...*(H(k))H as returned by ?gerqf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side CHARACTER*1. = 'L': apply Q or QT / QH from the left = 'R': apply Q or QT / QH from the right trans CHARACTER*1. = 'N': apply Q (no transpose) = 'T': apply QT (transpose, for real flavors) = 'C': apply QH (conjugate transpose, for complex flavors) m INTEGER. The number of rows of the matrix C. m = 0. n INTEGER. The number of columns of the matrix C. n = 0. k INTEGER. The number of elementary reflectors whose product defines the matrix Q. If side = 'L', m = k = 0; if side = 'R', n = k = 0. a REAL for sormr2 DOUBLE PRECISION for dormr2 COMPLEX for cunmr2 DOUBLE COMPLEX for zunmr2. Array, DIMENSION 5 Intel® Math Kernel Library Reference Manual 1404 (lda, m) if side = 'L', (lda, n) if side = 'R' The i-th row must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by ?gerqf in the last k rows of its array argument a. The array a is modified by the routine but restored on exit. lda INTEGER. The leading dimension of the array a. lda = max(1,k). tau REAL for sormr2 DOUBLE PRECISION for dormr2 COMPLEX for cunmr2 DOUBLE COMPLEX for zunmr2. Array, DIMENSION (k). tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by ?gerqf. c REAL for sormr2 DOUBLE PRECISION for dormr2 COMPLEX for cunmr2 DOUBLE COMPLEX for zunmr2. Array, DIMENSION (ldc, n). On entry, the m-by-n matrix C. ldc INTEGER. The leading dimension of the array c. ldc = max(1,m). work REAL for sormr2 DOUBLE PRECISION for dormr2 COMPLEX for cunmr2 DOUBLE COMPLEX for zunmr2. Workspace array, DIMENSION (n) if side = 'L', (m) if side = 'R' Output Parameters c On exit, c is overwritten by Q*C or QT*C / QH*C, or C*Q, or C*QT / C*QH. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument had an illegal value ?ormr3/?unmr3 Multiplies a general matrix by the orthogonal/unitary matrix from a RZ factorization determined by ?tzrzf (unblocked algorithm). Syntax call sormr3( side, trans, m, n, k, l, a, lda, tau, c, ldc, work, info ) call dormr3( side, trans, m, n, k, l, a, lda, tau, c, ldc, work, info ) call cunmr3( side, trans, m, n, k, l, a, lda, tau, c, ldc, work, info ) call zunmr3( side, trans, m, n, k, l, a, lda, tau, c, ldc, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h LAPACK Auxiliary and Utility Routines 5 1405 Description The routine ?ormr3/?unmr3 overwrites the general real/complex m-by-n matrix C with Q*C if side = 'L' and trans = 'N', or QT*C / QH*C if side = 'L' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors), or C*Q if side = 'R' and trans = 'N', or C*QT / C*QH if side = 'R' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors). Here Q is a real orthogonal or complex unitary matrix defined as the product of k elementary reflectors Q = H(1)*H(2)*...*H(k) as returned by ?tzrzf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side CHARACTER*1. = 'L': apply Q or QT / QH from the left = 'R': apply Q or QT / QH from the right trans CHARACTER*1. = 'N': apply Q (no transpose) = 'T': apply QT (transpose, for real flavors) = 'C': apply QH (conjugate transpose, for complex flavors) m INTEGER. The number of rows of the matrix C. m = 0. n INTEGER. The number of columns of the matrix C. n = 0. k INTEGER. The number of elementary reflectors whose product defines the matrix Q. If side = 'L', m = k = 0; if side = 'R', n = k = 0. l INTEGER. The number of columns of the matrix A containing the meaningful part of the Householder reflectors. If side = 'L', m = l = 0, if side = 'R', n = l = 0. a REAL for sormr3 DOUBLE PRECISION for dormr3 COMPLEX for cunmr3 DOUBLE COMPLEX for zunmr3. Array, DIMENSION (lda, m) if side = 'L', (lda, n) if side = 'R' The i-th row must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by ?tzrzf in the last k rows of its array argument a. The array a is modified by the routine but restored on exit. lda INTEGER. The leading dimension of the array a. lda = max(1,k). tau REAL for sormr3 DOUBLE PRECISION for dormr3 COMPLEX for cunmr3 DOUBLE COMPLEX for zunmr3. Array, DIMENSION (k). 5 Intel® Math Kernel Library Reference Manual 1406 tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by ?tzrzf. c REAL for sormr3 DOUBLE PRECISION for dormr3 COMPLEX for cunmr3 DOUBLE COMPLEX for zunmr3. Array, DIMENSION (ldc, n). On entry, the m-by-n matrix C. ldc INTEGER. The leading dimension of the array c. ldc = max(1,m). work REAL for sormr3 DOUBLE PRECISION for dormr3 COMPLEX for cunmr3 DOUBLE COMPLEX for zunmr3. Workspace array, DIMENSION (n) if side = 'L', (m) if side = 'R'. Output Parameters c On exit, c is overwritten by Q*C or QT*C / QH*C, or C*Q, or C*QT / C*QH. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument had an illegal value ?pbtf2 Computes the Cholesky factorization of a symmetric/ Hermitian positive-definite band matrix (unblocked algorithm). Syntax call spbtf2( uplo, n, kd, ab, ldab, info ) call dpbtf2( uplo, n, kd, ab, ldab, info ) call cpbtf2( uplo, n, kd, ab, ldab, info ) call zpbtf2( uplo, n, kd, ab, ldab, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the Cholesky factorization of a real symmetric or complex Hermitian positive definite band matrix A. The factorization has the form A = UT*U for real flavors, A = UH*U for complex flavors if uplo = 'U', or A = L*LT for real flavors, A = L*LH for complex flavors if uplo = 'L', where U is an upper triangular matrix, and L is lower triangular. This is the unblocked version of the algorithm, calling BLAS Level 2 Routines. LAPACK Auxiliary and Utility Routines 5 1407 Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric/ Hermitian matrix A is stored: = 'U': upper triangular = 'L': lower triangular n INTEGER. The order of the matrix A. n = 0. kd INTEGER. The number of super-diagonals of the matrix A if uplo = 'U', or the number of sub-diagonals if uplo = 'L'. kd = 0. ab REAL for spbtf2 DOUBLE PRECISION for dpbtf2 COMPLEX for cpbtf2 DOUBLE COMPLEX for zpbtf2. Array, DIMENSION (ldab, n). On entry, the upper or lower triangle of the symmetric/ Hermitian band matrix A, stored in the first kd+1 rows of the array. The j-th column of A is stored in the j-th column of the array ab as follows: if uplo = 'U', ab(kd+1+i -j,j) = A(i, j for max(1, j-kd) = i = j; if uplo = 'L', ab(1+i -j,j) = A(i, j for j = i = min(n, j+kd). ldab INTEGER. The leading dimension of the array ab. ldab = kd+1. Output Parameters ab On exit, If info = 0, the triangular factor U or L from the Cholesky factorization A=UT*U (A=UH*U), or A= L*LT (A = L*LH) of the band matrix A, in the same storage format as A. info INTEGER. = 0: successful exit < 0: if info = -k, the k-th argument had an illegal value > 0: if info = k, the leading minor of order k is not positive definite, and the factorization could not be completed. ?potf2 Computes the Cholesky factorization of a symmetric/ Hermitian positive-definite matrix (unblocked algorithm). Syntax call spotf2( uplo, n, a, lda, info ) call dpotf2( uplo, n, a, lda, info ) call cpotf2( uplo, n, a, lda, info ) call zpotf2( uplo, n, a, lda, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1408 Description The routine ?potf2 computes the Cholesky factorization of a real symmetric or complex Hermitian positive definite matrix A. The factorization has the form A = UT*U for real flavors, A = UH*U for complex flavors if uplo = 'U', or A = L*LT for real flavors, A = L*LH for complex flavors if uplo = 'L', where U is an upper triangular matrix, and L is lower triangular. This is the unblocked version of the algorithm, calling BLAS Level 2 Routines Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric/ Hermitian matrix A is stored. = 'U': upper triangular = 'L': lower triangular n INTEGER. The order of the matrix A. n = 0. a REAL for spotf2 DOUBLE PRECISION or dpotf2 COMPLEX for cpotf2 DOUBLE COMPLEX for zpotf2. Array, DIMENSION (lda, n). On entry, the symmetric/Hermitian matrix A. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A, and the strictly upper triangular part of a is not referenced. lda INTEGER. The leading dimension of the array a. lda = max(1,n). Output Parameters a On exit, If info = 0, the factor U or L from the Cholesky factorization A=UT*U (A=UH*U), or A= L*LT (A = L*LH). info INTEGER. = 0: successful exit < 0: if info = -k, the k-th argument had an illegal value > 0: if info = k, the leading minor of order k is not positive definite, and the factorization could not be completed. ?ptts2 Solves a tridiagonal system of the form A*X=B using the L*D*LH/L*D*LH factorization computed by ?pttrf. Syntax call sptts2( n, nrhs, d, e, b, ldb ) call dptts2( n, nrhs, d, e, b, ldb ) call cptts2( iuplo, n, nrhs, d, e, b, ldb ) LAPACK Auxiliary and Utility Routines 5 1409 call zptts2( iuplo, n, nrhs, d, e, b, ldb ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?ptts2 solves a tridiagonal system of the form A*X = B Real flavors sptts2/dptts2 use the L*D*LT factorization of A computed by spttrf/dpttrf, and complex flavors cptts2/zptts2 use the UH*D*U or L*D*LH factorization of A computed by cpttrf/zpttrf. D is a diagonal matrix specified in the vector d, U (or L) is a unit bidiagonal matrix whose superdiagonal (subdiagonal) is specified in the vector e, and X and B are n-by-nrhs matrices. Input Parameters iuplo INTEGER. Used with complex flavors only. Specifies the form of the factorization, and whether the vector e is the superdiagonal of the upper bidiagonal factor U or the subdiagonal of the lower bidiagonal factor L. = 1: A = UH*D*U , e is the superdiagonal of U; = 0: A = L*D*LH, e is the subdiagonal of L n INTEGER. The order of the tridiagonal matrix A. n = 0. nrhs INTEGER. The number of right hand sides, that is, the number of columns of the matrix B. nrhs = 0. d REAL for sptts2/cptts2 DOUBLE PRECISION for dptts2/zptts2. Array, DIMENSION (n). The n diagonal elements of the diagonal matrix D from the factorization of A. e REAL for sptts2 DOUBLE PRECISION for dptts2 COMPLEX for cptts2 DOUBLE COMPLEX for zptts2. Array, DIMENSION (n-1). Contains the (n-1) subdiagonal elements of the unit bidiagonal factor L from the L*D*LT (for real flavors) or L*D*LH (for complex flavors when iuplo = 0) factorization of A. For complex flavors when iuplo = 1, e contains the (n-1) superdiagonal elements of the unit bidiagonal factor U from the factorization A = UH*D*U. B REAL for sptts2/cptts2 DOUBLE PRECISION for dptts2/zptts2. Array, DIMENSION (ldb, nrhs). On entry, the right hand side vectors B for the system of linear equations. ldb INTEGER. The leading dimension of the array B. ldb = max(1,n). Output Parameters b On exit, the solution vectors, X. 5 Intel® Math Kernel Library Reference Manual 1410 ?rscl Multiplies a vector by the reciprocal of a real scalar. Syntax call srscl( n, sa, sx, incx ) call drscl( n, sa, sx, incx ) call csrscl( n, sa, sx, incx ) call zdrscl( n, sa, sx, incx ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?rscl multiplies an n-element real/complex vector x by the real scalar 1/a. This is done without overflow or underflow as long as the final result x/a does not overflow or underflow. Input Parameters n INTEGER. The number of components of the vector x. sa REAL for srscl/csrscl DOUBLE PRECISION for drscl/zdrscl. The scalar a which is used to divide each component of the vector x. sa must be = 0, or the subroutine will divide by zero. sx REAL for srscl DOUBLE PRECISION for drscl COMPLEX for csrscl DOUBLE COMPLEX for zdrscl. Array, DIMENSION (1+(n-1)* |incx|). The n-element vector x. incx INTEGER. The increment between successive values of the vector sx. If incx > 0, sx(1)=x(1), and sx(1+(i-1)*incx)=x(i), 1 0, then rows and columns k and ipiv(k) are interchanged and D(k,k) is a 1-by-1 diagonal block. If uplo = 'U' and ipiv(k) = ipiv(k-1) < 0, then rows and columns k-1 and -ipiv(k) are interchanged and D(k,k) is a 2-by-2 diagonal block. If uplo = 'L' and ipiv( k) = ipiv( k+1)< 0, then rows and columns k+1 and -ipiv(k) were interchanged and D(k:k+1,k:k+1) is a 2-by-2 diagonal block. info INTEGER. = 0: successful exit < 0: if info = -k, the k-th argument has an illegal value > 0: if info = k, D(k,k) is exactly zero. The factorization are completed, but the block diagonal matrix D is exactly singular, and division by zero will occur if it is used to solve a system of equations. ?hetf2 Computes the factorization of a complex Hermitian matrix, using the diagonal pivoting method (unblocked algorithm). Syntax call chetf2( uplo, n, a, lda, ipiv, info ) call zhetf2( uplo, n, a, lda, ipiv, info ) LAPACK Auxiliary and Utility Routines 5 1419 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine computes the factorization of a complex Hermitian matrix A using the Bunch-Kaufman diagonal pivoting method: A = U*D*UH or A = L*D*LH where U (or L) is a product of permutation and unit upper (lower) triangular matrices, UH is the conjugate transpose of U, and D is Hermitian and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. This is the unblocked version of the algorithm, calling BLAS Level 2 Routines. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian matrix A is stored: = 'U': Upper triangular = 'L': Lower triangular n INTEGER. The order of the matrix A. n = 0. A COMPLEX for chetf2 DOUBLE COMPLEX for zhetf2. Array, DIMENSION (lda, n). On entry, the Hermitian matrix A. If uplo = 'U', the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A, and the strictly lower triangular part of A is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A, and the strictly upper triangular part of A is not referenced. lda INTEGER. The leading dimension of the array a. lda = max(1,n). Output Parameters a On exit, the block diagonal matrix D and the multipliers used to obtain the factor U or L. ipiv INTEGER. Array, DIMENSION (n). Details of the interchanges and the block structure of D If ipiv(k) > 0, then rows and columns k and ipiv(k) were interchanged and D(k,k) is a 1-by-1 diagonal block. If uplo = 'U' and ipiv(k) = ipiv( k-1) < 0, then rows and columns k-1 and -ipiv(k) were interchanged and D(k-1:k,k-1:k ) is a 2-by-2 diagonal block. If uplo = 'L' and ipiv(k) = ipiv( k+1) < 0, then rows and columns k+1 and -ipiv(k) were interchanged and D(k:k+1, k:k+1) is a 2-by-2 diagonal block. info INTEGER. = 0: successful exit < 0: if info = -k, the k-th argument had an illegal value 5 Intel® Math Kernel Library Reference Manual 1420 > 0: if info = k, D(k,k) is exactly zero. The factorization has been completed, but the block diagonal matrix D is exactly singular, and division by zero will occur if it is used to solve a system of equations. ?tgex2 Swaps adjacent diagonal blocks in an upper (quasi) triangular matrix pair by an orthogonal/unitary equivalence transformation. Syntax call stgex2( wantq, wantz, n, a, lda, b, ldb, q, ldq, z, ldz, j1, n1, n2, work, lwork, info ) call dtgex2( wantq, wantz, n, a, lda, b, ldb, q, ldq, z, ldz, j1, n1, n2, work, lwork, info ) call ctgex2( wantq, wantz, n, a, lda, b, ldb, q, ldq, z, ldz, j1, info ) call ztgex2( wantq, wantz, n, a, lda, b, ldb, q, ldq, z, ldz, j1, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The real routines stgex2/dtgex2 swap adjacent diagonal blocks (A11, B11) and (A22, B22) of size 1-by-1 or 2-by-2 in an upper (quasi) triangular matrix pair (A, B) by an orthogonal equivalence transformation. (A, B) must be in generalized real Schur canonical form (as returned by sgges/dgges), that is, A is block upper triangular with 1-by-1 and 2-by-2 diagonal blocks. B is upper triangular. The complex routines ctgex2/ztgex2 swap adjacent diagonal 1-by-1 blocks (A11, B11) and (A22, B22) in an upper triangular matrix pair (A, B) by an unitary equivalence transformation. (A, B) must be in generalized Schur canonical form, that is, A and B are both upper triangular. All routines optionally update the matrices Q and Z of generalized Schur vectors: For real flavors, Q(in)*A(in)*Z(in)T = Q(out)*A(out)*Z(out)T Q(in)*B(in)*Z(in)T = Q(out)*B(out)*Z(out)T. For complex flavors, Q(in)*A(in)*Z(in)H = Q(out)*A(out)*Z(out)H Q(in)*B(in)*Z(in)H = Q(out)*B(out)*Z(out)H. Input Parameters wantq LOGICAL. If wantq = .TRUE. : update the left transformation matrix Q; If wantq = .FALSE. : do not update Q. wantz LOGICAL. If wantz = .TRUE. : update the right transformation matrix Z; If wantz = .FALSE.: do not update Z. n INTEGER. The order of the matrices A and B. n = 0. a, b REAL for stgex2 DOUBLE PRECISION for dtgex2 LAPACK Auxiliary and Utility Routines 5 1421 COMPLEX for ctgex2 DOUBLE COMPLEX for ztgex2. Arrays, DIMENSION (lda, n) and (ldb, n), respectively. On entry, the matrices A and B in the pair (A, B). lda INTEGER. The leading dimension of the array a. lda = max(1,n). ldb INTEGER. The leading dimension of the array b. ldb = max(1,n). q, z REAL for stgex2 DOUBLE PRECISION for dtgex2 COMPLEX for ctgex2 DOUBLE COMPLEX for ztgex2. Arrays, DIMENSION (ldq, n) and (ldz, n), respectively. On entry, if wantq = .TRUE., q contains the orthogonal/unitary matrix Q, and if wantz = .TRUE., z contains the orthogonal/unitary matrix Z. ldq INTEGER. The leading dimension of the array q. ldq = 1. If wantq = .TRUE., ldq = n. ldz INTEGER. The leading dimension of the array z. ldz = 1. If wantz = .TRUE., ldz = n. j1 INTEGER. The index to the first block (A11, B11). 1 = j1 = n. n1 INTEGER. Used with real flavors only. The order of the first block (A11, B11). n1 = 0, 1 or 2. n2 INTEGER. Used with real flavors only. The order of the second block (A22, B22). n2 = 0, 1 or 2. work REAL for stgex2 DOUBLE PRECISION for dtgex2. Workspace array, DIMENSION (max(1,lwork)). Used with real flavors only. lwork INTEGER. The dimension of the array work. lwork=max(n*(n2+n1), 2*(n2+n1)2) Output Parameters a On exit, the updated matrix A. B On exit, the updated matrix B. Q On exit, the updated matrix Q. Not referenced if wantq = .FALSE.. z On exit, the updated matrix Z. Not referenced if wantz = .FALSE.. info INTEGER. =0: Successful exit For stgex2/dtgex2: If info = 1, the transformed matrix (A, B) would be too far from generalized Schur form; the blocks are not swapped and (A, B) and (Q, Z) are unchanged. The problem of swapping is too ill-conditioned. If info = -16: lwork is too small. Appropriate value for lwork is returned in work(1). For ctgex2/ztgex2: If info = 1, the transformed matrix pair (A, B) would be too far from generalized Schur form; the problem is ill-conditioned. 5 Intel® Math Kernel Library Reference Manual 1422 ?tgsy2 Solves the generalized Sylvester equation (unblocked algorithm). Syntax call stgsy2( trans, ijob, m, n, a, lda, b, ldb, c, ldc, d, ldd, e, lde, f, ldf, scale, rdsum, rdscal, iwork, pq, info ) call dtgsy2( trans, ijob, m, n, a, lda, b, ldb, c, ldc, d, ldd, e, lde, f, ldf, scale, rdsum, rdscal, iwork, pq, info ) call ctgsy2( trans, ijob, m, n, a, lda, b, ldb, c, ldc, d, ldd, e, lde, f, ldf, scale, rdsum, rdscal, iwork, pq, info ) call ztgsy2( trans, ijob, m, n, a, lda, b, ldb, c, ldc, d, ldd, e, lde, f, ldf, scale, rdsum, rdscal, iwork, pq, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?tgsy2 solves the generalized Sylvester equation: A*R-L*B=scale*C (1) D*R-L*E=scale*F using Level 1 and 2 BLAS, where R and L are unknown m-by-n matrices, (A, D), ( B, E) and (C, F) are given matrix pairs of size m-by -m, n-by-n and m-by-n, respectively. For stgsy2/dtgsy2, pairs (A, D) and (B, E) must be in generalized Schur canonical form, that is, A, B are upper quasi triangular and D, E are upper triangular. For ctgsy2/ztgsy2, matrices A, B, D and E are upper triangular (that is, (A, D) and (B, E) in generalized Schur form). The solution (R, L) overwrites (C, F). 0 = scale = 1 is an output scaling factor chosen to avoid overflow. In matrix notation, solving equation (1) corresponds to solve Z*x = scale*b where Z is defined for real flavors as and for complex flavors as Here Ik is the identity matrix of size k and XT (XH) is the transpose (conjugate transpose) of X. kron(X, Y) denotes the Kronecker product between the matrices X and Y. LAPACK Auxiliary and Utility Routines 5 1423 For real flavors, if trans = 'T', solve the transposed system ZT*y = scale*b for y, which is equivalent to solving for R and L in AT*R+DT*L=scale*C (4) R*BT+L*ET=scale*(-F) For complex flavors, if trans = 'C', solve the conjugate transposed system ZH*y = scale*b for y, which is equivalent to solving for R and L in AH*R+DH*L=scale*C (5) R*BH+L*EH=scale*(-F) These cases are used to compute an estimate of Dif[(A,D),(B,E)] = sigma_min(Z) using reverse communication with ?lacon. ?tgsy2 also (for ijob = 1) contributes to the computation in ?tgsyl of an upper bound on the separation between two matrix pairs. Then the input (A, D), (B, E) are sub-pencils of the matrix pair (two matrix pairs) in ?tgsyl. See ?tgsyl for details. Input Parameters trans CHARACTER*1. If trans = 'N', solve the generalized Sylvester equation (1); If trans = 'T': solve the transposed system (4). If trans = 'C': solve the conjugate transposed system (5). ijob INTEGER. Specifies what kind of functionality is to be performed. If ijob = 0: solve (1) only. If ijob = 1: a contribution from this subsystem to a Frobenius norm-based estimate of the separation between two matrix pairs is computed (look ahead strategy is used); If ijob = 2: a contribution from this subsystem to a Frobenius norm-based estimate of the separation between two matrix pairs is computed (?gecon on sub-systems is used). Not referenced if trans = 'T'. m INTEGER. On entry, m specifies the order of A and D, and the row dimension of C, F, R and L. n INTEGER. On entry, n specifies the order of B and E, and the column dimension of C, F, R and L. a, b REAL for stgsy2 DOUBLE PRECISION for dtgsy2 COMPLEX for ctgsy2 DOUBLE COMPLEX for ztgsy2. Arrays, DIMENSION (lda, m) and (ldb, n), respectively. On entry, a contains an upper (quasi) triangular matrix A, and b contains an upper (quasi) triangular matrix B. lda INTEGER. The leading dimension of the array a. lda = max(1, m). ldb INTEGER. The leading dimension of the array b. ldb = max(1, n). c, f REAL for stgsy2 DOUBLE PRECISION for dtgsy2 COMPLEX for ctgsy2 5 Intel® Math Kernel Library Reference Manual 1424 DOUBLE COMPLEX for ztgsy2. Arrays, DIMENSION (ldc, n) and (ldf, n), respectively. On entry, c contains the right-hand-side of the first matrix equation in (1), and f contains the right-hand-side of the second matrix equation in (1). ldc INTEGER. The leading dimension of the array c. ldc = max(1, m). d, e REAL for stgsy2 DOUBLE PRECISION for dtgsy2 COMPLEX for ctgsy2 DOUBLE COMPLEX for ztgsy2. Arrays, DIMENSION (ldd, m) and (lde, n), respectively. On entry, d contains an upper triangular matrix D, and e contains an upper triangular matrix E. ldd INTEGER. The leading dimension of the array d. ldd = max(1, m). lde INTEGER. The leading dimension of the array e. lde = max(1, n). ldf INTEGER. The leading dimension of the array f. ldf = max(1, m). rdsum REAL for stgsy2/ctgsy2 DOUBLE PRECISION for dtgsy2/ztgsy2. On entry, the sum of squares of computed contributions to the Difestimate under computation by ?tgsyL, where the scaling factor rdscal has been factored out. rdscal REAL for stgsy2/ctgsy2 DOUBLE PRECISION for dtgsy2/ztgsy2. On entry, scaling factor used to prevent overflow in rdsum. iwork INTEGER. Used with real flavors only. Workspace array, DIMENSION (m+n+2). Output Parameters c On exit, if ijob = 0, c is overwritten by the solution R. f On exit, if ijob = 0, f is overwritten by the solution L. scale REAL for stgsy2/ctgsy2 DOUBLE PRECISION for dtgsy2/ztgsy2. On exit, 0 = scale = 1. If 0 < scale < 1, the solutions R and L (C and F on entry) hold the solutions to a slightly perturbed system, but the input matrices A, B, D and E are not changed. If scale = 0, R and L hold the solutions to the homogeneous system with C = F = 0. Normally scale = 1. rdsum On exit, the corresponding sum of squares updated with the contributions from the current sub-system. If trans = 'T', rdsum is not touched. Note that rdsum only makes sense when ?tgsy2 is called by ?tgsyl. rdscal On exit, rdscal is updated with respect to the current contributions in rdsum. If trans = 'T', rdscal is not touched. Note that rdscal only makes sense when ?tgsy2 is called by ?tgsyl. pq INTEGER. Used with real flavors only. On exit, the number of subsystems (of size 2-by-2, 4-by-4 and 8-by-8) solved by the routine stgsy2/dtgsy2. info INTEGER. On exit, if info is set to = 0: Successful exit < 0: If info = -i, the i-th argument has an illegal value. LAPACK Auxiliary and Utility Routines 5 1425 > 0: The matrix pairs (A, D) and (B, E) have common or very close eigenvalues. ?trti2 Computes the inverse of a triangular matrix (unblocked algorithm). Syntax call strti2( uplo, diag, n, a, lda, info ) call dtrti2( uplo, diag, n, a, lda, info ) call ctrti2( uplo, diag, n, a, lda, info ) call ztrti2( uplo, diag, n, a, lda, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?trti2 computes the inverse of a real/complex upper or lower triangular matrix. This is the Level 2 BLAS version of the algorithm. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular. = 'U': upper triangular = 'L': lower triangular diag CHARACTER*1. Specifies whether or not the matrix A is unit triangular. = 'N': non-unit triangular = 'N': non-unit triangular n INTEGER. The order of the matrix A. n = 0. a REAL for strti2 DOUBLE PRECISION for dtrti2 COMPLEX for ctrti2 DOUBLE COMPLEX for ztrti2. Array, DIMENSION (lda, n). On entry, the triangular matrix A. If uplo = 'U', the leading n-by-n upper triangular part of the array a contains the upper triangular matrix, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of the array a contains the lower triangular matrix, and the strictly upper triangular part of a is not referenced. If diag = 'U', the diagonal elements of a are also not referenced and are assumed to be 1. lda INTEGER. The leading dimension of the array a. lda = max(1,n). 5 Intel® Math Kernel Library Reference Manual 1426 Output Parameters a On exit, the (triangular) inverse of the original matrix, in the same storage format. info INTEGER. = 0: successful exit < 0: if info = -k, the k-th argument had an illegal value clag2z Converts a complex single precision matrix to a complex double precision matrix. Syntax call clag2z( m, n, sa, ldsa, a, lda, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description This routine converts a complex single precision matrix SA to a complex double precision matrix A. Note that while it is possible to overflow while converting from double to single, it is not possible to overflow when converting from single to double. This is an auxiliary routine so there is no argument checking. Input Parameters m INTEGER. The number of lines of the matrix A (m = 0). n INTEGER. The number of columns in the matrix A (n = 0). ldsa INTEGER. The leading dimension of the array sa; ldsa = max(1, m). a DOUBLE PRECISION array, DIMENSION (lda, n). On entry, contains the m-by-n coefficient matrix A. lda INTEGER. The leading dimension of the array a; lda = max(1, m). Output Parameters sa REAL array, DIMENSION (ldsa, n). On exit, contains the m-by-n coefficient matrix SA. info INTEGER. If info = 0, the execution is successful. dlag2s Converts a double precision matrix to a single precision matrix. Syntax call dlag2s( m, n, a, lda, sa, ldsa, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h LAPACK Auxiliary and Utility Routines 5 1427 Description This routine converts a double precision matrix SA to a single precision matrix A. RMAX is the overflow for the single precision arithmetic. dlag2s checks that all the entries of A are between - RMAX and RMAX. If not, the convertion is aborted and a flag is raised. This is an auxiliary routine so there is no argument checking. Input Parameters m INTEGER. The number of lines of the matrix A (m = 0). n INTEGER. The number of columns in the matrix A (n = 0). a DOUBLE PRECISION array, DIMENSION (lda, n). On entry, contains the m-by-n coefficient matrix A. lda INTEGER. The leading dimension of the array a; lda = max(1, m). ldsa INTEGER. The leading dimension of the array sa; ldsa = max(1, m). Output Parameters sa REAL array, DIMENSION (ldsa, n). On exit, if info = 0, contains the m-by-n coefficient matrix SA; if info > 0, the content of sa is unspecified. info INTEGER. If info = 0, the execution is successful. If info = 1, an entry of the matrix A is greater than the single precision overflow threshold; in this case, the content of sa on exit is unspecified. slag2d Converts a single precision matrix to a double precision matrix. Syntax call slag2d( m, n, sa, ldsa, a, lda, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine converts a single precision matrix SA to a double precision matrix A. Note that while it is possible to overflow while converting from double to single, it is not possible to overflow when converting from single to double. This is an auxiliary routine so there is no argument checking. Input Parameters m INTEGER. The number of lines of the matrix A (m = 0). n INTEGER. The number of columns in the matrix A (n = 0). sa REAL array, DIMENSION (ldsa, n). On entry, contains the m-by-n coefficient matrix SA. ldsa INTEGER. The leading dimension of the array sa; ldsa = max(1, m). 5 Intel® Math Kernel Library Reference Manual 1428 lda INTEGER. The leading dimension of the array a; lda = max(1, m). Output Parameters a DOUBLE PRECISION array, DIMENSION (lda, n). On exit, contains the m-by-n coefficient matrix A. info INTEGER. If info = 0, the execution is successful. zlag2c Converts a complex double precision matrix to a complex single precision matrix. Syntax call zlag2c( m, n, a, lda, sa, ldsa, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine converts a double precision complex matrix SA to a single precision complex matrix A. RMAX is the overflow for the single precision arithmetic. zlag2c checks that all the entries of A are between - RMAX and RMAX. If not, the convertion is aborted and a flag is raised. This is an auxiliary routine so there is no argument checking. Input Parameters m INTEGER. The number of lines of the matrix A (m = 0). n INTEGER. The number of columns in the matrix A (n = 0). a DOUBLE COMPLEX array, DIMENSION (lda, n). On entry, contains the m-by-n coefficient matrix A. lda INTEGER. The leading dimension of the array a; lda = max(1, m). ldsa INTEGER. The leading dimension of the array sa; ldsa = max(1, m). Output Parameters sa COMPLEX array, DIMENSION (ldsa, n). On exit, if info = 0, contains the m-by-n coefficient matrix SA; if info > 0, the content of sa is unspecified. info INTEGER. If info = 0, the execution is successful. If info = 1, an entry of the matrix A is greater than the single precision overflow threshold; in this case, the content of sa on exit is unspecified. ?larfp Generates a real or complex elementary reflector. LAPACK Auxiliary and Utility Routines 5 1429 Syntax Fortran 77: call slarfp(n, alpha, x, incx, tau) call dlarfp(n, alpha, x, incx, tau) call clarfp(n, alpha, x, incx, tau) call zlarfp(n, alpha, x, incx, tau) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?larfp routines generate a real or complex elementary reflector H of order n, such that H * (alpha) = (beta), ( x ) ( 0 ) and H'*H =I for real flavors, conjg(H)'*H =I for complex flavors. Here alpha and beta are scalars, beta is real and non-negative, x is (n-1)-element vector. H is represented in the form H = I - tau*( 1 )* (1 v'), ( v ) where tau is scalar, and v is (n-1)-element vector . For real flavors if the elements of x are all zero, then tau = 0 and H is taken to be the unit matrix. Otherwise 1 = tau = 2. For complex flavors if the elements of x are all zero and alpha is real, then tau = 0 and H is taken to be the unit matrix. Otherwise 1 = real(tau) = 2, and abs (tau-1= 1. Input Parameters n INTEGER. Specifies the order of the elementary reflector. alpha REAL for slarfp DOUBLE PRECISION for dlarfp COMPLEX for clarfp DOUBLE COMPLEX for zlarfp Specifies the scalar alpha. x REAL for slarfp DOUBLE PRECISION for dlarfp COMPLEX for clarfp DOUBLE COMPLEX for zlarfp Array, DIMENSION at least (1 + (n - 1)*abs(incx)). It contains the vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. 5 Intel® Math Kernel Library Reference Manual 1430 Output Parameters alpha Overwritten by the value beta. y Overwritten by the vector v. tau REAL for slarfp DOUBLE PRECISION for dlarfp COMPLEX for clarfp DOUBLE COMPLEX for zlarfp Contains the scalar tau. ila?lc Scans a matrix for its last non-zero column. Syntax Fortran 77: value = ilaslc(m, n, a, lda) value = iladlc(m, n, a, lda) value = ilaclc(m, n, a, lda) value = ilazlc(m, n, a, lda) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ila?lc routines scan a matrix A for its last non-zero column. Input Parameters m INTEGER. Specifies number of rows in the matrix A. n INTEGER. Specifies number of columns in the matrix A. a REAL for ilaslc DOUBLE PRECISION for iladlc COMPLEX for ilaclc DOUBLE COMPLEX for ilazlc Array, DIMENSION (lda, *). The second dimension of a must be at least max(1, n). Before entry the leading n-by-n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). Output Parameters value INTEGER Number of the last non-zero column. LAPACK Auxiliary and Utility Routines 5 1431 ila?lr Scans a matrix for its last non-zero row. Syntax Fortran 77: value = ilaslr(m, n, a, lda) value = iladlr(m, n, a, lda) value = ilaclr(m, n, a, lda) value = ilazlr(m, n, a, lda) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ila?lr routines scan a matrix A for its last non-zero row. Input Parameters m INTEGER. Specifies number of rows in the matrix A. n INTEGER. Specifies number of columns in the matrix A. a REAL for ilaslr DOUBLE PRECISION for iladlr COMPLEX for ilaclr DOUBLE COMPLEX for idazlr Array, DIMENSION (lda, *). The second dimension of a must be at least max(1, n). Before entry the leading n-by-n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). Output Parameters value INTEGER Number of the last non-zero row. ?gsvj0 Pre-processor for the routine ?gesvj. Syntax Fortran 77: call sgsvj0(jobv, m, n, a, lda, d, sva, mv, v, ldv, eps, sfmin, tol, nsweep, work, lwork, info) call dgsvj0(jobv, m, n, a, lda, d, sva, mv, v, ldv, eps, sfmin, tol, nsweep, work, lwork, info) 5 Intel® Math Kernel Library Reference Manual 1432 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description This routine is called from ?gesvj as a pre-processor and that is its main purpose. It applies Jacobi rotations in the same way as ?gesvj does, but it does not check convergence (stopping criterion). The routine ?gsvj0 enables ?gesvj to use a simplified version of itself to work on a submatrix of the original matrix. Input Parameters jobv CHARACTER*1. Must be 'V', 'A', or 'N'. Specifies whether the output from this routine is used to compute the matrix V. If jobv = 'V', the product of the Jacobi rotations is accumulated by postmultiplying the n-by-n array v. If jobv = 'A', the product of the Jacobi rotations is accumulated by postmultiplying the mv-by-n array v. If jobv = 'N', the Jacobi rotations are not accumulated. m INTEGER. The number of rows of the input matrix A (m = 0). n INTEGER. The number of columns of the input matrix B (m = n = 0). a REAL for sgsvj0 DOUBLE PRECISION for dgsvj0. Arrays, DIMENSION (lda, *). Contains the m-by-n matrix A, such that A*diag(D) represents the input matrix. The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of a; at least max(1, m). d REAL for sgsvj0 DOUBLE PRECISION for dgsvj0. Arrays, DIMENSION (n). Contains the diagonal matrix D that accumulates the scaling factors from the fast scaled Jacobi rotations. On entry A*diag(D) represents the input matrix. sva REAL for sgsvj0 DOUBLE PRECISION for dgsvj0. Arrays, DIMENSION (n). Contains the Euclidean norms of the columns of the matrix A*diag(D). mv INTEGER. The leading dimension of b; at least max(1, p). If jobv = 'A', then mv rows of v are post-multiplied by a sequence of Jacobi rotations. If jobv = 'N', then mv is not referenced . v REAL for sgsvj0 DOUBLE PRECISION for dgsvj0. Array, DIMENSION (ldv, *). The second dimension of a must be at least max(1, n). If jobv = 'V', then n rows of v are post-multiplied by a sequence of Jacobi rotations. If jobv = 'A', then mv rows of v are post-multiplied by a sequence of Jacobi rotations. If jobv = 'N', then v is not referenced. ldv INTEGER. The leading dimension of the array v; ldv = 1 LAPACK Auxiliary and Utility Routines 5 1433 ldv =n if jobv = 'V'; ldv =mv if jobv = 'A'. eps REAL for sgsvj0 DOUBLE PRECISION for dgsvj0. The relative machine precision (epsilon) returned by the routine ?lamch. sfmin REAL for sgsvj0 DOUBLE PRECISION for dgsvj0. Value of safe minimum returned by the routine ?lamch. tol REAL for sgsvj0 DOUBLE PRECISION for dgsvj0. The threshold for Jacobi rotations. For a pair A(:,p), A(:,q) of pivot columns, the Jacobi rotation is applied only if abs(cos(angle(A(:,p),A(:,q))))> tol. nsweep INTEGER. The number of sweeps of Jacobi rotations to be performed. work REAL for sgsvj0 DOUBLE PRECISION for dgsvj0. Workspace array, DIMENSION (lwork). lwork INTEGER. The size of the array work; at least max(1, m). Output Parameters a On exit, A*diag(D) represents the input matrix post-multiplied by a sequence of Jacobi rotations, where the rotation threshold and the total number of sweeps are given in tol and nsweep, respectively d On exit, A*diag(D) represents the input matrix post-multiplied by a sequence of Jacobi rotations, where the rotation threshold and the total number of sweeps are given in tol and nsweep, respectively. sva On exit, contains the Euclidean norms of the columns of the output matrix A*diag(D). v If jobv = 'V', then n rows of v are post-multiplied by a sequence of Jacobi rotations. If jobv = 'A', then mv rows of v are post-multiplied by a sequence of Jacobi rotations. If jobv = 'N', then v is not referenced. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?gsvj1 Pre-processor for the routine ?gesvj, applies Jacobi rotations targeting only particular pivots. Syntax Fortran 77: call sgsvj1(jobv, m, n, n1, a, lda, d, sva, mv, v, ldv, eps, sfmin, tol, nsweep, work, lwork, info) call dgsvj1(jobv, m, n, n1, a, lda, d, sva, mv, v, ldv, eps, sfmin, tol, nsweep, work, lwork, info) 5 Intel® Math Kernel Library Reference Manual 1434 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description T This routine is called from ?gesvj as a pre-processor and that is its main purpose. It applies Jacobi rotations in the same way as ?gesvj does, but it targets only particular pivots and it does not check convergence (stopping criterion). The routine ?gsvj1 applies few sweeps of Jacobi rotations in the column space of the input m-by-n matrix A. The pivot pairs are taken from the (1,2) off-diagonal block in the corresponding n-by-n Gram matrix A'*A. The block-entries (tiles) of the (1,2) off-diagonal block are marked by the [x]'s in the following scheme: | * * * [x] [x] [x] | | * * * [x] [x] [x] | | * * * [x] [x] [x] | | [x] [x] [x] * * * | | [x] [x] [x] * * * | | [x] [x] [x] * * * | row-cycling in the nblr-by-nblc [x] blocks, row-cyclic pivoting inside each [x] block In terms of the columns of the matrix A, the first n1 columns are rotated 'against' the remaining n-n1 columns, trying to increase the angle between the corresponding subspaces. The off-diagonal block is n1-by- (n-n1) and it is tiled using quadratic tiles. The number of sweeps is specified by nsweep, and the orthogonality threshold is set by tol. Input Parameters jobv CHARACTER*1. Must be 'V', 'A', or 'N'. Specifies whether the output from this routine is used to compute the matrix V. If jobv = 'V', the product of the Jacobi rotations is accumulated by postmultiplying the n-by-n array v. If jobv = 'A', the product of the Jacobi rotations is accumulated by postmultiplying the mv-by-n array v. If jobv = 'N', the Jacobi rotations are not accumulated. m INTEGER. The number of rows of the input matrix A (m = 0). n INTEGER. The number of columns of the input matrix B (m = n = 0). n1 INTEGER. Specifies the 2-by-2 block partition. The first n1 columns are rotated 'against' the remaining n-n1 columns of the matrix A. a REAL for sgsvj1 DOUBLE PRECISION for dgsvj1. Arrays, DIMENSION (lda, *). Contains the m-by-n matrix A, such that A*diag(D) represents the input matrix. The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of a; at least max(1, m). d REAL for sgsvj1 DOUBLE PRECISION for dgsvj1. Arrays, DIMENSION (n). Contains the diagonal matrix D that accumulates the scaling factors from the fast scaled Jacobi rotations. On entry A*diag(D) represents the input matrix. sva REAL for sgsvj1 DOUBLE PRECISION for dgsvj1. LAPACK Auxiliary and Utility Routines 5 1435 Arrays, DIMENSION (n). Contains the Euclidean norms of the columns of the matrix A*diag(D). mv INTEGER. The leading dimension of b; at least max(1, p). If jobv = 'A', then mv rows of v are post-multiplied by a sequence of Jacobi rotations. If jobv = 'N', then mv is not referenced . v REAL for sgsvj1 DOUBLE PRECISION for dgsvj1. Array, DIMENSION (ldv, *). The second dimension of a must be at least max(1, n). If jobv = 'V', then n rows of v are post-multiplied by a sequence of Jacobi rotations. If jobv = 'A', then mv rows of v are post-multiplied by a sequence of Jacobi rotations. If jobv = 'N', then v is not referenced. ldv INTEGER. The leading dimension of the array v; ldv = 1 ldv = n if jobv = 'V'; ldv = mv if jobv = 'A'. eps REAL for sgsvj1 DOUBLE PRECISION for dgsvj1. The relative machine precision (epsilon) returned by the routine ?lamch. sfmin REAL for sgsvj1 DOUBLE PRECISION for dgsvj1. Value of safe minimum returned by the routine ?lamch. tol REAL for sgsvj1 DOUBLE PRECISION for dgsvj1. The threshold for Jacobi rotations. For a pair A(:,p), A(:,q) of pivot columns, the Jacobi rotation is applied only if abs(cos(angle(A(:,p),A(:,q))))> tol. nsweep INTEGER. The number of sweeps of Jacobi rotations to be performed. work REAL for sgsvj1 DOUBLE PRECISION for dgsvj1. Workspace array, DIMENSION (lwork). lwork INTEGER. The size of the array work; at least max(1, m). Output Parameters a On exit, A*diag(D) represents the input matrix post-multiplied by a sequence of Jacobi rotations, where the rotation threshold and the total number of sweeps are given in tol and nsweep, respectively d On exit, A*diag(D) represents the input matrix post-multiplied by a sequence of Jacobi rotations, where the rotation threshold and the total number of sweeps are given in tol and nsweep, respectively. sva On exit, contains the Euclidean norms of the columns of the output matrix A*diag(D). v If jobv = 'V', then n rows of v are post-multiplied by a sequence of Jacobi rotations. If jobv = 'A', then mv rows of v are post-multiplied by a sequence of Jacobi rotations. If jobv = 'N', then v is not referenced. 5 Intel® Math Kernel Library Reference Manual 1436 info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?sfrk Performs a symmetric rank-k operation for matrix in RFP format. Syntax Fortran 77: call ssfrk(transr, uplo, trans, n, k, alpha, a, lda, beta, c) call dsfrk(transr, uplo, trans, n, k, alpha, a, lda, beta, c) C: lapack_int LAPACKE_sfrk( int matrix_order, char transr, char uplo, char trans, lapack_int n, lapack_int k, alpha, const * a, lapack_int lda, beta, * c ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The ?sfrk routines perform a matrix-matrix operation using symmetric matrices. The operation is defined as C := alpha*A*AT + beta*C, or C := alpha*AT*A + beta*C, where: alpha and beta are scalars, C is an n-by-n symmetric matrix in rectangular full packed (RFP) format, A is an n-by-k matrix in the first case and a k-by-n matrix in the second case. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type defintions. transr CHARACTER*1. if transr = 'N' or 'n', the normal form of RFP C is stored; if transr= 'T' or 't', the transpose form of RFP C is stored. uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array c is used. If uplo = 'U' or 'u', then the upper triangular part of the array c is used. If uplo = 'L' or 'l', then the low triangular part of the array c is used. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then C := alpha*A*AT + beta*C; LAPACK Auxiliary and Utility Routines 5 1437 if trans = 'T' or 't', then C := alpha*AT*A + beta*C; n INTEGER. Specifies the order of the matrix C. The value of n must be at least zero. k INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the matrix A, and on entry with trans = 'T' or 't', k specifies the number of rows of the matrix A. The value of k must be at least zero. alpha REAL for ssfrk DOUBLE PRECISION for dsfrk Specifies the scalar alpha. a REAL for ssfrk DOUBLE PRECISION for dsfrk Array, DIMENSION (lda,ka), where ka is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array a must contain the matrix A, otherwise the leading kby- n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then lda must be at least max(1,n), otherwise lda must be at least max(1, k). beta REAL for ssfrk DOUBLE PRECISION for dsfrk Specifies the scalar beta. c REAL for ssfrk DOUBLE PRECISION for dsfrk Array, DIMENSION (n*(n+1)/2 ). Before entry contains the symmetric matrix C in RFP format. Output Parameters c If trans = 'N' or 'n', then c contains C := alpha*A*A' + beta*C; if trans = 'T' or 't', then c contains C := alpha*A'*A + beta*C; ?hfrk Performs a Hermitian rank-k operation for matrix in RFP format. Syntax Fortran 77: call chfrk(transr, uplo, trans, n, k, alpha, a, lda, beta, c) call zhfrk(transr, uplo, trans, n, k, alpha, a, lda, beta, c) C: lapack_int LAPACKE_chfrk( int matrix_order, char transr, char uplo, char trans, lapack_int n, lapack_int k, float alpha, const lapack_complex_float* a, lapack_int lda, float beta, lapack_complex_float* c ); lapack_int LAPACKE_zhfrk( int matrix_order, char transr, char uplo, char trans, lapack_int n, lapack_int k, double alpha, const lapack_complex_double* a, lapack_int lda, double beta, lapack_complex_double* c ); 5 Intel® Math Kernel Library Reference Manual 1438 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The ?hfrk routines perform a matrix-matrix operation using Hermitian matrices. The operation is defined as C := alpha*A*AH + beta*C, or C := alpha*AH*A + beta*C, where: alpha and beta are real scalars, C is an n-by-n Hermitian matrix in RFP format, A is an n-by-k matrix in the first case and a k-by-n matrix in the second case. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type defintions. transr CHARACTER*1. if transr = 'N' or 'n', the normal form of RFP C is stored; if transr = 'C' or 'c', the conjugate-transpose form of RFP C is stored. uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array c is used. If uplo = 'U' or 'u', then the upper triangular part of the array c is used. If uplo = 'L' or 'l', then the low triangular part of the array c is used. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then C := alpha*A*AH + beta*C; if trans = 'C' or 'c', then C := alpha*AH*A + beta*C. n INTEGER. Specifies the order of the matrix C. The value of n must be at least zero. k INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the matrix a, and on entry with trans = 'T' or 't' or 'C' or 'c', k specifies the number of rows of the matrix a. The value of k must be at least zero. alpha COMPLEX for chfrk DOUBLE COMPLEX for zhfrk Specifies the scalar alpha. a COMPLEX for chfrk DOUBLE COMPLEX for zhfrk Array, DIMENSION (lda,ka), where ka is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array a must contain the matrix A, otherwise the leading kby- n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then lda must be at least max(1,n), otherwise lda must be at least max(1, k). LAPACK Auxiliary and Utility Routines 5 1439 beta COMPLEX for chfrk DOUBLE COMPLEX for zhfrk Specifies the scalar beta. c COMPLEX for chfrk DOUBLE COMPLEX for zhfrk Array, DIMENSION (n*(n+1)/2 ). Before entry contains the Hermitian matrix C in in RFP format. Output Parameters c If trans = 'N' or 'n', then c contains C := alpha*A*AH + beta*C; if trans = 'C' or 'c', then c contains C := alpha*AH*A + beta*C ; ?tfsm Solves a matrix equation (one operand is a triangular matrix in RFP format). Syntax Fortran 77: call stfsm(transr, side, uplo, trans, diag, m, n, alpha, a, b, ldb) call dtfsm(transr, side, uplo, trans, diag, m, n, alpha, a, b, ldb) call ctfsm(transr, side, uplo, trans, diag, m, n, alpha, a, b, ldb) call ztfsm(transr, side, uplo, trans, diag, m, n, alpha, a, b, ldb) C: lapack_int LAPACKE_tfsm( int matrix_order, char transr, char side, char uplo, char trans, char diag, lapack_int m, lapack_int n, alpha, const * a, * b, lapack_int ldb ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The ?tfsm routines solve one of the following matrix equations: op(A)*X = alpha*B, or X*op(A) = alpha*B, where: alpha is a scalar, X and B are m-by-n matrices, A is a unit, or non-unit, upper or lower triangular matrix in rectangular full packed (RFP) format. op(A) can be one of the following: • op(A) = A or op(A) = AT for real flavors • op(A) = A or op(A) = AH for complex flavors 5 Intel® Math Kernel Library Reference Manual 1440 The matrix B is overwritten by the solution matrix X. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type defintions. transr CHARACTER*1. if transr = 'N' or 'n', the normal form of RFP A is stored; if transr = 'T' or 't', the transpose form of RFP A is stored; if transr = 'C' or 'c', the conjugate-transpose form of RFP A is stored. side CHARACTER*1. Specifies whether op(A) appears on the left or right of X in the equation: if side = 'L' or 'l', then op(A)*X = alpha*B; if side = 'R' or 'r', then X*op(A) = alpha*B. uplo CHARACTER*1. Specifies whether the RFP matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans CHARACTER*1. Specifies the form of op(A) used in the matrix multiplication: if trans = 'N' or 'n', then op(A) = A; if trans = 'T' or 't', then op(A) = A'; if trans = 'C' or 'c', then op(A) = conjg(A'). diag CHARACTER*1. Specifies whether the RFP matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. m INTEGER. Specifies the number of rows of B. The value of m must be at least zero. n INTEGER. Specifies the number of columns of B. The value of n must be at least zero. alpha REAL for stfsm DOUBLE PRECISION for dtfsm COMPLEX for ctfsm DOUBLE COMPLEX for ztfsm Specifies the scalar alpha. When alpha is zero, then a is not referenced and b need not be set before entry. a REAL for stfsm DOUBLE PRECISION for dtfsm COMPLEX for ctfsm DOUBLE COMPLEX for ztfsm Array, DIMENSION (n*(n+1)/2). Contains the matrix A in RFP format. b REAL for stfsm DOUBLE PRECISION for dtfsm COMPLEX for ctfsm DOUBLE COMPLEX for ztfsm Array, DIMENSION (ldb,n). Before entry, the leading m-by-n part of the array b must contain the right-hand side matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. The value of ldb must be at least max(1, +m). LAPACK Auxiliary and Utility Routines 5 1441 Output Parameters b Overwritten by the solution matrix X. ?lansf Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a symmetric matrix in RFP format. Syntax val = slansf(norm, transr, uplo, n, a, work) val = dlansf(norm, transr, uplo, n, a, work) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description T The function ?lansf returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of an n-by-n real symmetric matrix A in the rectangular full packed (RFP) format . Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). transr CHARACTER*1. Specifies whether the RFP format of matrix A is normal or transposed format. If transr = 'N': RFP format is normal; if transr = 'T': RFP format is transposed. uplo CHARACTER*1. Specifies whether the RFP matrix A came from upper or lower triangular matrix. If uplo = 'U': RFP matrix A came from an upper triangular matrix; if uplo = 'L': RFP matrix A came from a lower triangular matrix. n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lansf is set to zero. a REAL for slansf DOUBLE PRECISION for dlansf Array, DIMENSION (n*(n+1)/2). The upper (if uplo = 'U') or lower (if uplo = 'L') part of the symetric matrix A stored in RFP format. work REAL for slansf. 5 Intel® Math Kernel Library Reference Manual 1442 DOUBLE PRECISION for dlansf. Workspace array, DIMENSION (max(1,lwork)), where lwork = n when norm = 'I' or '1' or 'O'; otherwise, work is not referenced. Output Parameters val REAL for slansf DOUBLE PRECISION for dlansf Value returned by the function. ?lanhf Returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of a Hermitian matrix in RFP format. Syntax val = clanhf(norm, transr, uplo, n, a, work) val = zlanhf(norm, transr, uplo, n, a, work) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lanhf returns the value of the 1-norm, or the Frobenius norm, or the infinity norm, or the element of largest absolute value of an n-by-n complex Hermitian matrix A in the rectangular full packed (RFP) format. Input Parameters norm CHARACTER*1. Specifies the value to be returned by the routine: = 'M' or 'm': val = max(abs(Aij)), largest absolute value of the matrix A. = '1' or 'O' or 'o': val = norm1(A), 1-norm of the matrix A (maximum column sum), = 'I' or 'i': val = normI(A), infinity norm of the matrix A (maximum row sum), = 'F', 'f', 'E' or 'e': val = normF(A), Frobenius norm of the matrix A (square root of sum of squares). transr CHARACTER*1. Specifies whether the RFP format of matrix A is normal or conjugatetransposed format. If transr = 'N': RFP format is normal; if transr = 'C': RFP format is conjugate-transposed. uplo CHARACTER*1. Specifies whether the RFP matrix A came from upper or lower triangular matrix. If uplo = 'U': RFP matrix A came from an upper triangular matrix; if uplo = 'L': RFP matrix A came from a lower triangular matrix. n INTEGER. The order of the matrix A. n = 0. When n = 0, ?lanhf is set to zero. LAPACK Auxiliary and Utility Routines 5 1443 a COMPLEX for clanhf DOUBLE COMPLEX for zlanhf Array, DIMENSION (n*(n+1)/2). The upper (if uplo = 'U') or lower (if uplo = 'L') part of the Hermitian matrix A stored in RFP format. work COMPLEX for clanhf. DOUBLE COMPLEX for zlanhf. Workspace array, DIMENSION (max(1,lwork)), where lwork = n when norm = 'I' or '1' or 'O'; otherwise, work is not referenced. Output Parameters val COMPLEX for clanhf DOUBLE COMPLEX for zlanhf Value returned by the function. ?tfttp Copies a triangular matrix from the rectangular full packed format (TF) to the standard packed format (TP) . Syntax Fortran 77: call stfttp( transr, uplo, n, arf, ap, info ) call dtfttp( transr, uplo, n, arf, ap, info ) call ctfttp( transr, uplo, n, arf, ap, info ) call ztfttp( transr, uplo, n, arf, ap, info ) C: lapack_int LAPACKE_tfttp( int matrix_order, char transr, char uplo, lapack_int n, const * arf, * ap ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The routine copies a triangular matrix A from the Rectangular Full Packed (RFP) format to the standard packed format. For the description of the RFP format, see Matrix Storage Schemes. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type defintions. transr CHARACTER*1. = 'N': arf is in the Normal format, = 'T': arf is in the Transpose format (for stfttp and dtfttp), 5 Intel® Math Kernel Library Reference Manual 1444 = 'C': arf is in the Conjugate-transpose format (for ctfttp and ztfttp). uplo CHARACTER*1. Specifies whether A is upper or lower triangular: = 'U': A is upper triangular, = 'L': A is lower triangular. n INTEGER. The order of the matrix A. n = 0. arf REAL for stfttp, DOUBLE PRECISION for dtfttp, COMPLEX for ctfttp, DOUBLE COMPLEX for ztfttp. Array, DIMENSION at least max (1, n*(n+1)/2). On entry, the upper or lower triangular matrix A stored in the RFP format. Output Parameters ap REAL for stfttp, DOUBLE PRECISION for dtfttp, COMPLEX for ctfttp, DOUBLE COMPLEX for ztfttp. Array, DIMENSION at least max (1, n*(n+1)/2). On exit, the upper or lower triangular matrix A, packed columnwise in a linear array. The j-th column of A is stored in the array ap as follows: if uplo = 'U', ap(i + (j-1)*j/2) = A(i,j) for 1 = i = j, if uplo = 'L', ap(i + (j-1)*(2n-j)/2) = A(i,j) for j = i = n. info INTEGER. =0: successful exit, < 0: if info = -i, the i-th parameter had an illegal value. ?tfttr Copies a triangular matrix from the rectangular full packed format (TF) to the standard full format (TR) . Syntax Fortran 77: call stfttr( transr, uplo, n, arf, a, lda, info ) call dtfttr( transr, uplo, n, arf, a, lda, info ) call ctfttr( transr, uplo, n, arf, a, lda, info ) call ztfttr( transr, uplo, n, arf, a, lda, info ) C: lapack_int LAPACKE_tfttr( int matrix_order, char transr, char uplo, lapack_int n, const * arf, * a, lapack_int lda ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description LAPACK Auxiliary and Utility Routines 5 1445 The routine copies a triangular matrix A from the Rectangular Full Packed (RFP) format to the standard full format. For the description of the RFP format, see Matrix Storage Schemes. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type defintions. transr CHARACTER*1. = 'N': arf is in the Normal format, = 'T': arf is in the Transpose format (for stfttr and dtfttr), = 'C': arf is in the Conjugate-transpose format (for ctfttr and ztfttr). uplo CHARACTER*1. Specifies whether A is upper or lower triangular: = 'U': A is upper triangular, = 'L': A is lower triangular. n INTEGER. The order of the matrices arf and a. n = 0. arf REAL for stfttr, DOUBLE PRECISION for dtfttr, COMPLEX for ctfttr, DOUBLE COMPLEX for ztfttr. Array, DIMENSION at least max (1, n*(n+1)/2). On entry, the upper or lower triangular matrix A stored in the RFP format. lda INTEGER. The leading dimension of the array a. lda = max(1,n). Output Parameters a REAL for stfttr, DOUBLE PRECISION for dtfttr, COMPLEX for ctfttr, DOUBLE COMPLEX for ztfttr. Array, DIMENSION (lda, *). On exit, the triangular matrix A. If uplo = 'U', the leading n-by-n upper triangular part of the array a contains the upper triangular matrix, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of the array a contains the lower triangular matrix, and the strictly upper triangular part of a is not referenced. info INTEGER. =0: successful exit, < 0: if info = -i, the i-th parameter had an illegal value. ?tpttf Copies a triangular matrix from the standard packed format (TP) to the rectangular full packed format (TF). Syntax Fortran 77: call stpttf( transr, uplo, n, ap, arf, info ) call dtpttf( transr, uplo, n, ap, arf, info ) call ctpttf( transr, uplo, n, ap, arf, info ) 5 Intel® Math Kernel Library Reference Manual 1446 call ztpttf( transr, uplo, n, ap, arf, info ) C: lapack_int LAPACKE_tpttf( int matrix_order, char transr, char uplo, lapack_int n, const * ap, * arf ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The routine copies a triangular matrix A from the standard packed format to the Rectangular Full Packed (RFP) format. For the description of the RFP format, see Matrix Storage Schemes. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type defintions. transr CHARACTER*1. = 'N': arf must be in the Normal format, = 'T': arf must be in the Transpose format (for stpttf and dtpttf), = 'C': arf must be in the Conjugate-transpose format (for ctpttf and ztpttf). uplo CHARACTER*1. Specifies whether A is upper or lower triangular: = 'U': A is upper triangular, = 'L': A is lower triangular. n INTEGER. The order of the matrix A. n = 0. ap REAL for stpttf, DOUBLE PRECISION for dtpttf, COMPLEX for ctpttf, DOUBLE COMPLEX for ztpttf. Array, DIMENSION at least max (1, n*(n+1)/2). On entry, the upper or lower triangular matrix A, packed columnwise in a linear array. The j-th column of A is stored in the array ap as follows: if uplo = 'U', ap(i + (j-1)*j/2) = A(i,j) for 1 = i = j, if uplo = 'L', ap(i + (j-1)*(2n-j)/2) = A(i,j) for j = i = n. Output Parameters arf REAL for stpttf, DOUBLE PRECISION for dtpttf, COMPLEX for ctfttp, DOUBLE COMPLEX for ztpttf. Array, DIMENSION at least max (1, n*(n+1)/2). On exit, the upper or lower triangular matrix A stored in the RFP format. info INTEGER. =0: successful exit, < 0: if info = -i, the i-th parameter had an illegal value. LAPACK Auxiliary and Utility Routines 5 1447 ?tpttr Copies a triangular matrix from the standard packed format (TP) to the standard full format (TR) . Syntax Fortran 77: call stpttr( uplo, n, ap, a, lda, info ) call dtpttr( uplo, n, ap, a, lda, info ) call ctpttr( uplo, n, ap, a, lda, info ) call ztpttr( uplo, n, ap, a, lda, info ) C: lapack_int LAPACKE_tpttr( int matrix_order, char uplo, lapack_int n, const * ap, * a, lapack_int lda ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The routine copies a triangular matrix A from the standard packed format to the standard full format. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type defintions. uplo CHARACTER*1. Specifies whether A is upper or lower triangular: = 'U': A is upper triangular, = 'L': A is lower triangular. n INTEGER. The order of the matrices ap and a. n = 0. ap REAL for stpttr, DOUBLE PRECISION for dtpttr, COMPLEX for ctpttr, DOUBLE COMPLEX for ztpttr. Array, DIMENSION at least max (1, n*(n+1)/2). On entry, the upper or lower triangular matrix A, packed columnwise in a linear array. The j-th column of A is stored in the array ap as follows: if uplo = 'U', ap(i + (j-1)*j/2) = A(i,j) for 1 = i = j, if uplo = 'L', ap(i + (j-1)*(2n-j)/2) = A(i,j) for j = i = n. lda INTEGER. The leading dimension of the array a. lda = max(1,n). Output Parameters a REAL for stpttr, DOUBLE PRECISION for dtpttr, COMPLEX for ctpttr, DOUBLE COMPLEX for ztpttr. 5 Intel® Math Kernel Library Reference Manual 1448 Array, DIMENSION (lda, *). On exit, the triangular matrix A. If uplo = 'U', the leading n-by-n upper triangular part of the array a contains the upper triangular part of the matrix A, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of the array a contains the lower triangular part of the matrix A, and the strictly upper triangular part of a is not referenced. info INTEGER. =0: successful exit, < 0: if info = -i, the i-th parameter had an illegal value. ?trttf Copies a triangular matrix from the standard full format (TR) to the rectangular full packed format (TF). Syntax Fortran 77: call strttf( transr, uplo, n, a, lda, arf, info ) call dtrttf( transr, uplo, n, a, lda, arf, info ) call ctrttf( transr, uplo, n, a, lda, arf, info ) call ztrttf( transr, uplo, n, a, lda, arf, info ) C: lapack_int LAPACKE_trttf( int matrix_order, char transr, char uplo, lapack_int n, const * a, lapack_int lda, * arf ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The routine copies a triangular matrix A from the standard full format to the Rectangular Full Packed (RFP) format. For the description of the RFP format, see Matrix Storage Schemes. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type defintions. transr CHARACTER*1. = 'N': arf must be in the Normal format, = 'T': arf must be in the Transpose format (for strttf and dtrttf), = 'C': arf must be in the Conjugate-transpose format (for ctrttf and ztrttf). uplo CHARACTER*1. Specifies whether A is upper or lower triangular: = 'U': A is upper triangular, = 'L': A is lower triangular. LAPACK Auxiliary and Utility Routines 5 1449 n INTEGER. The order of the matrix A. n = 0. a REAL for strttf, DOUBLE PRECISION for dtrttf, COMPLEX for ctrttf, DOUBLE COMPLEX for ztrttf. Array, DIMENSION (lda, *). On entry, the triangular matrix A. If uplo = 'U', the leading n-by-n upper triangular part of the array a contains the upper triangular matrix, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of the array a contains the lower triangular matrix, and the strictly upper triangular part of a is not referenced. lda INTEGER. The leading dimension of the array a. lda = max(1,n). Output Parameters arf REAL for strttf, DOUBLE PRECISION for dtrttf, COMPLEX for ctrttf, DOUBLE COMPLEX for ztrttf. Array, DIMENSION at least max (1, n*(n+1)/2). On exit, the upper or lower triangular matrix A stored in the RFP format. info INTEGER. =0: successful exit, < 0: if info = -i, the i-th parameter had an illegal value. ?trttp Copies a triangular matrix from the standard full format (TR) to the standard packed format (TP) . Syntax Fortran 77: call strttp( uplo, n, a, lda, ap, info ) call dtrttp( uplo, n, a, lda, ap, info ) call ctrttp( uplo, n, a, lda, ap, info ) call ztrttp( uplo, n, a, lda, ap, info ) C: lapack_int LAPACKE_trttp( int matrix_order, char uplo, lapack_int n, const * a, lapack_int lda, * ap ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The routine copies a triangular matrix A from the standard full format to the standard packed format. 5 Intel® Math Kernel Library Reference Manual 1450 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type defintions. uplo CHARACTER*1. Specifies whether A is upper or lower triangular: = 'U': A is upper triangular, = 'L': A is lower triangular. n INTEGER. The order of the matrices a and ap. n = 0. a REAL for strttp, DOUBLE PRECISION for dtrttp, COMPLEX for ctrttp, DOUBLE COMPLEX for ztrttp. Array, DIMENSION (lda, n). On entry, the triangular matrix A. If uplo = 'U', the leading n-by-n upper triangular part of the array a contains the upper triangular matrix, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of the array a contains the lower triangular matrix, and the strictly upper triangular part of a is not referenced. lda INTEGER. The leading dimension of the array a. lda = max(1,n). Output Parameters ap REAL for strttp, DOUBLE PRECISION for dtrttp, COMPLEX for ctrttp, DOUBLE COMPLEX for ztrttp. Array, DIMENSION at least max (1, n*(n+1)/2). On exit, the upper or lower triangular matrix A, packed columnwise in a linear array. The j-th column of A is stored in the array ap as follows: if uplo = 'U', ap(i + (j-1)*j/2) = A(i,j) for 1 = i = j, if uplo = 'L', ap(i + (j-1)*(2n-j)/2) = A(i,j) for j = i = n. info INTEGER. =0: successful exit, < 0: if info = -i, the i-th parameter had an illegal value. ?pstf2 Computes the Cholesky factorization with complete pivoting of a real symmetric or complex Hermitian positive semi-definite matrix. Syntax call spstf2( uplo, n, a, lda, piv, rank, tol, work, info ) call dpstf2( uplo, n, a, lda, piv, rank, tol, work, info ) call cpstf2( uplo, n, a, lda, piv, rank, tol, work, info ) call zpstf2( uplo, n, a, lda, piv, rank, tol, work, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h LAPACK Auxiliary and Utility Routines 5 1451 Description The real flavors spstf2 and dpstf2 compute the Cholesky factorization with complete pivoting of a real symmetric positive semi-definite matrix A. The complex flavors cpstf2 and zpstf2 compute the Cholesky factorization with complete pivoting of a complex Hermitian positive semi-definite matrix A. The factorization has the form: PT* A * P = UT * U, if uplo = 'U' for real flavors, PT* A * P = UH * U, if uplo = 'U' for complex flavors, PT* A * P = L * LT, if uplo = 'L' for real flavors, PT* A * P = L * LH, if uplo = 'L' for complex flavors, where U is an upper triangular matrix and L is lower triangular, and P is stored as vector piv. This algorithm does not check that A is positive semi-definite. This version of the algorithm calls level 2 BLAS. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric or Hermitian matrix A is stored: = 'U': Upper triangular, = 'L': Lower triangular. n INTEGER. The order of the matrix A. n = 0. a REAL for spstf2, DOUBLE PRECISION for dpstf2, COMPLEX for cpstf2, DOUBLE COMPLEX for zpstf2. Array, DIMENSION (lda, *). On entry, the symmetric matrix A. If uplo = 'U', the leading n-by-n upper triangular part of the array a contains the upper triangular part of the matrix A, and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of the array a contains the lower triangular part of the matrix A, and the strictly upper triangular part of a is not referenced. tol REAL for spstf2 and cpstf2, DOUBLE PRECISION for dpstf2 and zpstf2. A user-defined tolerance. If tol < 0, n*ulp*max(A(k,k)) will be used (ulp is the Unit in the Last Place, or Unit of Least Precision). The algorithm terminates at the (k - 1)-st step if the pivot is not greater than tol. lda INTEGER. The leading dimension of the matrix A. lda = max(1,n). work REAL for spstf2 and cpstf2, DOUBLE PRECISION for dpstf2 and zpstf2. Workspace array, DIMENSION at least max (1, 2*n). Output Parameters piv INTEGER. Array. DIMENSION at least max (1,n). piv is such that the non-zero entries are P ( piv (k), k ) = 1. a On exit, if info = 0, the factor U or L from the Cholesky factorization stored the same way as the matrix A is stored on entry. 5 Intel® Math Kernel Library Reference Manual 1452 rank INTEGER. The rank of A, determined by the number of steps the algorithm completed. info INTEGER. < 0: if info = -k, the k-th parameter had an illegal value, =0: the algorithm completed successfully, > 0: the matrix A is rank-deficient with the computed rank, returned in rank, or indefinite. dlat2s Converts a double-precision triangular matrix to a single-precision triangular matrix. Syntax call dlat2s( uplo, n, a, lda, sa, ldsa, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description This routine converts a double-precision triangular matrix A to a single-precision triangular matrix SA. dlat2s checks that all the elements of A are between -RMAX and RMAX, where RMAX is the overflow for the single-precision arithmetic. If this condition is not met, the conversion is aborted and a flag is raised. The routine does no parameter checking. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: = 'U': A is upper triangular, = 'L': A is lower triangular. n INTEGER. The number of rows and columns of the matrix A. n = 0. a DOUBLE PRECISION. Array, DIMENSION (lda, *). On entry, the n-by-n triangular matrix A. lda INTEGER. The leading dimension of the array a. lda = max(1,n). ldsa INTEGER. The leading dimension of the array sa. ldsa = max(1,n). Output Parameters sa REAL. Array, DIMENSION (ldsa, *). Only the part of sa determined by uplo is referenced. On exit, • if info = 0, the n-by-n triangular matrix SA, • if info > 0, the content of the part of sa determined by uplo is unspecified. info INTEGER. =0: successful exit, LAPACK Auxiliary and Utility Routines 5 1453 > 0: an element of the matrix A is greater than the single-precision overflow threshold; in this case, the content of the part of sa determined by uplo is unspecified on exit. zlat2c Converts a double complex triangular matrix to a complex triangular matrix. Syntax call zlat2c( uplo, n, a, lda, sa, ldsa, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description This routine is declared in mkl_lapack.fi for FORTRAN 77 interface and in mkl_lapack.h for C interface. The routine converts a DOUBLE COMPLEX triangular matrix A to a COMPLEX triangular matrix SA. zlat2c checks that the real and complex parts of all the elements of A are between -RMAX and RMAX, where RMAX is the overflow for the single-precision arithmetic. If this condition is not met, the conversion is aborted and a flag is raised. The routine does no parameter checking. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: = 'U': A is upper triangular, = 'L': A is lower triangular. n INTEGER. The number of rows and columns in the matrix A. n = 0. a DOUBLE COMPLEX. Array, DIMENSION (lda, *). On entry, the n-by-n triangular matrix A. lda INTEGER. The leading dimension of the array a. lda = max(1,n). ldsa INTEGER. The leading dimension of the array sa. ldsa = max(1,n). Output Parameters sa COMPLEX. Array, DIMENSION (ldsa, *). Only the part of sa determined by uplo is referenced. On exit, • if info = 0, the n-by-n triangular matrix sa, • if info > 0, the content of the part of sa determined by uplo is unspecified. info INTEGER. =0: successful exit, > 0: the real or complex part of an element of the matrix A is greater than the single-precision overflow threshold; in this case, the content of the part of sa determined by uplo is unspecified on exit. 5 Intel® Math Kernel Library Reference Manual 1454 ?lacp2 Copies all or part of a real two-dimensional array to a complex array. Syntax call clacp2( uplo, m, n, a, lda, b, ldb ) call zlacp2( uplo, m, n, a, lda, b, ldb ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine copies all or part of a two-dimensional matrix A to another matrix B. Input Parameters uplo CHARACTER*1. Specifies the part of the matrix A to be copied to B. If uplo = 'U', the upper triangular part of A; if uplo = 'L', the lower triangular part of A. Otherwise, all of the matrix A is copied. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a REAL for clacp2 DOUBLE PRECISION for zlacp2 Array a(lda,n), contains the m-by-n matrix A. If uplo = 'U', only the upper triangle or trapezoid is accessed; if uplo = 'L', only the lower triangle or trapezoid is accessed. lda INTEGER. The leading dimension of a; lda = max(1, m). ldb INTEGER. The leading dimension of the output array b; ldb = max(1, m). Output Parameters b COMPLEX for clacp2 DOUBLE COMPLEX for zlacp2. Array b(ldb,m), contains the m-by-n matrix B. On exit, B = A in the locations specified by uplo. ?la_gbamv Performs a matrix-vector operation to calculate error bounds. Syntax Fortran 77: call sla_gbamv(trans, m, n, kl, ku, alpha, ab, ldab, x, incx, beta, y, incy) call dla_gbamv(trans, m, n, kl, ku, alpha, ab, ldab, x, incx, beta, y, incy) call cla_gbamv(trans, m, n, kl, ku, alpha, ab, ldab, x, incx, beta, y, incy) LAPACK Auxiliary and Utility Routines 5 1455 call zla_gbamv(trans, m, n, kl, ku, alpha, ab, ldab, x, incx, beta, y, incy) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_gbamv function performs one of the matrix-vector operations defined as y := alpha*abs(A)*abs(x) + beta*abs(y), or y := alpha*abs(A)T*abs(x) + beta*abs(y), where: alpha and beta are scalars, x and y are vectors, A is an m-by-n matrix, with kl sub-diagonals and ku super-diagonals. This function is primarily used in calculating error bounds. To protect against underflow during evaluation, the function perturbs components in the resulting vector away from zero by (n + 1) times the underflow threshold. To prevent unnecessarily large errors for block structure embedded in general matrices, the function does not perturb symbolically zero components. A zero entry is considered symbolic if all multiplications involved in computing that entry have at least one zero multiplicand. Input Parameters trans INTEGER. Specifies the operation to be performed: If trans = 'BLAS_NO_TRANS', then y := alpha*abs(A)*abs(x) + beta*abs(y) If trans = 'BLAS_TRANS', then y := alpha*abs(AT)*abs(x) + beta*abs(y) If trans = 'BLAS_CONJ_TRANS', then y := alpha*abs(AT)*abs(x) + beta*abs(y) The parameter is unchanged on exit. m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. Unchanged on exit. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. Unchanged on exit. kl INTEGER. Specifies the number of sub-diagonals within the band of A. kl = 0. ku INTEGER. Specifies the number of super-diagonals within the band of A. ku = 0. alpha REAL for sla_gbamv and cla_gbamv DOUBLE PRECISION for dla_gbamv and zla_gbamv Specifies the scalar alpha. Unchanges on exit. ab REAL for sla_gbamv DOUBLE PRECISION for dla_gbamv COMPLEX for cla_gbamv DOUBLE COMPLEX for zla_gbamv Array, DIMENSION (ldab, *). 5 Intel® Math Kernel Library Reference Manual 1456 Before entry, the leading m-by-n part of the array ab must contain the matrix of coefficients. The second dimension of ab must be at least max(1,n). Unchanged on exit. ldab INTEGER. Specifies the leading dimension of ab as declared in the calling (sub)program. The value of ldab must be at least max(1, m). Unchanged on exit. x REAL for sla_gbamv DOUBLE PRECISION for dla_gbamv COMPLEX for cla_gbamv DOUBLE COMPLEX for zla_gbamv Array, DIMENSION (1 + (n - 1)*abs(incx)) when trans = 'N' or 'n' and at least (1 + (m - 1)*abs(incx)) otherwise. Before entry, the incremented array x must contain the vector x. incx INTEGER. Specifies the increment for the elements of x. incx must not be zero. beta REAL for sla_gbamv and cla_gbamv DOUBLE PRECISION for dla_gbamv and zla_gbamv Specifies the scalar beta. When beta is zero, you do not need to set y on input. y REAL for sla_gbamv and cla_gbamv DOUBLE PRECISION for dla_gbamv and zla_gbamv Array, DIMENSION at least (1 +(m - 1)*abs(incy)) when trans = 'N' or 'n' and at least (1 +(n - 1)*abs(incy)) otherwise. Before entry with beta non-zero, the incremented array y must contain the vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Unchanged on exit. Output Parameters y Updated vector y. ?la_gbrcond Estimates the Skeel condition number for a general banded matrix. Syntax Fortran 77: call sla_gbrcond( trans, n, kl, ku, ab, ldab, afb, ldafb, ipiv, cmode, c, info, work, iwork ) call dla_gbrcond( trans, n, kl, ku, ab, ldab, afb, ldafb, ipiv, cmode, c, info, work, iwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h LAPACK Auxiliary and Utility Routines 5 1457 Description The function estimates the Skeel condition number of op(A) * op2(C) where the cmode parameter determines op2 as follows: cmode Value op2(C) 1 C 0 I -1 inv(C) The Skeel condition number cond(A) = norminf(|inv(A)||A|) is computed by computing scaling factors R such that diag(R)*A*op2(C) is row equilibrated and by computing the standard infinity-norm condition number. Input Parameters trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ab, afb, c, work REAL for sla_gbrcond DOUBLE PRECISION for dla_gbrcond Arrays: ab(ldab,*) contains the original band matrix A stored in rows from 1 to kl + ku + 1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = A(i,j) for max(1,j-ku) = i = min(n,j+kl) afb(ldafb,*) contains details of the LU factorization of the band matrix A, as returned by ?gbtrf. U is stored as an upper triangular band matrix with kl+ku superdiagonals in rows 1 to kl+ku+1, and the multipliers used during the factorization are stored in rows kl+ku+2 to 2*kl+ku+1. c, DIMENSION n. The vector C in the formula op(A) * op2(C). work is a workspace array of DIMENSION (5*n). The second dimension of ab and afb must be at least max(1, n). ldab INTEGER. The leading dimension of the array ab. ldab = kl+ku+1. ldafb INTEGER. The leading dimension of afb. ldafb = 2*kl+ku+1. ipiv INTEGER. 5 Intel® Math Kernel Library Reference Manual 1458 Array with DIMENSION n. The pivot indices from the factorization A = P*L*U as computed by ?gbtrf. Row i of the matrix was interchanged with row ipiv(i). cmode INTEGER. Determines op2(C) in the formula op(A) * op2(C) as follows: If cmode = 1, op2(C) = C. If cmode = 0, op2(C) = I. If cmode = -1, op2(C) = inv(C). iwork INTEGER. Workspace array with DIMENSION n. Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?gbtrf ?la_gbrcond_c Computes the infinity norm condition number of op(A)*inv(diag(c)) for general banded matrices. Syntax Fortran 77: call cla_gbrcond_c( trans, n, kl, ku, ab, ldab, afb, ldafb, ipiv, c, capply, info, work, rwork ) call zla_gbrcond_c( trans, n, kl, ku, ab, ldab, afb, ldafb, ipiv, c, capply, info, work, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function computes the infinity norm condition number of op(A) * inv(diag(c)) where the c is a REAL vector for cla_gbrcond_c and a DOUBLE PRECISION vector for zla_gbrcond_c. Input Parameters trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose) If trans = 'T', the system has the form AT*X = B (Transpose) If trans = 'C', the system has the form AH*X = B (Conjugate Transpose = Transpose) n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ab, afb, work COMPLEX for cla_gbrcond_c LAPACK Auxiliary and Utility Routines 5 1459 DOUBLE COMPLEX for zla_gbrcond_c Arrays: ab(ldab,*) contains the original band matrix A stored in rows from 1 to kl + ku + 1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = A(i,j) for max(1,j-ku) = i = min(n,j+kl) afb(ldafb,*) contains details of the LU factorization of the band matrix A, as returned by ?gbtrf. U is stored as an upper triangular band matrix with kl+ku superdiagonals in rows 1 to kl+ku+1, and the multipliers used during the factorization are stored in rows kl+ku+2 to 2*kl+ku+1. work is a workspace array of DIMENSION (5*n). The second dimension of ab and afb must be at least max(1, n). ldab INTEGER. The leading dimension of the array ab. ldab = kl+ku+1. ldafb INTEGER. The leading dimension of afb. ldafb = 2*kl+ku+1. ipiv INTEGER. Array with DIMENSION n. The pivot indices from the factorization A = P*L*U as computed by ?gbtrf. Row i of the matrix was interchanged with row ipiv(i). c, rwork REAL for cla_gbrcond_c DOUBLE PRECISION for zla_gbrcond_c Array c with DIMENSION n. The vector c in the formula op(A) * inv(diag(c)). Array rwork with DIMENSION n is a workspace. capply LOGICAL. If .TRUE., then the function uses the vector c from the formula op(A) * inv(diag(c)). Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?gbtrf ?la_gbrcond_x Computes the infinity norm condition number of op(A)*diag(x) for general banded matrices. Syntax Fortran 77: call cla_gbrcond_x( trans, n, kl, ku, ab, ldab, afb, ldafb, ipiv, x, info, work, rwork ) call zla_gbrcond_x( trans, n, kl, ku, ab, ldab, afb, ldafb, ipiv, x, info, work, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function computes the infinity norm condition number of 5 Intel® Math Kernel Library Reference Manual 1460 op(A) * diag(x) where the x is a COMPLEX vector for cla_gbrcond_x and a DOUBLE COMPLEX vector for zla_gbrcond_x. Input Parameters trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose) If trans = 'T', the system has the form AT*X = B (Transpose) If trans = 'C', the system has the form AH*X = B (Conjugate Transpose = Transpose) n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ab, afb, x, work COMPLEX for cla_gbrcond_x DOUBLE COMPLEX for zla_gbrcond_x Arrays: ab(ldab,*) contains the original band matrix A stored in rows from 1 to kl + ku + 1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = A(i,j) for max(1,j-ku) = i = min(n,j+kl) afb(ldafb,*) contains details of the LU factorization of the band matrix A, as returned by ?gbtrf. U is stored as an upper triangular band matrix with kl+ku superdiagonals in rows 1 to kl+ku+1, and the multipliers used during the factorization are stored in rows kl+ku+2 to 2*kl+ku+1. x, DIMENSION n. The vector x in the formula op(A) * diag(x). work is a workspace array of DIMENSION (2*n). The second dimension of ab and afb must be at least max(1, n). ldab INTEGER. The leading dimension of the array ab. ldab = kl+ku+1. ldafb INTEGER. The leading dimension of afb. ldafb = 2*kl+ku+1. ipiv INTEGER. Array with DIMENSION n. The pivot indices from the factorization A = P*L*U as computed by ?gbtrf. Row i of the matrix was interchanged with row ipiv(i). rwork REAL for cla_gbrcond_x DOUBLE PRECISION for zla_gbrcond_x Array rwork with DIMENSION n is a workspace. Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?gbtrf LAPACK Auxiliary and Utility Routines 5 1461 ?la_gbrfsx_extended Improves the computed solution to a system of linear equations for general banded matrices by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. Syntax Fortran 77: call sla_gbrfsx_extended( prec_type, trans_type, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call dla_gbrfsx_extended( prec_type, trans_type, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call cla_gbrfsx_extended( prec_type, trans_type, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call zla_gbrfsx_extended( prec_type, trans_type, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_gbrfsx_extended subroutine improves the computed solution to a system of linear equations by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. The ?gbrfsx routine calls ?la_gbrfsx_extended to perform iterative refinement. In addition to normwise error bound, the code provides maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. Use ?la_gbrfsx_extended to set only the second fields of err_bnds_norm and err_bnds_comp. Input Parameters prec_type INTEGER. Specifies the intermediate precision to be used in refinement. The value is defined by ilaprec(p), where p is a CHARACTER and: If p = 'S': Single. If p = 'D': Double. If p = 'I': Indigenous. If p = 'X', 'E': Extra. trans_type INTEGER. Specifies the transposition operation on A. The value is defined by ilatrans(t), where t is a CHARACTER and: If t = 'N': No transpose. 5 Intel® Math Kernel Library Reference Manual 1462 If t = 'T': Transpose. If t = 'C': Conjugate Transpose. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrix B. ab, afb, b, y REAL for sla_gbrfsx_extended DOUBLE PRECISION for dla_gbrfsx_extended COMPLEX for cla_gbrfsx_extended DOUBLE COMPLEX for zla_gbrfsx_extended. Arrays: ab(ldab,*), afb(ldafb,*), b(ldb,*), y(ldy,*). The array ab contains the original n-by-n matrix A. The second dimension of ab must be at least max(1,n). The array afb contains the factors L and U from the factorization A = P*L*U) as computed by ?gbtrf. The second dimension of afb must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). The array y on entry contains the solution matrix X as computed by ? gbtrs. The second dimension of y must be at least max(1,nrhs). ldab INTEGER. The leading dimension of the array ab; ldab = max(1,n). ldafb INTEGER. The leading dimension of the array afb; ldafb = max(1,n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains the pivot indices from the factorization A = P*L*U) as computed by ?gbtrf; row i of the matrix was interchanged with row ipiv(i). colequ LOGICAL. If colequ = .TRUE., column equilibration was done to A before calling this routine. This is needed to compute the solution and error bounds correctly. c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. c contains the column scale factors for A. If colequ = .FALSE., c is not accessed. If c is input, each element of c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by power of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldy INTEGER. The leading dimension of the array y; ldy = max(1, n). n_norms INTEGER. Determines which error bounds to return. See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. If n_norms = 1, returns normwise error bounds. If n_norms = 2, returns componentwise error bounds. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. LAPACK Auxiliary and Utility Routines 5 1463 Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the following three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. Use this subroutine to set only the second field above. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: 5 Intel® Math Kernel Library Reference Manual 1464 The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. Use this subroutine to set only the second field above. res, dy, y_tail REAL for sla_gbrfsx_extended DOUBLE PRECISION for dla_gbrfsx_extended COMPLEX for cla_gbrfsx_extended DOUBLE COMPLEX for zla_gbrfsx_extended. Workspace arrays of DIMENSION n. res holds the intermediate residual. dy holds the intermediate solution. y_tail holds the trailing bits of the intermediate solution. ayb REAL for single precision flavors LAPACK Auxiliary and Utility Routines 5 1465 DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION n. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. ithresh INTEGER. The maximum number of residual computations allowed for refinement. The default is 10. For 'aggressive', set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the guarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. rthresh REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Determines when to stop refinement if the error estimate stops decreasing. Refinement stops when the next solution no longer satisfies norm(dx_{i+1}) < rthresh * norm(dx_i) where norm(z) is the infinity norm of Z. rthresh satisfies 0 < rthresh = 1. The default value is 0.5. For 'aggressive' set to 0.9 to permit convergence on extremely ill-conditioned matrices. dz_ub REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Determines when to start considering componentwise convergence. Componentwise dz_ub convergence is only considered after each component of the solution y is stable, that is, the relative change in each component is less than dz_ub. The default value is 0.25, requiring the first bit to be stable. ignore_cwise LOGICAL If .TRUE., the function ignores componentwise convergence. Default value is .FALSE. Output Parameters y REAL for sla_gbrfsx_extended DOUBLE PRECISION for dla_gbrfsx_extended COMPLEX for cla_gbrfsx_extended DOUBLE COMPLEX for zla_gbrfsx_extended. The improved solution matrix Y. berr_out REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for right-hand-side j from the formula max(i) ( abs(res(i)) / ( abs(op(A))*abs(y) + abs(B) )(i) ) where abs(z) is the componentwise absolute value of the matrix or vector Z. This is computed by ?la_lin_berr. 5 Intel® Math Kernel Library Reference Manual 1466 err_bnds_norm, err_bnds_comp Values of the corresponding input parameters improved after iterative refinement and stored in the second column of the array ( 1:nrhs, 2 ). The other elements are kept unchanged. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. See Also ?gbrfsx ?gbtrf ?gbtrs ?lamch ilaprec ilatrans ?la_lin_berr ?la_gbrpvgrw Computes the reciprocal pivot growth factor norm(A)/ norm(U) for a general band matrix. Syntax Fortran 77: call sla_gbrpvgrw( n, kl, ku, ncols, ab, ldab, afb, ldafb ) call dla_gbrpvgrw( n, kl, ku, ncols, ab, ldab, afb, ldafb ) call cla_gbrpvgrw( n, kl, ku, ncols, ab, ldab, afb, ldafb ) call zla_gbrpvgrw( n, kl, ku, ncols, ab, ldab, afb, ldafb ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_gbrpvgrw routine computes the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the equilibrated matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. Input Parameters n INTEGER. The number of linear equations, the order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ncols INTEGER. The number of columns of the matrix A; ncols = 0. ab, afb REAL for sla_gbrpvgrw DOUBLE PRECISION for dla_gbrpvgrw COMPLEX for cla_gbrpvgrw DOUBLE COMPLEX for zla_gbrpvgrw. Arrays: ab(ldab,*), afb(ldafb,*). LAPACK Auxiliary and Utility Routines 5 1467 ab contains the original band matrix A (see Matrix Storage Schemes) stored in rows from 1 to kl + ku + 1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = A(i,j) for max(1,j-ku) = i = min(n,j+kl) afb contains details of the LU factorization of the band matrix A, as returned by ?gbtrf. U is stored as an upper triangular band matrix with kl+ku superdiagonals in rows 1 to kl+ku+1, and the multipliers used during the factorization are stored in rows kl+ku+2 to 2*kl+ku +1. ldab INTEGER. The leading dimension of ab; ldab = kl+ku+1. ldafb INTEGER. The leading dimension of afb; ldafb = 2*kl+ku+1. See Also ?gbtrf ?la_geamv Computes a matrix-vector product using a general matrix to calculate error bounds. Syntax Fortran 77: call sla_geamv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call dla_geamv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call cla_geamv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call zla_geamv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_geamv routines perform a matrix-vector operation defined as y := alpha*abs(A)*(x) + beta*abs(y), or y := alpha*abs(AT)*abs(x) + beta*abs(y), where: alpha and beta are scalars, x and y are vectors, A is an m-by-n matrix. This function is primarily used in calculating error bounds. To protect against underflow during evaluation, the function perturbs components in the resulting vector away from zero by (n + 1) times the underflow threshold. To prevent unnecessarily large errors for block structure embedded in general matrices, the function does not perturb symbolically zero components. A zero entry is considered symbolic if all multiplications involved in computing that entry have at least one zero multiplicand. 5 Intel® Math Kernel Library Reference Manual 1468 Input Parameters trans CHARACTER*1. Specifies the operation: if trans = BLAS_NO_TRANS , then y := alpha*abs(A)*abs(x) + beta*abs(y) if trans = BLAS_TRANS, then y := alpha*abs(AT)*abs(x) + beta*abs(y) if trans = 'BLAS_CONJ_TRANS, then y := alpha*abs(AT)*abs(x) + beta*abs(y). m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha REAL for sla_geamv and for cla_geamv DOUBLE PRECISION for dla_geamv and zla_geamv Specifies the scalar alpha. a REAL for sla_geamv DOUBLE PRECISION for dla_geamv COMPLEX for cla_geamv DOUBLE COMPLEX for zla_geamv Array, DIMENSION (lda, *). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. The second dimension of a must be at least max(1,n). lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). x REAL for sla_geamv DOUBLE PRECISION for dla_geamv COMPLEX for cla_geamv DOUBLE COMPLEX for zla_geamv Array, DIMENSION at least (1+(n-1)*abs(incx)) when trans = 'N' or 'n' and at least (1+(m - 1)*abs(incx)) otherwise. Before entry, the incremented array x must contain the vector X. incx INTEGER. Specifies the increment for the elements of x. The value of incx must be non-zero. beta REAL for sla_geamv and for cla_geamv DOUBLE PRECISION for dla_geamv and zla_geamv Specifies the scalar beta. When beta is zero, you do not need to set y on input. y REAL for sla_geamv and for cla_geamv DOUBLE PRECISION for dla_geamv and zla_geamv Array, DIMENSION at least (1 +(m - 1)*abs(incy)) when trans = 'N' or 'n' and at least (1 +(n - 1)*abs(incy)) otherwise. Before entry with non-zero beta, the incremented array y must contain the vector Y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must be non-zero. Output Parameters y Updated vector Y. LAPACK Auxiliary and Utility Routines 5 1469 ?la_gercond Estimates the Skeel condition number for a general matrix. Syntax Fortran 77: call sla_gercond( trans, n, a, lda, af, ldaf, ipiv, cmode, c, info, work, iwork ) call dla_gercond( trans, n, a, lda, af, ldaf, ipiv, cmode, c, info, work, iwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function estimates the Skeel condition number of op(A) * op2(C) where the cmode parameter determines op2 as follows: cmode Value op2(C) 1 C 0 I -1 inv(C) The Skeel condition number cond(A) = norminf(|inv(A)||A| is computed by computing scaling factors R such that diag(R)*A*op2(C) is row equilibrated and by computing the standard infinity-norm condition number. Input Parameters trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form AT*X = B (Transpose). If trans = 'C', the system has the form AH*X = B (Conjugate Transpose = Transpose). n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. a, af, c, work REAL for sla_gercond DOUBLE PRECISION for dla_gercond Arrays: a(lda,*) contains the original general n-by-n matrix A. af(ldaf,*) contains factors L and U from the factorization of the general matrix A=P*L*U, as returned by ?getrf. c, DIMENSION n. The vector C in the formula op(A) * op2(C). 5 Intel® Math Kernel Library Reference Manual 1470 work is a workspace array of DIMENSION (3*n). The second dimension of a and af must be at least max(1, n). lda INTEGER. The leading dimension of the array a. lda = max(1,n). ldaf INTEGER. The leading dimension of af. ldaf = max(1, n). ipiv INTEGER. Array with DIMENSION n. The pivot indices from the factorization A = P*L*U as computed by ?getrf. Row i of the matrix was interchanged with row ipiv(i). cmode INTEGER. Determines op2(C) in the formula op(A) * op2(C) as follows: If cmode = 1, op2(C) = C. If cmode = 0, op2(C) = I. If cmode = -1, op2(C) = inv(C). iwork INTEGER. Workspace array with DIMENSION n. Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?getrf ?la_gercond_c Computes the infinity norm condition number of op(A)*inv(diag(c)) for general matrices. Syntax Fortran 77: call cla_gercond_c( trans, n, a, lda, af, ldaf, ipiv, c, capply, info, work, rwork ) call zla_gercond_c( trans, n, a, lda, af, ldaf, ipiv, c, capply, info, work, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function computes the infinity norm condition number of op(A) * inv(diag(c)) where the c is a REAL vector for cla_gercond_c and a DOUBLE PRECISION vector for zla_gercond_c. Input Parameters trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose) If trans = 'T', the system has the form AT*X = B (Transpose) If trans = 'C', the system has the form AH*X = B (Conjugate Transpose = Transpose) n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. LAPACK Auxiliary and Utility Routines 5 1471 a, af, work COMPLEX for cla_gercond_c DOUBLE COMPLEX for zla_gercond_c Arrays: a(lda,*) contains the original general n-by-n matrix A. af(ldaf,*) contains the factors L and U from the factorization A=P*L*U as returned by ?getrf. work is a workspace array of DIMENSION (2*n). The second dimension of a and af must be at least max(1, n). lda INTEGER. The leading dimension of the array a. lda = max(1,n). ldaf INTEGER. The leading dimension of af. ldaf = max(1,n). ipiv INTEGER. Array with DIMENSION n. The pivot indices from the factorization A = P*L*U as computed by ?getrf. Row i of the matrix was interchanged with row ipiv(i). c, rwork REAL for cla_gercond_c DOUBLE PRECISION for zla_gercond_c Array c with DIMENSION n. The vector c in the formula op(A) * inv(diag(c)). Array rwork with DIMENSION n is a workspace. capply LOGICAL. If capply=.TRUE., then the function uses the vector c from the formula op(A) * inv(diag(c)). Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?getrf ?la_gercond_x Computes the infinity norm condition number of op(A)*diag(x) for general matrices. Syntax Fortran 77: call cla_gercond_x( trans, n, a, lda, af, ldaf, ipiv, x, info, work, rwork ) call zla_gercond_x( trans, n, a, lda, af, ldaf, ipiv, x, info, work, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function computes the infinity norm condition number of op(A) * diag(x) where the x is a COMPLEX vector for cla_gercond_x and a DOUBLE COMPLEX vector for zla_gercond_x. 5 Intel® Math Kernel Library Reference Manual 1472 Input Parameters trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose) If trans = 'T', the system has the form AT*X = B (Transpose) If trans = 'C', the system has the form AH*X = B (Conjugate Transpose = Transpose) n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. a, af, x, work COMPLEX for cla_gercond_x DOUBLE COMPLEX for zla_gercond_x Arrays: a(lda,*) contains the original general n-by-n matrix A. af(ldaf,*) contains the factors L and U from the factorization A=P*L*U as returned by ?getrf. x, DIMENSION n. The vector x in the formula op(A) * diag(x). work is a workspace array of DIMENSION (2*n). The second dimension of a and af must be at least max(1, n). lda INTEGER. The leading dimension of the array a. lda = max(1,n). ldaf INTEGER. The leading dimension of af. ldaf = max(1,n). ipiv INTEGER. Array with DIMENSION n. The pivot indices from the factorization A = P*L*U as computed by ?getrf. Row i of the matrix was interchanged with row ipiv(i). rwork REAL for cla_gercond_x DOUBLE PRECISION for zla_gercond_x Array rwork with DIMENSION n is a workspace. Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?getrf ?la_gerfsx_extended Improves the computed solution to a system of linear equations for general matrices by performing extraprecise iterative refinement and provides error bounds and backward error estimates for the solution. Syntax Fortran 77: call sla_gerfsx_extended( prec_type, trans_type, n, nrhs, a, lda, af, ldaf, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, errs_n, errs_c, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call dla_gerfsx_extended( prec_type, trans_type, n, nrhs, a, lda, af, ldaf, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, errs_n, errs_c, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) LAPACK Auxiliary and Utility Routines 5 1473 call cla_gerfsx_extended( prec_type, trans_type, n, nrhs, a, lda, af, ldaf, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, errs_n, errs_c, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call zla_gerfsx_extended( prec_type, trans_type, n, nrhs, a, lda, af, ldaf, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, errs_n, errs_c, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_gerfsx_extended subroutine improves the computed solution to a system of linear equations for general matrices by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. The ?gerfsx routine calls ?la_gerfsx_extended to perform iterative refinement. In addition to normwise error bound, the code provides maximum componentwise error bound, if possible. See comments for errs_n and errs_c for details of the error bounds. Use ?la_gerfsx_extended to set only the second fields of errs_n and errs_c. Input Parameters prec_type INTEGER. Specifies the intermediate precision to be used in refinement. The value is defined by ilaprec(p), where p is a CHARACTER and: If p = 'S': Single. If p = 'D': Double. If p = 'I': Indigenous. If p = 'X', 'E': Extra. trans_type INTEGER. Specifies the transposition operation on A. The value is defined by ilatrans(t), where t is a CHARACTER and: If t = 'N': No transpose. If t = 'T': Transpose. If t = 'C': Conjugate Transpose. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrix B. a, af, b, y REAL for sla_gerfsx_extended DOUBLE PRECISION for dla_gerfsx_extended COMPLEX for cla_gerfsx_extended DOUBLE COMPLEX for zla_gerfsx_extended. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), y(ldy,*). The array a contains the original matrix n-by-n matrix A. The second dimension of a must be at least max(1,n). The array af contains the factors L and U from the factorization A = P*L*U) as computed by ?getrf. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). 5 Intel® Math Kernel Library Reference Manual 1474 The array y on entry contains the solution matrix X as computed by ? getrs. The second dimension of y must be at least max(1,nrhs). lda INTEGER. The leading dimension of the array a; lda = max(1,n). ldaf INTEGER. The leading dimension of the array af; ldaf = max(1,n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains the pivot indices from the factorization A = P*L*U) as computed by ?getrf; row i of the matrix was interchanged with row ipiv(i). colequ LOGICAL. If colequ = .TRUE., column equilibration was done to A before calling this routine. This is needed to compute the solution and error bounds correctly. c REAL for single precision flavors (sla_gerfsx_extended, cla_gerfsx_extended) DOUBLE PRECISION for double precision flavors (dla_gerfsx_extended, zla_gerfsx_extended). c contains the column scale factors for A. If colequ = .FALSE., c is not used. If c is input, each element of c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by power of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldy INTEGER. The leading dimension of the array y; ldy = max(1, n). n_norms INTEGER. Determines which error bounds to return. See errs_n and errs_c descriptions in Output Arguments section below. If n_norms = 1, returns normwise error bounds. If n_norms = 2, returns componentwise error bounds. errs_n REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in errs_n(i,:) corresponds to the i-th right-hand side. The second index in errs_n(:,err) contains the following three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. LAPACK Auxiliary and Utility Routines 5 1475 err=2 "Guaranteed" error bound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. Use this subroutine to set only the second field above. errs_c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then errs_c is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in errs_c(i,:) corresponds to the i-th right-hand side. The second index in errs_c(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and 5 Intel® Math Kernel Library Reference Manual 1476 sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. Use this subroutine to set only the second field above. res, dy, y_tail REAL for sla_gerfsx_extended DOUBLE PRECISION for dla_gerfsx_extended COMPLEX for cla_gerfsx_extended DOUBLE COMPLEX for zla_gerfsx_extended. Workspace arrays of DIMENSION n. res holds the intermediate residual. dy holds the intermediate solution. y_tail holds the trailing bits of the intermediate solution. ayb REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION n. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. ithresh INTEGER. The maximum number of residual computations allowed for refinement. The default is 10. For 'aggressive', set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the guarantees in errs_n and errs_c may no longer be trustworthy. rthresh REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Determines when to stop refinement if the error estimate stops decreasing. Refinement stops when the next solution no longer satisfies norm(dx_{i+1}) < rthresh * norm(dx_i) where norm(z) is the infinity norm of Z. rthresh satisfies 0 < rthresh = 1. LAPACK Auxiliary and Utility Routines 5 1477 The default value is 0.5. For 'aggressive' set to 0.9 to permit convergence on extremely ill-conditioned matrices. dz_ub REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Determines when to start considering componentwise convergence. Componentwise dz_ub convergence is only considered after each component of the solution y is stable, that is, the relative change in each component is less than dz_ub. The default value is 0.25, requiring the first bit to be stable. ignore_cwise LOGICAL If .TRUE., the function ignores componentwise convergence. Default value is .FALSE. Output Parameters y REAL for sla_gerfsx_extended DOUBLE PRECISION for dla_gerfsx_extended COMPLEX for cla_gerfsx_extended DOUBLE COMPLEX for zla_gerfsx_extended. The improved solution matrix Y. berr_out REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for right-hand-side j from the formula max(i) ( abs(res(i)) / ( abs(op(A))*abs(y) + abs(B) )(i) ) where abs(z) is the componentwise absolute value of the matrix or vector Z. This is computed by ?la_lin_berr. errs_n, errs_c Values of the corresponding input parameters improved after iterative refinement and stored in the second column of the array ( 1:nrhs, 2 ). The other elements are kept unchanged. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. See Also ?gerfsx ?getrf ?getrs ?lamch ilaprec ilatrans ?la_lin_berr ?la_heamv Computes a matrix-vector product using a Hermitian indefinite matrix to calculate error bounds. Syntax Fortran 77: call cla_heamv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) call zla_heamv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) 5 Intel® Math Kernel Library Reference Manual 1478 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_heamv routines perform a matrix-vector operation defined as y := alpha*abs(A)*abs(x) + beta*abs(y), where: alpha and beta are scalars, x and y are vectors, A is an n-by-n Hermitian matrix. This function is primarily used in calculating error bounds. To protect against underflow during evaluation, the function perturbs components in the resulting vector away from zero by (n + 1) times the underflow threshold. To prevent unnecessarily large errors for block structure embedded in general matrices, the function does not perturb symbolically zero components. A zero entry is considered symbolic if all multiplications involved in computing that entry have at least one zero multiplicand. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array A is to be referenced: If uplo = 'BLAS_UPPER', only the upper triangular part of A is to be referenced, If uplo = 'BLAS_LOWER', only the lower triangular part of A is to be referenced. n INTEGER. Specifies the number of rows and columns of the matrix A. The value of n must be at least zero. alpha REAL for cla_heamv DOUBLE PRECISION for zla_heamv Specifies the scalar alpha. a COMPLEX for cla_heamv DOUBLE COMPLEX for zla_heamv Array, DIMENSION (lda, *). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. The second dimension of a must be at least max(1,n). lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). x COMPLEX for cla_heamv DOUBLE COMPLEX for zla_heamv Array, DIMENSION at least (1+(n-1)*abs(incx)). Before entry, the incremented array x must contain the vector X. incx INTEGER. Specifies the increment for the elements of x. The value of incx must be non-zero. beta REAL for cla_heamv DOUBLE PRECISION for zla_heamv Specifies the scalar beta. When beta is zero, you do not need to set y on input. y REAL for cla_heamv LAPACK Auxiliary and Utility Routines 5 1479 DOUBLE PRECISION for zla_heamv Array, DIMENSION at least (1 +(n - 1)*abs(incy)) otherwise. Before entry with non-zero beta, the incremented array y must contain the vector Y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must be non-zero. Output Parameters y Updated vector Y. ?la_hercond_c Computes the infinity norm condition number of op(A)*inv(diag(c)) for Hermitian indefinite matrices. Syntax Fortran 77: call cla_hercond_c( uplo, n, a, lda, af, ldaf, ipiv, c, capply, info, work, rwork ) call zla_hercond_c( uplo, n, a, lda, af, ldaf, ipiv, c, capply, info, work, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function computes the infinity norm condition number of op(A) * inv(diag(c)) where the c is a REAL vector for cla_hercond_c and a DOUBLE PRECISION vector for zla_hercond_c. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. a COMPLEX for cla_hercond_c DOUBLE COMPLEX for zla_hercond_c Array, DIMENSION (lda, *). On entry, the n-by-n matrix A. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of the array a. lda = max(1,n). af COMPLEX for cla_hercond_c DOUBLE COMPLEX for zla_hercond_c Array, DIMENSION (ldaf, *). The block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?hetrf. The second dimension of af must be at least max(1,n). ldaf INTEGER. The leading dimension of the array af. ldaf = max(1,n). ipiv INTEGER. 5 Intel® Math Kernel Library Reference Manual 1480 Array with DIMENSION n. Details of the interchanges and the block structure of D as determined by ?hetrf. c REAL for cla_hercond_c DOUBLE PRECISION for zla_hercond_c Array c with DIMENSION n. The vector c in the formula op(A) * inv(diag(c)). capply LOGICAL. If .TRUE., then the function uses the vector c from the formula op(A) * inv(diag(c)). work COMPLEX for cla_hercond_c DOUBLE COMPLEX for zla_hercond_c Array DIMENSION 2*n. Workspace. rwork REAL for cla_hercond_c DOUBLE PRECISION for zla_hercond_c Array DIMENSION n. Workspace. Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?hetrf ?la_hercond_x Computes the infinity norm condition number of op(A)*diag(x) for Hermitian indefinite matrices. Syntax Fortran 77: call cla_hercond_x( uplo, n, a, lda, af, ldaf, ipiv, x, info, work, rwork ) call zla_hercond_x( uplo, n, a, lda, af, ldaf, ipiv, x, info, work, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function computes the infinity norm condition number of op(A) * diag(x) where the x is a COMPLEX vector for cla_hercond_x and a DOUBLE COMPLEX vector for zla_hercond_x. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. LAPACK Auxiliary and Utility Routines 5 1481 a COMPLEX for cla_hercond_c DOUBLE COMPLEX for zla_hercond_c Array, DIMENSION (lda, *). On entry, the n-by-n matrix A. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of the array a. lda = max(1,n). af COMPLEX for cla_hercond_c DOUBLE COMPLEX for zla_hercond_c Array, DIMENSION (ldaf, *). The block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?hetrf. The second dimension of af must be at least max(1,n). ldaf INTEGER. The leading dimension of the array af. ldaf = max(1,n). ipiv INTEGER. Array with DIMENSION n. Details of the interchanges and the block structure of D as determined by ?hetrf. x COMPLEX for cla_hercond_c DOUBLE COMPLEX for zla_hercond_c Array x with DIMENSION n. The vector x in the formula op(A) * inv(diag(x)). work COMPLEX for cla_hercond_c DOUBLE COMPLEX for zla_hercond_c Array DIMENSION 2*n. Workspace. rwork REAL for cla_hercond_c DOUBLE PRECISION for zla_hercond_c Array DIMENSION n. Workspace. Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?hetrf ?la_herfsx_extended Improves the computed solution to a system of linear equations for Hermitian indefinite matrices by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. Syntax Fortran 77: call cla_herfsx_extended( prec_type, uplo, n, nrhs, a, lda, af, ldaf, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call zla_herfsx_extended( prec_type, uplo, n, nrhs, a, lda, af, ldaf, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) 5 Intel® Math Kernel Library Reference Manual 1482 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_herfsx_extended subroutine improves the computed solution to a system of linear equations by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. The ?herfsx routine calls ?la_herfsx_extended to perform iterative refinement. In addition to normwise error bound, the code provides maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. Use ?la_herfsx_extended to set only the second fields of err_bnds_norm and err_bnds_comp. Input Parameters prec_type INTEGER. Specifies the intermediate precision to be used in refinement. The value is defined by ilaprec(p), where p is a CHARACTER and: If p = 'S': Single. If p = 'D': Double. If p = 'I': Indigenous. If p = 'X', 'E': Extra. uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrix B. a, af, b, y COMPLEX for cla_herfsx_extended DOUBLE COMPLEX for zla_herfsx_extended. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), y(ldy,*). The array a contains the original n-by-n matrix A. The second dimension of a must be at least max(1,n). The array af contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?hetrf. The second dimension of af must be at least max(1,n). The array b contains the right-hand-side of the matrix B. The second dimension of b must be at least max(1,nrhs). The array y on entry contains the solution matrix X as computed by ? hetrs. The second dimension of y must be at least max(1,nrhs). lda INTEGER. The leading dimension of the array a; lda = max(1,n). ldaf INTEGER. The leading dimension of the array af; ldaf = max(1,n). ipiv INTEGER. Array, DIMENSION n. Details of the interchanges and the block structure of D as determined by ?hetrf. colequ LOGICAL. If colequ = .TRUE., column equilibration was done to A before calling this routine. This is needed to compute the solution and error bounds correctly. c REAL for cla_herfsx_extended DOUBLE PRECISION for zla_herfsx_extended. LAPACK Auxiliary and Utility Routines 5 1483 c contains the column scale factors for A. If colequ = .FALSE., c is not used. If c is input, each element of c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by power of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldy INTEGER. The leading dimension of the array y; ldy = max(1, n). n_norms INTEGER. Determines which error bounds to return. See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. If n_norms = 1, returns normwise error bounds. If n_norms = 2, returns componentwise error bounds. err_bnds_norm REAL for cla_herfsx_extended DOUBLE PRECISION for zla_herfsx_extended. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error. Normwise relative error in the i-th solution vector is defined as follows: The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the following three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for cla_herfsx_extended and sqrt(n)*dlamch(e) for zla_herfsx_extended. err=2 "Guaranteed" error bound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for cla_herfsx_extended and sqrt(n)*dlamch(e) for zla_herfsx_extended. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for cla_herfsx_extended and sqrt(n)*dlamch(e) for 5 Intel® Math Kernel Library Reference Manual 1484 zla_herfsx_extended to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. Use this subroutine to set only the second field above. err_bnds_comp REAL for cla_herfsx_extended DOUBLE PRECISION for zla_herfsx_extended. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for cla_herfsx_extended and sqrt(n)*dlamch(e) for zla_herfsx_extended. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for cla_herfsx_extended and sqrt(n)*dlamch(e) for zla_herfsx_extended. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for cla_herfsx_extended and sqrt(n)*dlamch(e) for LAPACK Auxiliary and Utility Routines 5 1485 zla_herfsx_extended to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. Use this subroutine to set only the second field above. res, dy, y_tail COMPLEX for cla_herfsx_extended DOUBLE COMPLEX for zla_herfsx_extended. Workspace arrays of DIMENSION n. res holds the intermediate residual. dy holds the intermediate solution. y_tail holds the trailing bits of the intermediate solution. ayb REAL for cla_herfsx_extended DOUBLE PRECISION for zla_herfsx_extended. Workspace array, DIMENSION n. rcond REAL for cla_herfsx_extended DOUBLE PRECISION for zla_herfsx_extended. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. ithresh INTEGER. The maximum number of residual computations allowed for refinement. The default is 10. For 'aggressive', set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the guarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. rthresh REAL for cla_herfsx_extended DOUBLE PRECISION for zla_herfsx_extended. Determines when to stop refinement if the error estimate stops decreasing. Refinement stops when the next solution no longer satisfies norm(dx_{i+1}) < rthresh * norm(dx_i) where norm(z) is the infinity norm of Z. rthresh satisfies 0 < rthresh = 1. The default value is 0.5. For 'aggressive' set to 0.9 to permit convergence on extremely ill-conditioned matrices. dz_ub REAL for cla_herfsx_extended DOUBLE PRECISION for zla_herfsx_extended. Determines when to start considering componentwise convergence. Componentwise dz_ub convergence is only considered after each component of the solution y is stable, that is, the relative change in each component is less than dz_ub. The default value is 0.25, requiring the first bit to be stable. ignore_cwise LOGICAL 5 Intel® Math Kernel Library Reference Manual 1486 If .TRUE., the function ignores componentwise convergence. Default value is .FALSE. Output Parameters y COMPLEX for cla_herfsx_extended DOUBLE COMPLEX for zla_herfsx_extended. The improved solution matrix Y. berr_out REAL for cla_herfsx_extended DOUBLE PRECISION for zla_herfsx_extended. Array, DIMENSION nrhs. berr_out(j) contains the componentwise relative backward error for right-hand-side j from the formula max(i) ( abs(res(i)) / ( abs(op(A))*abs(y) + abs(B) )(i) ) where abs(z) is the componentwise absolute value of the matrix or vector Z. This is computed by ?la_lin_berr. err_bnds_norm, err_bnds_comp Values of the corresponding input parameters improved after iterative refinement and stored in the second column of the array ( 1:nrhs, 2 ). The other elements are kept unchanged. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. See Also ?herfsx ?hetrf ?hetrs ?lamch ilaprec ilatrans ?la_lin_berr ?la_herpvgrw Computes the reciprocal pivot growth factor norm(A)/ norm(U) for a Hermitian indefinite matrix. Syntax Fortran 77: call cla_herpvgrw( uplo, n, info, a, lda, af, ldaf, ipiv, work ) call zla_herpvgrw( uplo, n, info, a, lda, af, ldaf, ipiv, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_herpvgrw routine computes the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the equilibrated matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. LAPACK Auxiliary and Utility Routines 5 1487 Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations, the order of the matrix A; n = 0. info INTEGER. The value of INFO returned from ?hetrf, that is, the pivot in column info is exactly 0. a, af COMPLEX for cla_herpvgrw DOUBLE COMPLEX for zla_herpvgrw. Arrays: a(lda,*), af(ldaf,*). a contains the n-by-n matrix A. The second dimension of a must be at least max(1,n). af contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?hetrf. The second dimension of af must be at least max(1,n). lda INTEGER. The leading dimension of array a; lda = max(1,n). ldaf INTEGER. The leading dimension of array af; ldaf = max(1,n). ipiv INTEGER. Array, DIMENSION n. Details of the interchanges and the block structure of D as determined by ?hetrf. work REAL for cla_herpvgrw DOUBLE PRECISION for zla_herpvgrw. Array, DIMENSION 2*n. Workspace. See Also ?hetrf ?la_lin_berr Computes component-wise relative backward error. Syntax Fortran 77: call sla_lin_berr(n, nz, nrhs, res, ayb, berr ) call dla_lin_berr(n, nz, nrhs, res, ayb, berr ) call cla_lin_berr(n, nz, nrhs, res, ayb, berr ) call zla_lin_berr(n, nz, nrhs, res, ayb, berr ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_lin_berr computes a component-wise relative backward error from the formula: max(i) ( abs(R(i))/( abs(op(A_s))*abs(Y) + abs(B_s) )(i) ) where abs(Z) is the component-wise value of the matrix or vector Z. 5 Intel® Math Kernel Library Reference Manual 1488 Input Parameters n INTEGER. The number of linear equations, the order of the matrix A; n = 0. nz INTEGER. The parameter for guarding against spuriously zero residuals. (nz+1)*slamch( 'Safe minimum' ) is added to R(i) in the numerator of the relative backward error formula. The default value is n. nrhs INTEGER. Number of right-hand sides, the number of columns in the matrices AYB, RES, and BERR; nrhs = 0. res, ayb REAL for sla_lin_berr, cla_lin_berr DOUBLE PRECISION for dla_lin_berr, zla_lin_berr Arrays, DIMENSION (n,nrhs). res is the residual matrix, that is, the matrix R in the relative backward error formula. ayb is the denominator of that formula, that is, the matrix abs(op(A_s))*abs(Y) + abs(B_s). The matrices A, Y, and B are from iterative refinement. See description of ?la_gerfsx_extended. Output Parameters berr REAL for sla_lin_berr DOUBLE PRECISION for dla_lin_berr COMPLEX for cla_lin_berr DOUBLE COMPLEX for zla_lin_berr The component-wise relative backward error. See Also ?lamch ?la_gerfsx_extended ?la_porcond Estimates the Skeel condition number for a symmetric positive-definite matrix. Syntax Fortran 77: call sla_porcond( uplo, n, a, lda, af, ldaf, cmode, c, info, work, iwork ) call dla_porcond( uplo, n, a, lda, af, ldaf, cmode, c, info, work, iwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function estimates the Skeel condition number of op(A) * op2(C) where the cmode parameter determines op2 as follows: cmode Value op2(C) 1 C LAPACK Auxiliary and Utility Routines 5 1489 cmode Value op2(C) 0 I -1 inv(C) The Skeel condition number cond(A) = norminf(|inv(A)||A|) is computed by computing scaling factors R such that diag(R)*A*op2(C) is row equilibrated and by computing the standard infinity-norm condition number. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. a, af, c, work REAL for sla_porcond DOUBLE PRECISION for dla_porcond Arrays: a (lda,*) contains the n-by-n matrix A. af (ldaf,*) contains the triangular factor L or U from the Cholesky factorization A = UT*U or A = L*LT, as computed by ?potrf. c, DIMENSION n. The vector C in the formula op(A) * op2(C). work is a workspace array of DIMENSION (3*n). The second dimension of a and af must be at least max(1, n). lda INTEGER. The leading dimension of the array ab. lda = max(1,n). ldaf INTEGER. The leading dimension of af. ldaf = max(1,n). cmode INTEGER. Determines op2(C) in the formula op(A) * op2(C) as follows: If cmode = 1, op2(C) = C. If cmode = 0, op2(C) = I. If cmode = -1, op2(C) = inv(C). iwork INTEGER. Workspace array with DIMENSION n. Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?potrf ?la_porcond_c Computes the infinity norm condition number of op(A)*inv(diag(c)) for Hermitian positive-definite matrices. 5 Intel® Math Kernel Library Reference Manual 1490 Syntax Fortran 77: call cla_porcond_c( uplo, n, a, lda, af, ldaf, c, capply, info, work, rwork ) call zla_porcond_c( uplo, n, a, lda, af, ldaf, c, capply, info, work, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function computes the infinity norm condition number of op(A) * inv(diag(c)) where the c is a REAL vector for cla_porcond_c and a DOUBLE PRECISION vector for zla_porcond_c. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. a COMPLEX for cla_porcond_c DOUBLE COMPLEX for zla_porcond_c Array, DIMENSION (lda, *). On entry, the n-by-n matrix A. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of the array a. lda = max(1,n). af COMPLEX for cla_porcond_c DOUBLE COMPLEX for zla_porcond_c Array, DIMENSION (ldaf, *). The triangular factor L or U from the Cholesky factorization A = UH*U or A = L*LH, as computed by ?potrf. The second dimension of af must be at least max(1,n). ldaf INTEGER. The leading dimension of the array af. ldaf = max(1,n). c REAL for cla_porcond_c DOUBLE PRECISION for zla_porcond_c Array c with DIMENSION n. The vector c in the formula op(A) * inv(diag(c)). capply LOGICAL. If .TRUE., then the function uses the vector c from the formula op(A) * inv(diag(c)). work COMPLEX for cla_porcond_c DOUBLE COMPLEX for zla_porcond_c Array DIMENSION 2*n. Workspace. rwork REAL for cla_porcond_c DOUBLE PRECISION for zla_porcond_c Array DIMENSION n. Workspace. LAPACK Auxiliary and Utility Routines 5 1491 Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?potrf ?la_porcond_x Computes the infinity norm condition number of op(A)*diag(x) for Hermitian positive-definite matrices. Syntax Fortran 77: call cla_porcond_x( uplo, n, a, lda, af, ldaf, x, info, work, rwork ) call zla_porcond_x( uplo, n, a, lda, af, ldaf, x, info, work, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function computes the infinity norm condition number of op(A) * diag(x) where the x is a COMPLEX vector for cla_porcond_x and a DOUBLE COMPLEX vector for zla_porcond_x. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. a COMPLEX for cla_porcond_c DOUBLE COMPLEX for zla_porcond_c Array, DIMENSION (lda, *). On entry, the n-by-n matrix A. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of the array a. lda = max(1,n). af COMPLEX for cla_porcond_c DOUBLE COMPLEX for zla_porcond_c Array, DIMENSION (ldaf, *). The triangular factor L or U from the Cholesky factorization A = UH*U or A = L*LH, as computed by ?potrf. The second dimension of af must be at least max(1,n). ldaf INTEGER. The leading dimension of the array af. ldaf = max(1,n). x COMPLEX for cla_porcond_c DOUBLE COMPLEX for zla_porcond_c 5 Intel® Math Kernel Library Reference Manual 1492 Array x with DIMENSION n. The vector x in the formula op(A) * inv(diag(x)). work COMPLEX for cla_porcond_c DOUBLE COMPLEX for zla_porcond_c Array DIMENSION 2*n. Workspace. rwork REAL for cla_porcond_c DOUBLE PRECISION for zla_porcond_c Array DIMENSION n. Workspace. Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?potrf ?la_porfsx_extended Improves the computed solution to a system of linear equations for symmetric or Hermitian positive-definite matrices by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. Syntax Fortran 77: call sla_porfsx_extended( prec_type, uplo, n, nrhs, a, lda, af, ldaf, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call dla_porfsx_extended( prec_type, uplo, n, nrhs, a, lda, af, ldaf, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call cla_porfsx_extended( prec_type, uplo, n, nrhs, a, lda, af, ldaf, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call zla_porfsx_extended( prec_type, uplo, n, nrhs, a, lda, af, ldaf, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_porfsx_extended subroutine improves the computed solution to a system of linear equations by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. The ?herfsx routine calls ?la_porfsx_extended to perform iterative refinement. In addition to normwise error bound, the code provides maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. Use ?la_porfsx_extended to set only the second fields of err_bnds_norm and err_bnds_comp. LAPACK Auxiliary and Utility Routines 5 1493 Input Parameters prec_type INTEGER. Specifies the intermediate precision to be used in refinement. The value is defined by ilaprec(p), where p is a CHARACTER and: If p = 'S': Single. If p = 'D': Double. If p = 'I': Indigenous. If p = 'X', 'E': Extra. uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrix B. a, af, b, y REAL for sla_porfsx_extended DOUBLE PRECISION for dla_porfsx_extended COMPLEX for cla_porfsx_extended DOUBLE COMPLEX for zla_porfsx_extended. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), y(ldy,*). The array a contains the original n-by-n matrix A. The second dimension of a must be at least max(1,n). The array af contains the triangular factor L or U from the Cholesky factorization as computed by ?potrf: A = UT*U or A = L*LT for real flavors, A = UH*U or A = L*LH for complex flavors. The second dimension of af must be at least max(1,n). The array b contains the right-hand-side of the matrix B. The second dimension of b must be at least max(1,nrhs). The array y on entry contains the solution matrix X as computed by ? potrs. The second dimension of y must be at least max(1,nrhs). lda INTEGER. The leading dimension of the array a; lda = max(1,n). ldaf INTEGER. The leading dimension of the array af; ldaf = max(1,n). colequ LOGICAL. If colequ = .TRUE., column equilibration was done to A before calling this routine. This is needed to compute the solution and error bounds correctly. c REAL for sla_porfsx_extended and cla_porfsx_extended DOUBLE PRECISION for dla_porfsx_extended and zla_porfsx_extended. c contains the column scale factors for A. If colequ = .FALSE., c is not used. If c is input, each element of c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by power of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldy INTEGER. The leading dimension of the array y; ldy = max(1, n). 5 Intel® Math Kernel Library Reference Manual 1494 n_norms INTEGER. Determines which error bounds to return. See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. If n_norms = 1, returns normwise error bounds. If n_norms = 2, returns componentwise error bounds. err_bnds_norm REAL for sla_porfsx_extended and cla_porfsx_extended DOUBLE PRECISION for dla_porfsx_extended and zla_porfsx_extended. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error. Normwise relative error in the i-th solution vector is defined as follows: The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the following three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for sla_porfsx_extended/ cla_porfsx_extended and sqrt(n)*dlamch(e) for dla_porfsx_extended/ zla_porfsx_extended. err=2 "Guaranteed" error bound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for sla_porfsx_extended/ cla_porfsx_extended and sqrt(n)*dlamch(e) for dla_porfsx_extended/ zla_porfsx_extended. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for sla_porfsx_extended/ cla_porfsx_extended and sqrt(n)*dlamch(e) for dla_porfsx_extended/ zla_porfsx_extended to determine if the error estimate is "guaranteed". These reciprocal LAPACK Auxiliary and Utility Routines 5 1495 condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. Use this subroutine to set only the second field above. err_bnds_comp REAL for sla_porfsx_extended and cla_porfsx_extended DOUBLE PRECISION for dla_porfsx_extended and zla_porfsx_extended. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for sla_porfsx_extended/ cla_porfsx_extended and sqrt(n)*dlamch(e) for dla_porfsx_extended/ zla_porfsx_extended. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for sla_porfsx_extended/ cla_porfsx_extended and sqrt(n)*dlamch(e) for dla_porfsx_extended/ zla_porfsx_extended. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold 5 Intel® Math Kernel Library Reference Manual 1496 sqrt(n)*slamch(e) for sla_porfsx_extended/ cla_porfsx_extended and sqrt(n)*dlamch(e) for dla_porfsx_extended/ zla_porfsx_extended to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. Use this subroutine to set only the second field above. res, dy, y_tail REAL for sla_porfsx_extended DOUBLE PRECISION for dla_porfsx_extended COMPLEX for cla_porfsx_extended DOUBLE COMPLEX for zla_porfsx_extended. Workspace arrays of DIMENSION n. res holds the intermediate residual. dy holds the intermediate solution. y_tail holds the trailing bits of the intermediate solution. ayb REAL for sla_porfsx_extended and cla_porfsx_extended DOUBLE PRECISION for dla_porfsx_extended and zla_porfsx_extended. Workspace array, DIMENSION n. rcond REAL for sla_porfsx_extended and cla_porfsx_extended DOUBLE PRECISION for dla_porfsx_extended and zla_porfsx_extended. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. ithresh INTEGER. The maximum number of residual computations allowed for refinement. The default is 10. For 'aggressive', set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the guarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. rthresh REAL for sla_porfsx_extended and cla_porfsx_extended DOUBLE PRECISION for dla_porfsx_extended and zla_porfsx_extended. Determines when to stop refinement if the error estimate stops decreasing. Refinement stops when the next solution no longer satisfies norm(dx_{i+1}) < rthresh * norm(dx_i) where norm(z) is the infinity norm of Z. rthresh satisfies 0 < rthresh = 1. LAPACK Auxiliary and Utility Routines 5 1497 The default value is 0.5. For 'aggressive' set to 0.9 to permit convergence on extremely ill-conditioned matrices. dz_ub REAL for sla_porfsx_extended and cla_porfsx_extended DOUBLE PRECISION for dla_porfsx_extended and zla_porfsx_extended. Determines when to start considering componentwise convergence. Componentwise dz_ub convergence is only considered after each component of the solution y is stable, that is, the relative change in each component is less than dz_ub. The default value is 0.25, requiring the first bit to be stable. ignore_cwise LOGICAL If .TRUE., the function ignores componentwise convergence. Default value is .FALSE. Output Parameters y REAL for sla_porfsx_extended DOUBLE PRECISION for dla_porfsx_extended COMPLEX for cla_porfsx_extended DOUBLE COMPLEX for zla_porfsx_extended. The improved solution matrix Y. berr_out REAL for sla_porfsx_extended and cla_porfsx_extended DOUBLE PRECISION for dla_porfsx_extended and zla_porfsx_extended. Array, DIMENSION nrhs. berr_out(j) contains the componentwise relative backward error for right-hand-side j from the formula max(i) ( abs(res(i)) / ( abs(op(A))*abs(y) + abs(B) )(i) ) where abs(z) is the componentwise absolute value of the matrix or vector Z. This is computed by ?la_lin_berr. err_bnds_norm, err_bnds_comp Values of the corresponding input parameters improved after iterative refinement and stored in the second column of the array ( 1:nrhs, 2 ). The other elements are kept unchanged. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. See Also ?porfsx ?potrf ?potrs ?lamch ilaprec ilatrans ?la_lin_berr ?la_porpvgrw Computes the reciprocal pivot growth factor norm(A)/ norm(U) for a symmetric or Hermitian positivedefinite matrix. 5 Intel® Math Kernel Library Reference Manual 1498 Syntax Fortran 77: call sla_porpvgrw( uplo, ncols, a, lda, af, ldaf, work ) call dla_porpvgrw( uplo, ncols, a, lda, af, ldaf, work ) call cla_porpvgrw( uplo, ncols, a, lda, af, ldaf, work ) call zla_porpvgrw( uplo, ncols, a, lda, af, ldaf, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_porpvgrw routine computes the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the equilibrated matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. ncols INTEGER. The number of columns of the matrix A; ncols = 0. a, af REAL for sla_porpvgrw DOUBLE PRECISION for dla_porpvgrw COMPLEX for cla_porpvgrw DOUBLE COMPLEX for zla_porpvgrw. Arrays: a(lda,*), af(ldaf,*). The array a contains the input n-by-n matrix A. The second dimension of a must be at least max(1,n). The array af contains the triangular factor L or U from the Cholesky factorization as computed by ?potrf: A = UT*U or A = L*LT for real flavors, A = UH*U or A = L*LH for complex flavors. The second dimension of af must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1,n). ldaf INTEGER. The leading dimension of af; ldaf = max(1,n). work REAL for sla_porpvgrw and cla_porpvgrw DOUBLE PRECISION for dla_porpvgrw and zla_porpvgrw. Workspace array, dimension 2*n. See Also ?potrf ?laqhe Scales a Hermitian matrix. Syntax call claqhe( uplo, n, a, lda, s, scond, amax, equed ) LAPACK Auxiliary and Utility Routines 5 1499 call zlaqhe( uplo, n, a, lda, s, scond, amax, equed ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine equilibrates a Hermitian matrix A using the scaling factors in the vector s. Input Parameters uplo CHARACTER*1. Specifies whether to store the upper or lower part of the Hermitian matrix A. If uplo = 'U', the upper triangular part of A; if uplo = 'L', the lower triangular part of A. n INTEGER. The order of the matrix A. n = 0. a COMPLEX for claqhe DOUBLE COMPLEX for zlaqhe Array, DIMENSION (lda,n). On entry, the Hermitian matrix A. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of matrix A and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of matrix A and the strictly upper triangular part of a is not referenced. lda INTEGER. The leading dimension of the array a. lda = max(n,1). s REAL for claqhe DOUBLE PRECISION for zlaqhe Array, DIMENSION (n). The scale factors for A. scond REAL for claqhe DOUBLE PRECISION for zlaqhe Ratio of the smallest s(i) to the largest s(i). amax REAL for claqhe DOUBLE PRECISION for zlaqhe Absolute value of largest matrix entry. Output Parameters a If equed = 'Y', a contains the equilibrated matrix diag(s)*A*diag(s). equed CHARACTER*1. Specifies whether or not equilibration was done. If equed = 'N': No equilibration. If equed = 'Y': Equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). Application Notes The routine uses internal parameters thresh, large, and small. The parameter thresh is a threshold value used to decide if scaling should be done based on the ratio of the scaling factors. If scond < thresh, scaling is done. 5 Intel® Math Kernel Library Reference Manual 1500 The large and small parameters are threshold values used to decide if scaling should be done based on the absolute size of the largest matrix element. If amax > large or amax < small, scaling is done. ?laqhp Scales a Hermitian matrix stored in packed form. Syntax call claqhp( uplo, n, ap, s, scond, amax, equed ) call zlaqhp( uplo, n, ap, s, scond, amax, equed ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine equilibrates a Hermitian matrix A using the scaling factors in the vector s. Input Parameters uplo CHARACTER*1. Specifies whether to store the upper or lower part of the Hermitian matrix A. If uplo = 'U', the upper triangular part of A; if uplo = 'L', the lower triangular part of A. n INTEGER. The order of the matrix A. n = 0. ap COMPLEX for claqhp DOUBLE COMPLEX for zlaqhp Array, DIMENSION (n*(n+1)/2). The Hermitian matrix A. • If uplo = 'U', the upper triangular part of the Hermitian matrix A is stored in the packed array ap as follows: ap(i+(j-1)*j/2) = A(i,j) for 1 = i = j. • If uplo = 'L', the lower triangular part of Hermitian matrix A is stored in the packed array ap as follows: ap(i+(j-1)*(2n-j)/2) = A(i,j) for j = i = n. s REAL for claqhp DOUBLE PRECISION for zlaqhp Array, DIMENSION (n). The scale factors for A. scond REAL for claqhp DOUBLE PRECISION for zlaqhp Ratio of the smallest s(i) to the largest s(i). amax REAL for claqhp DOUBLE PRECISION for zlaqhp Absolute value of largest matrix entry. Output Parameters a If equed = 'Y', a contains the equilibrated matrix diag(s)*A*diag(s) in the same storage format as on input. LAPACK Auxiliary and Utility Routines 5 1501 equed CHARACTER*1. Specifies whether or not equilibration was done. If equed = 'N': No equilibration. If equed = 'Y': Equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). Application Notes The routine uses internal parameters thresh, large, and small. The parameter thresh is a threshold value used to decide if scaling should be done based on the ratio of the scaling factors. If scond < thresh, scaling is done. The large and small parameters are threshold values used to decide if scaling should be done based on the absolute size of the largest matrix element. If amax > large or amax < small, scaling is done. ?larcm Multiplies a square real matrix by a complex matrix. Syntax call clarcm( m, n, a, lda, b, ldb, c, ldc, rwork ) call zlarcm( m, n, a, lda, b, ldb, c, ldc, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine performs a simple matrix-matrix multiplication of the form C = A*B, where A is m-by-m and real, B is m-by-n and complex, C is m-by-n and complex. Input Parameters m INTEGER. The number of rows and columns of the matrix A and of the number of rows of the matrix C (m = 0). n INTEGER. The number of columns of the matrix B and the number of columns of the matrix C (n = 0). a REAL for clarcm DOUBLE PRECISION for zlarcm Array, DIMENSION (lda, m). Contains the m-by-m matrix A. lda INTEGER. The leading dimension of the array a, lda = max(1, m). b COMPLEX for clarcm DOUBLE COMPLEX for zlarcm Array, DIMENSION (ldb, n). Contains the m-by-n matrix B. ldb INTEGER. The leading dimension of the array b, ldb = max(1, n). ldc INTEGER. The leading dimension of the output array c, ldc = max(1, m). rwork REAL for clarcm DOUBLE PRECISION for zlarcm Workspace array, DIMENSION (2*m*n). 5 Intel® Math Kernel Library Reference Manual 1502 Output Parameters c COMPLEX for clarcm DOUBLE COMPLEX for zlarcm Array, DIMENSION (ldc, n). Contains the m-by-n matrix C. ?la_rpvgrw Computes the reciprocal pivot growth factor norm(A)/ norm(U) for a general matrix. Syntax Fortran 77: call sla_rpvgrw( n, ncols, a, lda, af, ldaf ) call dla_rpvgrw( n, ncols, a, lda, af, ldaf ) call cla_rpvgrw( n, ncols, a, lda, af, ldaf ) call zla_rpvgrw( n, ncols, a, lda, af, ldaf ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_rpvgrw routine computes the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the equilibrated matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. Input Parameters n INTEGER. The number of linear equations, the order of the matrix A; n = 0. ncols INTEGER. The number of columns of the matrix A; ncols = 0. a, af REAL for sla_rpvgrw DOUBLE PRECISION for dla_rpvgrw COMPLEX for cla_rpvgrw DOUBLE COMPLEX for zla_rpvgrw. Arrays: a(lda,*), af(ldaf,*). The array a contains the input n-by-n matrix A. The second dimension of a must be at least max(1,n). The array af contains the factors L and U from the factorization triangular factor L or U from the Cholesky factorization A = P*L*U as computed by ?getrf. The second dimension of af must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1,n). ldaf INTEGER. The leading dimension of af; ldaf = max(1,n). See Also ?getrf LAPACK Auxiliary and Utility Routines 5 1503 ?larscl2 Performs reciprocal diagonal scaling on a vector. Syntax Fortran 77: call slarscl2(m, n, d, x, ldx) call dlarscl2(m, n, d, x, ldx) call clarscl2(m, n, d, x, ldx) call zlarscl2(m, n, d, x, ldx) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?larscl2 routines perform reciprocal diagonal scaling on a vector x := D-1*x, where: x is a vector, and D is a diagonal matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix D and the number of elements of the vector x. The value of m must be at least zero. n INTEGER. The number of columns of D and x. The value of n must be at least zero. d REAL for slarscl2 and clarscl2. DOUBLE PRECISION for dlarscl2 and zlarscl2. Array, DIMENSION m. Diagonal matrix D stored as a vector of length m. x REAL for slarscl2. DOUBLE PRECISION for dlarscl2. COMPLEX for clarscl2. DOUBLE COMPLEX for zlarscl2. Array, DIMENSION (ldx,n). The vector x to scale by D. ldx INTEGER. The leading dimension of the vector x. The value of ldx must be at least zero. Output Parameters x Scaled vector x. ?lascl2 Performs diagonal scaling on a vector. Syntax Fortran 77: call slascl2(m, n, d, x, ldx) 5 Intel® Math Kernel Library Reference Manual 1504 call dlascl2(m, n, d, x, ldx) call clascl2(m, n, d, x, ldx) call zlascl2(m, n, d, x, ldx) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?lascl2 routines perform diagonal scaling on a vector x := D*x, where: x is a vector, and D is a diagonal matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix D and the number of elements of the vector x. The value of m must be at least zero. n INTEGER. The number of columns of D and x. The value of n must be at least zero. d REAL for slascl2 and clascl2. DOUBLE PRECISION for dlascl2 and zlascl2. Array, DIMENSION m. Diagonal matrix D stored as a vector of length m. x REAL for slascl2. DOUBLE PRECISION for dlascl2. COMPLEX for clascl2. DOUBLE COMPLEX for zlascl2. Array, DIMENSION (ldx,n). The vector x to scale by D. ldx INTEGER. The leading dimension of the vector x. The value of ldx must be at least zero. Output Parameters x Scaled vector x. ?la_syamv Computes a matrix-vector product using a symmetric indefinite matrix to calculate error bounds. Syntax Fortran 77: call sla_syamv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) call dla_syamv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) call cla_syamv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) call zla_syamv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) LAPACK Auxiliary and Utility Routines 5 1505 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_syamv routines perform a matrix-vector operation defined as y := alpha*abs(A)*abs(x) + beta*abs(y), where: alpha and beta are scalars, x and y are vectors, A is an n-by-n Hermitian matrix. This function is primarily used in calculating error bounds. To protect against underflow during evaluation, the function perturbs components in the resulting vector away from zero by (n + 1) times the underflow threshold. To prevent unnecessarily large errors for block structure embedded in general matrices, the function does not perturb symbolically zero components. A zero entry is considered symbolic if all multiplications involved in computing that entry have at least one zero multiplicand. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array A is to be referenced: If uplo = 'BLAS_UPPER', only the upper triangular part of A is to be referenced, If uplo = 'BLAS_LOWER', only the lower triangular part of A is to be referenced. n INTEGER. Specifies the number of rows and columns of the matrix A. The value of n must be at least zero. alpha REAL for sla_syamv and cla_syamv DOUBLE PRECISION for dla_syamv and zla_syamv. Specifies the scalar alpha. a REAL for sla_syamv DOUBLE PRECISION for dla_syamv COMPLEX for cla_syamv DOUBLE COMPLEX for zla_syamv. Array, DIMENSION (lda, *). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. The second dimension of a must be at least max(1,n). lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). x REAL for sla_syamv DOUBLE PRECISION for dla_syamv COMPLEX for cla_syamv DOUBLE COMPLEX for zla_syamv. Array, DIMENSION at least (1+(n-1)*abs(incx)). Before entry, the incremented array x must contain the vector X. incx INTEGER. Specifies the increment for the elements of x. The value of incx must be non-zero. beta REAL for sla_syamv and cla_syamv DOUBLE PRECISION for dla_syamv and zla_syamv 5 Intel® Math Kernel Library Reference Manual 1506 Specifies the scalar beta. When beta is zero, you do not need to set y on input. y REAL for sla_syamv and cla_syamv DOUBLE PRECISION for dla_syamv and zla_syamv Array, DIMENSION at least (1 +(n - 1)*abs(incy)) otherwise. Before entry with non-zero beta, the incremented array y must contain the vector Y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must be non-zero. Output Parameters y Updated vector Y. ?la_syrcond Estimates the Skeel condition number for a symmetric indefinite matrix. Syntax Fortran 77: call sla_syrcond( uplo, n, a, lda, af, ldaf, ipiv, cmode, c, info, work, iwork ) call dla_syrcond( uplo, n, a, lda, af, ldaf, ipiv, cmode, c, info, work, iwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function estimates the Skeel condition number of op(A) * op2(C) where the cmode parameter determines op2 as follows: cmode Value op2(C) 1 C 0 I -1 inv(C) The Skeel condition number cond(A) = norminf(|inv(A)||A|) is computed by computing scaling factors R such that diag(R)*A*op2(C) is row equilibrated and by computing the standard infinity-norm condition number. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. LAPACK Auxiliary and Utility Routines 5 1507 Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. a, af, c, work REAL for sla_syrcond DOUBLE PRECISION for dla_syrcond Arrays: ab (lda,*) contains the n-by-n matrix A. af (ldaf,*) contains the The block diagonal matrix D and the multipliers used to obtain the factor L or U as computed by ?sytrf. The second dimension of a and af must be at least max(1, n). c, DIMENSION n. The vector C in the formula op(A) * op2(C). work is a workspace array of DIMENSION (3*n). lda INTEGER. The leading dimension of the array ab. lda = max(1,n). ldaf INTEGER. The leading dimension of af. ldaf = max(1,n). ipiv INTEGER. Array with DIMENSION n. Details of the interchanges and the block structure of D as determined by ?sytrf. cmode INTEGER. Determines op2(C) in the formula op(A) * op2(C) as follows: If cmode = 1, op2(C) = C. If cmode = 0, op2(C) = I. If cmode = -1, op2(C) = inv(C). iwork INTEGER. Workspace array with DIMENSION n. Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?sytrf ?la_syrcond_c Computes the infinity norm condition number of op(A)*inv(diag(c)) for symmetric indefinite matrices. Syntax Fortran 77: call cla_syrcond_c( uplo, n, a, lda, af, ldaf, ipiv, c, capply, info, work, rwork ) call zla_syrcond_c( uplo, n, a, lda, af, ldaf, ipiv, c, capply, info, work, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function computes the infinity norm condition number of op(A) * inv(diag(c)) 5 Intel® Math Kernel Library Reference Manual 1508 where the c is a REAL vector for cla_syrcond_c and a DOUBLE PRECISION vector for zla_syrcond_c. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. a COMPLEX for cla_syrcond_c DOUBLE COMPLEX for zla_syrcond_c Array, DIMENSION (lda, *). On entry, the n-by-n matrix A. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of the array a. lda = max(1,n). af COMPLEX for cla_syrcond_c DOUBLE COMPLEX for zla_syrcond_c Array, DIMENSION (ldaf, *). The block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?sytrf. The second dimension of af must be at least max(1,n). ldaf INTEGER. The leading dimension of the array af. ldaf = max(1,n). ipiv INTEGER. Array with DIMENSION n. Details of the interchanges and the block structure of D as determined by ?sytrf. c REAL for cla_syrcond_c DOUBLE PRECISION for zla_syrcond_c Array c with DIMENSION n. The vector c in the formula op(A) * inv(diag(c)). capply LOGICAL. If .TRUE., then the function uses the vector c from the formula op(A) * inv(diag(c)). work COMPLEX for cla_syrcond_c DOUBLE COMPLEX for zla_syrcond_c Array DIMENSION 2*n. Workspace. rwork REAL for cla_syrcond_c DOUBLE PRECISION for zla_syrcond_c Array DIMENSION n. Workspace. Output Parameters info INTEGER. If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?sytrf ?la_syrcond_x Computes the infinity norm condition number of op(A)*diag(x) for symmetric indefinite matrices. LAPACK Auxiliary and Utility Routines 5 1509 Syntax Fortran 77: call cla_syrcond_x( uplo, n, a, lda, af, ldaf, ipiv, x, info, work, rwork ) call zla_syrcond_x( uplo, n, a, lda, af, ldaf, ipiv, x, info, work, rwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function computes the infinity norm condition number of op(A) * diag(x) where the x is a COMPLEX vector for cla_syrcond_x and a DOUBLE COMPLEX vector for zla_syrcond_x. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. a COMPLEX for cla_syrcond_c DOUBLE COMPLEX for zla_syrcond_c Array, DIMENSION (lda, *). On entry, the n-by-n matrix A. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of the array a. lda = max(1,n). af COMPLEX for cla_syrcond_c DOUBLE COMPLEX for zla_syrcond_c Array, DIMENSION (ldaf, *). The block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?sytrf. The second dimension of af must be at least max(1,n). ldaf INTEGER. The leading dimension of the array af. ldaf = max(1,n). ipiv INTEGER. Array with DIMENSION n. Details of the interchanges and the block structure of D as determined by ?sytrf. x COMPLEX for cla_syrcond_c DOUBLE COMPLEX for zla_syrcond_c Array x with DIMENSION n. The vector x in the formula op(A) * inv(diag(x)). work COMPLEX for cla_syrcond_c DOUBLE COMPLEX for zla_syrcond_c Array DIMENSION 2*n. Workspace. rwork REAL for cla_syrcond_c DOUBLE PRECISION for zla_syrcond_c Array DIMENSION n. Workspace. Output Parameters info INTEGER. 5 Intel® Math Kernel Library Reference Manual 1510 If info = 0, the execution is successful. If i > 0, the i-th parameter is invalid. See Also ?sytrf ?la_syrfsx_extended Improves the computed solution to a system of linear equations for symmetric indefinite matrices by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. Syntax Fortran 77: call sla_syrfsx_extended( prec_type, uplo, n, nrhs, a, lda, af, ldaf, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call dla_syrfsx_extended( prec_type, uplo, n, nrhs, a, lda, af, ldaf, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call cla_syrfsx_extended( prec_type, uplo, n, nrhs, a, lda, af, ldaf, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) call zla_syrfsx_extended( prec_type, uplo, n, nrhs, a, lda, af, ldaf, ipiv, colequ, c, b, ldb, y, ldy, berr_out, n_norms, err_bnds_norm, err_bnds_comp, res, ayb, dy, y_tail, rcond, ithresh, rthresh, dz_ub, ignore_cwise, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_syrfsx_extended subroutine improves the computed solution to a system of linear equations by performing extra-precise iterative refinement and provides error bounds and backward error estimates for the solution. The ?syrfsx routine calls ?la_syrfsx_extended to perform iterative refinement. In addition to normwise error bound, the code provides maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. Use ?la_syrfsx_extended to set only the second fields of err_bnds_norm and err_bnds_comp. Input Parameters prec_type INTEGER. Specifies the intermediate precision to be used in refinement. The value is defined by ilaprec(p), where p is a CHARACTER and: If p = 'S': Single. If p = 'D': Double. If p = 'I': Indigenous. If p = 'X', 'E': Extra. uplo CHARACTER*1. Must be 'U' or 'L'. LAPACK Auxiliary and Utility Routines 5 1511 Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrix B. a, af, b, y REAL for sla_syrfsx_extended DOUBLE PRECISION for dla_syrfsx_extended COMPLEX for cla_syrfsx_extended DOUBLE COMPLEX for zla_syrfsx_extended. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), y(ldy,*). The array a contains the original n-by-n matrix A. The second dimension of a must be at least max(1,n). The array af contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?sytrf. The second dimension of af must be at least max(1,n). The array b contains the right-hand-side of the matrix B. The second dimension of b must be at least max(1,nrhs). The array y on entry contains the solution matrix X as computed by ? sytrs. The second dimension of y must be at least max(1,nrhs). lda INTEGER. The leading dimension of the array a; lda = max(1,n). ldaf INTEGER. The leading dimension of the array af; ldaf = max(1,n). ipiv INTEGER. Array with DIMENSION n. Details of the interchanges and the block structure of D as determined by ?sytrf. colequ LOGICAL. If colequ = .TRUE., column equilibration was done to A before calling this routine. This is needed to compute the solution and error bounds correctly. c REAL for sla_syrfsx_extended and cla_syrfsx_extended DOUBLE PRECISION for dla_syrfsx_extended and zla_syrfsx_extended. c contains the column scale factors for A. If colequ = .FALSE., c is not used. If c is input, each element of c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by power of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldy INTEGER. The leading dimension of the array y; ldy = max(1, n). n_norms INTEGER. Determines which error bounds to return. See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. If n_norms = 1, returns normwise error bounds. If n_norms = 2, returns componentwise error bounds. err_bnds_norm REAL for sla_syrfsx_extended and cla_syrfsx_extended DOUBLE PRECISION for dla_syrfsx_extended and zla_syrfsx_extended. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error. 5 Intel® Math Kernel Library Reference Manual 1512 Normwise relative error in the i-th solution vector is defined as follows: The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the following three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for sla_syrfsx_extended/ cla_syrfsx_extended and sqrt(n)*dlamch(e) for dla_syrfsx_extended/ zla_syrfsx_extended. err=2 "Guaranteed" error bound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for sla_syrfsx_extended/ cla_syrfsx_extended and sqrt(n)*dlamch(e) for dla_syrfsx_extended/ zla_syrfsx_extended. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for sla_syrfsx_extended/ cla_syrfsx_extended and sqrt(n)*dlamch(e) for dla_syrfsx_extended/ zla_syrfsx_extended to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. Use this subroutine to set only the second field above. err_bnds_comp REAL for sla_syrfsx_extended and cla_syrfsx_extended DOUBLE PRECISION for dla_syrfsx_extended and zla_syrfsx_extended. LAPACK Auxiliary and Utility Routines 5 1513 Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for sla_syrfsx_extended/ cla_syrfsx_extended and sqrt(n)*dlamch(e) for dla_syrfsx_extended/ zla_syrfsx_extended. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for sla_syrfsx_extended/ cla_syrfsx_extended and sqrt(n)*dlamch(e) for dla_syrfsx_extended/ zla_syrfsx_extended. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for sla_syrfsx_extended/ cla_syrfsx_extended and sqrt(n)*dlamch(e) for dla_syrfsx_extended/ zla_syrfsx_extended to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. 5 Intel® Math Kernel Library Reference Manual 1514 Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. Use this subroutine to set only the second field above. res, dy, y_tail REAL for sla_syrfsx_extended DOUBLE PRECISION for dla_syrfsx_extended COMPLEX for cla_syrfsx_extended DOUBLE COMPLEX for zla_syrfsx_extended. Workspace arrays of DIMENSION n. res holds the intermediate residual. dy holds the intermediate solution. y_tail holds the trailing bits of the intermediate solution. ayb REAL for sla_syrfsx_extended and cla_syrfsx_extended DOUBLE PRECISION for dla_syrfsx_extended and zla_syrfsx_extended. Workspace array, DIMENSION n. rcond REAL for sla_syrfsx_extended and cla_syrfsx_extended DOUBLE PRECISION for dla_syrfsx_extended and zla_syrfsx_extended. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. ithresh INTEGER. The maximum number of residual computations allowed for refinement. The default is 10. For 'aggressive', set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the guarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. rthresh REAL for sla_syrfsx_extended and cla_syrfsx_extended DOUBLE PRECISION for dla_syrfsx_extended and zla_syrfsx_extended. Determines when to stop refinement if the error estimate stops decreasing. Refinement stops when the next solution no longer satisfies norm(dx_{i+1}) < rthresh * norm(dx_i) where norm(z) is the infinity norm of Z. rthresh satisfies 0 < rthresh = 1. The default value is 0.5. For 'aggressive' set to 0.9 to permit convergence on extremely ill-conditioned matrices. dz_ub REAL for sla_syrfsx_extended and cla_syrfsx_extended DOUBLE PRECISION for dla_syrfsx_extended and zla_syrfsx_extended. Determines when to start considering componentwise convergence. Componentwise dz_ub convergence is only considered after each component of the solution y is stable, that is, the relative change in each component is less than dz_ub. The default value is 0.25, requiring the first bit to be stable. LAPACK Auxiliary and Utility Routines 5 1515 ignore_cwise LOGICAL If .TRUE., the function ignores componentwise convergence. Default value is .FALSE. Output Parameters y REAL for sla_syrfsx_extended DOUBLE PRECISION for dla_syrfsx_extended COMPLEX for cla_syrfsx_extended DOUBLE COMPLEX for zla_syrfsx_extended. The improved solution matrix Y. berr_out REAL for sla_syrfsx_extended and cla_syrfsx_extended DOUBLE PRECISION for dla_syrfsx_extended and zla_syrfsx_extended. Array, DIMENSION nrhs. berr_out(j) contains the componentwise relative backward error for right-hand-side j from the formula max(i) ( abs(res(i)) / ( abs(op(A))*abs(y) + abs(B) )(i) ) where abs(z) is the componentwise absolute value of the matrix or vector Z. This is computed by ?la_lin_berr. err_bnds_norm, err_bnds_comp Values of the corresponding input parameters improved after iterative refinement and stored in the second column of the array ( 1:nrhs, 2 ). The other elements are kept unchanged. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. See Also ?syrfsx ?sytrf ?sytrs ?lamch ilaprec ilatrans ?la_lin_berr ?la_syrpvgrw Computes the reciprocal pivot growth factor norm(A)/ norm(U) for a symmetric indefinite matrix. Syntax Fortran 77: call sla_syrpvgrw( uplo, n, info, a, lda, af, ldaf, ipiv, work ) call dla_syrpvgrw( uplo, n, info, a, lda, af, ldaf, ipiv, work ) call cla_syrpvgrw( uplo, n, info, a, lda, af, ldaf, ipiv, work ) call zla_syrpvgrw( uplo, n, info, a, lda, af, ldaf, ipiv, work ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1516 Description The ?la_syrpvgrw routine computes the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the equilibrated matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Specifies the triangle of A to store: If uplo = 'U', the upper triangle of A is stored, If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations, the order of the matrix A; n = 0. info INTEGER. The value of INFO returned from ?sytrf, that is, the pivot in column info is exactly 0. a, af REAL for sla_syrpvgrw DOUBLE PRECISION for dla_syrpvgrw COMPLEX for cla_syrpvgrw DOUBLE COMPLEX for zla_syrpvgrw. Arrays: a(lda,*), af(ldaf,*). The array a contains the input n-by-n matrix A. The second dimension of a must be at least max(1,n). The array af contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?sytrf. The second dimension of af must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1,n). ldaf INTEGER. The leading dimension of af; ldaf = max(1,n). ipiv INTEGER. Array, DIMENSION n. Details of the interchanges and the block structure of D as determined by ?sytrf. work REAL for sla_syrpvgrw and cla_syrpvgrw DOUBLE PRECISION for dla_syrpvgrw and zla_syrpvgrw. Workspace array, dimension 2*n. See Also ?sytrf ?la_wwaddw Adds a vector into a doubled-single vector. Syntax Fortran 77: call sla_wwaddw( n, x, y, w ) call dla_wwaddw( n, x, y, w ) call cla_wwaddw( n, x, y, w ) call zla_wwaddw( n, x, y, w ) LAPACK Auxiliary and Utility Routines 5 1517 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?la_wwaddw routine adds a vector W into a doubled-single vector (X, Y). This works for all existing IBM hex and binary floating-point arithmetics, but not for decimal. Input Parameters n INTEGER. The length of vectors X, Y, and W . x, y, w REAL for sla_wwaddw DOUBLE PRECISION for dla_wwaddw COMPLEX for cla_wwaddw DOUBLE COMPLEX for zla_wwaddw. Arrays DIMENSION n. x and y contain the first and second parts of the doubled-single accumulation vector, respectively. w contains the vector W to be added. Output Parameters x, y Contain the first and second parts of the doubled-single accumulation vector, respectively, after adding the vector W. Utility Functions and Routines This section describes LAPACK utility functions and routines. Summary information about these routines is given in the following table: LAPACK Utility Routines Routine Name Data Types Description ilaver Returns the version of the Lapack library. ilaenv Environmental enquiry function which returns values for tuning algorithmic performance. iparmq Environmental enquiry function which returns values for tuning algorithmic performance. ieeeck Checks if the infinity and NaN arithmetic is safe. Called by ilaenv. lsame Tests two characters for equality regardless of case. lsamen Tests two character strings for equality regardless of case. ?labad s, d Returns the square root of the underflow and overflow thresholds if the exponent-range is very large. ?lamch s, d Determines machine parameters for floating-point arithmetic. ?lamc1 s, d Called from ?lamc2. Determines machine parameters given by beta, t, rnd, ieee1. ?lamc2 s, d Used by ?lamch. Determines machine parameters specified in its arguments list. 5 Intel® Math Kernel Library Reference Manual 1518 Routine Name Data Types Description ?lamc3 s, d Called from ?lamc1-?lamc5. Intended to force a and b to be stored prior to doing the addition of a and b. ?lamc4 s, d This is a service routine for ?lamc2. ?lamc5 s, d Called from ?lamc2. Attempts to compute the largest machine floating-point number, without overflow. second/dsecnd Return user time for a process. chla_transtype Translates a BLAST-specified integer constant to the character string specifying a transposition operation. iladiag Translates a character string specifying whether a matrix has a unit diagonal or not to the relevant BLAST-specified integer constant. ilaprec Translates a character string specifying an intermediate precision to the relevant BLAST-specified integer constant. ilatrans Translates a character string specifying a transposition operation to the BLAST-specified integer constant. ilauplo Translates a character string specifying an upper- or lower-triangular matrix to the relevant BLAST-specified integer constant. xerbla Error handling routine called by LAPACK routines. xerbla_array Assists other languages in calling the xerbla function. ilaver Returns the version of the LAPACK library. Syntax call ilaver( vers_major, vers_minor, vers_patch ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description This routine returns the version of the LAPACK library. Output Parameters vers_major INTEGER. Returns the major version of the LAPACK library. vers_minor INTEGER. Returns the minor version from the major version of the LAPACK library. vers_patch INTEGER. Returns the patch version from the minor version of the LAPACK library. LAPACK Auxiliary and Utility Routines 5 1519 ilaenv Environmental enquiry function that returns values for tuning algorithmic performance. Syntax value = ilaenv( ispec, name, opts, n1, n2, n3, n4 ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The enquiry function ilaenv is called from the LAPACK routines to choose problem-dependent parameters for the local environment. See ispec below for a description of the parameters. This version provides a set of parameters that should give good, but not optimal, performance on many of the currently available computers. This routine will not function correctly if it is converted to all lower case. Converting it to all upper case is allowed. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Input Parameters ispec INTEGER. Specifies the parameter to be returned as the value of ilaenv: = 1: the optimal blocksize; if this value is 1, an unblocked algorithm will give the best performance. = 2: the minimum block size for which the block routine should be used; if the usable block size is less than this value, an unblocked routine should be used. = 3: the crossover point (in a block routine, for n less than this value, an unblocked routine should be used) = 4: the number of shifts, used in the nonsymmetric eigenvalue routines (deprecated) = 5: the minimum column dimension for blocking to be used; rectangular blocks must have dimension at least k-by-m, where k is given by ilaenv(2,...) and m by ilaenv(5,...) = 6: the crossover point for the SVD (when reducing an m-by-n matrix to bidiagonal form, if max(m,n)/min(m,n) exceeds this value, a QR factorization is used first to reduce the matrix to a triangular form.) = 7: the number of processors = 8: the crossover point for the multishift QR and QZ methods for nonsymmetric eigenvalue problems (deprecated). 5 Intel® Math Kernel Library Reference Manual 1520 = 9: maximum size of the subproblems at the bottom of the computation tree in the divide-and-conquer algorithm (used by ?gelsd and ?gesdd) =10: ieee NaN arithmetic can be trusted not to trap =11: infinity arithmetic can be trusted not to trap 12 = ispec = 16: ?hseqr or one of its subroutines, see iparmq for detailed explanation. name CHARACTER*(*). The name of the calling subroutine, in either upper case or lower case. opts CHARACTER*(*). The character options to the subroutine name, concatenated into a single character string. For example, uplo = 'U', trans = 'T', and diag = 'N' for a triangular routine would be specified as opts = 'UTN'. n1, n2, n3, n4 INTEGER. Problem dimensions for the subroutine name; these may not all be required. Output Parameters value INTEGER. If value = 0: the value of the parameter specified by ispec; If value = -k < 0: the k-th argument had an illegal value. Application Notes The following conventions have been used when calling ilaenv from the LAPACK routines: 1. opts is a concatenation of all of the character options to subroutine name, in the same order that they appear in the argument list for name, even if they are not used in determining the value of the parameter specified by ispec. 2. The problem dimensions n1, n2, n3, n4 are specified in the order that they appear in the argument list for name. n1 is used first, n2 second, and so on, and unused problem dimensions are passed a value of -1. 3. The parameter value returned by ilaenv is checked for validity in the calling subroutine. For example, ilaenv is used to retrieve the optimal blocksize for strtri as follows: nb = ilaenv( 1, 'strtri', uplo // diag, n, -1, -1, -1> ) if( nb.le.1 ) nb = max( 1, n ) Below is an example of ilaenv usage in C language: #include #include "mkl.h" int main(void) { int size = 1000; int ispec = 1; int dummy = -1; int blockSize1 = ilaenv(&ispec, "dsytrd", "U", &size, &dummy, &dummy, &dummy); int blockSize2 = ilaenv(&ispec, "dormtr", "LUN", &size, &size, &dummy, &dummy); printf("DSYTRD blocksize = %d\n", blockSize1); printf("DORMTR blocksize = %d\n", blockSize2); return 0; } LAPACK Auxiliary and Utility Routines 5 1521 See Also ?hseqr iparmq iparmq Environmental enquiry function which returns values for tuning algorithmic performance. Syntax value = iparmq( ispec, name, opts, n, ilo, ihi, lwork ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function sets problem and machine dependent parameters useful for ?hseqr and its subroutines. It is called whenever ilaenv is called with 12=ispec=16. Input Parameters ispec INTEGER. Specifies the parameter to be returned as the value of iparmq: = 12: (inmin) Matrices of order nmin or less are sent directly to ?lahqr, the implicit double shift QR algorithm. nmin must be at least 11. = 13: (inwin) Size of the deflation window. This is best set greater than or equal to the number of simultaneous shifts ns. Larger matrices benefit from larger deflation windows. = 14: (inibl) Determines when to stop nibbling and invest in an (expensive) multi-shift QR sweep. If the aggressive early deflation subroutine finds ld converged eigenvalues from an order nw deflation window and ld>(nw*nibble)/100, then the next QR sweep is skipped and early deflation is applied immediately to the remaining active diagonal block. Setting iparmq(ispec=14)=0 causes TTQRE to skip a multi-shift QR sweep whenever early deflation finds a converged eigenvalue. Setting iparmq(ispec=14) greater than or equal to 100 prevents TTQRE from skipping a multi-shift QR sweep. = 15: (nshfts) The number of simultaneous shifts in a multi-shift QR iteration. = 16: (iacc22) iparmq is set to 0, 1 or 2 with the following meanings. 0: During the multi-shift QR sweep, ?laqr5 does not accumulate reflections and does not use matrix-matrix multiply to update the far-fromdiagonal matrix entries. 1: During the multi-shift QR sweep, ?laqr5 and/or ?laqr3 accumulates reflections and uses matrix-matrix multiply to update the far-from-diagonal matrix entries. 2: During the multi-shift QR sweep, ?laqr5 accumulates reflections and takes advantage of 2-by-2 block structure during matrix-matrix multiplies. (If ?trrm is slower than ?gemm, then iparmq(ispec=16)=1 may be more efficient than iparmq(ispec=16)=2 despite the greater level of arithmetic work implied by the latter choice.) name CHARACTER*(*). The name of the calling subroutine. opts CHARACTER*(*). This is a concatenation of the string arguments to TTQRE. 5 Intel® Math Kernel Library Reference Manual 1522 n INTEGER. n is the order of the Hessenberg matrix H. ilo, ihi INTEGER. It is assumed that H is already upper triangular in rows and columns 1:ilo-1 and ihi+1:n. lwork INTEGER. The amount of workspace available. Output Parameters value INTEGER. If value = 0: the value of the parameter specified by iparmq; If value = -k < 0: the k-th argument had an illegal value. Application Notes The following conventions have been used when calling ilaenv from the LAPACK routines: 1. opts is a concatenation of all of the character options to subroutine name, in the same order that they appear in the argument list for name, even if they are not used in determining the value of the parameter specified by ispec. 2. The problem dimensions n1, n2, n3, n4 are specified in the order that they appear in the argument list for name. n1 is used first, n2 second, and so on, and unused problem dimensions are passed a value of -1. 3. The parameter value returned by ilaenv is checked for validity in the calling subroutine. For example, ilaenv is used to retrieve the optimal blocksize for strtri as follows: nb = ilaenv( 1, 'strtri', uplo // diag, n, -1, -1, -1> ) if( nb.le.1 ) nb = max( 1, n ) ieeeck Checks if the infinity and NaN arithmetic is safe. Called by ilaenv. Syntax ival = ieeeck( ispec, zero, one ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ieeeck is called from ilaenv to verify that infinity and possibly NaN arithmetic is safe, that is, will not trap. Input Parameters ispec INTEGER. Specifies whether to test just for infinity arithmetic or both for infinity and NaN arithmetic: If ispec = 0: Verify infinity arithmetic only. If ispec = 1: Verify infinity and NaN arithmetic. zero REAL. Must contain the value 0.0 This is passed to prevent the compiler from optimizing away this code. one REAL. Must contain the value 1.0 LAPACK Auxiliary and Utility Routines 5 1523 This is passed to prevent the compiler from optimizing away this code. Output Parameters ival INTEGER. If ival = 0: Arithmetic failed to produce the correct answers. If ival = 1: Arithmetic produced the correct answers. lsamen Tests two character strings for equality regardless of case. Syntax val = lsamen(n, ca, cb) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description This logical function tests if the first n letters of the string ca are the same as the first n letters of cb, regardless of case. The function lsamen returns .TRUE. if ca and cb are equivalent except for case and .FALSE. otherwise. lsamen also returns .FALSE. if len(ca) or len(cb) is less than n. Input Parameters n INTEGER. The number of characters in ca and cb to be compared. ca, cb CHARACTER*(*). Specify two character strings of length at least n to be compared. Only the first n characters of each string will be accessed. Output Parameters val LOGICAL. Result of the comparison. ?labad Returns the square root of the underflow and overflow thresholds if the exponent-range is very large. Syntax call slabad( small, large ) call dlabad( small, large ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine takes as input the values computed by slamch/dlamch for underflow and overflow, and returns the square root of each of these values if the log of large is sufficiently large. This subroutine is intended to identify machines with a large exponent range, such as the Crays, and redefine the underflow and overflow limits to be the square roots of the values computed by ?lamch. This subroutine is needed because ?lamch does not compensate for poor arithmetic in the upper half of the exponent range, as is found on a Cray. 5 Intel® Math Kernel Library Reference Manual 1524 Input Parameters small REAL for slabad DOUBLE PRECISION for dlabad. The underflow threshold as computed by ?lamch. large REAL for slabad DOUBLE PRECISION for dlabad. The overflow threshold as computed by ?lamch. Output Parameters small On exit, if log10(large) is sufficiently large, the square root of small, otherwise unchanged. large On exit, if log10(large) is sufficiently large, the square root of large, otherwise unchanged. ?lamch Determines machine parameters for floating-point arithmetic. Syntax val = slamch( cmach ) val = dlamch( cmach ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The function ?lamch determines single precision and double precision machine parameters. Input Parameters cmach CHARACTER*1. Specifies the value to be returned by ?lamch: = 'E' or 'e', val = eps = 'S' or 's', val = sfmin = 'B' or 'b', val = base = 'P' or 'p', val = eps*base = 'n' or 'n', val = t = 'R' or 'r', val = rnd = 'm' or 'm', val = emin = 'U' or 'u', val = rmin = 'L' or 'l', val = emax = 'O' or 'o', val = rmax where eps = relative machine precision; sfmin = safe minimum, such that 1/sfmin does not overflow; base = base of the machine; prec = eps*base; t = number of (base) digits in the mantissa; rnd = 1.0 when rounding occurs in addition, 0.0 otherwise; emin = minimum exponent before (gradual) underflow; LAPACK Auxiliary and Utility Routines 5 1525 rmin = underflow_threshold - base**(emin-1); emax = largest exponent before overflow; rmax = overflow_threshold - (base**emax)*(1-eps). NOTE You can use a character string for cmach instead of a single character in order to make your code more readable. The first character of the string determines the value to be returned. For example, 'Precision' is interpreted as 'p'. Output Parameters val REAL for slamch DOUBLE PRECISION for dlamch Value returned by the function. ?lamc1 Called from ?lamc2. Determines machine parameters given by beta, t, rnd, ieee1. Syntax call slamc1( beta, t, rnd, ieee1 ) call dlamc1( beta, t, rnd, ieee1 ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lamc1 determines machine parameters given by beta, t, rnd, ieee1. Output Parameters beta INTEGER. The base of the machine. t INTEGER. The number of (beta) digits in the mantissa. rnd LOGICAL. Specifies whether proper rounding ( rnd = .TRUE. ) or chopping ( rnd = .FALSE. ) occurs in addition. This may not be a reliable guide to the way in which the machine performs its arithmetic. ieee1 LOGICAL. Specifies whether rounding appears to be done in the ieee 'round to nearest' style. ?lamc2 Used by ?lamch. Determines machine parameters specified in its arguments list. Syntax call slamc2( beta, t, rnd, eps, emin, rmin, emax, rmax ) call dlamc2( beta, t, rnd, eps, emin, rmin, emax, rmax ) 5 Intel® Math Kernel Library Reference Manual 1526 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine ?lamc2 determines machine parameters specified in its arguments list. Output Parameters beta INTEGER. The base of the machine. t INTEGER. The number of (beta) digits in the mantissa. rnd LOGICAL. Specifies whether proper rounding (rnd = .TRUE.) or chopping (rnd = .FALSE.) occurs in addition. This may not be a reliable guide to the way in which the machine performs its arithmetic. eps REAL for slamc2 DOUBLE PRECISION for dlamc2 The smallest positive number such that fl(1.0 - eps) < 1.0, where fl denotes the computed value. emin INTEGER. The minimum exponent before (gradual) underflow occurs. rmin REAL for slamc2 DOUBLE PRECISION for dlamc2 The smallest normalized number for the machine, given by baseemin-1, where base is the floating point value of beta. emax INTEGER.The maximum exponent before overflow occurs. rmax REAL for slamc2 DOUBLE PRECISION for dlamc2 The largest positive number for the machine, given by baseemax(1 - eps), where base is the floating point value of beta. ?lamc3 Called from ?lamc1-?lamc5. Intended to force a and b to be stored prior to doing the addition of a and b. Syntax val = slamc3( a, b ) val = dlamc3( a, b ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine is intended to force A and B to be stored prior to doing the addition of A and B, for use in situations where optimizers might hold one of these in a register. LAPACK Auxiliary and Utility Routines 5 1527 Input Parameters a, b REAL for slamc3 DOUBLE PRECISION for dlamc3 The values a and b. Output Parameters val REAL for slamc3 DOUBLE PRECISION for dlamc3 The result of adding values a and b. ?lamc4 This is a service routine for ?lamc2. Syntax call slamc4( emin, start, base ) call dlamc4( emin, start, base ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description This is a service routine for ?lamc2. Input Parameters start REAL for slamc4 DOUBLE PRECISION for dlamc4 The starting point for determining emin. base INTEGER. The base of the machine. Output Parameters emin INTEGER. The minimum exponent before (gradual) underflow, computed by setting a = start and dividing by base until the previous a can not be recovered. ?lamc5 Called from ?lamc2. Attempts to compute the largest machine floating-point number, without overflow. Syntax call slamc5( beta, p, emin, ieee, emax, rmax) call dlamc5( beta, p, emin, ieee, emax, rmax) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1528 Description The routine ?lamc5 attempts to compute rmax, the largest machine floating-point number, without overflow. It assumes that emax + abs(emin) sum approximately to a power of 2. It will fail on machines where this assumption does not hold, for example, the Cyber 205 (emin = -28625, emax = 28718). It will also fail if the value supplied for emin is too large (that is, too close to zero), probably with overflow. Input Parameters beta INTEGER. The base of floating-point arithmetic. p INTEGER. The number of base beta digits in the mantissa of a floating-point value. emin INTEGER. The minimum exponent before (gradual) underflow. ieee LOGICAL. A logical flag specifying whether or not the arithmetic system is thought to comply with the IEEE standard. Output Parameters emax INTEGER. The largest exponent before overflow. rmax REAL for slamc5 DOUBLE PRECISION for dlamc5 The largest machine floating-point number. second/dsecnd Return user time for a process. Syntax val = second() val = dsecnd() Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The functions second/dsecnd return the user time for a process in seconds. These versions get the time from the system function etime. The difference is that dsecnd returns the result with double precision. Output Parameters val REAL for second DOUBLE PRECISION for dsecnd User time for a process. chla_transtype Translates a BLAST-specified integer constant to the character string specifying a transposition operation. Syntax val = chla_transtype( trans ) LAPACK Auxiliary and Utility Routines 5 1529 Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The chla_transtype function translates a BLAST-specified integer constant to the character string specifying a transposition operation. The function returns a CHARACTER*1. If the input is not an integer indicating a transposition operator, then val is 'X'. Otherwise, the function returns the constant value corresponding to trans. Input Parameters trans INTEGER. Specifies the form of the system of equations: If trans = BLAS_NO_TRANS = 111: No transpose. If trans = BLAS_TRANS = 112: Transpose. If trans = BLAS_CONJ_TRANS = 113: Conjugate Transpose. Output Parameters val CHARACTER*1 Character that specifies a transposition operation. iladiag Translates a character string specifying whether a matrix has a unit diagonal to the relevant BLASTspecified integer constant. Syntax val = iladiag( diag ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The iladiag function translates a character string specifying whether a matrix has a unit diagonal or not to the relevant BLAST-specified integer constant. The function returns an INTEGER. If val < 0, the input is not a character indicating a unit or non-unit diagonal. Otherwise, the function returns the constant value corresponding to diag. Input Parameters diag CHARACTER*1. Specifies the form of the system of equations: If diag = 'N': A is non-unit triangular. If diag = 'U': A is unit triangular. Output Parameters val INTEGER Value returned by the function. 5 Intel® Math Kernel Library Reference Manual 1530 ilaprec Translates a character string specifying an intermediate precision to the relevant BLAST-specified integer constant. Syntax val = ilaprec( prec ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ilaprec function translates a character string specifying an intermediate precision to the relevant BLAST-specified integer constant. The function returns an INTEGER. If val < 0, the input is not a character indicating a supported intermediate precision. Otherwise, the function returns the constant value corresponding to prec. Input Parameters prec CHARACTER*1. Specifies the form of the system of equations: If prec = 'S': Single. If prec = 'D': Double. If prec = 'I': Indigenous. If prec = 'X', 'E': Extra. Output Parameters val INTEGER Value returned by the function. ilatrans Translates a character string specifying a transposition operation to the BLAST-specified integer constant. Syntax val = ilatrans( trans ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ilatrans function translates a character string specifying a transposition operation to the BLASTspecified integer constant. The function returns a INTEGER. If val < 0, the input is not a character indicating a transposition operator. Otherwise, the function returns the constant value corresponding to trans. Input Parameters trans CHARACTER*1. LAPACK Auxiliary and Utility Routines 5 1531 Specifies the form of the system of equations: If trans = 'N': No transpose. If trans = 'T': Transpose. If trans = 'C': Conjugate Transpose. Output Parameters val INTEGER Character that specifies a transposition operation. ilauplo Translates a character string specifying an upper- or lower-triangular matrix to the relevant BLASTspecified integer constant. Syntax val = ilauplo( uplo ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ilauplo function translates a character string specifying an upper- or lower-triangular matrix to the relevant BLAST-specified integer constant. The function returns an INTEGER. If val < 0, the input is not a character indicating an upper- or lowertriangular matrix. Otherwise, the function returns the constant value corresponding to uplo. Input Parameters diag CHARACTER. Specifies the form of the system of equations: If diag = 'U': A is upper triangular. If diag = 'L': A is lower triangular. Output Parameters val INTEGER Value returned by the function. xerbla_array Assists other languages in calling the xerbla function. Syntax call xerbla_array( srname_array, srname_len, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 5 Intel® Math Kernel Library Reference Manual 1532 Description The routine assists other languages in calling the error handling xerbla function. Rather than taking a Fortran string argument as the function name, xerbla_array takes an array of single characters along with the array length. The routine then copies up to 32 characters of that array into a Fortran string and passes that to xerbla. If called with a non-positive srname_len, the routine will call xerbla with a string of all blank characters. If some macro or other device makes xerbla_array available to C99 by a name lapack_xerbla and with a common Fortran calling convention, a C99 program could invoke xerbla via: { int flen = strlen(__func__); lapack_xerbla(__func__, &flen, &info); } Providing xerbla_array is not necessary for intercepting LAPACK errors. xerbla_array calls xerbla. Output Parameters srname_array CHARACTER(1). Array, dimension (srname_len). The name of the routine that called xerbla_array. srname_len INTEGER. The length of the name in srname_array. info INTEGER. Position of the invalid parameter in the parameter list of the calling routine. LAPACK Auxiliary and Utility Routines 5 1533 5 Intel® Math Kernel Library Reference Manual 1534 ScaLAPACK Routines 6 This chapter describes the Intel® Math Kernel Library implementation of routines from the ScaLAPACK package for distributed-memory architectures. Routines are supported for both real and complex dense and band matrices to perform the tasks of solving systems of linear equations, solving linear least-squares problems, eigenvalue and singular value problems, as well as performing a number of related computational tasks. Intel MKL ScaLAPACK routines are written in FORTRAN 77 with exception of a few utility routines written in C to exploit the IEEE arithmetic. All routines are available in all precision types: single precision, double precision, complexm, and double complex precision. See the mkl_scalapack.h header file for C declarations of ScaLAPACK routines. NOTE ScaLAPACK routines are provided only with Intel® MKL versions for Linux* and Windows* OSs. Sections in this chapter include descriptions of ScaLAPACK computational routines that perform distinct computational tasks, as well as driver routines for solving standard types of problems in one call. Generally, ScaLAPACK runs on a network of computers using MPI as a message-passing layer and a set of prebuilt communication subprograms (BLACS), as well as a set of BLAS optimized for the target architecture. Intel MKL version of ScaLAPACK is optimized for Intel® processors. For the detailed system and environment requirements, see Intel® MKL Release Notes and Intel® MKL User's Guide. For full reference on ScaLAPACK routines and related information, see [SLUG]. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Overview The model of the computing environment for ScaLAPACK is represented as a one-dimensional array of processes (for operations on band or tridiagonal matrices) or also a two-dimensional process grid (for operations on dense matrices). To use ScaLAPACK, all global matrices or vectors should be distributed on this array or grid prior to calling the ScaLAPACK routines. ScaLAPACK uses the two-dimensional block-cyclic data distribution as a layout for dense matrix computations. This distribution provides good work balance between available processors, as well as gives the opportunity to use BLAS Level 3 routines for optimal local computations. Information about the data distribution that is required to establish the mapping between each global array and its corresponding process and memory location is contained in the so called array descriptor associated with each global array. An example of an array descriptor structure is given in Table "Content of the array descriptor for dense matrices". Content of the array descriptor for dense matrices Array Element # Name Definition 1 dtype Descriptor type ( =1 for dense matrices) 2 ctxt BLACS context handle for the process grid 1535 Array Element # Name Definition 3 m Number of rows in the global array 4 n Number of columns in the global array 5 mb Row blocking factor 6 nb Column blocking factor 7 rsrc Process row over which the first row of the global array is distributed 8 csrc Process column over which the first column of the global array is distributed 9 lld Leading dimension of the local array The number of rows and columns of a global dense matrix that a particular process in a grid receives after data distributing is denoted by LOCr() and LOCc(), respectively. To compute these numbers, you can use the ScaLAPACK tool routine numroc. After the block-cyclic distribution of global data is done, you may choose to perform an operation on a submatrix of the global matrix A, which is contained in the global subarray sub(A), defined by the following 6 values (for dense matrices): m The number of rows of sub(A) n The number of columns of sub(A) a A pointer to the local array containing the entire global array A ia The row index of sub(A) in the global array ja The column index of sub(A) in the global array desca The array descriptor for the global array Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Routine Naming Conventions For each routine introduced in this chapter, you can use the ScaLAPACK name. The naming convention for ScaLAPACK routines is similar to that used for LAPACK routines (see Routine Naming Conventions in Chapter 4). A general rule is that each routine name in ScaLAPACK, which has an LAPACK equivalent, is simply the LAPACK name prefixed by initial letter p. ScaLAPACK names have the structure p?yyzzz or p?yyzz, which is described below. The initial letter p is a distinctive prefix of ScaLAPACK routines and is present in each such routine. The second symbol ? indicates the data type: s real, single precision d real, double precision c complex, single precision z complex, double precision The second and third letters yy indicate the matrix type as: ge general gb general band gg a pair of general matrices (for a generalized problem) 6 Intel® Math Kernel Library Reference Manual 1536 dt general tridiagonal (diagonally dominant-like) db general band (diagonally dominant-like) po symmetric or Hermitian positive-definite pb symmetric or Hermitian positive-definite band pt symmetric or Hermitian positive-definite tridiagonal sy symmetric st symmetric tridiagonal (real) he Hermitian or orthogonal tr triangular (or quasi-triangular) tz trapezoidal un unitary For computational routines, the last three letters zzz indicate the computation performed and have the same meaning as for LAPACK routines. For driver routines, the last two letters zz or three letters zzz have the following meaning: sv a simple driver for solving a linear system svx an expert driver for solving a linear system ls a driver for solving a linear least squares problem ev a simple driver for solving a symmetric eigenvalue problem evd a simple driver for solving an eigenvalue problem using a divide and conquer algorithm evx an expert driver for solving a symmetric eigenvalue problem svd a driver for computing a singular value decomposition gvx an expert driver for solving a generalized symmetric definite eigenvalue problem Simple driver here means that the driver just solves the general problem, whereas an expert driver is more versatile and can also optionally perform some related computations (such, for example, as refining the solution and computing error bounds after the linear system is solved). Computational Routines In the sections that follow, the descriptions of ScaLAPACK computational routines are given. These routines perform distinct computational tasks that can be used for: • Solving Systems of Linear Equations • Orthogonal Factorizations and LLS Problems • Symmetric Eigenproblems • Nonsymmetric Eigenproblems • Singular Value Decomposition • Generalized Symmetric-Definite Eigenproblems See also the respective driver routines. Linear Equations ScaLAPACK supports routines for the systems of equations with the following types of matrices: • general • general banded • general diagonally dominant-like banded (including general tridiagonal) • symmetric or Hermitian positive-definite ScaLAPACK Routines 6 1537 • symmetric or Hermitian positive-definite banded • symmetric or Hermitian positive-definite tridiagonal A diagonally dominant-like matrix is defined as a matrix for which it is known in advance that pivoting is not required in the LU factorization of this matrix. For the above matrix types, the library includes routines for performing the following computations: factoring the matrix; equilibrating the matrix; solving a system of linear equations; estimating the condition number of a matrix; refining the solution of linear equations and computing its error bounds; inverting the matrix. Note that for some of the listed matrix types only part of the computational routines are provided (for example, routines that refine the solution are not provided for band or tridiagonal matrices). See Table “Computational Routines for Systems of Linear Equations” for full list of available routines. To solve a particular problem, you can either call two or more computational routines or call a corresponding driver routine that combines several tasks in one call. Thus, to solve a system of linear equations with a general matrix, you can first call p?getrf(LU factorization) and then p?getrs(computing the solution). Then, you might wish to call p?gerfs to refine the solution and get the error bounds. Alternatively, you can just use the driver routine p?gesvx which performs all these tasks in one call. Table “Computational Routines for Systems of Linear Equations” lists the ScaLAPACK computational routines for factorizing, equilibrating, and inverting matrices, estimating their condition numbers, solving systems of equations with real matrices, refining the solution, and estimating its error. Computational Routines for Systems of Linear Equations Matrix type, storage scheme Factorize matrix Equilibrate matrix Solve system Condition number Estimate error Invert matrix general (partial pivoting) p?getrf p?geequ p?getrs p?gecon p?gerfs p?getri general band (partial pivoting) p?gbtrf p?gbtrs general band (no pivoting) p?dbtrf p?dbtrs general tridiagonal (no pivoting) p?dttrf p?dttrs symmetric/Hermitian positive-definite p?potrf p?poequ p?potrs p?pocon p?porfs p?potri symmetric/Hermitian positive-definite, band p?pbtrf p?pbtrs symmetric/Hermitian positive-definite, tridiagonal p?pttrf p?pttrs triangular p?trtrs p?trcon p?trrfs p?trtri In this table ? stands for s (single precision real), d (double precision real), c (single precision complex), or z (double precision complex). Routines for Matrix Factorization This section describes the ScaLAPACK routines for matrix factorization. The following factorizations are supported: • LU factorization of general matrices • LU factorization of diagonally dominant-like matrices • Cholesky factorization of real symmetric or complex Hermitian positive-definite matrices You can compute the factorizations using full and band storage of matrices. p?getrf Computes the LU factorization of a general m-by-n distributed matrix. 6 Intel® Math Kernel Library Reference Manual 1538 Syntax call psgetrf(m, n, a, ia, ja, desca, ipiv, info) call pdgetrf(m, n, a, ia, ja, desca, ipiv, info) call pcgetrf(m, n, a, ia, ja, desca, ipiv, info) call pzgetrf(m, n, a, ia, ja, desca, ipiv, info) Include Files • C: mkl_scalapack.h Description The p?getrf routine forms the LU factorization of a general m-by-n distributed matrix sub(A) = A(ia:ia +m-1, ja:ja+n-1) as A = P*L*U where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n) and U is upper triangular (upper trapezoidal if m < n). L and U are stored in sub(A). The routine uses partial pivoting, with row interchanges. Input Parameters m (global) INTEGER. The number of rows in the distributed submatrix sub(A); m=0. n (global) INTEGER. The number of columns in the distributed submatrix sub(A); n=0. a (local) REAL for psgetrf DOUBLE PRECISION for pdgetrf COMPLEX for pcgetrf DOUBLE COMPLEX for pzgetrf. Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1)). Contains the local pieces of the distributed matrix sub(A) to be factored. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix A(ia:ia+n-1, ja:ja+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. Output Parameters a Overwritten by local pieces of the factors L and U from the factorization A = P*L*U. The unit diagonal elements of L are not stored. ipiv (local) INTEGER array. The dimension of ipiv is (LOCr(m_a)+ mb_a). This array contains the pivoting information: local row i was interchanged with global row ipiv(i). This array is tied to the distributed matrix A. info (global) INTEGER. If info=0, the execution is successful. ScaLAPACK Routines 6 1539 info < 0: if the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. If info = i, uii is 0. The factorization has been completed, but the factor U is exactly singular. Division by zero will occur if you use the factor U for solving a system of linear equations. p?gbtrf Computes the LU factorization of a general n-by-n banded distributed matrix. Syntax call psgbtrf(n, bwl, bwu, a, ja, desca, ipiv, af, laf, work, lwork, info) call pdgbtrf(n, bwl, bwu, a, ja, desca, ipiv, af, laf, work, lwork, info) call pcgbtrf(n, bwl, bwu, a, ja, desca, ipiv, af, laf, work, lwork, info) call pzgbtrf(n, bwl, bwu, a, ja, desca, ipiv, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?gbtrf routine computes the LU factorization of a general n-by-n real/complex banded distributed matrix A(1:n, ja:ja+n-1) using partial pivoting with row interchanges. The resulting factorization is not the same factorization as returned from the LAPACK routine ?gbtrf. Additional permutations are performed on the matrix for the sake of parallelism. The factorization has the form A(1:n, ja:ja+n-1) = P*L*U*Q where P and Q are permutation matrices, and L and U are banded lower and upper triangular matrices, respectively. The matrix Q represents reordering of columns for the sake of parallelism, while P represents reordering of rows for numerical stability using classic partial pivoting. Input Parameters n (global) INTEGER. The number of rows and columns in the distributed submatrix A(1:n, ja:ja+n-1); n = 0. bwl (global) INTEGER. The number of sub-diagonals within the band of A ( 0 = bwl = n-1 ). bwu (global) INTEGER. The number of super-diagonals within the band of A ( 0 = bwu = n-1 ). a (local) REAL for psgbtrf DOUBLE PRECISION for pdgbtrf COMPLEX for pcgbtrf DOUBLE COMPLEX for pzgbtrf. Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1) where lld_a = 2*bwl + 2*bwu +1. 6 Intel® Math Kernel Library Reference Manual 1540 Contains the local pieces of the n-by-n distributed banded matrix A(1:n, ja:ja+n-1) to be factored. ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on ( which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 501, then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. laf (local) INTEGER. The dimension of the array af. Must be laf = (NB+bwu)*(bwl+bwu)+6*(bwl+bwu)*(bwl+2*bwu). If laf is not large enough, an error code will be returned and the minimum acceptable size will be returned in af(1). work (local) Same type as a. Workspace array of dimension lwork . lwork (local or global) INTEGER. The size of the work array (lwork = 1). If lwork is too small, the minimal acceptable size will be returned in work(1) and an error code is returned. Output Parameters a On exit, this array contains details of the factorization. Note that additional permutations are performed on the matrix, so that the factors returned are different from those returned by LAPACK. ipiv (local) INTEGER array. The dimension of ipiv must be = desca(NB). Contains pivot indices for local factorizations. Note that you should not alter the contents of this array between factorization and solve. af (local) REAL for psgbtrf DOUBLE PRECISION for pdgbtrf COMPLEX for pcgbtrf DOUBLE COMPLEX for pzgbtrf. Array, dimension (laf). Auxiliary Fillin space. Fillin is created during the factorization routine p? gbtrf and this is stored in af. Note that if a linear system is to be solved using p?gbtrs after the factorization routine, af must not be altered after the factorization. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. If info=0, the execution is successful. info < 0: If the ith argument is an array and the jth entry had an illegal value, then info = -(i*100+j); if the ith argument is a scalar and had an illegal value, then info = -i. info > 0: If info = k = NPROCS, the submatrix stored on processor info and factored locally was not nonsingular, and the factorization was not completed. If info = k > NPROCS, the submatrix stored on processor info-NPROCS representing interactions with other processors was not nonsingular, and the factorization was not completed. ScaLAPACK Routines 6 1541 p?dbtrf Computes the LU factorization of a n-by-n diagonally dominant-like banded distributed matrix. Syntax call psdbtrf(n, bwl, bwu, a, ja, desca, af, laf, work, lwork, info) call pddbtrf(n, bwl, bwu, a, ja, desca, af, laf, work, lwork, info) call pcdbtrf(n, bwl, bwu, a, ja, desca, af, laf, work, lwork, info) call pzdbtrf(n, bwl, bwu, a, ja, desca, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?dbtrf routine computes the LU factorization of a n-by-n real/complex diagonally dominant-like banded distributed matrix A(1:n, ja:ja+n-1) without pivoting. Note that the resulting factorization is not the same factorization as returned from LAPACK. Additional permutations are performed on the matrix for the sake of parallelism. Input Parameters n (global) INTEGER. The number of rows and columns in the distributed submatrix A(1:n, ja:ja+n-1); n = 0. bwl (global) INTEGER. The number of sub-diagonals within the band of A (0 = bwl = n-1). bwu (global) INTEGER. The number of super-diagonals within the band of A (0 = bwu = n-1). a (local) REAL for psdbtrf DOUBLE PRECISION for pddbtrf COMPLEX for pcdbtrf DOUBLE COMPLEX for pzdbtrf. Pointer into the local memory to an array of local dimension (lld_a,LOCc(ja+n-1)). Contains the local pieces of the n-by-n distributed banded matrix A(1:n, ja:ja+n-1) to be factored. ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on ( which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 501, then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. laf (local) INTEGER. The dimension of the array af. Must be laf = NB*(bwl+bwu)+6*(max(bwl,bwu))2 . If laf is not large enough, an error code will be returned and the minimum acceptable size will be returned in af(1). work (local) Same type as a. Workspace array of dimension lwork. 6 Intel® Math Kernel Library Reference Manual 1542 lwork (local or global) INTEGER. The size of the work array, must be lwork = (max(bwl,bwu))2. If lwork is too small, the minimal acceptable size will be returned in work(1) and an error code is returned. Output Parameters a On exit, this array contains details of the factorization. Note that additional permutations are performed on the matrix, so that the factors returned are different from those returned by LAPACK. af (local) REAL for psdbtrf DOUBLE PRECISION for pddbtrf COMPLEX for pcdbtrf DOUBLE COMPLEX for pzdbtrf. Array, dimension (laf). Auxiliary Fillin space. Fillin is created during the factorization routine p? dbtrf and this is stored in af. Note that if a linear system is to be solved using p?dbtrs after the factorization routine, af must not be altered after the factorization. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. If info=0, the execution is successful. info < 0: If the ith argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. info > 0: If info = k = NPROCS, the submatrix stored on processor info and factored locally was not diagonally dominant-like, and the factorization was not completed. If info = k > NPROCS, the submatrix stored on processor info-NPROCS representing interactions with other processors was not nonsingular, and the factorization was not completed. p?dttrf Computes the LU factorization of a diagonally dominant-like tridiagonal distributed matrix. Syntax call psdttrf(n, dl, d, du, ja, desca, af, laf, work, lwork, info) call pddttrf(n, dl, d, du, ja, desca, af, laf, work, lwork, info) call pcdttrf(n, dl, d, du, ja, desca, af, laf, work, lwork, info) call pzdttrf(n, dl, d, du, ja, desca, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?dttrf routine computes the LU factorization of an n-by-n real/complex diagonally dominant-like tridiagonal distributed matrix A(1:n, ja:ja+n-1) without pivoting for stability. The resulting factorization is not the same factorization as returned from LAPACK. Additional permutations are performed on the matrix for the sake of parallelism. ScaLAPACK Routines 6 1543 The factorization has the form: A(1:n, ja:ja+n-1) = P*L*U*PT, where P is a permutation matrix, and L and U are banded lower and upper triangular matrices, respectively. Input Parameters n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the distributed submatrix A(1:n, ja:ja+n-1) (n = 0). dl, d, du (local) REAL for pspttrf DOUBLE PRECISON for pdpttrf COMPLEX for pcpttrf DOUBLE COMPLEX for pzpttrf. Pointers to the local arrays of dimension (desca(nb_)) each. On entry, the array dl contains the local part of the global vector storing the subdiagonal elements of the matrix. Globally, dl(1) is not referenced, and dl must be aligned with d. On entry, the array d contains the local part of the global vector storing the diagonal elements of the matrix. On entry, the array du contains the local part of the global vector storing the super-diagonal elements of the matrix. du(n) is not referenced, and du must be aligned with d. ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on ( which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 501, then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. laf (local) INTEGER. The dimension of the array af. Must be laf = 2*(NB+2) . If laf is not large enough, an error code will be returned and the minimum acceptable size will be returned in af(1). work (local) Same type as d. Workspace array of dimension lwork. lwork (local or global) INTEGER. The size of the work array, must be at least lwork = 8*NPCOL. Output Parameters dl, d, du On exit, overwritten by the information containing the factors of the matrix. af (local) REAL for psdttrf DOUBLE PRECISION for pddttrf COMPLEX for pcdttrf DOUBLE COMPLEX for pzdttrf. Array, dimension (laf). Auxiliary Fillin space. Fillin is created during the factorization routine p? dttrf and this is stored in af. Note that if a linear system is to be solved using p?dttrs after the factorization routine, af must not be altered. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. 6 Intel® Math Kernel Library Reference Manual 1544 info (global) INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. info > 0: If info = k = NPROCS, the submatrix stored on processor info and factored locally was not diagonally dominant-like, and the factorization was not completed. If info = k > NPROCS, the submatrix stored on processor info-NPROCS representing interactions with other processors was not nonsingular, and the factorization was not completed. p?potrf Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite distributed matrix. Syntax call pspotrf(uplo, n, a, ia, ja, desca, info) call pdpotrf(uplo, n, a, ia, ja, desca, info) call pcpotrf(uplo, n, a, ia, ja, desca, info) call pzpotrf(uplo, n, a, ia, ja, desca, info) Include Files • C: mkl_scalapack.h Description The p?potrf routine computes the Cholesky factorization of a real symmetric or complex Hermitian positivedefinite distributed n-by-n matrix A(ia:ia+n-1, ja:ja+n-1), denoted below as sub(A). The factorization has the form sub(A) = UH*U if uplo='U', or sub(A) = L*LH if uplo='L' where L is a lower triangular matrix and U is upper triangular. Input Parameters uplo (global) CHARACTER*1. Indicates whether the upper or lower triangular part of sub(A) is stored. Must be 'U' or 'L'. If uplo = 'U', the array a stores the upper triangular part of the matrix sub(A) that is factored as UH*U. If uplo = 'L', the array a stores the lower triangular part of the matrix sub(A) that is factored as L*LH. n (global) INTEGER. The order of the distributed submatrix sub(A) (n=0). a (local) REAL for pspotrf DOUBLE PRECISON for pdpotrf COMPLEX for pcpotrf ScaLAPACK Routines 6 1545 DOUBLE COMPLEX for pzpotrf. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +n-1)). On entry, this array contains the local pieces of the n-by-n symmetric/ Hermitian distributed matrix sub(A) to be factored. Depending on uplo, the array a contains either the upper or the lower triangular part of the matrix sub(A) (see uplo). ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. Output Parameters a The upper or lower triangular part of a is overwritten by the Cholesky factor U or L, as specified by uplo. info (global) INTEGER. If info=0, the execution is successful; info < 0: if the i-th argument is an array, and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. If info = k >0, the leading minor of order k, A(ia:ia+k-1, ja:ja +k-1), is not positive-definite, and the factorization could not be completed. p?pbtrf Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite banded distributed matrix. Syntax call pspbtrf(uplo, n, bw, a, ja, desca, af, laf, work, lwork, info) call pdpbtrf(uplo, n, bw, a, ja, desca, af, laf, work, lwork, info) call pcpbtrf(uplo, n, bw, a, ja, desca, af, laf, work, lwork, info) call pzpbtrf(uplo, n, bw, a, ja, desca, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?pbtrf routine computes the Cholesky factorization of an n-by-n real symmetric or complex Hermitian positive-definite banded distributed matrix A(1:n, ja:ja+n-1). The resulting factorization is not the same factorization as returned from LAPACK. Additional permutations are performed on the matrix for the sake of parallelism. The factorization has the form: A(1:n, ja:ja+n-1) = P*UH*U*PT, if uplo='U', or A(1:n, ja:ja+n-1) = P*L*LH*PT, if uplo='L', where P is a permutation matrix and U and L are banded upper and lower triangular matrices, respectively. 6 Intel® Math Kernel Library Reference Manual 1546 Input Parameters uplo (global) CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', upper triangle of A(1:n, ja:ja+n-1) is stored; If uplo = 'L', lower triangle of A(1:n, ja:ja+n-1) is stored. n (global) INTEGER. The order of the distributed submatrix A(1:n, ja:ja +n-1). (n=0). bw (global) INTEGER. The number of superdiagonals of the distributed matrix if uplo = 'U', or the number of subdiagonals if uplo = 'L' (bw=0). a (local) REAL for pspbtrf DOUBLE PRECISON for pdpbtrf COMPLEX for pcpbtrf DOUBLE COMPLEX for pzpbtrf. Pointer into the local memory to an array of dimension (lld_a,LOCc(ja+n-1)). On entry, this array contains the local pieces of the upper or lower triangle of the symmetric/Hermitian band distributed matrix A(1:n, ja:ja+n-1) to be factored. ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on ( which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 501, then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. laf (local) INTEGER. The dimension of the array af. Must be laf = (NB+2*bw)*bw. If laf is not large enough, an error code will be returned and the minimum acceptable size will be returned in af(1). work (local) Same type as a. Workspace array of dimension lwork . lwork (local or global) INTEGER. The size of the work array, must be lwork = bw2. Output Parameters a On exit, if info=0, contains the permuted triangular factor U or L from the Cholesky factorization of the band matrix A(1:n, ja:ja+n-1), as specified by uplo. af (local) REAL for pspbtrf DOUBLE PRECISON for pdpbtrf COMPLEX for pcpbtrf DOUBLE COMPLEX for pzpbtrf. Array, dimension (laf). Auxiliary Fillin space. Fillin is created during the factorization routine p?pbtrf and this is stored in af. Note that if a linear system is to be solved using p?pbtrs after the factorization routine, af must not be altered. ScaLAPACK Routines 6 1547 work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. If info=0, the execution is successful. info < 0: If the ith argument is an array and the jth entry had an illegal value, then info = -(i*100+j); if the ith argument is a scalar and had an illegal value, then info = -i. info>0: If info = k = NPROCS, the submatrix stored on processor info and factored locally was not positive definite, and the factorization was not completed. If info = k > NPROCS, the submatrix stored on processor info-NPROCS representing interactions with other processors was not nonsingular, and the factorization was not completed. p?pttrf Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite tridiagonal distributed matrix. Syntax call pspttrf(n, d, e, ja, desca, af, laf, work, lwork, info) call pdpttrf(n, d, e, ja, desca, af, laf, work, lwork, info) call pcpttrf(n, d, e, ja, desca, af, laf, work, lwork, info) call pzpttrf(n, d, e, ja, desca, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?pttrf routine computes the Cholesky factorization of an n-by-n real symmetric or complex hermitian positive-definite tridiagonal distributed matrix A(1:n, ja:ja+n-1). The resulting factorization is not the same factorization as returned from LAPACK. Additional permutations are performed on the matrix for the sake of parallelism. The factorization has the form: A(1:n, ja:ja+n-1) = P*L*D*LH*PT, or A(1:n, ja:ja+n-1) = P*UH*D*U*PT, where P is a permutation matrix, and U and L are tridiagonal upper and lower triangular matrices, respectively. Input Parameters n (global) INTEGER. The order of the distributed submatrix A(1:n, ja:ja +n-1) (n = 0). d, e (local) REAL for pspttrf DOUBLE PRECISON for pdpttrf 6 Intel® Math Kernel Library Reference Manual 1548 COMPLEX for pcpttrf DOUBLE COMPLEX for pzpttrf. Pointers into the local memory to arrays of dimension (desca(nb_)) each. On entry, the array d contains the local part of the global vector storing the main diagonal of the distributed matrix A. On entry, the array e contains the local part of the global vector storing the upper diagonal of the distributed matrix A. ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 501, then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. laf (local) INTEGER. The dimension of the array af. Must be laf = NB+2. If laf is not large enough, an error code will be returned and the minimum acceptable size will be returned in af(1). work (local) Same type as d and e. Workspace array of dimension lwork . lwork (local or global) INTEGER. The size of the work array, must be at least lwork = 8*NPCOL. Output Parameters d, e On exit, overwritten by the details of the factorization. af (local) REAL for pspttrf DOUBLE PRECISION for pdpttrf COMPLEX for pcpttrf DOUBLE COMPLEX for pzpttrf. Array, dimension (laf). Auxiliary Fillin space. Fillin is created during the factorization routine p? pttrf and this is stored in af. Note that if a linear system is to be solved using p?pttrs after the factorization routine, af must not be altered. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. info > 0: If info = k = NPROCS, the submatrix stored on processor info and factored locally was not positive definite, and the factorization was not completed. If info = k > NPROCS, the submatrix stored on processor info-NPROCS representing interactions with other processors was not nonsingular, and the factorization was not completed. ScaLAPACK Routines 6 1549 Routines for Solving Systems of Linear Equations This section describes the ScaLAPACK routines for solving systems of linear equations. Before calling most of these routines, you need to factorize the matrix of your system of equations (see Routines for Matrix Factorization in this chapter). However, the factorization is not necessary if your system of equations has a triangular matrix. p?getrs Solves a system of distributed linear equations with a general square matrix, using the LU factorization computed by p?getrf. Syntax call psgetrs(trans, n, nrhs, a, ia, ja, desca, ipiv, b, ib, jb, descb, info) call pdgetrs(trans, n, nrhs, a, ia, ja, desca, ipiv, b, ib, jb, descb, info) call pcgetrs(trans, n, nrhs, a, ia, ja, desca, ipiv, b, ib, jb, descb, info) call pzgetrs(trans, n, nrhs, a, ia, ja, desca, ipiv, b, ib, jb, descb, info) Include Files • C: mkl_scalapack.h Description The p?getrs routine solves a system of distributed linear equations with a general n-by-n distributed matrix sub(A) = A(ia:ia+n-1, ja:ja+n-1) using the LU factorization computed by p?getrf. The system has one of the following forms specified by trans: sub(A)*X = sub(B) (no transpose), sub(A)T*X = sub(B) (transpose), sub(A)H*X = sub(B) (conjugate transpose), where sub(B) = B(ib:ib+n-1, jb:jb+nrhs-1). Before calling this routine, you must call p?getrf to compute the LU factorization of sub(A). Input Parameters trans (global) CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', then sub(A)*X = sub(B) is solved for X. If trans = 'T', then sub(A)T*X = sub(B) is solved for X. If trans = 'C', then sub(A)H *X = sub(B) is solved for X. n (global) INTEGER. The number of linear equations; the order of the submatrix sub(A) (n=0). nrhs (global) INTEGER. The number of right hand sides; the number of columns of the distributed submatrix sub(B) (nrhs=0). a, b (global) REAL for psgetrs DOUBLE PRECISION for pdgetrs COMPLEX for pcgetrs DOUBLE COMPLEX for pzgetrs. 6 Intel® Math Kernel Library Reference Manual 1550 Pointers into the local memory to arrays of local dimension a(lld_a, LOCc(ja+n-1)) and b(lld_b, LOCc(jb+nrhs-1)), respectively. On entry, the array a contains the local pieces of the factors L and U from the factorization sub(A) = P*L*U; the unit diagonal elements of L are not stored. On entry, the array b contains the right hand sides sub(B). ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. ipiv (local) INTEGER array. The dimension of ipiv is (LOCr(m_a) + mb_a). This array contains the pivoting information: local row i of the matrix was interchanged with the global row ipiv(i). This array is tied to the distributed matrix A. ib, jb (global) INTEGER. The row and column indices in the global array B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. Output Parameters b On exit, overwritten by the solution distributed matrix X. info INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. p?gbtrs Solves a system of distributed linear equations with a general band matrix, using the LU factorization computed by p?gbtrf. Syntax call psgbtrs(trans, n, bwl, bwu, nrhs, a, ja, desca, ipiv, b, ib, descb, af, laf, work, lwork, info) call pdgbtrs(trans, n, bwl, bwu, nrhs, a, ja, desca, ipiv, b, ib, descb, af, laf, work, lwork, info) call pcgbtrs(trans, n, bwl, bwu, nrhs, a, ja, desca, ipiv, b, ib, descb, af, laf, work, lwork, info) call pzgbtrs(trans, n, bwl, bwu, nrhs, a, ja, desca, ipiv, b, ib, descb, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?gbtrs routine solves a system of distributed linear equations with a general band distributed matrix sub(A) = A(1:n, ja:ja+n-1) using the LU factorization computed by p?gbtrf. ScaLAPACK Routines 6 1551 The system has one of the following forms specified by trans: sub(A)*X = sub(B) (no transpose), sub(A)T*X = sub(B) (transpose), sub(A)H*X = sub(B) (conjugate transpose), where sub(B) = B(ib:ib+n-1, 1:nrhs) . Before calling this routine, you must call p?gbtrf to compute the LU factorization of sub(A). Input Parameters trans (global) CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', then sub(A)*X = sub(B) is solved for X. If trans = 'T', then sub(A)T*X = sub(B) is solved for X. If trans = 'C', then sub(A)H *X = sub(B) is solved for X. n (global) INTEGER. The number of linear equations; the order of the distributed submatrix sub(A) (n = 0). bwl (global) INTEGER. The number of sub-diagonals within the band of A ( 0 = bwl = n-1 ). bwu (global) INTEGER. The number of super-diagonals within the band of A ( 0 = bwu = n-1 ). nrhs (global) INTEGER. The number of right hand sides; the number of columns of the distributed submatrix sub(B) (nrhs = 0). a, b (global) REAL for psgbtrs DOUBLE PRECISION for pdgbtrs COMPLEX for pcgbtrs DOUBLE COMPLEX for pzgbtrs. Pointers into the local memory to arrays of local dimension a(lld_a,LOCc(ja+n-1)) and b(lld_b,LOCc(nrhs)), respectively. The array a contains details of the LU factorization of the distributed band matrix A. On entry, the array b contains the local pieces of the right hand sides B(ib:ib+n-1, 1:nrhs). ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on ( which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 501, then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. ib (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on ( which may be either all of A or a submatrix of A). descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 502 , then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. laf (local) INTEGER. The dimension of the array af. Must be laf = NB*(bwl+bwu)+6*(bwl+bwu)*(bwl+2*bwu). 6 Intel® Math Kernel Library Reference Manual 1552 If laf is not large enough, an error code will be returned and the minimum acceptable size will be returned in af(1). work (local) Same type as a. Workspace array of dimension lwork. lwork (local or global) INTEGER. The size of the work array, must be at least lwork = nrhs*(NB+2*bwl+4*bwu). Output Parameters ipiv (local) INTEGER array. The dimension of ipiv must be = desca(NB). Contains pivot indices for local factorizations. Note that you should not alter the contents of this array between factorization and solve. b On exit, overwritten by the local pieces of the solution distributed matrix X. af (local) REAL for psgbtrs DOUBLE PRECISION for pdgbtrs COMPLEX for pcgbtrs DOUBLE COMPLEX for pzgbtrs. Array, dimension (laf). Auxiliary Fillin space. Fillin is created during the factorization routine p? gbtrf and this is stored in af. Note that if a linear system is to be solved using p?gbtrs after the factorization routine, af must not be altered after the factorization. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the jth entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. p?dbtrs Solves a system of linear equations with a diagonally dominant-like banded distributed matrix using the factorization computed by p?dbtrf. Syntax call psdbtrs(trans, n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pddbtrs(trans, n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pcdbtrs(trans, n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pzdbtrs(trans, n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h ScaLAPACK Routines 6 1553 Description The p?dbtrs routine solves for X one of the systems of equations: sub(A)*X = sub(B), (sub(A))T*X = sub(B), or (sub(A))H*X = sub(B), where sub(A) = A(1:n, ja:ja+n-1) is a diagonally dominant-like banded distributed matrix, and sub(B) denotes the distributed matrix B(ib:ib+n-1, 1:nrhs). This routine uses the LU factorization computed by p?dbtrf. Input Parameters trans (global) CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', then sub(A)*X = sub(B) is solved for X. If trans = 'T', then (sub(A))T*X = sub(B) is solved for X. If trans = 'C', then (sub(A))H*X = sub(B) is solved for X. n (global) INTEGER. The order of the distributed submatrix sub(A) (n = 0). bwl (global) INTEGER. The number of subdiagonals within the band of A ( 0 = bwl = n-1 ). bwu (global) INTEGER. The number of superdiagonals within the band of A ( 0 = bwu = n-1 ). nrhs (global) INTEGER. The number of right hand sides; the number of columns of the distributed submatrix sub(B) (nrhs = 0). a, b (local) REAL for psdbtrs DOUBLE PRECISON for pddbtrs COMPLEX for pcdbtrs DOUBLE COMPLEX for pzdbtrs. Pointers into the local memory to arrays of local dimension a(lld_a,LOCc(ja+n-1)) and b(lld_b,LOCc(nrhs)), respectively. On entry, the array a contains details of the LU factorization of the band matrix A, as computed by p?dbtrf. On entry, the array b contains the local pieces of the right hand side distributed matrix sub(B). ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 501, then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. ib (global) INTEGER. The row index in the global array B that points to the first row of the matrix to be operated on (which may be either all of B or a submatrix of B). descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. If descb(dtype_) = 502, then dlen_ = 7; else if descb(dtype_) = 1, then dlen_ = 9. 6 Intel® Math Kernel Library Reference Manual 1554 af, work (local) REAL for psdbtrs DOUBLE PRECISION for pddbtrs COMPLEX for pcdbtrs DOUBLE COMPLEX for pzdbtrs. Arrays of dimension (laf) and (lwork), respectively The array af contains auxiliary Fillin space. Fillin is created during the factorization routine p? dbtrf and this is stored in af. The array work is a workspace array. laf (local) INTEGER. The dimension of the array af. Must be laf = NB*(bwl+bwu)+6*(max(bwl,bwu))2 . If laf is not large enough, an error code will be returned and the minimum acceptable size will be returned in af(1). lwork (local or global) INTEGER. The size of the array work, must be at least lwork = (max(bwl,bwu))2. Output Parameters b On exit, this array contains the local pieces of the solution distributed matrix X. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info INTEGER. If info=0, the execution is successful. info < 0: if the ith argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. p?dttrs Solves a system of linear equations with a diagonally dominant-like tridiagonal distributed matrix using the factorization computed by p?dttrf. Syntax call psdttrs(trans, n, nrhs, dl, d, du, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pddttrs(trans, n, nrhs, dl, d, du, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pcdttrs(trans, n, nrhs, dl, d, du, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pzdttrs(trans, n, nrhs, dl, d, du, ja, desca, b, ib, descb, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?dttrs routine solves for X one of the systems of equations: sub(A)*X = sub(B), (sub(A))T*X = sub(B), or ScaLAPACK Routines 6 1555 (sub(A))H*X = sub(B), where sub(A) = (1:n, ja:ja+n-1); is a diagonally dominant-like tridiagonal distributed matrix, and sub(B) denotes the distributed matrix B(ib:ib+n-1, 1:nrhs). This routine uses the LU factorization computed by p?dttrf. Input Parameters trans (global) CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', then sub(A)*X = sub(B) is solved for X. If trans = 'T', then (sub(A))T*X = sub(B) is solved for X. If trans = 'C', then (sub(A))H*X = sub(B) is solved for X. n (global) INTEGER. The order of the distributed submatrix sub(A) (n = 0). nrhs (global) INTEGER. The number of right hand sides; the number of columns of the distributed submatrix sub(B) (nrhs = 0). dl, d, du (local) REAL for psdttrs DOUBLE PRECISON for pddttrs COMPLEX for pcdttrs DOUBLE COMPLEX for pzdttrs. Pointers to the local arrays of dimension (desca(nb_)) each. On entry, these arrays contain details of the factorization. Globally, dl(1) and du(n) are not referenced; dl and du must be aligned with d. ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 501 or 502, then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. b (local) Same type as d. Pointer into the local memory to an array of local dimension b(lld_b,LOCc(nrhs)). On entry, the array b contains the local pieces of the n-by-nrhs right hand side distributed matrix sub(B). ib (global) INTEGER. The row index in the global array B that points to the first row of the matrix to be operated on (which may be either all of B or a submatrix of B). descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. If descb(dtype_) = 502, then dlen_ = 7; else if descb(dtype_) = 1, then dlen_ = 9. af, work (local) REAL for psdttrs DOUBLE PRECISION for pddttrs COMPLEX for pcdttrs DOUBLE COMPLEX for pzdttrs. Arrays of dimension (laf) and (lwork), respectively. 6 Intel® Math Kernel Library Reference Manual 1556 The array af contains auxiliary Fillin space. Fillin is created during the factorization routine p?dttrf and this is stored in af. If a linear system is to be solved using p?dttrsafter the factorization routine, af must not be altered. The array work is a workspace array. laf (local) INTEGER. The dimension of the array af. Must be laf = NB*(bwl+bwu)+6*(bwl+bwu)*(bwl+2*bwu). If laf is not large enough, an error code will be returned and the minimum acceptable size will be returned in af(1). lwork (local or global) INTEGER. The size of the array work, must be at least lwork = 10*NPCOL+4*nrhs. Output Parameters b On exit, this array contains the local pieces of the solution distributed matrix X. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info INTEGER. If info=0, the execution is successful. info < 0: if the ith argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. p?potrs Solves a system of linear equations with a Choleskyfactored symmetric/Hermitian distributed positivedefinite matrix. Syntax call pspotrs(uplo, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) call pdpotrs(uplo, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) call pcpotrs(uplo, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) call pzpotrs(uplo, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) Include Files • C: mkl_scalapack.h Description The p?potrs routine solves for X a system of distributed linear equations in the form: sub(A)*X = sub(B) , where sub(A) = A(ia:ia+n-1, ja:ja+n-1) is an n-by-n real symmetric or complex Hermitian positive definite distributed matrix, and sub(B) denotes the distributed matrix B(ib:ib+n-1, jb:jb+nrhs-1). This routine uses Cholesky factorization sub(A) = UH*U, or sub(A) = L*LH computed by p?potrf. ScaLAPACK Routines 6 1557 Input Parameters uplo (global) CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', upper triangle of sub(A) is stored; If uplo = 'L', lower triangle of sub(A) is stored. n (global) INTEGER. The order of the distributed submatrix sub(A) (n=0). nrhs (global) INTEGER. The number of right hand sides; the number of columns of the distributed submatrix sub(B) (nrhs=0). a, b (local) REAL for pspotrs DOUBLE PRECISION for pdpotrs COMPLEX for pcpotrs DOUBLE COMPLEX for pzpotrs. Pointers into the local memory to arrays of local dimension a(lld_a,LOCc(ja+n-1)) and b(lld_b,LOCc(jb+nrhs-1)), respectively. The array a contains the factors L or U from the Cholesky factorization sub(A) = L*LH or sub(A) = UH*U, as computed by p?potrf. On entry, the array b contains the local pieces of the right hand sides sub(B). ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. ib, jb (global) INTEGER. The row and column indices in the global array B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. Output Parameters b Overwritten by the local pieces of the solution matrix X. info INTEGER. If info=0, the execution is successful. info < 0: if the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. p?pbtrs Solves a system of linear equations with a Choleskyfactored symmetric/Hermitian positive-definite band matrix. Syntax call pspbtrs(uplo, n, bw, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pdpbtrs(uplo, n, bw, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pcpbtrs(uplo, n, bw, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) 6 Intel® Math Kernel Library Reference Manual 1558 call pzpbtrs(uplo, n, bw, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?pbtrs routine solves for X a system of distributed linear equations in the form: sub(A)*X = sub(B) , where sub(A) = A(1:n, ja:ja+n-1) is an n-by-n real symmetric or complex Hermitian positive definite distributed band matrix, and sub(B) denotes the distributed matrix B(ib:ib+n-1, 1:nrhs). This routine uses Cholesky factorization sub(A) = P*UH*U*PT, or sub(A) = P*L*LH*PT computed by p?pbtrf. Input Parameters uplo (global) CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', upper triangle of sub(A) is stored; If uplo = 'L', lower triangle of sub(A) is stored. n (global) INTEGER. The order of the distributed submatrix sub(A) (n=0). bw (global) INTEGER. The number of superdiagonals of the distributed matrix if uplo = 'U', or the number of subdiagonals if uplo = 'L' (bw=0). nrhs (global) INTEGER. The number of right hand sides; the number of columns of the distributed submatrix sub(B) (nrhs=0). a, b (local) REAL for pspbtrs DOUBLE PRECISION for pdpbtrs COMPLEX for pcpbtrs DOUBLE COMPLEX for pzpbtrs. Pointers into the local memory to arrays of local dimension a(lld_a,LOCc(ja+n-1)) and b(lld_b,LOCc(nrhs-1)), respectively. The array a contains the permuted triangular factor U or L from the Cholesky factorization sub(A) = P*UH*U*PT, or sub(A) = P*L*LH*PT of the band matrix A, as returned by p?pbtrf. On entry, the array b contains the local pieces of the n-by-nrhs right hand side distributed matrix sub(B). ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 501, then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. ib (global) INTEGER. The row index in the global array B indicating the first row of the submatrix sub(B). descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. If descb(dtype_) = 502, then dlen_ = 7; ScaLAPACK Routines 6 1559 else if descb(dtype_) = 1, then dlen_ = 9. af, work (local) Arrays, same type as a. The array af is of dimension (laf). It contains auxiliary Fillin space. Fillin is created during the factorization routine p?dbtrf and this is stored in af. The array work is a workspace array of dimension lwork. laf (local) INTEGER. The dimension of the array af. Must be laf = nrhs*bw. If laf is not large enough, an error code will be returned and the minimum acceptable size will be returned in af(1). lwork (local or global) INTEGER. The size of the array work, must be at least lwork = bw2. Output Parameters b On exit, if info=0, this array contains the local pieces of the n-by-nrhs solution distributed matrix X. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. p?pttrs Solves a system of linear equations with a symmetric (Hermitian) positive-definite tridiagonal distributed matrix using the factorization computed by p?pttrf. Syntax call pspttrs(n, nrhs, d, e, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pdpttrs(n, nrhs, d, e, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pcpttrs(uplo, n, nrhs, d, e, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pzpttrs(uplo, n, nrhs, d, e, ja, desca, b, ib, descb, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?pttrs routine solves for X a system of distributed linear equations in the form: sub(A)*X = sub(B) , where sub(A) = A(1:n, ja:ja+n-1) is an n-by-n real symmetric or complex Hermitian positive definite tridiagonal distributed matrix, and sub(B) denotes the distributed matrix B(ib:ib+n-1, 1:nrhs). This routine uses the factorization sub(A) = P*L*D*LH*PT, or sub(A) = P*UH*D*U*PT computed by p?pttrf. 6 Intel® Math Kernel Library Reference Manual 1560 Input Parameters uplo (global, used in complex flavors only) CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', upper triangle of sub(A) is stored; If uplo = 'L', lower triangle of sub(A) is stored. n (global) INTEGER. The order of the distributed submatrix sub(A) (n=0). nrhs (global) INTEGER. The number of right hand sides; the number of columns of the distributed submatrix sub(B) (nrhs=0). d, e (local) REAL for pspttrs DOUBLE PRECISON for pdpttrs COMPLEX for pcpttrs DOUBLE COMPLEX for pzpttrs. Pointers into the local memory to arrays of dimension (desca(nb_)) each. These arrays contain details of the factorization as returned by p?pttrf ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 501 or 502, then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. b (local) Same type as d, e. Pointer into the local memory to an array of local dimension b(lld_b, LOCc(nrhs)). On entry, the array b contains the local pieces of the n-by-nrhs right hand side distributed matrix sub(B). ib (global) INTEGER. The row index in the global array B that points to the first row of the matrix to be operated on (which may be either all of B or a submatrix of B). descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. If descb(dtype_) = 502, then dlen_ = 7; else if descb(dtype_) = 1, then dlen_ = 9. af, work (local) REAL for pspttrs DOUBLE PRECISION for pdpttrs COMPLEX for pcpttrs DOUBLE COMPLEX for pzpttrs. Arrays of dimension (laf) and (lwork), respectively The array af contains auxiliary Fillin space. Fillin is created during the factorization routine p? pttrf and this is stored in af. The array work is a workspace array. laf (local) INTEGER. The dimension of the array af. Must be laf = NB+2. If laf is not large enough, an error code is returned and the minimum acceptable size will be returned in af(1). lwork (local or global) INTEGER. The size of the array work, must be at least lwork = (10+2*min(100,nrhs))*NPCOL+4*nrhs. ScaLAPACK Routines 6 1561 Output Parameters b On exit, this array contains the local pieces of the solution distributed matrix X. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info INTEGER. If info=0, the execution is successful. info < 0: if the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. p?trtrs Solves a system of linear equations with a triangular distributed matrix. Syntax call pstrtrs(uplo, trans, diag, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) call pdtrtrs(uplo, trans, diag, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) call pctrtrs(uplo, trans, diag, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) call pztrtrs(uplo, trans, diag, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) Include Files • C: mkl_scalapack.h Description The p?trtrs routine solves for X one of the following systems of linear equations: sub(A)*X = sub(B), (sub(A))T*X = sub(B), or (sub(A))H*X = sub(B), where sub(A) = A(ia:ia+n-1, ja:ja+n-1) is a triangular distributed matrix of order n, and sub(B) denotes the distributed matrix B(ib:ib+n-1, jb:jb+nrhs-1). A check is made to verify that sub(A) is nonsingular. Input Parameters uplo (global) CHARACTER*1. Must be 'U' or 'L'. Indicates whether sub(A) is upper or lower triangular: If uplo = 'U', then sub(A) is upper triangular. If uplo = 'L', then sub(A) is lower triangular. trans (global) CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', then sub(A)*X = sub(B) is solved for X. If trans = 'T', then sub(A)T*X = sub(B) is solved for X. If trans = 'C', then sub(A)H*X = sub(B) is solved for X. diag (global) CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then sub(A) is not a unit triangular matrix. If diag = 'U', then sub(A) is unit triangular. 6 Intel® Math Kernel Library Reference Manual 1562 n (global) INTEGER. The order of the distributed submatrix sub(A) (n=0). nrhs (global) INTEGER. The number of right-hand sides; i.e., the number of columns of the distributed matrix sub(B) (nrhs=0). a, b (local) REAL for pstrtrs DOUBLE PRECISION for pdtrtrs COMPLEX for pctrtrs DOUBLE COMPLEX for pztrtrs. Pointers into the local memory to arrays of local dimension a(lld_a,LOCc(ja+n-1)) and b(lld_b,LOCc(jb+nrhs-1)), respectively. The array a contains the local pieces of the distributed triangular matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular matrix, and the strictly lower triangular part of sub(A) is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular matrix, and the strictly upper triangular part of sub(A) is not referenced. If diag = 'U', the diagonal elements of sub(A) are also not referenced and are assumed to be 1. On entry, the array b contains the local pieces of the right hand side distributed matrix sub(B). ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. ib, jb (global) INTEGER. The row and column indices in the global array B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. Output Parameters b On exit, if info=0, sub(B) is overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. info < 0: if the ith argument is an array and the jth entry had an illegal value, then info = -(i*100+j); if the ith argument is a scalar and had an illegal value, then info = -i; info > 0: if info = i, the i-th diagonal element of sub(A) is zero, indicating that the submatrix is singular and the solutions X have not been computed. Routines for Estimating the Condition Number This section describes the ScaLAPACK routines for estimating the condition number of a matrix. The condition number is used for analyzing the errors in the solution of a system of linear equations. Since the condition number may be arbitrarily large when the matrix is nearly singular, the routines actually compute the reciprocal condition number. ScaLAPACK Routines 6 1563 p?gecon Estimates the reciprocal of the condition number of a general distributed matrix in either the 1-norm or the infinity-norm. Syntax call psgecon(norm, n, a, ia, ja, desca, anorm, rcond, work, lwork, iwork, liwork, info) call pdgecon(norm, n, a, ia, ja, desca, anorm, rcond, work, lwork, iwork, liwork, info) call pcgecon(norm, n, a, ia, ja, desca, anorm, rcond, work, lwork, rwork, lrwork, info) call pzgecon(norm, n, a, ia, ja, desca, anorm, rcond, work, lwork, rwork, lrwork, info) Include Files • C: mkl_scalapack.h Description The p?gecon routine estimates the reciprocal of the condition number of a general distributed real/complex matrix sub(A) = A(ia:ia+n-1, ja:ja+n-1) in either the 1-norm or infinity-norm, using the LU factorization computed by p?getrf. An estimate is obtained for ||(sub(A))-1||, and the reciprocal of the condition number is computed as Input Parameters norm (global) CHARACTER*1. Must be '1' or 'O' or 'I'. Specifies whether the 1-norm condition number or the infinity-norm condition number is required. If norm = '1' or 'O', then the 1-norm is used; If norm = 'I', then the infinity-norm is used. n (global) INTEGER. The order of the distributed submatrix sub(A) (n = 0). a (local) REAL for psgecon DOUBLE PRECISION for pdgecon COMPLEX for pcgecon DOUBLE COMPLEX for pzgecon. Pointer into the local memory to an array of dimension a(lld_a,LOCc(ja +n-1)). The array a contains the local pieces of the factors L and U from the factorization sub(A) = P*L*U; the unit diagonal elements of L are not stored. 6 Intel® Math Kernel Library Reference Manual 1564 ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. anorm (global) REAL for single precision flavors, DOUBLE PRECISION for double precision flavors. If norm = '1' or 'O', the 1-norm of the original distributed matrix sub(A); If norm = 'I', the infinity-norm of the original distributed matrix sub(A). work (local) REAL for psgecon DOUBLE PRECISION for pdgecon COMPLEX for pcgecon DOUBLE COMPLEX for pzgecon. The array work of dimension (lwork) is a workspace array. lwork (local or global) INTEGER. The dimension of the array work. For real flavors: lwork must be at least lwork = 2*LOCr(n+mod(ia-1,mb_a))+ 2*LOCc(n+mod(ja-1,nb_a))+ max(2, max(nb_a*max(1, iceil(NPROW-1, NPCOL)), LOCc(n +mod(ja-1,nb_a)) + nb_a*max(1, iceil(NPCOL-1, NPROW)))). For complex flavors: lwork must be at least lwork = 2*LOCr(n+mod(ia-1,mb_a))+ max(2, max(nb_a*iceil(NPROW-1, NPCOL), LOCc(n+mod(ja-1,nb_a))+ nb_a*iceil(NPCOL-1, NPROW))). LOCr and LOCc values can be computed using the ScaLAPACK tool function numroc; NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. iwork (local) INTEGER. Workspace array, DIMENSION (liwork). Used in real flavors only. liwork (local or global) INTEGER. The dimension of the array iwork; used in real flavors only. Must be at least liwork = LOCr(n+mod(ia-1,mb_a)). rwork (local) REAL for pcgecon DOUBLE PRECISION for pzgecon Workspace array, DIMENSION (lrwork). Used in complex flavors only. lrwork (local or global) INTEGER. The dimension of the array rwork; used in complex flavors only. Must be at least lrwork = max(1, 2*LOCc(n+mod(ja-1,nb_a))). Output Parameters rcond (global) REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The reciprocal of the condition number of the distributed matrix sub(A). See Description. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. iwork(1) On exit, iwork(1) contains the minimum value of liwork required for optimum performance (for real flavors). ScaLAPACK Routines 6 1565 rwork(1) On exit, rwork(1) contains the minimum value of lrwork required for optimum performance (for complex flavors). info (global) INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. p?pocon Estimates the reciprocal of the condition number (in the 1 - norm) of a symmetric / Hermitian positivedefinite distributed matrix. Syntax call pspocon(uplo, n, a, ia, ja, desca, anorm, rcond, work, lwork, iwork, liwork, info) call pdpocon(uplo, n, a, ia, ja, desca, anorm, rcond, work, lwork, iwork, liwork, info) call pcpocon(uplo, n, a, ia, ja, desca, anorm, rcond, work, lwork, rwork, lrwork, info) call pzpocon(uplo, n, a, ia, ja, desca, anorm, rcond, work, lwork, rwork, lrwork, info) Include Files • C: mkl_scalapack.h Description The p?pocon routine estimates the reciprocal of the condition number (in the 1 - norm) of a real symmetric or complex Hermitian positive definite distributed matrix sub(A) = A(ia:ia+n-1, ja:ja+n-1), using the Cholesky factorization sub(A) = UH*U or sub(A) = L*LH computed by p?potrf. An estimate is obtained for ||(sub(A))-1||, and the reciprocal of the condition number is computed as Input Parameters uplo (global) CHARACTER*1. Must be 'U' or 'L'. Specifies whether the factor stored in sub(A) is upper or lower triangular. If uplo = 'U', sub(A) stores the upper triangular factor U of the Cholesky factorization sub(A) = UH*U. If uplo = 'L', sub(A) stores the lower triangular factor L of the Cholesky factorization sub(A) = L*LH. n (global) INTEGER. The order of the distributed submatrix sub(A) (n=0). a (local) REAL for pspocon 6 Intel® Math Kernel Library Reference Manual 1566 DOUBLE PRECISION for pdpocon COMPLEX for pcpocon DOUBLE COMPLEX for pzpocon. Pointer into the local memory to an array of dimension a(lld_a,LOCc(ja +n-1)). The array a contains the local pieces of the factors L or U from the Cholesky factorization sub(A) = UH*U, or sub(A) = L*LH, as computed by p? potrf. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. anorm (global) REAL for single precision flavors, DOUBLE PRECISION for double precision flavors. The 1-norm of the symmetric/Hermitian distributed matrix sub(A). work (local) REAL for pspocon DOUBLE PRECISION for pdpocon COMPLEX for pcpocon DOUBLE COMPLEX for pzpocon. The array work of dimension (lwork) is a workspace array. lwork (local or global) INTEGER. The dimension of the array work. For real flavors: lwork must be at least lwork = 2*LOCr(n+mod(ia-1,mb_a))+ 2*LOCc(n+mod(ja-1,nb_a))+ max(2, max(nb_a*iceil(NPROW-1, NPCOL), LOCc(n+mod(ja-1,nb_a)) + nb_a*iceil(NPCOL-1, NPROW))). For complex flavors: lwork must be at least lwork = 2*LOCr(n+mod(ia-1,mb_a))+ max(2, max(nb_a*max(1,iceil(NPROW-1, NPCOL)), LOCc(n+mod(ja-1,nb_a)) + nb_a*max(1,iceil(NPCOL-1, NPROW)))). iwork (local) INTEGER. Workspace array, DIMENSION (liwork). Used in real flavors only. liwork (local or global) INTEGER. The dimension of the array iwork; used in real flavors only. Must be at least liwork = LOCr(n+mod(ia-1,mb_a)). rwork (local) REAL for pcpocon DOUBLE PRECISION for pzpocon Workspace array, DIMENSION (lrwork). Used in complex flavors only. lrwork (local or global) INTEGER. The dimension of the array rwork; used in complex flavors only. Must be at least lrwork = 2*LOCc(n +mod(ja-1,nb_a)). Output Parameters rcond (global) REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The reciprocal of the condition number of the distributed matrix sub(A). work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. ScaLAPACK Routines 6 1567 iwork(1) On exit, iwork(1) contains the minimum value of liwork required for optimum performance (for real flavors). rwork(1) On exit, rwork(1) contains the minimum value of lrwork required for optimum performance (for complex flavors). info (global) INTEGER. If info=0, the execution is successful. info < 0: If the ith argument is an array and the jth entry had an illegal value, then info = -(i*100+j); if the ith argument is a scalar and had an illegal value, then info = -i. p?trcon Estimates the reciprocal of the condition number of a triangular distributed matrix in either 1-norm or infinity-norm. Syntax call pstrcon(norm, uplo, diag, n, a, ia, ja, desca, rcond, work, lwork, iwork, liwork, info) call pdtrcon(norm, uplo, diag, n, a, ia, ja, desca, rcond, work, lwork, iwork, liwork, info) call pctrcon(norm, uplo, diag, n, a, ia, ja, desca, rcond, work, lwork, rwork, lrwork, info) call pztrcon(norm, uplo, diag, n, a, ia, ja, desca, rcond, work, lwork, rwork, lrwork, info) Include Files • C: mkl_scalapack.h Description The p?trcon routine estimates the reciprocal of the condition number of a triangular distributed matrix sub(A) = A(ia:ia+n-1, ja:ja+n-1), in either the 1-norm or the infinity-norm. The norm of sub(A) is computed and an estimate is obtained for ||(sub(A))-1||, then the reciprocal of the condition number is computed as Input Parameters norm (global) CHARACTER*1. Must be '1' or 'O' or 'I'. Specifies whether the 1-norm condition number or the infinity-norm condition number is required. If norm = '1' or 'O', then the 1-norm is used; If norm = 'I', then the infinity-norm is used. uplo (global) CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', sub(A) is upper triangular. If uplo = 'L', sub(A) is lower triangular. diag (global) CHARACTER*1. Must be 'N' or 'U'. 6 Intel® Math Kernel Library Reference Manual 1568 If diag = 'N', sub(A) is non-unit triangular. If diag = 'U', sub(A) is unit triangular. n (global) INTEGER. The order of the distributed submatrix sub(A), (n=0). a (local) REAL for pstrcon DOUBLE PRECISION for pdtrcon COMPLEX for pctrcon DOUBLE COMPLEX for pztrcon. Pointer into the local memory to an array of dimension a(lld_a,LOCc(ja+n-1)). The array a contains the local pieces of the triangular distributed matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of this distributed matrix contains the upper triangular matrix, and its strictly lower triangular part is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of this distributed matrix contains the lower triangular matrix, and its strictly upper triangular part is not referenced. If diag = 'U', the diagonal elements of sub(A) are also not referenced and are assumed to be 1. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. work (local) REAL for pstrcon DOUBLE PRECISION for pdtrcon COMPLEX for pctrcon DOUBLE COMPLEX for pztrcon. The array work of dimension (lwork) is a workspace array. lwork (local or global) INTEGER. The dimension of the array work. For real flavors: lwork must be at least lwork = 2*LOCr(n+mod(ia-1,mb_a))+ LOCc(n+mod(ja-1,nb_a))+ max(2, max(nb_a*max(1,iceil(NPROW-1, NPCOL)), LOCc(n+mod(ja-1,nb_a))+ nb_a*max(1,iceil(NPCOL-1, NPROW)))). For complex flavors: lwork must be at least lwork = 2*LOCr(n+mod(ia-1,mb_a))+ max(2, max(nb_a*iceil(NPROW-1, NPCOL), LOCc(n+mod(ja-1,nb_a))+ nb_a*iceil(NPCOL-1, NPROW))). iwork (local) INTEGER. Workspace array, DIMENSION (liwork). Used in real flavors only. liwork (local or global) INTEGER. The dimension of the array iwork; used in real flavors only. Must be at least liwork = LOCr(n+mod(ia-1,mb_a)). rwork (local) REAL for pcpocon DOUBLE PRECISION for pzpocon Workspace array, DIMENSION (lrwork). Used in complex flavors only. ScaLAPACK Routines 6 1569 lrwork (local or global) INTEGER. The dimension of the array rwork; used in complex flavors only. Must be at least lrwork = LOCc(n+mod(ja-1,nb_a)). Output Parameters rcond (global) REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The reciprocal of the condition number of the distributed matrix sub(A). work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. iwork(1) On exit, iwork(1) contains the minimum value of liwork required for optimum performance (for real flavors). rwork(1) On exit, rwork(1) contains the minimum value of lrwork required for optimum performance (for complex flavors). info (global) INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. Refining the Solution and Estimating Its Error This section describes the ScaLAPACK routines for refining the computed solution of a system of linear equations and estimating the solution error. You can call these routines after factorizing the matrix of the system of equations and computing the solution (see Routines for Matrix Factorization and Solving Systems of Linear Equations). p?gerfs Improves the computed solution to a system of linear equations and provides error bounds and backward error estimates for the solution. Syntax call psgerfs(trans, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, ipiv, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, iwork, liwork, info) call pdgerfs(trans, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, ipiv, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, iwork, liwork, info) call pcgerfs(trans, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, ipiv, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, rwork, lrwork, info) call pzgerfs(trans, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, ipiv, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, rwork, lrwork, info) Include Files • C: mkl_scalapack.h Description The p?gerfs routine improves the computed solution to one of the systems of linear equations sub(A)*sub(X) = sub(B), sub(A)T*sub(X) = sub(B), or 6 Intel® Math Kernel Library Reference Manual 1570 sub(A)H*sub(X) = sub(B) and provides error bounds and backward error estimates for the solution. Here sub(A) = A(ia:ia+n-1, ja:ja+n-1), sub(B) = B(ib:ib+n-1, jb:jb+nrhs-1), and sub(X) = X(ix:ix+n-1, jx:jx+nrhs-1). Input Parameters trans (global) CHARACTER*1. Must be 'N' or 'T' or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form sub(A)*sub(X) = sub(B) (No transpose); If trans = 'T', the system has the form sub(A)T*sub(X) = sub(B) (Transpose); If trans = 'C', the system has the form sub(A)H*sub(X) = sub(B) (Conjugate transpose). n (global) INTEGER. The order of the distributed submatrix sub(A) (n = 0). nrhs (global) INTEGER. The number of right-hand sides, i.e., the number of columns of the matrices sub(B) and sub(X) (nrhs = 0). a, af, b, x (local) REAL for psgerfs DOUBLE PRECISION for pdgerfs COMPLEX for pcgerfs DOUBLE COMPLEX for pzgerfs. Pointers into the local memory to arrays of local dimension a(lld_a, LOCc(ja+n-1)), af(lld_af,LOCc(jaf+n-1)), b(lld_b,LOCc(jb +nrhs-1)), and x(lld_x,LOCc(jx+nrhs-1)), respectively. The array a contains the local pieces of the distributed matrix sub(A). The array af contains the local pieces of the distributed factors of the matrix sub(A) = P*L*U as computed by p?getrf. The array b contains the local pieces of the distributed matrix of right hand sides sub(B). On entry, the array x contains the local pieces of the distributed solution matrix sub(X). ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. iaf, jaf (global) INTEGER. The row and column indices in the global array AF indicating the first row and the first column of the submatrix sub(AF), respectively. descaf (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix AF. ib, jb (global) INTEGER. The row and column indices in the global array B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. ix, jx (global) INTEGER. The row and column indices in the global array X indicating the first row and the first column of the submatrix sub(X), respectively. ScaLAPACK Routines 6 1571 descx (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix X. ipiv (local) INTEGER. Array, dimension LOCr(m_af + mb_af. This array contains pivoting information as computed by p?getrf. If ipiv(i)=j, then the local row i was swapped with the global row j. This array is tied to the distributed matrix A. work (local) REAL for psgerfs DOUBLE PRECISION for pdgerfs COMPLEX for pcgerfs DOUBLE COMPLEX for pzgerfs. The array work of dimension (lwork) is a workspace array. lwork (local or global) INTEGER. The dimension of the array work. For real flavors: lwork must be at least lwork = 3*LOCr(n+mod(ia-1,mb_a)) For complex flavors: lwork must be at least lwork = 2*LOCr(n+mod(ia-1,mb_a)) iwork (local) INTEGER. Workspace array, DIMENSION (liwork). Used in real flavors only. liwork (local or global) INTEGER. The dimension of the array iwork; used in real flavors only. Must be at least liwork = LOCr(n+mod(ib-1,mb_b)). rwork (local) REAL for pcgerfs DOUBLE PRECISION for pzgerfs Workspace array, DIMENSION (lrwork). Used in complex flavors only. lrwork (local or global) INTEGER. The dimension of the array rwork; used in complex flavors only. Must be at least lrwork = LOCr(n +mod(ib-1,mb_b))). Output Parameters x On exit, contains the improved solution vectors. ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, dimension LOCc(jb+nrhs-1) each. The array ferr contains the estimated forward error bound for each solution vector of sub(X). If XTRUE is the true solution corresponding to sub(X), ferr is an estimated upper bound for the magnitude of the largest element in (sub(X) - XTRUE) divided by the magnitude of the largest element in sub(X). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. This array is tied to the distributed matrix X. The array berr contains the component-wise relative backward error of each solution vector (that is, the smallest relative change in any entry of sub(A) or sub(B) that makes sub(X) an exact solution). This array is tied to the distributed matrix X. 6 Intel® Math Kernel Library Reference Manual 1572 work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. iwork(1) On exit, iwork(1) contains the minimum value of liwork required for optimum performance (for real flavors). rwork(1) On exit, rwork(1) contains the minimum value of lrwork required for optimum performance (for complex flavors). info (global) INTEGER. If info=0, the execution is successful. info < 0: If the ith argument is an array and the jth entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. p?porfs Improves the computed solution to a system of linear equations with symmetric/Hermitian positive definite distributed matrix and provides error bounds and backward error estimates for the solution. Syntax call psporfs(uplo, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, iwork, liwork, info) call pdporfs(uplo, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, iwork, liwork, info) call pcporfs(uplo, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, rwork, lrwork, info) call pzporfs(uplo, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, rwork, lrwork, info) Include Files • C: mkl_scalapack.h Description The p?porfs routine improves the computed solution to the system of linear equations sub(A)*sub(X) = sub(B), where sub(A) = A(ia:ia+n-1, ja:ja+n-1) is a real symmetric or complex Hermitian positive definite distributed matrix and sub(B) = B(ib:ib+n-1, jb:jb+nrhs-1), sub(X) = X(ix:ix+n-1, jx:jx+nrhs-1) are right-hand side and solution submatrices, respectively. This routine also provides error bounds and backward error estimates for the solution. Input Parameters uplo (global) CHARACTER*1. Must be 'U' or 'L'. Specifies whether the upper or lower triangular part of the symmetric/ Hermitian matrix sub(A) is stored. If uplo = 'U', sub(A) is upper triangular. If uplo = 'L', sub(A) is lower triangular. ScaLAPACK Routines 6 1573 n (global) INTEGER. The order of the distributed matrix sub(A) (n=0). nrhs (global) INTEGER. The number of right-hand sides, i.e., the number of columns of the matrices sub(B) and sub(X) (nrhs=0). a, af, b, x (local) REAL for psporfs DOUBLE PRECISION for pdporfs COMPLEX for pcporfs DOUBLE COMPLEX for pzporfs. Pointers into the local memory to arrays of local dimension a(lld_a,LOCc(ja+n-1)), af(lld_af,LOCc(ja+n-1)), b(lld_b,LOCc(jb+nrhs-1)), and x(lld_x,LOCc(jx+nrhs-1)), respectively. The array a contains the local pieces of the n-by-n symmetric/Hermitian distributed matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular part of the matrix, and its strictly lower triangular part is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular part of the distributed matrix, and its strictly upper triangular part is not referenced. The array af contains the factors L or U from the Cholesky factorization sub(A) = L*LH or sub(A) = UH*U, as computed by p?potrf. On entry, the array b contains the local pieces of the distributed matrix of right hand sides sub(B). On entry, the array x contains the local pieces of the solution vectors sub(X). ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. iaf, jaf (global) INTEGER. The row and column indices in the global array AF indicating the first row and the first column of the submatrix sub(AF), respectively. descaf (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix AF. ib, jb (global) INTEGER. The row and column indices in the global array B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. ix, jx (global) INTEGER. The row and column indices in the global array X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix X. work (local) REAL for psporfs DOUBLE PRECISION for pdporfs COMPLEX for pcporfs 6 Intel® Math Kernel Library Reference Manual 1574 DOUBLE COMPLEX for pzporfs. The array work of dimension (lwork) is a workspace array. lwork (local) INTEGER. The dimension of the array work. For real flavors: lwork must be at least lwork = 3*LOCr(n+mod(ia-1,mb_a)) For complex flavors: lwork must be at least lwork = 2*LOCr(n+mod(ia-1,mb_a)) iwork (local) INTEGER. Workspace array, DIMENSION (liwork). Used in real flavors only. liwork (local or global) INTEGER. The dimension of the array iwork; used in real flavors only. Must be at least liwork = LOCr(n+mod(ib-1,mb_b)). rwork (local) REAL for pcporfs DOUBLE PRECISION for pzporfs Workspace array, DIMENSION (lrwork). Used in complex flavors only. lrwork (local or global) INTEGER. The dimension of the array rwork; used in complex flavors only. Must be at least lrwork = LOCr(n +mod(ib-1,mb_b))). Output Parameters x On exit, contains the improved solution vectors. ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, dimension LOCc(jb+nrhs-1) each. The array ferr contains the estimated forward error bound for each solution vector of sub(X). If XTRUE is the true solution corresponding to sub(X), ferr is an estimated upper bound for the magnitude of the largest element in (sub(X) - XTRUE)divided by the magnitude of the largest element in sub(X). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. This array is tied to the distributed matrix X. The array berr contains the component-wise relative backward error of each solution vector (that is, the smallest relative change in any entry of sub(A) or sub(B) that makes sub(X) an exact solution). This array is tied to the distributed matrix X. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. iwork(1) On exit, iwork(1) contains the minimum value of liwork required for optimum performance (for real flavors). rwork(1) On exit, rwork(1) contains the minimum value of lrwork required for optimum performance (for complex flavors). info (global) INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. ScaLAPACK Routines 6 1575 p?trrfs Provides error bounds and backward error estimates for the solution to a system of linear equations with a distributed triangular coefficient matrix. Syntax call pstrrfs(uplo, trans, diag, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, iwork, liwork, info) call pdtrrfs(uplo, trans, diag, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, iwork, liwork, info) call pctrrfs(uplo, trans, diag, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, rwork, lrwork, info) call pztrrfs(uplo, trans, diag, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, x, ix, jx, descx, ferr, berr, work, lwork, rwork, lrwork, info) Include Files • C: mkl_scalapack.h Description The p?trrfs routine provides error bounds and backward error estimates for the solution to one of the systems of linear equations sub(A)*sub(X) = sub(B), sub(A)T*sub(X) = sub(B), or sub(A)H*sub(X) = sub(B) , where sub(A) = A(ia:ia+n-1, ja:ja+n-1) is a triangular matrix, sub(B) = B(ib:ib+n-1, jb:jb+nrhs-1), and sub(X) = X(ix:ix+n-1, jx:jx+nrhs-1). The solution matrix X must be computed by p?trtrs or some other means before entering this routine. The routine p?trrfs does not do iterative refinement because doing so cannot improve the backward error. Input Parameters uplo (global) CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', sub(A) is upper triangular. If uplo = 'L', sub(A) is lower triangular. trans (global) CHARACTER*1. Must be 'N' or 'T' or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form sub(A)*sub(X) = sub(B) (No transpose); If trans = 'T', the system has the form sub(A)T*sub(X) = sub(B) (Transpose); If trans = 'C', the system has the form sub(A)H*sub(X) = sub(B) (Conjugate transpose). diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then sub(A) is non-unit triangular. If diag = 'U', then sub(A) is unit triangular. n (global) INTEGER. The order of the distributed matrix sub(A) (n=0). 6 Intel® Math Kernel Library Reference Manual 1576 nrhs (global) INTEGER. The number of right-hand sides, that is, the number of columns of the matrices sub(B) and sub(X) (nrhs=0). a, b, x (local) REAL for pstrrfs DOUBLE PRECISION for pdtrrfs COMPLEX for pctrrfs DOUBLE COMPLEX for pztrrfs. Pointers into the local memory to arrays of local dimension a(lld_a,LOCc(ja+n-1)), b(lld_b,LOCc(jb+nrhs-1)), and x(lld_x,LOCc(jx+nrhs-1)), respectively. The array a contains the local pieces of the original triangular distributed matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular part of the matrix, and its strictly lower triangular part is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular part of the distributed matrix, and its strictly upper triangular part is not referenced. If diag = 'U', the diagonal elements of sub(A) are also not referenced and are assumed to be 1. On entry, the array b contains the local pieces of the distributed matrix of right hand sides sub(B). On entry, the array x contains the local pieces of the solution vectors sub(X). ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. ib, jb (global) INTEGER. The row and column indices in the global array B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. ix, jx (global) INTEGER. The row and column indices in the global array X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix X. work (local) REAL for pstrrfs DOUBLE PRECISION for pdtrrfs COMPLEX for pctrrfs DOUBLE COMPLEX for pztrrfs. The array work of dimension (lwork) is a workspace array. lwork (local) INTEGER. The dimension of the array work. For real flavors: lwork must be at least lwork = 3*LOCr(n+mod(ia-1,mb_a)) For complex flavors: lwork must be at least lwork = 2*LOCr(n+mod(ia-1,mb_a)) ScaLAPACK Routines 6 1577 iwork (local) INTEGER. Workspace array, DIMENSION (liwork). Used in real flavors only. liwork (local or global) INTEGER. The dimension of the array iwork; used in real flavors only. Must be at least liwork = LOCr(n+mod(ib-1,mb_b)). rwork (local) REAL for pctrrfs DOUBLE PRECISION for pztrrfs Workspace array, DIMENSION (lrwork). Used in complex flavors only. lrwork (local or global) INTEGER. The dimension of the array rwork; used in complex flavors only. Must be at least lrwork = LOCr(n +mod(ib-1,mb_b))). Output Parameters ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, dimension LOCc(jb+nrhs-1) each. The array ferr contains the estimated forward error bound for each solution vector of sub(X). If XTRUE is the true solution corresponding to sub(X), ferr is an estimated upper bound for the magnitude of the largest element in (sub(X) - XTRUE) divided by the magnitude of the largest element in sub(X). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. This array is tied to the distributed matrix X. The array berr contains the component-wise relative backward error of each solution vector (that is, the smallest relative change in any entry of sub(A) or sub(B) that makes sub(X) an exact solution). This array is tied to the distributed matrix X. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. iwork(1) On exit, iwork(1) contains the minimum value of liwork required for optimum performance (for real flavors). rwork(1) On exit, rwork(1) contains the minimum value of lrwork required for optimum performance (for complex flavors). info (global) INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. Routines for Matrix Inversion This sections describes ScaLAPACK routines that compute the inverse of a matrix based on the previously obtained factorization. Note that it is not recommended to solve a system of equations Ax = b by first computing A-1 and then forming the matrix-vector product x = A-1b. Call a solver routine instead (see Solving Systems of Linear Equations); this is more efficient and more accurate. p?getri Computes the inverse of a LU-factored distributed matrix. 6 Intel® Math Kernel Library Reference Manual 1578 Syntax call psgetri(n, a, ia, ja, desca, ipiv, work, lwork, iwork, liwork, info) call pdgetri(n, a, ia, ja, desca, ipiv, work, lwork, iwork, liwork, info) call pcgetri(n, a, ia, ja, desca, ipiv, work, lwork, iwork, liwork, info) call pzgetri(n, a, ia, ja, desca, ipiv, work, lwork, iwork, liwork, info) Include Files • C: mkl_scalapack.h Description The p?getri routine computes the inverse of a general distributed matrix sub(A) = A(ia:ia+n-1, ja:ja +n-1) using the LU factorization computed by p?getrf. This method inverts U and then computes the inverse of sub(A) by solving the system inv(sub(A))*L = inv(U) for inv(sub(A)). Input Parameters n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the distributed submatrix sub(A) (n=0). a (local) REAL for psgetri DOUBLE PRECISION for pdgetri COMPLEX for pcgetri DOUBLE COMPLEX for pzgetri. Pointer into the local memory to an array of local dimension a(lld_a,LOCc(ja+n-1)). On entry, the array a contains the local pieces of the L and U obtained by the factorization sub(A) = P*L*U computed by p?getrf. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. work (local) REAL for psgetri DOUBLE PRECISION for pdgetri COMPLEX for pcgetri DOUBLE COMPLEX for pzgetri. The array work of dimension (lwork) is a workspace array. lwork (local) INTEGER. The dimension of the array work. lwork must be at least lwork=LOCr(n+mod(ia-1,mb_a))*nb_a. The array work is used to keep at most an entire column block of sub(A). iwork (local) INTEGER. Workspace array used for physically transposing the pivots, DIMENSION (liwork). liwork (local or global) INTEGER. The dimension of the array iwork. ScaLAPACK Routines 6 1579 The minimal value liwork of is determined by the following code: if NPROW == NPCOL then liwork = LOCc(n_a + mod(ja-1,nb_a))+ nb_a else liwork = LOCc(n_a + mod(ja-1,nb_a)) + max(ceil(ceil(LOCr(m_a)/mb_a)/(lcm/NPROW)),nb_a) end if where lcm is the least common multiple of process rows and columns (NPROW and NPCOL). Output Parameters ipiv (local) INTEGER. Array, dimension (LOCr(m_a)+ mb_a). This array contains the pivoting information. If ipiv(i)=j, then the local row i was swapped with the global row j. This array is tied to the distributed matrix A. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. iwork(1) On exit, iwork(1) contains the minimum value of liwork required for optimum performance. info (global) INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. info > 0: If info = i, U(i,i) is exactly zero. The factorization has been completed, but the factor U is exactly singular, and division by zero will occur if it is used to solve a system of equations. p?potri Computes the inverse of a symmetric/Hermitian positive definite distributed matrix. Syntax call pspotri(uplo, n, a, ia, ja, desca, info) call pdpotri(uplo, n, a, ia, ja, desca, info) call pcpotri(uplo, n, a, ia, ja, desca, info) call pzpotri(uplo, n, a, ia, ja, desca, info) Include Files • C: mkl_scalapack.h Description The p?potri routine computes the inverse of a real symmetric or complex Hermitian positive definite distributed matrix sub(A) = A(ia:ia+n-1, ja:ja+n-1) using the Cholesky factorization sub(A) = UH*U or sub(A) = L*LH computed by p?potrf. 6 Intel® Math Kernel Library Reference Manual 1580 Input Parameters uplo (global) CHARACTER*1. Must be 'U' or 'L'. Specifies whether the upper or lower triangular part of the symmetric/ Hermitian matrix sub(A) is stored. If uplo = 'U', upper triangle of sub(A) is stored. If uplo = 'L', lower triangle of sub(A) is stored. n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the distributed submatrix sub(A) (n=0). a (local) REAL for pspotri DOUBLE PRECISION for pdpotri COMPLEX for pcpotri DOUBLE COMPLEX for pzpotri. Pointer into the local memory to an array of local dimension a(lld_a,LOCc(ja+n-1)). On entry, the array a contains the local pieces of the triangular factor U or L from the Cholesky factorization sub(A) = UH*U, or sub(A) = L*LH, as computed by p?potrf. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. Output Parameters a On exit, overwritten by the local pieces of the upper or lower triangle of the (symmetric/Hermitian) inverse of sub(A). info (global) INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. info > 0: If info = i, the (i, i) element of the factor U or L is zero, and the inverse could not be computed. p?trtri Computes the inverse of a triangular distributed matrix. Syntax call pstrtri(uplo, diag, n, a, ia, ja, desca, info) call pdtrtri(uplo, diag, n, a, ia, ja, desca, info) call pctrtri(uplo, diag, n, a, ia, ja, desca, info) call pztrtri(uplo, diag, n, a, ia, ja, desca, info) Include Files • C: mkl_scalapack.h ScaLAPACK Routines 6 1581 Description The p?trtri routine computes the inverse of a real or complex upper or lower triangular distributed matrix sub(A) = A(ia:ia+n-1, ja:ja+n-1). Input Parameters uplo (global) CHARACTER*1. Must be 'U' or 'L'. Specifies whether the distributed matrix sub(A) is upper or lower triangular. If uplo = 'U', sub(A) is upper triangular. If uplo = 'L', sub(A) is lower triangular. diag CHARACTER*1. Must be 'N' or 'U'. Specifies whether or not the distributed matrix sub(A) is unit triangular. If diag = 'N', then sub(A) is non-unit triangular. If diag = 'U', then sub(A) is unit triangular. n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the distributed submatrix sub(A) (n=0). a (local) REAL for pstrtri DOUBLE PRECISION for pdtrtri COMPLEX for pctrtri DOUBLE COMPLEX for pztrtri. Pointer into the local memory to an array of local dimension a(lld_a,LOCc(ja+n-1)). The array a contains the local pieces of the triangular distributed matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular matrix to be inverted, and the strictly lower triangular part of sub(A) is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular matrix, and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. Output Parameters a On exit, overwritten by the (triangular) inverse of the original matrix. info (global) INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. info > 0: If info = k, A(ia+k-1, ja+k-1) is exactly zero. The triangular matrix sub(A) is singular and its inverse can not be computed. 6 Intel® Math Kernel Library Reference Manual 1582 Routines for Matrix Equilibration ScaLAPACK routines described in this section are used to compute scaling factors needed to equilibrate a matrix. Note that these routines do not actually scale the matrices. p?geequ Computes row and column scaling factors intended to equilibrate a general rectangular distributed matrix and reduce its condition number. Syntax call psgeequ(m, n, a, ia, ja, desca, r, c, rowcnd, colcnd, amax, info) call pdgeequ(m, n, a, ia, ja, desca, r, c, rowcnd, colcnd, amax, info) call pcgeequ(m, n, a, ia, ja, desca, r, c, rowcnd, colcnd, amax, info) call pzgeequ(m, n, a, ia, ja, desca, r, c, rowcnd, colcnd, amax, info) Include Files • C: mkl_scalapack.h Description The p?geequ routine computes row and column scalings intended to equilibrate an m-by-n distributed matrix sub(A) = A(ia:ia+m-1, ja:ja+n-1) and reduce its condition number. The output array r returns the row scale factors and the array c the column scale factors. These factors are chosen to try to make the largest element in each row and column of the matrix B with elements bij=r(i)*aij*c(j) have absolute value 1. r(i) and c(j) are restricted to be between SMLNUM = smallest safe number and BIGNUM = largest safe number. Use of these scaling factors is not guaranteed to reduce the condition number of sub(A) but works well in practice. SMLNUM and BIGNUM are parameters representing machine precision. You can use the ?lamch routines to compute them. For example, compute single precision (real and complex) values of SMLNUM and BIGNUM as follows: SMLNUM = slamch ('s') BIGNUM = 1 / SMLNUM The auxiliary function p?laqge uses scaling factors computed by p?geequ to scale a general rectangular matrix. Input Parameters m (global) INTEGER. The number of rows to be operated on, that is, the number of rows of the distributed submatrix sub(A) (m = 0). n (global) INTEGER. The number of columns to be operated on, that is, the number of columns of the distributed submatrix sub(A) (n = 0). a (local) REAL for psgeequ DOUBLE PRECISION for pdgeequ COMPLEX for pcgeequ DOUBLE COMPLEX for pzgeequ . Pointer into the local memory to an array of local dimension a(lld_a,LOCc(ja+n-1)). ScaLAPACK Routines 6 1583 The array a contains the local pieces of the m-by-n distributed matrix whose equilibration factors are to be computed. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. Output Parameters r, c (local) REAL for single precision flavors; DOUBLE PRECISION for double precision flavors. Arrays, dimension LOCr(m_a) and LOCc(n_a), respectively. If info = 0, or info > ia+m-1, the array r (ia:ia+m-1) contains the row scale factors for sub(A). r is aligned with the distributed matrix A, and replicated across every process column. r is tied to the distributed matrix A. If info = 0, the array c (ja:ja+n-1) contains the column scale factors for sub(A). c is aligned with the distributed matrix A, and replicated down every process row. c is tied to the distributed matrix A. rowcnd, colcnd (global) REAL for single precision flavors; DOUBLE PRECISION for double precision flavors. If info = 0 or info > ia+m-1, rowcnd contains the ratio of the smallest r(i) to the largest r(i) (ia = i = ia+m-1). If rowcnd = 0.1 and amax is neither too large nor too small, it is not worth scaling by r (ia:ia +m-1). If info = 0, colcnd contains the ratio of the smallest c(j) to the largest c(j) (ja = j = ja+n-1). If colcnd = 0.1, it is not worth scaling by c(ja:ja+n-1). amax (global) REAL for single precision flavors; DOUBLE PRECISION for double precision flavors. Absolute value of the largest matrix element. If amax is very close to overflow or very close to underflow, the matrix should be scaled. info (global) INTEGER. If info=0, the execution is successful. info < 0: If the ith argument is an array and the jth entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. info > 0: If info = i and i= m, the ith row of the distributed matrix sub(A) is exactly zero; i > m, the (i-m)th column of the distributed matrix sub(A) is exactly zero. p?poequ Computes row and column scaling factors intended to equilibrate a symmetric (Hermitian) positive definite distributed matrix and reduce its condition number. Syntax call pspoequ(n, a, ia, ja, desca, sr, sc, scond, amax, info) call pdpoequ(n, a, ia, ja, desca, sr, sc, scond, amax, info) 6 Intel® Math Kernel Library Reference Manual 1584 call pcpoequ(n, a, ia, ja, desca, sr, sc, scond, amax, info) call pzpoequ(n, a, ia, ja, desca, sr, sc, scond, amax, info) Include Files • C: mkl_scalapack.h Description The p?poequ routine computes row and column scalings intended to equilibrate a real symmetric or complex Hermitian positive definite distributed matrix sub(A) = A(ia:ia+n-1, ja:ja+n-1) and reduce its condition number (with respect to the two-norm). The output arrays sr and sc return the row and column scale factors These factors are chosen so that the scaled distributed matrix B with elements bij=s(i)*aij*s(j) has ones on the diagonal. This choice of sr and sc puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. The auxiliary function p?laqsy uses scaling factors computed by p?geequ to scale a general rectangular matrix. Input Parameters n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the distributed submatrix sub(A) (n=0). a (local) REAL for pspoequ DOUBLE PRECISION for pdpoequ COMPLEX for pcpoequ DOUBLE COMPLEX for pzpoequ. Pointer into the local memory to an array of local dimension a(lld_a,LOCc(ja+n-1)). The array a contains the n-by-n symmetric/Hermitian positive definite distributed matrix sub(A) whose scaling factors are to be computed. Only the diagonal elements of sub(A) are referenced. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. Output Parameters sr, sc (local) REAL for single precision flavors; DOUBLE PRECISION for double precision flavors. Arrays, dimension LOCr(m_a) and LOCc(n_a), respectively. ScaLAPACK Routines 6 1585 If info = 0, the array sr(ia:ia+n-1) contains the row scale factors for sub(A). sr is aligned with the distributed matrix A, and replicated across every process column. sr is tied to the distributed matrix A. If info = 0, the array sc (ja:ja+n-1) contains the column scale factors for sub(A). sc is aligned with the distributed matrix A, and replicated down every process row. sc is tied to the distributed matrix A. scond (global) REAL for single precision flavors; DOUBLE PRECISION for double precision flavors. If info = 0, scond contains the ratio of the smallest sr(i) ( or sc(j)) to the largest sr(i) ( or sc(j)), with ia=i=ia+n-1 and ja=j=ja+n-1. If scond = 0.1 and amax is neither too large nor too small, it is not worth scaling by sr ( or sc ). amax (global) REAL for single precision flavors; DOUBLE PRECISION for double precision flavors. Absolute value of the largest matrix element. If amax is very close to overflow or very close to underflow, the matrix should be scaled. info (global) INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. info > 0: If info = k, the k-th diagonal entry of sub(A) is nonpositive. Orthogonal Factorizations This section describes the ScaLAPACK routines for the QR (RQ) and LQ (QL) factorization of matrices. Routines for the RZ factorization as well as for generalized QR and RQ factorizations are also included. For the mathematical definition of the factorizations, see the respective LAPACK sections or refer to [SLUG]. Table "Computational Routines for Orthogonal Factorizations" lists ScaLAPACK routines that perform orthogonal factorization of matrices. Computational Routines for Orthogonal Factorizations Matrix type, factorization Factorize without pivoting Factorize with pivoting Generate matrix Q Apply matrix Q general matrices, QR factorization p?geqrf p?geqpf p?orgqr p?ungqr p?ormqr p?unmqr general matrices, RQ factorization p?gerqf p?orgrq p?ungrq p?ormrq p?unmrq general matrices, LQ factorization p?gelqf p?orglq p?unglq p?ormlq p?unmlq 6 Intel® Math Kernel Library Reference Manual 1586 Matrix type, factorization Factorize without pivoting Factorize with pivoting Generate matrix Q Apply matrix Q general matrices, QL factorization p?geqlf p?orgql p?ungql p?ormql p?unmql trapezoidal matrices, RZ factorization p?tzrzf p?ormrz p?unmrz pair of matrices, generalized QR factorization p?ggqrf pair of matrices, generalized RQ factorization p?ggrqf p?geqrf Computes the QR factorization of a general m-by-n matrix. Syntax call psgeqrf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pdgeqrf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pcgeqrf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pzgeqrf(m, n, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?geqrf routine forms the QR factorization of a general m-by-n distributed matrix sub(A)= A(ia:ia +m-1,ja:ja+n-1) as A=Q*R Input Parameters m (global) INTEGER. The number of rows in the distributed submatrix sub(A); (m = 0). n (global) INTEGER. The number of columns in the distributed submatrix sub(A); (n = 0). a (local) REAL for psgeqrf DOUBLE PRECISION for pdgeqrf COMPLEX for pcgeqrf DOUBLE COMPLEX for pzgeqrf. Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1)). Contains the local pieces of the distributed matrix sub(A) to be factored. ScaLAPACK Routines 6 1587 ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A(ia:ia +m-1,ja:ja+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A work (local). REAL for psgeqrf DOUBLE PRECISION for pdgeqrf. COMPLEX for pcgeqrf. DOUBLE COMPLEX for pzgeqrf Workspace array of dimension lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork = nb_a * (mp0+nq0+nb_a), where iroff = mod(ia-1, mb_a), icoff = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mp0 = numroc(m+iroff, mb_a, MYROW, iarow, NPROW), nq0 = numroc(n+icoff, nb_a, MYCOL, iacol, NPCOL), and numroc, indxg2p are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a The elements on and above the diagonal of sub(A) contain the min(m,n)-byn upper trapezoidal matrix R (R is upper triangular if m = n); the elements below the diagonal, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors (see Application Notes below). tau (local) REAL for psgeqrf DOUBLE PRECISION for pdgeqrf COMPLEX for pcgeqrf DOUBLE COMPLEX for pzgeqrf. Array, DIMENSION LOCc(ja+min(m,n)-1). Contains the scalar factor tau of elementary reflectors. tau is tied to the distributed matrix A. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0, the execution is successful. < 0, if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. Application Notes The matrix Q is represented as a product of elementary reflectors Q = H(ja)*H(ja+1)*...*H(ja+k-1), 6 Intel® Math Kernel Library Reference Manual 1588 where k = min(m,n). Each H(i) has the form H(i) = I - tau*v*v' where tau is a real/complex scalar, and v is a real/complex vector with v(1:i-1) = 0 and v(i) = 1; v(i +1:m) is stored on exit in A(ia+i:ia+m-1, ja+i-1), and tau in tau(ja+i-1). p?geqpf Computes the QR factorization of a general m-by-n matrix with pivoting. Syntax call psgeqpf(m, n, a, ia, ja, desca, ipiv, tau, work, lwork, info) call pdgeqpf(m, n, a, ia, ja, desca, ipiv, tau, work, lwork, info) call pcgeqpf(m, n, a, ia, ja, desca, ipiv, tau, work, lwork, info) call pzgeqpf(m, n, a, ia, ja, desca, ipiv, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?geqpf routine forms the QR factorization with column pivoting of a general m-by-n distributed matrix sub(A)= A(ia:ia+m-1,ja:ja+n-1) as sub(A)*P=Q*R Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(A) (m = 0). n (global) INTEGER. The number of columns in the submatrix sub(A) (n = 0). a (local) REAL for psgeqpf DOUBLE PRECISION for pdgeqpf COMPLEX for pcgeqpf DOUBLE COMPLEX for pzgeqpf. Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1)). Contains the local pieces of the distributed matrix sub(A) to be factored. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A(ia:ia +m-1,ja:ja+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. work (local). REAL for psgeqpf DOUBLE PRECISION for pdgeqpf. COMPLEX for pcgeqpf. DOUBLE COMPLEX for pzgeqpf Workspace array of dimension lwork. ScaLAPACK Routines 6 1589 lwork (local or global) INTEGER, dimension of work, must be at least For real flavors: lwork = max(3,mp0+nq0) + LOCc (ja+n-1) + nq0. For complex flavors: lwork = max(3,mp0+nq0) . Here iroff = mod(ia-1, mb_a), icoff = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mp0 = numroc(m+iroff, mb_a, MYROW, iarow, NPROW ), nq0 = numroc(n+icoff, nb_a, MYCOL, iacol, NPCOL), LOCc (ja+n-1) = numroc(ja+n-1, nb_a, MYCOL,csrc_a, NPCOL), and numroc, indxg2p are ScaLAPACK tool functions. You can determine MYROW, MYCOL, NPROW and NPCOL by calling the blacs_gridinfo subroutine. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a The elements on and above the diagonal of sub(A)contain the min(m, n)-byn upper trapezoidal matrix R (R is upper triangular if m = n); the elements below the diagonal, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors (see Application Notes below). ipiv (local) INTEGER. Array, DIMENSION LOCc(ja+n-1). ipiv(i) = k, the local i-th column of sub(A)*P was the global k-th column of sub(A). ipiv is tied to the distributed matrix A. tau (local) REAL for psgeqpf DOUBLE PRECISION for pdgeqpf COMPLEX for pcgeqpf DOUBLE COMPLEX for pzgeqpf. Array, DIMENSION LOCc(ja+min(m, n)-1). Contains the scalar factor tau of elementary reflectors. tau is tied to the distributed matrix A. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0, the execution is successful. < 0, if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. Application Notes The matrix Q is represented as a product of elementary reflectors Q = H(1)*H(2)*...*H(k) where k = min(m,n). Each H(i) has the form 6 Intel® Math Kernel Library Reference Manual 1590 H = I - tau*v*v' where tau is a real/complex scalar, and v is a real/complex vector with v(1:i-1) = 0 and v(i) = 1; v(i +1:m) is stored on exit in A(ia+i:ia+m-1,ja+i-1). The matrix P is represented in ipiv as follows: if ipiv(j)= i then the j-th column of P is the i-th canonical unit vector. p?orgqr Generates the orthogonal matrix Q of the QR factorization formed by p?geqrf. Syntax call psorgqr(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pdorgqr(m, n, k, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?orgqr routine generates the whole or part of m-by-n real distributed matrix Q denoting A(ia:ia+m-1, ja:ja+n-1) with orthonormal columns, which is defined as the first n columns of a product of k elementary reflectors of order m Q= H(1)*H(2)*...*H(k) as returned by p?geqrf. Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(Q) (m = 0). n (global) INTEGER. The number of columns in the submatrix sub(Q) (m = n = 0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q (n = k = 0). a (local) REAL for psorgqr DOUBLE PRECISION for pdorgqr Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1)). The j-th column must contain the vector which defines the elementary reflector H(j), ja=j=ja +k-1, as returned by p?geqrf in the k columns of its distributed matrix argument A(ia:*, ja:ja+k-1). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A(ia:ia +m-1,ja:ja+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psorgqr DOUBLE PRECISION for pdorgqr Array, DIMENSION LOCc(ja+k-1). ScaLAPACK Routines 6 1591 Contains the scalar factor tau (j) of elementary reflectors H(j) as returned by p?geqrf. tau is tied to the distributed matrix A. work (local) REAL for psorgqr DOUBLE PRECISION for pdorgqr Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work. Must be at least lwork = nb_a*(nqa0 + mpa0 + nb_a), where iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mpa0 = numroc(m+iroffa, mb_a, MYROW, iarow, NPROW), nqa0 = numroc(n+icoffa, nb_a, MYCOL, iacol, NPCOL); indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a Contains the local pieces of the m-by-n distributed matrix Q. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?ungqr Generates the complex unitary matrix Q of the QR factorization formed by p?geqrf. Syntax call pcungqr(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pzungqr(m, n, k, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description This routine generates the whole or part of m-by-n complex distributed matrix Q denoting A(ia:ia+m-1, ja:ja+n-1) with orthonormal columns, which is defined as the first n columns of a product of k elementary reflectors of order m Q = H(1)*H(2)*...*H(k) as returned by p?geqrf. 6 Intel® Math Kernel Library Reference Manual 1592 Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(Q); (m=0). n (global) INTEGER. The number of columns in the submatrix sub(Q) (m=n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q (n=k=0). a (local) COMPLEX for pcungqr DOUBLE COMPLEX for pzungqr Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +n-1)).The j-th column must contain the vector which defines the elementary reflector H(j), ja= j= ja +k-1, as returned by p?geqrf in the k columns of its distributed matrix argument A(ia:*, ja:ja+k-1). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) COMPLEX for pcungqr DOUBLE COMPLEX for pzungqr Array, DIMENSION LOCc(ja+k-1). Contains the scalar factor tau (j) of elementary reflectors H(j) as returned by p?geqrf. tau is tied to the distributed matrix A. work (local) COMPLEX for pcungqr DOUBLE COMPLEX for pzungqr Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork = nb_a*(nqa0 + mpa0 + nb_a), where iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mpa0 = numroc(m+iroffa, mb_a, MYROW, iarow, NPROW), nqa0 = numroc(n+icoffa, nb_a, MYCOL, iacol, NPCOL) indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a Contains the local pieces of the m-by-n distributed matrix Q. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. ScaLAPACK Routines 6 1593 < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?ormqr Multiplies a general matrix by the orthogonal matrix Q of the QR factorization formed by p?geqrf. Syntax call psormqr(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pdormqr(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?ormqr routine overwrites the general real m-by-n distributed matrix sub(C) = C(ic:ic+m-1,jc:jc +n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'T': QT*sub(C) sub(C)*QT where Q is a real orthogonal distributed matrix defined as the product of k elementary reflectors Q = H(1) H(2)... H(k) as returned by p?geqrf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER ='L': Q or QT is applied from the left. ='R': Q or QT is applied from the right. trans (global) CHARACTER ='N', no transpose, Q is applied. ='T', transpose, QT is applied. m (global) INTEGER. The number of rows in the distributed matrix sub(C) (m=0). n (global) INTEGER. The number of columns in the distributed matrix sub(C) (n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: If side = 'L', m=k=0 If side = 'R', n=k=0. a (local) REAL for psormqr DOUBLE PRECISION for pdormqr. 6 Intel® Math Kernel Library Reference Manual 1594 Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +k-1)). The j-th column must contain the vector which defines the elementary reflector H(j), ja=j=ja+k-1, as returned by p?geqrf in the k columns of its distributed matrix argument A(ia:*, ja:ja+k-1). A(ia:*, ja:ja+k-1) is modified by the routine but restored on exit. If side = 'L', lld_a = max(1, LOCr(ia+m-1)) If side = 'R', lld_a = max(1, LOCr(ia+n-1)) ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psormqr DOUBLE PRECISION for pdormqr Array, DIMENSION LOCc(ja+k-1). Contains the scalar factor tau (j) of elementary reflectors H(j) as returned by p?geqrf. tau is tied to the distributed matrix A. c (local) REAL for psormqr DOUBLE PRECISION for pdormqr Pointer into the local memory to an array of local dimension (lld_c, LOCc(jc+n-1)). Contains the local pieces of the distributed matrix sub(C) to be factored. ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) REAL for psormqr DOUBLE PRECISION for pdormqr. Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least: if side = 'L', lwork = max((nb_a*(nb_a-1))/2, (nqc0+mpc0)*nb_a) + nb_a*nb_a else if side = 'R', lwork = max((nb_a*(nb_a-1))/2, (nqc0+max(npa0+numroc(numroc(n+icoffc, nb_a, 0, 0, NPCOL), nb_a, 0, 0, lcmq), mpc0))*nb_a) + nb_a*nb_a end if where lcmq = lcm/NPCOL with lcm = ilcm(NPROW, NPCOL), iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), npa0= numroc(n+iroffa, mb_a, MYROW, iarow, NPROW), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jc, nb_c, MYCOL, csrc_c, NPCOL), mpc0= numroc(m+iroffc, mb_c, MYROW, icrow, NPROW), nqc0= numroc(n+icoffc, nb_c, MYCOL, iccol, NPCOL), ScaLAPACK Routines 6 1595 ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c Overwritten by the product Q*sub(C), or QT*sub(C), or sub(C)*QT, or sub(C)*Q. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?unmqr Multiplies a complex matrix by the unitary matrix Q of the QR factorization formed by p?geqrf. Syntax call pcunmqr(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pzunmqr(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description This routine overwrites the general complex m-by-n distributed matrix sub (C) = C(ic:ic+m-1,jc:jc +n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'T': QH*sub(C) sub(C)*QH where Q is a complex unitary distributed matrix defined as the product of k elementary reflectors Q = H(1) H(2)... H(k) as returned by p?geqrf. Q is of order m if side = 'L' and of order n if side ='R'. Input Parameters side (global) CHARACTER ='L': Q or QH is applied from the left. ='R': Q or QH is applied from the right. trans (global) CHARACTER 6 Intel® Math Kernel Library Reference Manual 1596 ='N', no transpose, Q is applied. ='C', conjugate transpose, QH is applied. m (global) INTEGER. The number of rows in the distributed matrix sub(C) (m=0). n (global) INTEGER. The number of columns in the distributed matrix sub(C) (n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: If side = 'L', m=k=0 If side = 'R', n=k=0. a (local) COMPLEX for pcunmqr DOUBLE COMPLEX for pzunmqr. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +k-1)). The j-th column must contain the vector which defines the elementary reflector H(j), ja=j=ja +k-1, as returned by p?geqrf in the k columns of its distributed matrix argument A(ia:*, ja:ja+k-1). A(ia:*, ja:ja+k-1) is modified by the routine but restored on exit. If side = 'L', lld_a = max(1, LOCr(ia+m-1)) If side = 'R', lld_a = max(1, LOCr(ia+n-1)) ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) COMPLEX for pcunmqr DOUBLE COMPLEX for pzunmqr Array, DIMENSION LOCc(ja+k-1). Contains the scalar factor tau (j) of elementary reflectors H(j) as returned by p?geqrf. tau is tied to the distributed matrix A. c (local) COMPLEX for pcunmqr DOUBLE COMPLEX for pzunmqr. Pointer into the local memory to an array of local dimension (lld_c, LOCc(jc+n-1)). Contains the local pieces of the distributed matrix sub(C) to be factored. ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) COMPLEX for pcunmqr DOUBLE COMPLEX for pzunmqr. Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least: If side = 'L', lwork = max((nb_a*(nb_a-1))/2, (nqc0 + mpc0)*nb_a) + nb_a*nb_a else if side = 'R', ScaLAPACK Routines 6 1597 lwork = max((nb_a*(nb_a-1))/2, (nqc0 + max(npa0 + numroc(numroc(n+icoffc, nb_a, 0, 0, NPCOL), nb_a, 0, 0, lcmq), mpc0))*nb_a) + nb_a*nb_a end if where lcmq = lcm/NPCOL with lcm = ilcm (NPROW, NPCOL), iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), npa0 = numroc(n+iroffa, mb_a, MYROW, iarow, NPROW), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(m+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(n+icoffc, nb_c, MYCOL, iccol, NPCOL), ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c Overwritten by the product Q*sub(C), or QH*sub(C), or sub(C)*QH, or sub(C)*Q . work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?gelqf Computes the LQ factorization of a general rectangular matrix. Syntax call psgelqf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pdgelqf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pcgelqf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pzgelqf(m, n, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h 6 Intel® Math Kernel Library Reference Manual 1598 Description The p?gelqf routine computes the LQ factorization of a real/complex distributed m-by-n matrix sub(A)= A(ia:ia+m-1, ia:ia+n-1) = L*Q. Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(Q) (m = 0). n (global) INTEGER. The number of columns in the submatrix sub(Q) (n = 0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q (n = k = 0). a (local) REAL for psgelqf DOUBLE PRECISION for pdgelqf COMPLEX for pcgelqf DOUBLE COMPLEX for pzgelqf Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1)). Contains the local pieces of the distributed matrix sub(A) to be factored. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A(ia:ia+m-1, ia:ia+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. work (local) REAL for psgelqf DOUBLE PRECISION for pdgelqf COMPLEX for pcgelqf DOUBLE COMPLEX for pzgelqf Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork = mb_a*(mp0 + nq0 + mb_a), where iroff = mod(ia-1, mb_a), icoff = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mp0 = numroc(m+iroff, mb_a, MYROW, iarow, NPROW), nq0 = numroc(n+icoff, nb_a, MYCOL, iacol, NPCOL) indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a The elements on and below the diagonal of sub(A) contain the m by min(m,n) lower trapezoidal matrix L (L is lower trapezoidal if m = n); the elements above the diagonal, with the array tau, represent the orthogonal/ unitary matrix Q as a product of elementary reflectors (see Application Notes below). ScaLAPACK Routines 6 1599 tau (local) REAL for psgelqf DOUBLE PRECISION for pdgelqf COMPLEX for pcgelqf DOUBLE COMPLEX for pzgelqf Array, DIMENSION LOCr(ia+min(m, n)-1). Contains the scalar factors of elementary reflectors. tau is tied to the distributed matrix A. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. Application Notes The matrix Q is represented as a product of elementary reflectors Q = H(ia+k-1)*H(ia+k-2)*...*H(ia), where k = min(m,n) Each H(i) has the form H(i) = I - tau*v*v' where tau is a real/complex scalar, and v is a real/complex vector with v(1:i-1) = 0 and v(i) = 1; v(i +1:n) is stored on exit in A(ia+i-1, ja+i:ja+n-1), and tau in tau (ia+i-1). p?orglq Generates the real orthogonal matrix Q of the LQ factorization formed by p?gelqf. Syntax call psorglq(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pdorglq(m, n, k, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?orglq routine generates the whole or part of m-by-n real distributed matrix Q denoting A(ia:ia+m-1, ja:ja+n-1) with orthonormal rows, which is defined as the first m rows of a product of k elementary reflectors of order n Q = H(k)*...* H(2)* H(1) as returned by p?gelqf. Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(Q); (m=0). 6 Intel® Math Kernel Library Reference Manual 1600 n (global) INTEGER. The number of columns in the submatrix sub(Q) (n=m=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q (m=k=0). a (local) REAL for psorglq DOUBLE PRECISION for pdorglq Pointer into the local memory to an array of local dimension (lld_a, LOCc (ja+n-1)). On entry, the i-th row must contain the vector which defines the elementary reflector H(i), ia=i=ia+k-1, as returned by p?gelqf in the k rows of its distributed matrix argument A(ia:ia+k -1, ja:*). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A(ia:ia+m-1, ja:ja+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. work (local) REAL for psorglq DOUBLE PRECISION for pdorglq Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork = mb_a*(mpa0+nqa0+mb_a), where iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mpa0 = numroc(m+iroffa, mb_a, MYROW, iarow, NPROW), nqa0 = numroc(n+icoffa, nb_a, MYCOL, iacol, NPCOL) indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a Contains the local pieces of the m-by-n distributed matrix Q to be factored. tau (local) REAL for psorglq DOUBLE PRECISION for pdorglq Array, DIMENSION LOCr(ia+k -1). Contains the scalar factors tau of elementary reflectors H(i). tau is tied to the distributed matrix A. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = -(i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. ScaLAPACK Routines 6 1601 p?unglq Generates the unitary matrix Q of the LQ factorization formed by p?gelqf. Syntax call pcunglq(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pzunglq(m, n, k, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description This routine generates the whole or part of m-by-n complex distributed matrix Q denoting A(ia:ia+m-1, ja:ja+n-1) with orthonormal rows, which is defined as the first m rows of a product of k elementary reflectors of order n Q = (H(k))H...*(H(2))H*(H(1))H as returned by p?gelqf. Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(Q) (m=0). n (global) INTEGER. The number of columns in the submatrix sub(Q) (n=m=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q (m=k=0). a (local) COMPLEX for pcunglq DOUBLE COMPLEX for pzunglq Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1)). On entry, the i-th row must contain the vector which defines the elementary reflector H(i), ia=i=ia+k-1, as returned by p? gelqf in the k rows of its distributed matrix argument A(ia:ia+k-1, ja:*). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A(ia:ia +m-1,ja:ja+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) COMPLEX for pcunglq DOUBLE COMPLEX for pzunglq Array, DIMENSION LOCr(ia+k-1). Contains the scalar factors tau of elementary reflectors H(i). tau is tied to the distributed matrix A. work (local) COMPLEX for pcunglq DOUBLE COMPLEX for pzunglq Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork = mb_a*(mpa0+nqa0+mb_a), where 6 Intel® Math Kernel Library Reference Manual 1602 iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mpa0 = numroc(m+iroffa, mb_a, MYROW, iarow, NPROW), nqa0 = numroc(n+icoffa, nb_a, MYCOL, iacol, NPCOL) indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a Contains the local pieces of the m-by-n distributed matrix Q to be factored. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?ormlq Multiplies a general matrix by the orthogonal matrix Q of the LQ factorization formed by p?gelqf. Syntax call psormlq(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, work, lwork, info) call pdormlq(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?ormlq routine overwrites the general real m-by-n distributed matrix sub(C) = C(ic:ic+m-1,jc:jc +n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'T': QT*sub(C) sub(C)*QT where Q is a real orthogonal distributed matrix defined as the product of k elementary reflectors Q = H(k)...H(2) H(1) as returned by p?gelqf. Q is of order m if side = 'L' and of order n if side = 'R'. ScaLAPACK Routines 6 1603 Input Parameters side (global) CHARACTER ='L': Q or QT is applied from the left. ='R': Q or QT is applied from the right. trans (global) CHARACTER ='N', no transpose, Q is applied. ='T', transpose, QT is applied. m (global) INTEGER. The number of rows in the distributed matrix sub(C) (m=0). n (global) INTEGER. The number of columns in the distributed matrix sub(C) (n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: If side = 'L', m=k=0 If side = 'R', n=k=0. a (local) REAL for psormlq DOUBLE PRECISION for pdormlq. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +m-1)), if side = 'L' and (lld_a, LOCc(ja+n-1)), if side = 'R'.The i-th row must contain the vector which defines the elementary reflector H(i), ia=i=ia+k-1, as returned by p?gelqf in the k rows of its distributed matrix argument A(ia:ia+k-1, ja:*). A(ia:ia+k-1, ja:*) is modified by the routine but restored on exit. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psormlq DOUBLE PRECISION for pdormlq Array, DIMENSION LOCc(ja+k-1). Contains the scalar factor tau (i) of elementary reflectors H(i) as returned by p?gelqf. tau is tied to the distributed matrix A. c (local) REAL for psormlq DOUBLE PRECISION for pdormlq Pointer into the local memory to an array of local dimension (lld_c, LOCc(jc+n-1)). Contains the local pieces of the distributed matrix sub(C) to be factored. ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) REAL for psormlq DOUBLE PRECISION for pdormlq. Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of the array work; must be at least: 6 Intel® Math Kernel Library Reference Manual 1604 If side = 'L', lwork = max((mb_a*(mb_a-1))/2, (mpc0+max mqa0)+ numroc(numroc(m + iroffc, mb_a, 0, 0, NPROW), mb_a, 0, 0, lcmp), nqc0))* mb_a) + mb_a*mb_a else if side = 'R', lwork = max((mb_a* (mb_a-1))/2, (mpc0+nqc0)*mb_a + mb_a*mb_a end if where lcmp = lcm/NPROW with lcm = ilcm (NPROW, NPCOL), iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mqa0 = numroc(m+icoffa, nb_a, MYCOL, iacol, NPCOL), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(m+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(n+icoffc, nb_c, MYCOL, iccol, NPCOL), ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c Overwritten by the product Q*sub(C), or Q' *sub (C), or sub(C)*Q', or sub(C)*Q work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?unmlq Multiplies a general matrix by the unitary matrix Q of the LQ factorization formed by p?gelqf. Syntax call pcunmlq(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pzunmlq(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h ScaLAPACK Routines 6 1605 Description This routine overwrites the general complex m-by-n distributed matrix sub (C) = C (ic:ic+m-1,jc:jc +n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'T': QH*sub(C) sub(C)*QH where Q is a complex unitary distributed matrix defined as the product of k elementary reflectors Q = H(k)' ... H(2)' H(1)' as returned by p?gelqf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER ='L': Q or QH is applied from the left. ='R': Q or QH is applied from the right. trans (global) CHARACTER ='N', no transpose, Q is applied. ='C', conjugate transpose, QH is applied. m (global) INTEGER. The number of rows in the distributed matrix sub(C) (m=0). n (global) INTEGER. The number of columns in the distributed matrix sub(C) (n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: If side = 'L', m=k=0 If side = 'R', n=k=0. a (local) COMPLEX for pcunmlq DOUBLE COMPLEX for pzunmlq. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +m-1)), if side = 'L', and (lld_a, LOCc(ja+n-1)), if side = 'R', where lld_a = max(1, LOCr (ia+k-1)). The i-th column must contain the vector which defines the elementary reflector H(i), ia=i=ia+k-1, as returned by p?gelqf in the k rows of its distributed matrix argument A(ia:ia+k-1, ja:*). A(ia:ia+k-1, ja:*) is modified by the routine but restored on exit. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) COMPLEX for pcunmlq DOUBLE COMPLEX for pzunmlq Array, DIMENSION LOCc(ia+k-1). Contains the scalar factor tau (i) of elementary reflectors H(i) as returned by p?gelqf. tau is tied to the distributed matrix A. c (local) COMPLEX for pcunmlq 6 Intel® Math Kernel Library Reference Manual 1606 DOUBLE COMPLEX for pzunmlq. Pointer into the local memory to an array of local dimension (lld_c, LOCc(jc+n-1)). Contains the local pieces of the distributed matrix sub(C) to be factored. ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) COMPLEX for pcunmlq DOUBLE COMPLEX for pzunmlq. Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of the array work; must be at least: If side = 'L', lwork = max((mb_a*(mb_a-1))/2, (mpc0 + max mqa0)+ numroc(numroc(m + iroffc, mb_a, 0, 0, NPROW), mb_a, 0, 0, lcmp), nqc0))*mb_a) + mb_a*mb_a else if side = 'R', lwork = max((mb_a* (mb_a-1))/2, (mpc0 + nqc0)*mb_a + mb_a*mb_a end if where lcmp = lcm/NPROW with lcm = ilcm (NPROW, NPCOL), iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mqa0 = numroc(m + icoffa, nb_a, MYCOL, iacol, NPCOL), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(m+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(n+icoffc, nb_c, MYCOL, iccol, NPCOL), ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c Overwritten by the product Q*sub(C), or Q'*sub (C), or sub(C)*Q', or sub(C)*Q work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. ScaLAPACK Routines 6 1607 p?geqlf Computes the QL factorization of a general matrix. Syntax call psgeqlf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pdgeqlf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pcgeqlf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pzgeqlf(m, n, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?geqlf routine forms the QL factorization of a real/complex distributed m-by-n matrix sub(A) = A(ia:ia+m-1, ja:ja+n-1) = Q*L. Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(Q); (m = 0). n (global) INTEGER. The number of columns in the submatrix sub(Q) (n = 0). a (local) REAL for psgeqlf DOUBLE PRECISION for pdgeqlf COMPLEX for pcgeqlf DOUBLE COMPLEX for pzgeqlf Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1)). Contains the local pieces of the distributed matrix sub(A) to be factored. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A(ia:ia+m-1, ia:ia+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. work (local) REAL for psgeqlf DOUBLE PRECISION for pdgeqlf COMPLEX for pcgeqlf DOUBLE COMPLEX for pzgeqlf Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork = nb_a*(mp0 + nq0 + nb_a), where iroff = mod(ia-1, mb_a), icoff = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mp0 = numroc(m+iroff, mb_a, MYROW, iarow, NPROW), nq0 = numroc(n+icoff, nb_a, MYCOL, iacol, NPCOL) 6 Intel® Math Kernel Library Reference Manual 1608 numroc and indxg2p are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, if m=n, the lower triangle of the distributed submatrix A(ia+m-n:ia +m-1, ja:ja+n-1) contains the n-by-n lower triangular matrix L; if m=n, the elements on and below the (n-m)-th superdiagonal contain the m-by-n lower trapezoidal matrix L; the remaining elements, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors (see Application Notes below). tau (local) REAL for psgeqlf DOUBLE PRECISION for pdgeqlf COMPLEX for pcgeqlf DOUBLE COMPLEX for pzgeqlf Array, DIMENSION LOCc(ja+n-1). Contains the scalar factors of elementary reflectors. tau is tied to the distributed matrix A. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. Application Notes The matrix Q is represented as a product of elementary reflectors Q = H(ja+k-1)*...*H(ja+1)*H(ja) where k = min(m,n) Each H(i) has the form H(i) = I - tau*v*v' where tau is a real/complex scalar, and v is a real/complex vector with v(m-k+i+1:m) = 0 and v(m-k+i) = 1; v(1:m-k+i-1) is stored on exit in A(ia:ia+m-k+i-2, ja+n-k+i-1), and tau in tau (ja+n-k+i-1). p?orgql Generates the orthogonal matrix Q of the QL factorization formed by p?geqlf. Syntax call psorgql(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pdorgql(m, n, k, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h ScaLAPACK Routines 6 1609 Description The p?orgql routine generates the whole or part of m-by-n real distributed matrix Q denoting A(ia:ia+m-1, ja:ja+n-1) with orthonormal rows, which is defined as the first m rows of a product of k elementary reflectors of order n Q = H(k)*...*H(2)*H(1) as returned by p?geqlf. Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(Q), (m=0). n (global) INTEGER. The number of columns in the submatrix sub(Q), (m=n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q (n=k=0). a (local) REAL for psorgql DOUBLE PRECISION for pdorgql Pointer into the local memory to an array of local dimension (lld_a, LOCc (ja+n-1)). On entry, the j-th column must contain the vector which defines the elementary reflector H(j),ja+n-k=j=ja+n-1, as returned by p?geqlf in the k columns of its distributed matrix argument A(ia:*,ja+nk: ja+n-1). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A(ia:ia+m-1, ja:ja+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psorgql DOUBLE PRECISION for pdorgql Array, DIMENSION LOCc(ja+n-1). Contains the scalar factors tau(j) of elementary reflectors H(j). tau is tied to the distributed matrix A. work (local) REAL for psorgql DOUBLE PRECISION for pdorgql Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork = nb_a*(nqa0+mpa0+nb_a), where iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mpa0 = numroc(m+iroffa, mb_a, MYROW, iarow, NPROW), nqa0 = numroc(n+icoffa, nb_a, MYCOL, iacol, NPCOL) indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. 6 Intel® Math Kernel Library Reference Manual 1610 If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a Contains the local pieces of the m-by-n distributed matrix Q to be factored. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?ungql Generates the unitary matrix Q of the QL factorization formed by p?geqlf. Syntax call pcungql(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pzungql(m, n, k, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description This routine generates the whole or part of m-by-n complex distributed matrix Q denoting A(ia:ia+m-1, ja:ja+n-1) with orthonormal rows, which is defined as the first n columns of a product of k elementary reflectors of order m Q = (H(k))H...*(H(2))H*(H(1))H as returned by p?geqlf. Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(Q) (m=0). n (global) INTEGER. The number of columns in the submatrix sub(Q) (m=n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q (n=k=0). a (local) COMPLEX for pcungql DOUBLE COMPLEX for pzungql Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1)). On entry, the j-th column must contain the vector which defines the elementary reflector H(j), ja+n-k= j= ja+n-1, as returned by p?geqlf in the k columns of its distributed matrix argument A(ia:*, ja+n-k: ja+n-1). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A(ia:ia +m-1,ja:ja+n-1), respectively. ScaLAPACK Routines 6 1611 desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) COMPLEX for pcungql DOUBLE COMPLEX for pzungql Array, DIMENSION LOCr(ia+n-1). Contains the scalar factors tau (j) of elementary reflectors H(j). tau is tied to the distributed matrix A. work (local) COMPLEX for pcungql DOUBLE COMPLEX for pzungql Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork = nb_a*(nqa0 + mpa0 + nb_a), where iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mpa0 = numroc(m+iroffa, mb_a, MYROW, iarow, NPROW), nqa0 = numroc(n+icoffa, nb_a, MYCOL, iacol, NPCOL) indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a Contains the local pieces of the m-by-n distributed matrix Q to be factored. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?ormql Multiplies a general matrix by the orthogonal matrix Q of the QL factorization formed by p?geqlf. Syntax call psormql(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pdormql(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h 6 Intel® Math Kernel Library Reference Manual 1612 Description The p?ormql routine overwrites the general real m-by-n distributed matrix sub(C) = C (ic:ic+m-1,jc:jc +n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'T': QT*sub(C) sub(C)*QT where Q is a real orthogonal distributed matrix defined as the product of k elementary reflectors Q = H(k)' ... H(2)' H(1)' as returned by p?geqlf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER ='L': Q or QT is applied from the left. ='R': Q or QT is applied from the right. trans (global) CHARACTER ='N', no transpose, Q is applied. ='T', transpose, QT is applied. m (global) INTEGER. The number of rows in the distributed matrix sub(C), (m=0). n (global) INTEGER. The number of columns in the distributed matrix sub(C), (n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: If side = 'L', m=k=0 If side = 'R', n=k=0. a (local) REAL for psormql DOUBLE PRECISION for pdormql. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +k-1)). The j-th column must contain the vector which defines the elementary reflector H(j), ja=j=ja+k-1, as returned by p?gelqf in the k columns of its distributed matrix argument A(ia:*, ja:ja+k-1).A(ia:*, ja:ja+k-1) is modified by the routine but restored on exit. If side = 'L',lld_a = max(1, LOCr(ia+m-1)), If side = 'R', lld_a = max(1, LOCr(ia+n-1)). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psormql DOUBLE PRECISION for pdormql. Array, DIMENSION LOCc(ja+n-1). Contains the scalar factor tau (j) of elementary reflectors H(j) as returned by p?geqlf. tau is tied to the distributed matrix A. c (local) REAL for psormql ScaLAPACK Routines 6 1613 DOUBLE PRECISION for pdormql. Pointer into the local memory to an array of local dimension (lld_c, LOCc(jc+n-1)). Contains the local pieces of the distributed matrix sub(C) to be factored. ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) REAL for psormql DOUBLE PRECISION for pdormql. Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least: If side = 'L', lwork = max((nb_a*(nb_a-1))/2, (nqc0+mpc0)*nb_a + nb_a*nb_a else if side ='R', lwork = max((nb_a*(nb_a-1))/2, (nqc0+max npa0)+ numroc(numroc(n+icoffc, nb_a, 0, 0, NPCOL), nb_a, 0, 0, lcmq), mpc0))*nb_a) + nb_a*nb_a end if where lcmp = lcm/NPCOL with lcm = ilcm (NPROW, NPCOL), iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), npa0= numroc(n + iroffa, mb_a, MYROW, iarow, NPROW), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(m+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(n+icoffc, nb_c, MYCOL, iccol, NPCOL), ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c Overwritten by the product Q* sub(C), or Q'*sub (C), or sub(C)* Q', or sub(C)* Q work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. 6 Intel® Math Kernel Library Reference Manual 1614 p?unmql Multiplies a general matrix by the unitary matrix Q of the QL factorization formed by p?geqlf. Syntax call pcunmql(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pzunmql(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description This routine overwrites the general complex m-by-n distributed matrix sub(C) = C (ic:ic+m-1,jc:jc +n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'C': QH*sub(C) sub(C)*QH where Q is a complex unitary distributed matrix defined as the product of k elementary reflectors Q = H(k)' ... H(2)' H(1)' as returned by p?geqlf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER ='L': Q or QH is applied from the left. ='R': Q or QH is applied from the right. trans (global) CHARACTER ='N', no transpose, Q is applied. ='C', conjugate transpose, QH is applied. m (global) INTEGER. The number of rows in the distributed matrix sub(C) (m=0). n (global) INTEGER. The number of columns in the distributed matrix sub(C) (n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: If side = 'L', m=k=0 If side = 'R', n=k=0. a (local) COMPLEX for pcunmql DOUBLE COMPLEX for pzunmql. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +k-1)). The j-th column must contain the vector which defines the elementary reflector H(j), ja=j=ja+k-1, as returned by p?geqlf in the k columns of its distributed matrix argument A(ia:*, ja:ja+k-1).A(ia:*, ja:ja+k-1) is modified by the routine but restored on exit. ScaLAPACK Routines 6 1615 If side = 'L',lld_a = max(1, LOCr(ia+m-1)), If side = 'R', lld_a = max(1, LOCr(ia+n-1)). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) COMPLEX for pcunmql DOUBLE COMPLEX for pzunmql Array, DIMENSION LOCc(ia+n-1). Contains the scalar factor tau (j) of elementary reflectors H(j) as returned by p?geqlf. tau is tied to the distributed matrix A. c (local) COMPLEX for pcunmql DOUBLE COMPLEX for pzunmql. Pointer into the local memory to an array of local dimension (lld_c, LOCc(jc+n-1)). Contains the local pieces of the distributed matrix sub(C) to be factored. ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) COMPLEX for pcunmql DOUBLE COMPLEX for pzunmql. Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least: If side = 'L', lwork = max((nb_a* (nb_a-1))/2, (nqc0+mpc0)*nb_a + nb_a*nb_a else if side ='R', lwork = max((nb_a*(nb_a-1))/2, (nqc0+max npa0)+ numroc(numroc(n+icoffc, nb_a, 0, 0, NPCOL), nb_a, 0, 0, lcmq), mpc0))*nb_a) + nb_a*nb_a end if where lcmp = lcm/NPCOL with lcm = ilcm (NPROW, NPCOL), iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), npa0 = numroc (n + iroffa, mb_a, MYROW, iarow, NPROW), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(m+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(n+icoffc, nb_c, MYCOL, iccol, NPCOL), ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. 6 Intel® Math Kernel Library Reference Manual 1616 If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c Overwritten by the product Q* sub(C), or Q' sub (C), or sub(C)* Q', or sub(C)* Q work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?gerqf Computes the RQ factorization of a general rectangular matrix. Syntax call psgerqf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pdgerqf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pcgerqf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pzgerqf(m, n, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?gerqf routine forms the QR factorization of a general m-by-n distributed matrix sub(A)= A(ia:ia +m-1,ja:ja+n-1) as A= R*Q Input Parameters m (global) INTEGER. The number of rows in the distributed submatrix sub(A); (m=0). n (global) INTEGER. The number of columns in the distributed submatrix sub(A); (n=0). a (local) REAL for psgeqrf DOUBLE PRECISION for pdgeqrf COMPLEX for pcgeqrf DOUBLE COMPLEX for pzgeqrf. Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1)). Contains the local pieces of the distributed matrix sub(A) to be factored. ScaLAPACK Routines 6 1617 ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A(ia:ia +m-1,ja:ja+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A work (local). REAL for psgeqrf DOUBLE PRECISION for pdgeqrf. COMPLEX for pcgeqrf. DOUBLE COMPLEX for pzgeqrf Workspace array of dimension lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork = mb_a*(mp0+nq0+mb_a), where iroff = mod(ia-1, mb_a), icoff = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mp0 = numroc(m+iroff, mb_a, MYROW, iarow, NPROW), nq0 = numroc(n+icoff, nb_a, MYCOL, iacol, NPCOL) and numroc, indxg2p are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, if m=n, the upper triangle of A(ia:ia+m-1, ja:ja+n-1) contains the m-by-m upper triangular matrix R; if m=n, the elements on and above the (m - n)-th subdiagonal contain the m-by-n upper trapezoidal matrix R; the remaining elements, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors (see Application Notes below). tau (local) REAL for psgeqrf DOUBLE PRECISION for pdgeqrf COMPLEX for pcgeqrf DOUBLE COMPLEX for pzgeqrf. Array, DIMENSION LOCr(ia+m-1). Contains the scalar factor tau of elementary reflectors. tau is tied to the distributed matrix A. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0, the execution is successful. < 0, if the i-th argument is an array and the j-entry had an illegal value, then info = -(i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. Application Notes The matrix Q is represented as a product of elementary reflectors 6 Intel® Math Kernel Library Reference Manual 1618 Q = H(ia)*H(ia+1)*...*H(ia+k-1), where k = min(m,n). Each H(i) has the form H(i) = I - tau*v*v' where tau is a real/complex scalar, and v is a real/complex vector with v(n-k+i+1:n) = 0 and v(n-k+i) = 1; v(1:n-k+i-1) is stored on exit in A(ia+m-k+i-1,ja:ja+n-k+i-2), and tau in tau(ia+m-k+i-1). p?orgrq Generates the orthogonal matrix Q of the RQ factorization formed by p?gerqf. Syntax call psorgrq(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pdorgrq(m, n, k, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?orgrq routine generates the whole or part of m-by-n real distributed matrix Q denoting A(ia:ia+m-1, ja:ja+n-1) with orthonormal columns, which is defined as the last m rows of a product of k elementary reflectors of order m Q= H(1)*H(2)*...*H(k) as returned by p?gerqf. Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(Q), (m=0). n (global) INTEGER. The number of columns in the submatrix sub(Q), (n=m=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q (m=k=0). a (local) REAL for psorgrq DOUBLE PRECISION for pdorgrq Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1)). The i-th column must contain the vector which defines the elementary reflector H(i), ja=j=ja+k-1, as returned by p?gerqf in the k columns of its distributed matrix argument A(ia:*, ja:ja+k-1). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A(ia:ia +m-1,ja:ja+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psorgrq DOUBLE PRECISION for pdorgrq Array, DIMENSION LOCc(ja+k-1). ScaLAPACK Routines 6 1619 Contains the scalar factor tau (i) of elementary reflectors H(i) as returned by p?gerqf. tau is tied to the distributed matrix A. work (local) REAL for psorgrq DOUBLE PRECISION for pdorgrq Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork=mb_a*(mpa0 + nqa0 + mb_a), where iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mpa0 = numroc(m+iroffa, mb_a, MYROW, iarow, NPROW), nqa0 = numroc(n+icoffa, nb_a, MYCOL, iacol, NPCOL) indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a Contains the local pieces of the m-by-n distributed matrix Q. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?ungrq Generates the unitary matrix Q of the RQ factorization formed by p?gerqf. Syntax call pcungrq(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pzungrq(m, n, k, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description This routine generates the m-by-n complex distributed matrix Q denoting A(ia:ia+m-1,ja:ja+n-1) with orthonormal rows, which is defined as the last m rows of a product of k elementary reflectors of order n Q = (H(1))H*(H(2))H*...*(H(k))H as returned by p?gerqf. Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(Q); (m=0). 6 Intel® Math Kernel Library Reference Manual 1620 n (global) INTEGER. The number of columns in the submatrix sub(Q) (n=m=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q (m=k=0). a (local) COMPLEX for pcungrq DOUBLE COMPLEX for pzungrqc Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +n-1)). The i-th row must contain the vector which defines the elementary reflector H(i), ia+m-k=i=ia+m-1, as returned by p?gerqf in the k rows of its distributed matrix argument A(ia+m-k:ia+m-1, ja:*). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) COMPLEX for pcungrq DOUBLE COMPLEX for pzungrq Array, DIMENSION LOCr(ia+m-1). Contains the scalar factor tau (i) of elementary reflectors H(i) as returned by p?gerqf. tau is tied to the distributed matrix A. work (local) COMPLEX for pcungrq DOUBLE COMPLEX for pzungrq Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork = mb_a*(mpa0 +nqa0+mb_a), where iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mpa0 = numroc(m+iroffa, mb_a, MYROW, iarow, NPROW), nqa0 = numroc(n+icoffa, nb_a, MYCOL, iacol, NPCOL) indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a Contains the local pieces of the m-by-n distributed matrix Q. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. ScaLAPACK Routines 6 1621 p?ormrq Multiplies a general matrix by the orthogonal matrix Q of the RQ factorization formed by p?gerqf. Syntax call psormrq(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pdormrq(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?ormrq routine overwrites the general real m-by-n distributed matrix sub(C) = C(ic:ic+m-1,jc:jc +n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'T': QT*sub(C) sub(C)*QT where Q is a real orthogonal distributed matrix defined as the product of k elementary reflectors Q = H(1) H(2)... H(k) as returned by p?gerqf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER ='L': Q or QT is applied from the left. ='R': Q or QT is applied from the right. trans (global) CHARACTER ='N', no transpose, Q is applied. ='T', transpose, QT is applied. m (global) INTEGER. The number of rows in the distributed matrix sub(C) (m=0). n (global) INTEGER. The number of columns in the distributed matrix sub(C) (n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: If side = 'L', m=k=0 If side = 'R', n=k=0. a (local) REAL for psormqr DOUBLE PRECISION for pdormqr. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +m-1)) if side = 'L', and (lld_a, LOCc(ja+n-1)) if side = 'R'. 6 Intel® Math Kernel Library Reference Manual 1622 The i-th row must contain the vector which defines the elementary reflector H(i), ia=i=ia+k-1, as returned by p?gerqf in the k rows of its distributed matrix argument A(ia:ia+k-1, ja:*).A(ia:ia+k-1, ja:*) is modified by the routine but restored on exit. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psormqr DOUBLE PRECISION for pdormqr Array, DIMENSION LOCc(ja+k-1). Contains the scalar factor tau (i) of elementary reflectors H(i) as returned by p?gerqf. tau is tied to the distributed matrix A. c (local) REAL for psormrq DOUBLE PRECISION for pdormrq Pointer into the local memory to an array of local dimension (lld_c, LOCc(jc+n-1)). Contains the local pieces of the distributed matrix sub(C) to be factored. ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) REAL for psormrq DOUBLE PRECISION for pdormrq. Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least: If side = 'L', lwork = max((mb_a*(mb_a-1))/2, (mpc0 + max(mqa0 + numroc(numroc(n+iroffc, mb_a, 0, 0, NPROW), mb_a, 0, 0, lcmp), nqc0))*mb_a) + mb_a*mb_a else if side ='R', lwork = max((mb_a*(mb_a-1))/2, (mpc0 + nqc0)*mb_a) + mb_a*mb_a end if where lcmp = lcm/NPROW with lcm = ilcm (NPROW, NPCOL), iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mqa0 = numroc(n+icoffa, nb_a, MYCOL, iacol, NPCOL), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(m+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(n+icoffc, nb_c, MYCOL, iccol, NPCOL), ScaLAPACK Routines 6 1623 ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c Overwritten by the product Q* sub(C), or Q'*sub (C), or sub(C)* Q', or sub(C)* Q work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?unmrq Multiplies a general matrix by the unitary matrix Q of the RQ factorization formed by p?gerqf. Syntax call pcunmrq(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pzunmrq(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description This routine overwrites the general complex m-by-n distributed matrix sub (C)= C(ic:ic+m-1,jc:jc+n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'C': QH*sub(C) sub(C)*QH where Q is a complex unitary distributed matrix defined as the product of k elementary reflectors Q = H(1)' H(2)'... H(k)' as returned by p?gerqf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER ='L': Q or QH is applied from the left. ='R': Q or QH is applied from the right. trans (global) CHARACTER 6 Intel® Math Kernel Library Reference Manual 1624 ='N', no transpose, Q is applied. ='C', conjugate transpose, QH is applied. m (global) INTEGER. The number of rows in the distributed matrix sub(C) , (m=0). n (global) INTEGER. The number of columns in the distributed matrix sub(C), (n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: If side = 'L', m=k=0 If side = 'R', n=k=0. a (local) COMPLEX for pcunmrq DOUBLE COMPLEX for pzunmrq. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +m-1)) if side = 'L', and (lld_a, LOCc(ja+n-1)) if side = 'R'. The i-th row must contain the vector which defines the elementary reflector H(i), ia=i=ia+k-1, as returned by p?gerqf in the k rows of its distributed matrix argument A(ia:ia +k-1, ja*).A(ia:ia +k-1, ja*) is modified by the routine but restored on exit. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) COMPLEX for pcunmrq DOUBLE COMPLEX for pzunmrq Array, DIMENSION LOCc(ja+k-1). Contains the scalar factor tau (i) of elementary reflectors H(i) as returned by p?gerqf. tau is tied to the distributed matrix A. c (local) COMPLEX for pcunmrq DOUBLE COMPLEX for pzunmrq. Pointer into the local memory to an array of local dimension (lld_c, LOCc(jc+n-1)). Contains the local pieces of the distributed matrix sub(C) to be factored. ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) COMPLEX for pcunmrq DOUBLE COMPLEX for pzunmrq. Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least: If side = 'L', lwork = max((mb_a*(mb_a-1))/2, (mpc0 + max(mqa0+numroc(numroc(n+iroffc, mb_a, 0, 0, NPROW), mb_a, 0, 0, lcmp), nqc0))*mb_a) + mb_a*mb_a else if side = 'R', ScaLAPACK Routines 6 1625 lwork = max((mb_a*(mb_a-1))/2, (mpc0 + nqc0)*mb_a) + mb_a*mb_a end if where lcmp = lcm/NPROW with lcm = ilcm(NPROW, NPCOL), iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mqa0 = numroc(m+icoffa, nb_a, MYCOL, iacol, NPCOL), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(m+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(n+icoffc, nb_c, MYCOL, iccol, NPCOL), ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c Overwritten by the product Q* sub(C) or Q'*sub (C), or sub(C)* Q', or sub(C)* Q work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?tzrzf Reduces the upper trapezoidal matrix A to upper triangular form. Syntax call pstzrzf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pdtzrzf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pctzrzf(m, n, a, ia, ja, desca, tau, work, lwork, info) call pztzrzf(m, n, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h 6 Intel® Math Kernel Library Reference Manual 1626 Description The p?tzrzf routine reduces the m-by-n (m = n) real/complex upper trapezoidal matrix sub(A)=(ia:ia +m-1,ja:ja+n-1) to upper triangular form by means of orthogonal/unitary transformations. The upper trapezoidal matrix A is factored as A = (R 0)*Z, where Z is an n-by-n orthogonal/unitary matrix and R is an m-by-m upper triangular matrix. Input Parameters m (global) INTEGER. The number of rows in the submatrix sub(A); (m=0). n (global) INTEGER. The number of columns in the submatrix sub(A) (n=0). a (local) REAL for pstzrzf DOUBLE PRECISION for pdtzrzf. COMPLEX for pctzrzf. DOUBLE COMPLEX for pztzrzf. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +n-1)). Contains the local pieces of the m-by-n distributed matrix sub (A) to be factored. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. work (local) REAL for pstzrzf DOUBLE PRECISION for pdtzrzf. COMPLEX for pctzrzf. DOUBLE COMPLEX for pztzrzf. Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least lwork = mb_a*(mp0+nq0+mb_a), where iroff = mod(ia-1, mb_a), icoff = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mp0 = numroc (m+iroff, mb_a, MYROW, iarow, NPROW), nq0 = numroc (n+icoff, nb_a, MYCOL, iacol, NPCOL) indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, the leading m-by-m upper triangular part of sub(A) contains the upper triangular matrix R, and elements m+1 to n of the first m rows of sub (A), with the array tau, represent the orthogonal/unitary matrix Z as a product of m elementary reflectors. ScaLAPACK Routines 6 1627 work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. tau (local) REAL for pstzrzf DOUBLE PRECISION for pdtzrzf. COMPLEX for pctzrzf. DOUBLE COMPLEX for pztzrzf. Array, DIMENSION LOCr(ia+m-1). Contains the scalar factor of elementary reflectors. tau is tied to the distributed matrix A. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. Application Notes The factorization is obtained by the Householder's method. The k-th transformation matrix, Z(k), which is or whose conjugate transpose is used to introduce zeros into the (m - k +1)-th row of sub(A), is given in the form where T(k) = i - tau*u(k)*u(k)', tau is a scalar and Z(k) is an (n - m) element vector. tau and Z(k) are chosen to annihilate the elements of the k-th row of sub(A). The scalar tau is returned in the k-th element of tau and the vector u(k) in the k-th row of sub(A), such that the elements of Z(k) are in a(k, m + 1),..., a(k, n). The elements of R are returned in the upper triangular part of sub(A). Z is given by Z = Z(1) * Z(2) *... * Z(m). p?ormrz Multiplies a general matrix by the orthogonal matrix from a reduction to upper triangular form formed by p?tzrzf. Syntax call psormrz(side, trans, m, n, k, l, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pdormrz(side, trans, m, n, k, l, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) 6 Intel® Math Kernel Library Reference Manual 1628 Include Files • C: mkl_scalapack.h Description This routine overwrites the general real m-by-n distributed matrix sub(C) = C(ic:ic+m-1, jc:jc+n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'T': QT*sub(C) sub(C)*QT where Q is a real orthogonal distributed matrix defined as the product of k elementary reflectors Q = H(1) H(2)... H(k) as returned by p?tzrzf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER ='L': Q or QT is applied from the left. ='R': Q or QT is applied from the right. trans (global) CHARACTER ='N', no transpose, Q is applied. ='T', transpose, QT is applied. m (global) INTEGER. The number of rows in the distributed matrix sub(C) (m=0). n (global) INTEGER. The number of columns in the distributed matrix sub(C) (n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: If side = 'L', m = k =0 If side = 'R', n = k =0. l (global) The columns of the distributed submatrix sub(A) containing the meaningful part of the Householder reflectors. If side = 'L', m = l =0 If side = 'R', n = l =0. a (local) REAL for psormrz DOUBLE PRECISION for pdormrz. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +m-1)) if side = 'L', and (lld_a, LOCc(ja+n-1)) if side = 'R', where lld_a = max(1,LOCr(ia+k-1)). The i-th row must contain the vector which defines the elementary reflector H(i), ia=i=ia+k-1, as returned by p?tzrzf in the k rows of its distributed matrix argument A(ia:ia+k-1, ja:*).A(ia:ia+k-1, ja:*) is modified by the routine but restored on exit. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. ScaLAPACK Routines 6 1629 tau (local) REAL for psormrz DOUBLE PRECISION for pdormrz Array, DIMENSION LOCc(ia+k-1). Contains the scalar factor tau (i) of elementary reflectors H(i) as returned by p?tzrzf. tau is tied to the distributed matrix A. c (local) REAL for psormrz DOUBLE PRECISION for pdormrz Pointer into the local memory to an array of local dimension (lld_c, LOCc(jc+n-1)). Contains the local pieces of the distributed matrix sub(C) to be factored. ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) REAL for psormrz DOUBLE PRECISION for pdormrz. Workspace array of dimension of lwork. lwork (local or global) INTEGER, dimension of work, must be at least: If side = 'L', lwork = max((mb_a*(mb_a-1))/2, (mpc0 + max(mqa0 + numroc(numroc(n+iroffc, mb_a, 0, 0, NPROW), mb_a, 0, 0, lcmp), nqc0))*mb_a) + mb_a*mb_a else if side ='R', lwork = max((mb_a*(mb_a-1))/2, (mpc0 + nqc0)*mb_a) + mb_a*mb_a end if where lcmp = lcm/NPROW with lcm = ilcm (NPROW, NPCOL), iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mqa0 = numroc(n+icoffa, nb_a, MYCOL, iacol, NPCOL), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(m+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(n+icoffc, nb_c, MYCOL, iccol, NPCOL), ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c Overwritten by the product Q*sub(C), or Q'*sub (C), or sub(C)*Q', or sub(C)*Q 6 Intel® Math Kernel Library Reference Manual 1630 work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?unmrz Multiplies a general matrix by the unitary transformation matrix from a reduction to upper triangular form determined by p?tzrzf. Syntax call pcunmrz(side, trans, m, n, k, l, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pzunmrz(side, trans, m, n, k, l, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description This routine overwrites the general complex m-by-n distributed matrix sub(C) = C(ic:ic+m-1,jc:jc+n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'C': QH*sub(C) sub(C)*QH where Q is a complex unitary distributed matrix defined as the product of k elementary reflectors Q = H(1)' H(2)'... H(k)' as returned by pctzrzf/pztzrzf. Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER ='L': Q or QH is applied from the left. ='R': Q or QH is applied from the right. trans (global) CHARACTER ='N', no transpose, Q is applied. ='C', conjugate transpose, QH is applied. m (global) INTEGER. The number of rows in the distributed matrix sub(C), (m=0). n (global) INTEGER. The number of columns in the distributed matrix sub(C), (n=0). k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: If side = 'L', m=k=0 If side = 'R', n=k=0. ScaLAPACK Routines 6 1631 l (global) INTEGER. The columns of the distributed submatrix sub(A) containing the meaningful part of the Householder reflectors. If side = 'L', m=l=0 If side = 'R', n=l=0. a (local) COMPLEX for pcunmrz DOUBLE COMPLEX for pzunmrz. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +m-1)) if side = 'L', and (lld_a, LOCc(ja+n-1)) if side = 'R', where lld_a = max(1, LOCr(ja+k-1)). The i-th row must contain the vector which defines the elementary reflector H(i), ia=i=ia+k-1, as returned by p?gerqf in the k rows of its distributed matrix argument A(ia:ia +k-1, ja*). A(ia:ia +k-1, ja*) is modified by the routine but restored on exit. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) COMPLEX for pcunmrz DOUBLE COMPLEX for pzunmrz Array, DIMENSION LOCc(ia+k-1). Contains the scalar factor tau (i) of elementary reflectors H(i) as returned by p?gerqf. tau is tied to the distributed matrix A. c (local) COMPLEX for pcunmrz DOUBLE COMPLEX for pzunmrz. Pointer into the local memory to an array of local dimension (lld_c, LOCc(jc+n-1)). Contains the local pieces of the distributed matrix sub(C) to be factored. ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) COMPLEX for pcunmrz DOUBLE COMPLEX for pzunmrz. Workspace array of dimension lwork. lwork (local or global) INTEGER, dimension of work, must be at least: If side = 'L', lwork = max((mb_a*(mb_a-1))/2, (mpc0+max(mqa0+numroc(numroc(n+iroffc, mb_a, 0, 0, NPROW), mb_a, 0, 0, lcmp), nqc0))*mb_a) + mb_a*mb_a else if side ='R', lwork = max((mb_a*(mb_a-1))/2, (mpc0+nqc0)*mb_a) + mb_a*mb_a end if where lcmp = lcm/NPROW with lcm = ilcm(NPROW, NPCOL), iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), 6 Intel® Math Kernel Library Reference Manual 1632 iacol = indxg2p(ja, nb_a, MYCOL, csrc_a, NPCOL), mqa0 = numroc(m+icoffa, nb_a, MYCOL, iacol, NPCOL), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(m+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(n+icoffc, nb_c, MYCOL, iccol, NPCOL), ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c Overwritten by the product Q* sub(C), or Q'*sub (C), or sub(C)*Q', or sub(C)*Q work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?ggqrf Computes the generalized QR factorization. Syntax call psggqrf(n, m, p, a, ia, ja, desca, taua, b, ib, jb, descb, taub, work, lwork, info) call pdggqrf(n, m, p, a, ia, ja, desca, taua, b, ib, jb, descb, taub, work, lwork, info) call pcggqrf(n, m, p, a, ia, ja, desca, taua, b, ib, jb, descb, taub, work, lwork, info) call pzggqrf(n, m, p, a, ia, ja, desca, taua, b, ib, jb, descb, taub, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?ggqrf routine forms the generalized QR factorization of an n-by-m matrix sub(A) = A(ia:ia+n-1, ja:ja+m-1) and an n-by-p matrix sub(B) = B(ib:ib+n-1, jb:jb+p-1): ScaLAPACK Routines 6 1633 as sub(A) = Q*R, sub(B) = Q*T*Z, where Q is an n-by-n orthogonal/unitary matrix, Z is a p-by-p orthogonal/unitary matrix, and R and T assume one of the forms: If n = m or if n 0, some or all of the eigenvalues fail to converge or not computed. If info = 1, bisection fails to converge for some eigenvalues; these eigenvalues are flagged by a negative block number. The effect is that the eigenvalues may not be as accurate as the absolute and relative tolerances. If info = 2, mismatch between the number of eigenvalues output and the number desired. If info = 3: range='i', and the Gershgorin interval initially used is incorrect. No eigenvalues are computed. Probable cause: the machine has a sloppy floating point arithmetic. Increase the fudge parameter, recompile, and try again. p?stein Computes the eigenvectors of a tridiagonal matrix using inverse iteration. Syntax call psstein(n, d, e, m, w, iblock, isplit, orfac, z, iz, jz, descz, work, lwork, iwork, liwork, ifail, iclustr, gap, info) ScaLAPACK Routines 6 1653 call pdstein(n, d, e, m, w, iblock, isplit, orfac, z, iz, jz, descz, work, lwork, iwork, liwork, ifail, iclustr, gap, info) call pcstein(n, d, e, m, w, iblock, isplit, orfac, z, iz, jz, descz, work, lwork, iwork, liwork, ifail, iclustr, gap, info) call pzstein(n, d, e, m, w, iblock, isplit, orfac, z, iz, jz, descz, work, lwork, iwork, liwork, ifail, iclustr, gap, info) Include Files • C: mkl_scalapack.h Description The p?stein routine computes the eigenvectors of a symmetric tridiagonal matrix T corresponding to specified eigenvalues, by inverse iteration. p?stein does not orthogonalize vectors that are on different processes. The extent of orthogonalization is controlled by the input parameter lwork. Eigenvectors that are to be orthogonalized are computed by the same process. p?stein decides on the allocation of work among the processes and then calls ?stein2 (modified LAPACK routine) on each individual process. If insufficient workspace is allocated, the expected orthogonalization may not be done. NOTE If the eigenvectors obtained are not orthogonal, increase lwork and run the code again. p = NPROW*NPCOL is the total number of processes. Input Parameters n (global) INTEGER. The order of the matrix T (n = 0). m (global) INTEGER. The number of eigenvectors to be returned. d, e, w (global) REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays: d(*) contains the diagonal elements of T. DIMENSION (n). e(*) contains the off-diagonal elements of T. DIMENSION (n-1). w(*) contains all the eigenvalues grouped by split-off block.The eigenvalues are supplied from smallest to largest within the block. (Here the output array w from p?stebz with order = 'B' is expected. The array should be replicated in all processes.) DIMENSION(m) iblock (global) INTEGER. Array, DIMENSION (n). The submatrix indices associated with the corresponding eigenvalues in w--1 for eigenvalues belonging to the first submatrix from the top, 2 for those belonging to the second submatrix, etc. (The output array iblock from p?stebz is expected here). isplit (global) INTEGER. Array, DIMENSION (n). The splitting points, at which T breaks up into submatrices. The first submatrix consists of rows/columns 1 to isplit(1), the second of rows/columns isplit(1)+1 through isplit(2), etc., and the nsplit-th consists of rows/columns isplit(nsplit-1)+1 through isplit(nsplit)=n . (The output array isplit from p?stebz is expected here.) 6 Intel® Math Kernel Library Reference Manual 1654 orfac (global) REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. orfac specifies which eigenvectors should be orthogonalized. Eigenvectors that correspond to eigenvalues within orfac*||T|| of each other are to be orthogonalized. However, if the workspace is insufficient (see lwork), this tolerance may be decreased until all eigenvectors can be stored in one process. No orthogonalization is done if orfac is equal to zero. A default value of 1000 is used if orfac is negative. orfac should be identical on all processes iz, jz (global) INTEGER. The row and column indices in the global array z indicating the first row and the first column of the submatrix Z, respectively. descz (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix Z. work (local). REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Workspace array, DIMENSION (lwork). lwork (local) INTEGER. lwork controls the extent of orthogonalization which can be done. The number of eigenvectors for which storage is allocated on each process is nvec = floor((lwork-max(5*n,np00*mq00))/n). Eigenvectors corresponding to eigenvalue clusters of size nvec - ceil(m/p) + 1 are guaranteed to be orthogonal (the orthogonality is similar to that obtained from ?stein2). NOTE lwork must be no smaller than max(5*n,np00*mq00) + ceil(m/p)*n and should have the same input value on all processes. It is the minimum value of lwork input on different processes that is significant. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. iwork (local) INTEGER. Workspace array, DIMENSION (3n+p+1). liwork (local) INTEGER. The size of the array iwork. It must be greater than (3*n +p+1). If liwork = -1, then liwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters z (local) REAL for psstein DOUBLE PRECISION for pdstein COMPLEX for pcstein DOUBLE COMPLEX for pzstein. ScaLAPACK Routines 6 1655 Array, DIMENSION (descz(dlen_), n/NPCOL + NB). z contains the computed eigenvectors associated with the specified eigenvalues. Any vector which fails to converge is set to its current iterate after MAXIT iterations (See ?stein2). On output, z is distributed across the p processes in block cyclic format. work(1) On exit, work(1) gives a lower bound on the workspace (lwork) that guarantees the user desired orthogonalization (see orfac). Note that this may overestimate the minimum workspace needed. iwork On exit, iwork(1) contains the amount of integer workspace required. On exit, the iwork(2) through iwork(p+2) indicate the eigenvectors computed by each process. Process i computes eigenvectors indexed iwork(i+2)+1 through iwork(i+3). ifail (global). INTEGER. Array, DIMENSION (m). On normal exit, all elements of ifail are zero. If one or more eigenvectors fail to converge after MAXIT iterations (as in ?stein), then info > 0 is returned. If mod(info, m+1)>0, then for i=1 to mod(info,m+1), the eigenvector corresponding to the eigenvalue w(ifail(i)) failed to converge (w refers to the array of eigenvalues on output). iclustr (global) INTEGER. Array, DIMENSION (2*p) This output array contains indices of eigenvectors corresponding to a cluster of eigenvalues that could not be orthogonalized due to insufficient workspace (see lwork, orfac and info). Eigenvectors corresponding to clusters of eigenvalues indexed iclustr(2*I-1) to iclustr(2*I), i = 1 to info/(m+1), could not be orthogonalized due to lack of workspace. Hence the eigenvectors corresponding to these clusters may not be orthogonal. iclustr is a zero terminated array ---(iclustr(2*k).ne. 0.and.iclustr(2*k+1).eq.0) if and only if k is the number of clusters. gap (global) REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. This output array contains the gap between eigenvalues whose eigenvectors could not be orthogonalized. The info/m output values in this array correspond to the info/(m+1) clusters indicated by the array iclustr. As a result, the dot product between eigenvectors corresponding to the i-th cluster may be as high as (O(n)*macheps)/gap(i). info (global) INTEGER. If info = 0, the execution is successful. If info < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), If the i-th argument is a scalar and had an illegal value, then info = -i. If info < 0: if info = -i, the i-th argument had an illegal value. If info > 0: if mod(info, m+1) = i, then i eigenvectors failed to converge in MAXIT iterations. Their indices are stored in the array ifail. If info/(m+1) = i, then eigenvectors corresponding to i clusters of eigenvalues could not be orthogonalized due to insufficient workspace. The indices of the clusters are stored in the array iclustr. Nonsymmetric Eigenvalue Problems This section describes ScaLAPACK routines for solving nonsymmetric eigenvalue problems, computing the Schur factorization of general matrices, as well as performing a number of related computational tasks. 6 Intel® Math Kernel Library Reference Manual 1656 To solve a nonsymmetric eigenvalue problem with ScaLAPACK, you usually need to reduce the matrix to the upper Hessenberg form and then solve the eigenvalue problem with the Hessenberg matrix obtained. Table "Computational Routines for Solving Nonsymmetric Eigenproblems" lists ScaLAPACK routines for reducing the matrix to the upper Hessenberg form by an orthogonal (or unitary) similarity transformation A = QHQH, as well as routines for solving eigenproblems with Hessenberg matrices, and multiplying the matrix after reduction. Computational Routines for Solving Nonsymmetric Eigenproblems Operation performed General matrix Orthogonal/Unitary matrix Hessenberg matrix Reduce to Hessenberg form A = QHQH p?gehrd Multiply the matrix after reduction p?ormhr/p?unmhr Find eigenvalues and Schur factorization p?lahqr p?gehrd Reduces a general matrix to upper Hessenberg form. Syntax call psgehrd(n, ilo, ihi, a, ia, ja, desca, tau, work, lwork, info) call pdgehrd(n, ilo, ihi, a, ia, ja, desca, tau, work, lwork, info) call pcgehrd(n, ilo, ihi, a, ia, ja, desca, tau, work, lwork, info) call pzgehrd(n, ilo, ihi, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?gehrd routine reduces a real/complex general distributed matrix sub (A) to upper Hessenberg form H by an orthogonal or unitary similarity transformation Q'*sub(A)*Q = H, where sub(A) = A(ia+n-1:ia+n-1, ja+n-1:ja+n-1). Input Parameters n (global) INTEGER. The order of the distributed matrix sub(A) (n=0). ilo, ihi (global) INTEGER. It is assumed that sub(A) is already upper triangular in rows ia:ia+ilo-2 and ia+ihi:ia+n-1 and columns ja:ja+ilo-2 and ja+ihi:ja+n-1. (See Application Notes below). If n > 0, 1=ilo=ihi=n; otherwise set ilo = 1, ihi = n. a (local) REAL for psgehrd DOUBLE PRECISION for pdgehrd COMPLEX for pcgehrd DOUBLE COMPLEX for pzgehrd. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +n-1)). On entry, this array contains the local pieces of the n-by-n general distributed matrix sub(A) to be reduced. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. ScaLAPACK Routines 6 1657 desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. work (local) REAL for psgehrd DOUBLE PRECISION for pdgehrd COMPLEX for pcgehrd DOUBLE COMPLEX for pzgehrd. Workspace array of dimension lwork. lwork (local or global) INTEGER, dimension of the array work. lwork is local input and must be at least lwork=NB*NB + NB*max(ihip+1, ihlp+inlq) where NB = mb_a = nb_a, iroffa = mod(ia-1, NB), icoffa = mod(ja-1, NB), ioff = mod(ia+ilo-2, NB), iarow = indxg2p(ia, NB, MYROW, rsrc_a, NPROW), ihip = numroc(ihi+iroffa, NB, MYROW, iarow, NPROW), ilrow = indxg2p(ia+ilo-1, NB, MYROW, rsrc_a, NPROW), ihlp = numroc(ihi-ilo+ioff+1, NB, MYROW, ilrow, NPROW), ilcol = indxg2p(ja+ilo-1, NB, MYCOL, csrc_a, NPCOL), inlq = numroc(n-ilo+ioff+1, NB, MYCOL, ilcol, NPCOL), indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, the upper triangle and the first subdiagonal of sub(A)are overwritten with the upper Hessenberg matrix H, and the elements below the first subdiagonal, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors (see Application Notes below). tau (local). REAL for psgehrd DOUBLE PRECISION for pdgehrd COMPLEX for pcgehrd DOUBLE COMPLEX for pzgehrd. Array, DIMENSION at least max(ja+n-2). The scalar factors of the elementary reflectors (see Application Notes below). Elements ja:ja+ilo-2 and ja+ihi:ja+n-2 of tau are set to zero. tau is tied to the distributed matrix A. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. 6 Intel® Math Kernel Library Reference Manual 1658 Application Notes The matrix Q is represented as a product of (ihi-ilo) elementary reflectors Q = H(ilo)*H(ilo+1)*...*H(ihi-1). Each H(i) has the form H(i)= i - tau*v*v' where tau is a real/complex scalar, and v is a real/complex vector with v(1:i)= 0, v(i+1)= 1 and v(ihi +1:n)= 0; v(i+2:ihi) is stored on exit in a(ia+ilo+i:ia+ihi-1,ja+ilo+i-2), and tau in tau(ja +ilo+i-2). The contents of a(ia:ia+n-1,ja:ja+n-1) are illustrated by the following example, with n = 7, ilo = 2 and ihi = 6: on entry on exit where a denotes an element of the original matrix sub(A), H denotes a modified element of the upper Hessenberg matrix H, and vi denotes an element of the vector defining H(ja+ilo+i-2). p?ormhr Multiplies a general matrix by the orthogonal transformation matrix from a reduction to Hessenberg form determined by p?gehrd. Syntax call psormhr(side, trans, m, n, ilo, ihi, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pdormhr(side, trans, m, n, ilo, ihi, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h ScaLAPACK Routines 6 1659 Description The p?ormhr routine overwrites the general real distributed m-by-n matrix sub(C)= C(ic:ic+m-1, jc:jc +n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'T': QT*sub(C) sub(C)*QT where Q is a real orthogonal distributed matrix of order nq, with nq = m if side = 'L' and nq = n if side = 'R'. Q is defined as the product of ihi-ilo elementary reflectors, as returned by p?gehrd. Q = H(ilo) H(ilo+1)... H(ihi-1). Input Parameters side (global) CHARACTER ='L': Q or QT is applied from the left. ='R': Q or QT is applied from the right. trans (global) CHARACTER ='N', no transpose, Q is applied. ='T', transpose, QT is applied. m (global) INTEGER. The number of rows in the distributed matrix sub (C) (m=0). n (global) INTEGER. The number of columns in he distributed matrix sub (C) (n=0). ilo, ihi (global) INTEGER. ilo and ihi must have the same values as in the previous call of p?gehrd. Q is equal to the unit matrix except for the distributed submatrix Q(ia +ilo:ia+ihi-1,ia+ilo:ja+ihi-1). If side = 'L', 1=ilo=ihi=max(1,m); If side = 'R', 1=ilo=ihi=max(1,n); ilo and ihi are relative indexes. a (local) REAL for psormhr DOUBLE PRECISION for pdormhr Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +m-1)) if side='L', and (lld_a, LOCc(ja+n-1)) if side = 'R'. Contains the vectors which define the elementary reflectors, as returned by p?gehrd. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psormhr DOUBLE PRECISION for pdormhr Array, DIMENSION LOCc(ja+m-2), if side = 'L', and LOCc(ja+n-2) if side = 'R'. This array contains the scalar factors tau(j) of the elementary reflectors H(j) as returned by p?gehrd. tau is tied to the distributed matrix A. 6 Intel® Math Kernel Library Reference Manual 1660 c (local) REAL for psormhr DOUBLE PRECISION for pdormhr Pointer into the local memory to an array of dimension (lld_c,LOCc(jc +n-1)). Contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) REAL for psormhr DOUBLE PRECISION for pdormhr Workspace array with dimension lwork. lwork (local or global) INTEGER. The dimension of the array work. lwork must be at least iaa = ia + ilo; jaa = ja+ilo-1; If side = 'L', mi = ihi-ilo; ni = n; icc = ic + ilo; jcc = jc; lwork = max((nb_a*(nb_a-1))/2, (nqc0+mpc0)*nb_a) + nb_a*nb_a else if side = 'R', mi = m; ni = ihi-ilo; icc = ic; jcc = jc + ilo; lwork = max((nb_a*(nb_a-1))/2, (nqc0+max(npa0+numroc(numroc(ni +icoffc, nb_a, 0, 0, NPCOL), nb_a, 0, 0, lcmq), mpc0))*nb_a) + nb_a*nb_a end if where lcmq = lcm/NPCOL with lcm = ilcm(NPROW, NPCOL), iroffa = mod(iaa-1, mb_a), icoffa = mod(jaa-1, nb_a), iarow = indxg2p(iaa, mb_a, MYROW, rsrc_a, NPROW), npa0 = numroc(ni+iroffa, mb_a, MYROW, iarow, NPROW), iroffc = mod(icc-1, mb_c), icoffc = mod(jcc-1, nb_c), icrow = indxg2p(icc, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jcc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(mi+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(ni+icoffc, nb_c, MYCOL, iccol, NPCOL), ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c sub(C) is overwritten by Q*sub(C), or Q'*sub(C), or sub(C)*Q', or sub(C)*Q. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. ScaLAPACK Routines 6 1661 < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?unmhr Multiplies a general matrix by the unitary transformation matrix from a reduction to Hessenberg form determined by p?gehrd. Syntax call pcunmhr(side, trans, m, n, ilo, ihi, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pzunmhr(side, trans, m, n, ilo, ihi, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description This routine overwrites the general complex distributed m-by-n matrix sub(C) = C(ic:ic+m-1,jc:jc+n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'H': QH*sub(C) sub(C)*QH where Q is a complex unitary distributed matrix of order nq, with nq = m if side = 'L' and nq = n if side = 'R'. Q is defined as the product of ihi-ilo elementary reflectors, as returned by p?gehrd. Q = H(ilo) H(ilo+1)... H(ihi-1). Input Parameters side (global) CHARACTER ='L': Q or QH is applied from the left. ='R': Q or QH is applied from the right. trans (global) CHARACTER ='N', no transpose, Q is applied. ='C', conjugate transpose, QH is applied. m (global) INTEGER. The number of rows in the distributed submatrix sub (C) (m=0). n (global) INTEGER. The number of columns in the distributed submatrix sub (C) (n=0). ilo, ihi (global) INTEGER These must be the same parameters ilo and ihi, respectively, as supplied to p?gehrd. Q is equal to the unit matrix except in the distributed submatrix Q (ia+ilo:ia+ihi-1,ia+ilo:ja+ihi-1). If side ='L', then 1=ilo=ihi=max(1,m). If side = 'R', then 1=ilo=ihi=max(1,n) ilo and ihi are relative indexes. 6 Intel® Math Kernel Library Reference Manual 1662 a (local) COMPLEX for pcunmhr DOUBLE COMPLEX for pzunmhr. Pointer into the local memory to an array of dimension (lld_a, LOC c(ja +m-1)) if side='L', and (lld_a, LOCc(ja+n-1)) if side = 'R'. Contains the vectors which define the elementary reflectors, as returned by p?gehrd. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) COMPLEX for pcunmhr DOUBLE COMPLEX for pzunmhr. Array, DIMENSION LOCc(ja+m-2), if side = 'L', and LOCc(ja+n-2) if side = 'R'. This array contains the scalar factors tau(j) of the elementary reflectors H(j) as returned by p?gehrd. tau is tied to the distributed matrix A. c (local) COMPLEX for pcunmhr DOUBLE COMPLEX for pzunmhr. Pointer into the local memory to an array of dimension (lld_c, LOCc(jc +n-1)). Contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) COMPLEX for pcunmhr DOUBLE COMPLEX for pzunmhr. Workspace array with dimension lwork. lwork (local or global) The dimension of the array work. lwork must be at least iaa = ia + ilo; jaa = ja+ilo-1; If side = 'L', mi = ihi-ilo; ni = n; icc = ic + ilo; jcc = jc; lwork = max((nb_a*(nb_a-1))/2, (nqc0+mpc0)*nb_a) + nb_a*nb_a else if side = 'R', mi = m; ni = ihi-ilo; icc = ic; jcc = jc + ilo; lwork = max((nb_a*(nb_a-1))/2, (nqc0 + max(npa0+numroc(numroc(ni +icoffc, nb_a, 0, 0, NPCOL), nb_a, 0, 0, lcmq ), mpc0))*nb_a) + nb_a*nb_a end if where lcmq = lcm/NPCOL with lcm = ilcm(NPROW, NPCOL), iroffa = mod(iaa-1, mb_a), icoffa = mod(jaa-1, nb_a), iarow = indxg2p(iaa, mb_a, MYROW, rsrc_a, NPROW), npa0 = numroc(ni+iroffa, mb_a, MYROW, iarow, NPROW), iroffc = mod(icc-1, mb_c), icoffc = mod(jcc-1, nb_c), icrow = indxg2p(icc, mb_c, MYROW, rsrc_c, NPROW), ScaLAPACK Routines 6 1663 iccol = indxg2p(jcc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(mi+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(ni+icoffc, nb_c, MYCOL, iccol, NPCOL), ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c C is overwritten by Q* sub(C) or Q'*sub(C) or sub(C)*Q' or sub(C)*Q. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?lahqr Computes the Schur decomposition and/or eigenvalues of a matrix already in Hessenberg form. Syntax call pslahqr(wantt, wantz, n, ilo, ihi, a, desca, wr, wi, iloz, ihiz, z, descz, work, lwork, iwork, ilwork, info) call pdlahqr(wantt, wantz, n, ilo, ihi, a, desca, wr, wi, iloz, ihiz, z, descz, work, lwork, iwork, ilwork, info) Include Files • C: mkl_scalapack.h Description This is an auxiliary routine used to find the Schur decomposition and/or eigenvalues of a matrix already in Hessenberg form from columns ilo to ihi. Input Parameters wantt (global) LOGICAL If wantt = .TRUE., the full Schur form T is required; If wantt = .FALSE., only eigenvalues are required. wantz (global) LOGICAL. If wantz = .TRUE., the matrix of Schur vectors z is required; If wantz = .FALSE., Schur vectors are not required. n (global) INTEGER. The order of the Hessenberg matrix A (and z if wantz). (n=0). ilo, ihi (global) INTEGER. 6 Intel® Math Kernel Library Reference Manual 1664 It is assumed that A is already upper quasi-triangular in rows and columns ihi+1:n, and that A(ilo, ilo-1) = 0 (unless ilo = 1). p?lahqr works primarily with the Hessenberg submatrix in rows and columns ilo to ihi, but applies transformations to all of h if wantt is .TRUE.. 1=ilo=max(1,ihi); ihi = n. a (global) REAL for pslahqr DOUBLE PRECISION for pdlahqr Array, DIMENSION (desca(lld_),*). On entry, the upper Hessenberg matrix A. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. iloz, ihiz (global) INTEGER. Specify the rows of z to which transformations must be applied if wantz is .TRUE.. 1=iloz=ilo; ihi=ihiz=n. z (global ) REAL for pslahqr DOUBLE PRECISION for pdlahqr Array. If wantz is .TRUE., on entry z must contain the current matrix Z of transformations accumulated by pdhseqr. If wantz is .FALSE., z is not referenced. descz (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix Z. work (local) REAL for pslahqr DOUBLE PRECISION for pdlahqr Workspace array with dimension lwork. lwork (local) INTEGER. The dimension of work. lwork is assumed big enough so that lwork=3*n + max(2*max(descz(lld_),desca(lld_)) + 2*LOCq(n), 7*ceil(n/hbl)/lcm(NPROW,NPCOL))). If lwork = -1, then work(1)gets set to the above number and the code returns immediately. iwork (global and local) INTEGER array of size ilwork. ilwork (local) INTEGER This holds some of the iblk integer arrays. Output Parameters a On exit, if wantt is .TRUE., A is upper quasi-triangular in rows and columns ilo:ihi, with any 2-by-2 or larger diagonal blocks not yet in standard form. If wantt is .FALSE., the contents of A are unspecified on exit. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. wr, wi (global replicated output) REAL for pslahqr DOUBLE PRECISION for pdlahqr Arrays, DIMENSION(n) each. The real and imaginary parts, respectively, of the computed eigenvalues ilo to ihi are stored in the corresponding elements of wr and wi. If two eigenvalues are computed as a complex conjugate pair, they are stored in consecutive elements of wr and wi, say the i-th and (i+1)-th, with wi(i)> 0 and wi(i+1) < 0. If wantt is .TRUE. , the eigenvalues are stored in the same order as on the diagonal of the Schur form returned in A. A may be returned with larger diagonal blocks until the next release. ScaLAPACK Routines 6 1665 z On exit z has been updated; transformations are applied only to the submatrix z(iloz:ihiz, ilo:ihi). info (global) INTEGER. = 0: the execution is successful. < 0: parameter number -info incorrect or inconsistent > 0: p?lahqr failed to compute all the eigenvalues ilo to ihi in a total of 30*(ihi-ilo+1) iterations; if info = i, elements i+1:ihi of wr and wi contain those eigenvalues which have been successfully computed. Singular Value Decomposition This section describes ScaLAPACK routines for computing the singular value decomposition (SVD) of a general m-by-n matrix A (see "Singular Value Decomposition" in LAPACK chapter). To find the SVD of a general matrix A, this matrix is first reduced to a bidiagonal matrix B by a unitary (orthogonal) transformation, and then SVD of the bidiagonal matrix is computed. Note that the SVD of B is computed using the LAPACK routine ?bdsqr . Table "Computational Routines for Singular Value Decomposition (SVD)" lists ScaLAPACK computational routines for performing this decomposition. Computational Routines for Singular Value Decomposition (SVD) Operation General matrix Orthogonal/unitary matrix Reduce A to a bidiagonal matrix p?gebrd Multiply matrix after reduction p?ormbr/p?unmbr p?gebrd Reduces a general matrix to bidiagonal form. Syntax call psgebrd(m, n, a, ia, ja, desca, d, e, tauq, taup, work, lwork, info) call pdgebrd(m, n, a, ia, ja, desca, d, e, tauq, taup, work, lwork, info) call pcgebrd(m, n, a, ia, ja, desca, d, e, tauq, taup, work, lwork, info) call pzgebrd(m, n, a, ia, ja, desca, d, e, tauq, taup, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?gebrd routine reduces a real/complex general m-by-n distributed matrix sub(A)= A(ia:ia +m-1,ja:ja+n-1) to upper or lower bidiagonal form B by an orthogonal/unitary transformation: Q'*sub(A)*P = B. If m= n, B is upper bidiagonal; if m < n, B is lower bidiagonal. Input Parameters m (global) INTEGER. The number of rows in the distributed matrix sub(A) (m=0). n (global) INTEGER. The number of columns in the distributed matrix sub(A) (n=0). a (local) 6 Intel® Math Kernel Library Reference Manual 1666 REAL for psgebrd DOUBLE PRECISION for pdgebrd COMPLEX for pcgebrd DOUBLE COMPLEX for pzgebrd. Real pointer into the local memory to an array of dimension (lld_a, LOCc(ja+n-1)). On entry, this array contains the distributed matrix sub (A). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. work (local) REAL for psgebrd DOUBLE PRECISION for pdgebrd COMPLEX for pcgebrd DOUBLE COMPLEX for pzgebrd. Workspace array of dimension lwork. lwork (local or global) INTEGER, dimension of work, must be at least: lwork = nb*(mpa0 + nqa0+1)+ nqa0 where nb = mb_a = nb_a, iroffa = mod(ia-1, nb), icoffa = mod(ja-1, nb), iarow = indxg2p(ia, nb, MYROW, rsrc_a, NPROW), iacol = indxg2p (ja, nb, MYCOL, csrc_a, NPCOL), mpa0 = numroc(m +iroffa, nb, MYROW, iarow, NPROW), nqa0 = numroc(n +icoffa, nb, MYCOL, iacol, NPCOL), indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, if m=n, the diagonal and the first superdiagonal of sub(A) are overwritten with the upper bidiagonal matrix B; the elements below the diagonal, with the array tauq, represent the orthogonal/unitary matrix Q as a product of elementary reflectors, and the elements above the first superdiagonal, with the array taup, represent the orthogonal matrix P as a product of elementary reflectors. If m < n, the diagonal and the first subdiagonal are overwritten with the lower bidiagonal matrix B; the elements below the first subdiagonal, with the array tauq, represent the orthogonal/unitary matrix Q as a product of elementary reflectors, and the elements above the diagonal, with the array taup, represent the orthogonal matrix P as a product of elementary reflectors. See Application Notes below. d (local) REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION LOCc(ja+min(m,n)-1) if m=n; LOCr(ia+min(m,n)-1) otherwise. The distributed diagonal elements of the bidiagonal matrix B: d(i) = a(i,i). d is tied to the distributed matrix A. e (local) ScaLAPACK Routines 6 1667 REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION LOCr(ia+min(m,n)-1) if m=n; LOCc(ja+min(m,n)-2) otherwise. The distributed off-diagonal elements of the bidiagonal distributed matrix B: If m=n, e(i) = a(i,i+1) for i = 1,2,..., n-1; if m < n, e(i) = a(i+1, i) for i = 1,2,...,m-1. e is tied to the distributed matrix A. tauq, taup (local) REAL for psgebrd DOUBLE PRECISION for pdgebrd COMPLEX for pcgebrd DOUBLE COMPLEX for pzgebrd. Arrays, DIMENSION LOCc(ja+min(m,n)-1) for tauq and LOCr(ia +min(m,n)-1) for taup. Contain the scalar factors of the elementary reflectors which represent the orthogonal/unitary matrices Q and P, respectively. tauq and taup are tied to the distributed matrix A. See Application Notes below. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. Application Notes The matrices Q and P are represented as products of elementary reflectors: If m = n, Q = H(1)*H(2)*...*H(n), and P = G(1)*G(2)*...*G(n-1). Each H(i) and G(i) has the form: H(i)= i - tauq * v * v' and G(i) = i - taup*u*u' where tauq and taup are real/complex scalars, and v and u are real/complex vectors; v(1:i-1) = 0, v(i) = 1, and v(i+1:m) is stored on exit in A(ia+i:ia+m-1,ja+i-1); u(1:i) = 0, u(i+1) = 1, and u(i+2:n) is stored on exit in A (ia+i-1,ja+i+1:ja+n-1); tauq is stored in tauq(ja+i-1) and taup in taup(ia+i-1). If m < n, Q = H(1)*H(2)*...*H(m-1), and P = G(1)* G(2)*...* G(m) Each H (i) and G(i) has the form: H(i)= i-tauq*v*v' and G(i)= i-taup*u*u' here tauq and taup are real/complex scalars, and v and u are real/complex vectors; v(1:i) = 0, v(i+1) = 1, and v(i+2:m) is stored on exit in A (ia+i:ia+m-1,ja+i-1); u(1:i-1) = 0, u(i) = 1, and u(i+1:n) is stored on exit in A(ia+i-1,ja+i+1:ja+n-1); tauq is stored in tauq(ja+i-1) and taup in taup(ia+i-1). The contents of sub(A) on exit are illustrated by the following examples: m = 6 and n = 5 (m > n): 6 Intel® Math Kernel Library Reference Manual 1668 m = 5 and n = 6 (m < n): where d and e denote diagonal and off-diagonal elements of B, vi denotes an element of the vector defining H(i), and ui an element of the vector defining G(i). p?ormbr Multiplies a general matrix by one of the orthogonal matrices from a reduction to bidiagonal form determined by p?gebrd. Syntax call psormbr(vect, side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pdormbr(vect, side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description If vect = 'Q', the p?ormbr routine overwrites the general real distributed m-by-n matrix sub(C) = C(c:ic +m-1,jc:jc+n-1) with side ='L' side ='R' trans = 'N': Q sub(C) sub(C) Q trans = 'T': QT sub(C) sub(C) QT If vect = 'P', the routine overwrites sub(C) with side ='L' side ='R' trans = 'N': P sub(C) sub(C) P trans = 'T': PT sub(C) sub(C) PT ScaLAPACK Routines 6 1669 Here Q and PT are the orthogonal distributed matrices determined by p?gebrd when reducing a real distributed matrix A(ia:*, ja:*) to bidiagonal form: A(ia:*,ja:*) = Q*B*PT. Q and PT are defined as products of elementary reflectors H(i) and G(i) respectively. Let nq = m if side = 'L' and nq = n if side = 'R'. Thus nq is the order of the orthogonal matrix Q or PT that is applied. If vect = 'Q', A(ia:*, ja:*) is assumed to have been an nq-by-k matrix: If nq = k, Q = H(1) H(2)...H(k); If nq < k, Q = H(1) H(2)...H(nq-1). If vect = 'P', A(ia:*, ja:*) is assumed to have been a k-by-nq matrix: If k < nq, P = G(1) G(2)...G(k); If k = nq, P = G(1) G(2)...G(nq-1). Input Parameters vect (global) CHARACTER. If vect ='Q', then Q or QT is applied. If vect ='P', then P or PT is applied. side (global) CHARACTER. If side ='L', then Q or QT, P or PT is applied from the left. If side ='R', then Q or QT, P or PT is applied from the right. trans (global) CHARACTER. If trans = 'N', no transpose, Q or P is applied. If trans = 'T', then QT or PT is applied. m (global) INTEGER. The number of rows in the distributed matrix sub (C). n (global) INTEGER. The number of columns in the distributed matrix sub (C). k (global) INTEGER. If vect = 'Q', the number of columns in the original distributed matrix reduced by p?gebrd; If vect = 'P', the number of rows in the original distributed matrix reduced by p?gebrd. Constraints: k = 0. a (local) REAL for psormbr DOUBLE PRECISION for pdormbr. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +min(nq,k)-1)) If vect='Q', and (lld_a, LOCc(ja+nq-1)) If vect = 'P'. nq = m if side = 'L', and nq = n otherwise. The vectors which define the elementary reflectors H(i) and G(i), whose products determine the matrices Q and P, as returned by p?gebrd. If vect = 'Q', lld_a=max(1, LOCr(ia+nq-1)); If vect = 'P', lld_a=max(1, LOCr(ia+min(nq, k)-1)). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) 6 Intel® Math Kernel Library Reference Manual 1670 REAL for psormbr DOUBLE PRECISION for pdormbr. Array, DIMENSION LOCc(ja+min(nq, k)-1), if vect = 'Q', and LOCr(ia +min(nq, k)-1), if vect = 'P'. tau(i) must contain the scalar factor of the elementary reflector H(i) or G(i), which determines Q or P, as returned by pdgebrd in its array argument tauq or taup. tau is tied to the distributed matrix A. c (local) REAL for psormbr DOUBLE PRECISION for pdormbr Pointer into the local memory to an array of dimension (lld_a, LOCc (jc +n-1)). Contains the local pieces of the distributed matrix sub (C). ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) REAL for psormbr DOUBLE PRECISION for pdormbr. Workspace array of dimension lwork. lwork (local or global) INTEGER, dimension of work, must be at least: If side = 'L' nq = m; if ((vect = 'Q' and nq=k) or (vect is not equal to 'Q' and nq>k)), iaa=ia; jaa=ja; mi=m; ni=n; icc=ic; jcc=jc; else iaa= ia+1; jaa=ja; mi=m-1; ni=n; icc=ic+1; jcc= jc; end if else If side = 'R', nq = n; if((vect = 'Q' and nq=k) or (vect is not equal to 'Q' and nq>k)), iaa=ia; jaa=ja; mi=m; ni=n; icc=ic; jcc=jc; else iaa= ia; jaa= ja+1; mi= m; ni= n-1; icc= ic; jcc= jc+1; end if end if If vect = 'Q', If side = 'L', lwork=max((nb_a*(nb_a-1))/2, (nqc0 + mpc0)*nb_a) + nb_a * nb_a else if side = 'R', lwork=max((nb_a*(nb_a-1))/2, (nqc0 + max(npa0 + numroc(numroc(ni+icoffc, nb_a, 0, 0, NPCOL), nb_a, 0, 0, lcmq), mpc0))*nb_a) + nb_a*nb_a end if else if vect is not equal to 'Q', if side = 'L', lwork=max((mb_a*(mb_a-1))/2, (mpc0 + max(mqa0 + numroc(numroc(mi+iroffc, mb_a, 0, 0, NPROW), mb_a, 0, 0, lcmp), nqc0))*mb_a) + mb_a*mb_a else if side = 'R', lwork=max((mb_a*(mb_a-1))/2, (mpc0 + nqc0)*mb_a) + mb_a*mb_a ScaLAPACK Routines 6 1671 end if end if where lcmp = lcm/NPROW, lcmq = lcm/NPCOL, with lcm = ilcm(NPROW, NPCOL), iroffa = mod(iaa-1, mb_a), icoffa = mod(jaa-1, nb_a), iarow = indxg2p(iaa, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(jaa, nb_a, MYCOL, csrc_a, NPCOL), mqa0 = numroc(mi+icoffa, nb_a, MYCOL, iacol, NPCOL), npa0 = numroc(ni+iroffa, mb_a, MYROW, iarow, NPROW), iroffc = mod(icc-1, mb_c), icoffc = mod(jcc-1, nb_c), icrow = indxg2p(icc, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jcc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(mi+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(ni+icoffc, nb_c, MYCOL, iccol, NPCOL), indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c On exit, if vect='Q', sub(C) is overwritten by Q*sub(C), or Q'*sub(C), or sub(C)*Q', or sub(C)*Q; if vect='P', sub(C) is overwritten by P*sub(C), or P'*sub(C), or sub(C)*P, or sub(C)*P'. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?unmbr Multiplies a general matrix by one of the unitary transformation matrices from a reduction to bidiagonal form determined by p?gebrd. Syntax call pcunmbr(vect, side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pzunmbr(vect, side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h 6 Intel® Math Kernel Library Reference Manual 1672 Description If vect = 'Q', the p?unmbr routine overwrites the general complex distributed m-by-n matrix sub(C) = C(ic:ic+m-1, jc:jc+n-1) with side ='L' side ='R' trans = 'N': Q*sub(C) sub(C)*Q trans = 'C': QH*sub(C) sub(C)*QH If vect = 'P', the routine overwrites sub(C) with side ='L' side ='R' trans = 'N': P*sub(C) sub(C)*P trans = 'C': PH*sub(C) sub(C)*PH Here Q and PH are the unitary distributed matrices determined by p?gebrd when reducing a complex distributed matrix A(ia:*, ja:*) to bidiagonal form: A(ia:*,ja:*) = Q*B*PH. Q and PH are defined as products of elementary reflectors H(i) and G(i) respectively. Let nq = m if side = 'L' and nq = n if side = 'R'. Thus nq is the order of the unitary matrix Q or PH that is applied. If vect = 'Q', A(ia:*, ja:*) is assumed to have been an nq-by-k matrix: If nq = k, Q = H(1) H(2)... H(k); If nq < k, Q = H(1) H(2)... H(nq-1). If vect = 'P', A(ia:*, ja:*) is assumed to have been a k-by-nq matrix: If k < nq, P = G(1) G(2)... G(k); If k = nq, P = G(1) G(2)... G(nq-1). Input Parameters vect (global) CHARACTER. If vect ='Q', then Q or QH is applied. If vect ='P', then P or PH is applied. side (global) CHARACTER. If side ='L', then Q or QH, P or PH is applied from the left. If side ='R', then Q or QH, P or PH is applied from the right. trans (global) CHARACTER. If trans = 'N', no transpose, Q or P is applied. If trans = 'C', conjugate transpose, QH or PH is applied. m (global) INTEGER. The number of rows in the distributed matrix sub (C) m=0. n (global) INTEGER. The number of columns in the distributed matrix sub (C) n=0. k (global) INTEGER. If vect = 'Q', the number of columns in the original distributed matrix reduced by p?gebrd; If vect = 'P', the number of rows in the original distributed matrix reduced by p?gebrd. Constraints: k = 0. a (local) ScaLAPACK Routines 6 1673 COMPLEX for psormbr DOUBLE COMPLEX for pdormbr. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +min(nq,k)-1)) if vect='Q', and (lld_a, LOCc(ja+nq-1)) if vect = 'P'. nq = m if side = 'L', and nq = n otherwise. The vectors which define the elementary reflectors H(i) and G(i), whose products determine the matrices Q and P, as returned by p?gebrd. If vect = 'Q', lld_a = max(1, LOCr(ia+nq-1)); If vect = 'P', lld_a = max(1, LOCr(ia+min(nq, k)-1)). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. tau (local) COMPLEX for pcunmbr DOUBLE COMPLEX for pzunmbr. Array, DIMENSION LOCc(ja+min(nq, k)-1), if vect = 'Q', and LOCr(ia +min(nq, k)-1), if vect = 'P'. tau(i) must contain the scalar factor of the elementary reflector H(i) or G(i), which determines Q or P, as returned by p?gebrd in its array argument tauq or taup. tau is tied to the distributed matrix A. c (local) COMPLEX for pcunmbr DOUBLE COMPLEX for pzunmbr Pointer into the local memory to an array of dimension (lld_a, LOCc (jc +n-1)). Contains the local pieces of the distributed matrix sub (C). ic, jc (global) INTEGER. The row and column indices in the global array c indicating the first row and the first column of the submatrix C, respectively. descc (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix C. work (local) COMPLEX for pcunmbr DOUBLE COMPLEX for pzunmbr. Workspace array of dimension lwork. lwork (local or global) INTEGER, dimension of work, must be at least: If side = 'L' nq = m; if ((vect = 'Q' and nq = k) or (vect is not equal to 'Q' and nq > k)), iaa= ia; jaa= ja; mi= m; ni= n; icc= ic; jcc= jc; else iaa= ia+1; jaa= ja; mi= m-1; ni= n; icc= ic+1; jcc= jc; end if else If side = 'R', nq = n; if ((vect = 'Q' and nq = k) or (vect is not equal to 'Q' and nq = k)), iaa= ia; jaa= ja; mi= m; ni= n; icc= ic; jcc= jc; else iaa= ia; jaa= ja+1; mi= m; ni= n-1; icc= ic; jcc= jc+1; end if 6 Intel® Math Kernel Library Reference Manual 1674 end if If vect = 'Q', If side = 'L', lwork = max((nb_a*(nb_a-1))/2, (nqc0+mpc0)*nb_a) + nb_a*nb_a else if side = 'R', lwork = max((nb_a*(nb_a-1))/2, (nqc0 + max(npa0+numroc(numroc(ni+icoffc, nb_a, 0, 0, NPCOL), nb_a, 0, 0, lcmq), mpc0))*nb_a) + nb_a*nb_a end if else if vect is not equal to 'Q', if side = 'L', lwork = max((mb_a*(mb_a-1))/2, (mpc0 + max(mqa0+numroc(numroc(mi+iroffc, mb_a, 0, 0, NPROW), mb_a, 0, 0, lcmp), nqc0))*mb_a) + mb_a*mb_a else if side = 'R', lwork = max((mb_a*(mb_a-1))/2, (mpc0 + nqc0)*mb_a) + mb_a*mb_a end if end if where lcmp = lcm/NPROW, lcmq = lcm/NPCOL, with lcm = ilcm(NPROW, NPCOL), iroffa = mod(iaa-1, mb_a), icoffa = mod(jaa-1, nb_a), iarow = indxg2p(iaa, mb_a, MYROW, rsrc_a, NPROW), iacol = indxg2p(jaa, nb_a, MYCOL, csrc_a, NPCOL), mqa0 = numroc(mi+icoffa, nb_a, MYCOL, iacol, NPCOL), npa0 = numroc(ni+iroffa, mb_a, MYROW, iarow, NPROW), iroffc = mod(icc-1, mb_c), icoffc = mod(jcc-1, nb_c), icrow = indxg2p(icc, mb_c, MYROW, rsrc_c, NPROW), iccol = indxg2p(jcc, nb_c, MYCOL, csrc_c, NPCOL), mpc0 = numroc(mi+iroffc, mb_c, MYROW, icrow, NPROW), nqc0 = numroc(ni+icoffc, nb_c, MYCOL, iccol, NPCOL), indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c On exit, if vect='Q', sub(C) is overwritten by Q*sub(C), or Q'*sub(C), or sub(C)*Q', or sub(C)*Q; if vect='P', sub(C) is overwritten by P*sub(C), or P'*sub(C), or sub(C)*P, or sub(C)*P'. work(1) On exit work(1) contains the minimum value of lwork required for optimum performance. info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. ScaLAPACK Routines 6 1675 Generalized Symmetric-Definite Eigen Problems This section describes ScaLAPACK routines that allow you to reduce the generalized symmetric-definite eigenvalue problems (see Generalized Symmetric-Definite Eigenvalue Problems in LAPACK chapters) to standard symmetric eigenvalue problem Cy = ?y, which you can solve by calling ScaLAPACK routines described earlier in this chapter (see Symmetric Eigenproblems). Table "Computational Routines for Reducing Generalized Eigenproblems to Standard Problems" lists these routines. Computational Routines for Reducing Generalized Eigenproblems to Standard Problems Operation Real symmetric matrices Complex Hermitian matrices Reduce to standard problems p?sygst p?hegst p?sygst Reduces a real symmetric-definite generalized eigenvalue problem to the standard form. Syntax call pssygst(ibtype, uplo, n, a, ia, ja, desca, b, ib, jb, descb, scale, info) call pdsygst(ibtype, uplo, n, a, ia, ja, desca, b, ib, jb, descb, scale, info) Include Files • C: mkl_scalapack.h Description The p?sygst routine reduces real symmetric-definite generalized eigenproblems to the standard form. In the following sub(A) denotes A(ia:ia+n-1, ja:ja+n-1) and sub(B) denotes B(ib:ib+n-1, jb:jb +n-1). If ibtype = 1, the problem is sub(A)*x = ?*sub(B)*x, and sub(A) is overwritten by inv(UT)*sub(A)*inv(U), or inv(L)*sub(A)*inv(LT). If ibtype = 2 or 3, the problem is sub(A)*sub(B)*x = ?*x, or sub(B)*sub(A)*x = ?*x, and sub(A) is overwritten by U*sub(A)*UT, or LT*sub(A)*L. sub(B) must have been previously factorized as UT*U or L*LT by p?potrf. Input Parameters ibtype (global) INTEGER. Must be 1 or 2 or 3. If itype = 1, compute inv(UT)*sub(A)*inv(U), or inv(L)*sub(A)*inv(LT); If itype = 2 or 3, compute U*sub(A)*UT, or LT*sub(A)*L. uplo (global) CHARACTER. Must be 'U' or 'L'. If uplo = 'U', the upper triangle of sub(A) is stored and sub (B) is factored as UT*U. If uplo = 'L', the lower triangle of sub(A) is stored and sub (B) is factored as L*LT. 6 Intel® Math Kernel Library Reference Manual 1676 n (global) INTEGER. The order of the matrices sub (A) and sub (B) (n = 0). a (local) REAL for pssygst DOUBLE PRECISION for pdsygst. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +n-1)). On entry, the array contains the local pieces of the n-by-n symmetric distributed matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular part of the matrix, and its strictly lower triangular part is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular part of the matrix, and its strictly upper triangular part is not referenced. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. b (local) REAL for pssygst DOUBLE PRECISION for pdsygst. Pointer into the local memory to an array of dimension (lld_b, LOCc(jb +n-1)). On entry, the array contains the local pieces of the triangular factor from the Cholesky factorization of sub (B) as returned by p?potrf. ib, jb (global) INTEGER. The row and column indices in the global array b indicating the first row and the first column of the submatrix B, respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. Output Parameters a On exit, if info = 0, the transformed matrix, stored in the same format as sub(A). scale (global) REAL for pssygst DOUBLE PRECISION for pdsygst. Amount by which the eigenvalues should be scaled to compensate for the scaling performed in this routine. At present, scale is always returned as 1.0, it is returned here to allow for future enhancement. info (global) INTEGER. If info = 0, the execution is successful. If info < 0, if the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?hegst Reduces a Hermitian-definite generalized eigenvalue problem to the standard form. Syntax call pchegst(ibtype, uplo, n, a, ia, ja, desca, b, ib, jb, descb, scale, info) call pzhegst(ibtype, uplo, n, a, ia, ja, desca, b, ib, jb, descb, scale, info) ScaLAPACK Routines 6 1677 Include Files • C: mkl_scalapack.h Description The p?hegst routine reduces complex Hermitian-definite generalized eigenproblems to the standard form. In the following sub(A) denotes A(ia:ia+n-1, ja:ja+n-1) and sub(B) denotes B(ib:ib+n-1, jb:jb+n-1). If ibtype = 1, the problem is sub(A)*x = ?*sub(B)*x, and sub(A) is overwritten by inv(UH)*sub(A)*inv(U), or inv(L)*sub(A)*inv(LH). If ibtype = 2 or 3, the problem is sub(A)*sub(B)*x = ?*x, or sub(B)*sub(A)*x = ?*x, and sub(A) is overwritten by U*sub(A)*UH, or LH*sub(A)*L. sub(B) must have been previously factorized as UH*U or L*LH by p?potrf. Input Parameters ibtype (global) INTEGER. Must be 1 or 2 or 3. If itype = 1, compute inv(UH)*sub(A)*inv(U), or inv(L)*sub(A)*inv(LH); If itype = 2 or 3, compute U*sub(A)*UH, or LH*sub(A)*L. uplo (global) CHARACTER. Must be 'U' or 'L'. If uplo = 'U', the upper triangle of sub(A) is stored and sub (B) is factored as UH*U. If uplo = 'L', the lower triangle of sub(A) is stored and sub (B) is factored as L*LH. n (global) INTEGER. The order of the matrices sub (A) and sub (B) (n=0). a (local) COMPLEX for pchegst DOUBLE COMPLEX for pzhegst. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +n-1)). On entry, the array contains the local pieces of the n-by-n Hermitian distributed matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular part of the matrix, and its strictly lower triangular part is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular part of the matrix, and its strictly upper triangular part is not referenced. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. b (local) COMPLEX for pchegst DOUBLE COMPLEX for pzhegst. Pointer into the local memory to an array of dimension (lld_b, LOCc(jb +n-1)). On entry, the array contains the local pieces of the triangular factor from the Cholesky factorization of sub (B) as returned by p?potrf. 6 Intel® Math Kernel Library Reference Manual 1678 ib, jb (global) INTEGER. The row and column indices in the global array b indicating the first row and the first column of the submatrix B, respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. Output Parameters a On exit, if info = 0, the transformed matrix, stored in the same format as sub(A). scale (global) REAL for pchegst DOUBLE PRECISION for pzhegst. Amount by which the eigenvalues should be scaled to compensate for the scaling performed in this routine. At present, scale is always returned as 1.0, it is returned here to allow for future enhancement. info (global) INTEGER. If info = 0, the execution is successful. If info <0, if the i-th argument is an array and the j-entry had an illegal value, then info = -(i100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. Driver Routines Table "ScaLAPACK Driver Routines" lists ScaLAPACK driver routines available for solving systems of linear equations, linear least-squares problems, standard eigenvalue and singular value problems, and generalized symmetric definite eigenproblems. ScaLAPACK Driver Routines Type of Problem Matrix type, storage scheme Driver Linear equations general (partial pivoting) p?gesv (simple driver)p?gesvx (expert driver) general band (partial pivoting) p?gbsv (simple driver) general band (no pivoting) p?dbsv (simple driver) general tridiagonal (no pivoting) p?dtsv (simple driver) symmetric/Hermitian positive-definite p?posv (simple driver)p?posvx (expert driver) symmetric/Hermitian positive-definite, band p?pbsv (simple driver) symmetric/Hermitian positive-definite, tridiagonal p?ptsv (simple driver) Linear least squares problem general m-by-n p?gels Symmetric eigenvalue problem symmetric/Hermitian p?syev / p?heev (simple driver); p? syevd / p?heevd (simple driver with a divide and conquer algorithm); p? syevx / p?heevx (expert driver) Singular value decomposition general m-by-n p?gesvd Generalized symmetric definite eigenvalue problem symmetric/Hermitian, one matrix also positive-definite p?sygvx / p?hegvx (expert driver) p?gesv Computes the solution to the system of linear equations with a square distributed matrix and multiple right-hand sides. ScaLAPACK Routines 6 1679 Syntax call psgesv(n, nrhs, a, ia, ja, desca, ipiv, b, ib, jb, descb, info) call pdgesv(n, nrhs, a, ia, ja, desca, ipiv, b, ib, jb, descb, info) call pcgesv(n, nrhs, a, ia, ja, desca, ipiv, b, ib, jb, descb, info) call pzgesv(n, nrhs, a, ia, ja, desca, ipiv, b, ib, jb, descb, info) Include Files • C: mkl_scalapack.h Description The p?gesv routine computes the solution to a real or complex system of linear equations sub(A)*X = sub(B), where sub(A) = A(ia:ia+n-1, ja:ja+n-1) is an n-by-n distributed matrix and X and sub(B) = B(ib:ib+n-1, jb:jb+nrhs-1) are n-by-nrhs distributed matrices. The LU decomposition with partial pivoting and row interchanges is used to factor sub(A) as sub(A) = P*L*U, where P is a permutation matrix, L is unit lower triangular, and U is upper triangular. L and U are stored in sub(A). The factored form of sub(A) is then used to solve the system of equations sub(A)*X = sub(B). Input Parameters n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the distributed submatrix sub(A) (n = 0). nrhs (global) INTEGER. The number of right hand sides, that is, the number of columns of the distributed submatrices B and X (nrhs = 0). a, b (local) REAL for psgesv DOUBLE PRECISION for pdgesv COMPLEX for pcgesv DOUBLE COMPLEX for pzgesv. Pointers into the local memory to arrays of local dimension a(lld_a,LOCc(ja+n-1)) and b(lld_b,LOCc(jb+nrhs-1)), respectively. On entry, the array a contains the local pieces of the n-by-n distributed matrix sub(A) to be factored. On entry, the array b contains the right hand side distributed matrix sub(B). ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of sub(A), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. ib, jb (global) INTEGER. The row and column indices in the global array B indicating the first row and the first column of sub(B), respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. Output Parameters a Overwritten by the factors L and U from the factorization sub(A) = P*L*U; the unit diagonal elements of L are not stored . b Overwritten by the solution distributed matrix X. ipiv (local) INTEGER array. 6 Intel® Math Kernel Library Reference Manual 1680 The dimension of ipiv is (LOCr(m_a)+mb_a). This array contains the pivoting information. The (local) row i of the matrix was interchanged with the (global) row ipiv(i). This array is tied to the distributed matrix A. info (global) INTEGER. If info=0, the execution is successful. info < 0: If the i-th argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the i-th argument is a scalar and had an illegal value, then info = -i. info > 0: If info = k, U(ia+k-1,ja+k-1) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution could not be computed. p?gesvx Uses the LU factorization to compute the solution to the system of linear equations with a square matrix A and multiple right-hand sides, and provides error bounds on the solution. Syntax call psgesvx(fact, trans, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, ipiv, equed, r, c, b, ib, jb, descb, x, ix, jx, descx, rcond, ferr, berr, work, lwork, iwork, liwork, info) call pdgesvx(fact, trans, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, ipiv, equed, r, c, b, ib, jb, descb, x, ix, jx, descx, rcond, ferr, berr, work, lwork, iwork, liwork, info) call pcgesvx(fact, trans, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, ipiv, equed, r, c, b, ib, jb, descb, x, ix, jx, descx, rcond, ferr, berr, work, lwork, rwork, lrwork, info) call pzgesvx(fact, trans, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, ipiv, equed, r, c, b, ib, jb, descb, x, ix, jx, descx, rcond, ferr, berr, work, lwork, rwork, lrwork, info) Include Files • C: mkl_scalapack.h Description The p?gesvx routine uses the LU factorization to compute the solution to a real or complex system of linear equations AX = B, where A denotes the n-by-n submatrix A(ia:ia+n-1, ja:ja+n-1), B denotes the n-bynrhs submatrix B(ib:ib+n-1, jb:jb+nrhs-1) and X denotes the n-by-nrhs submatrix X(ix:ix+n-1, jx:jx+nrhs-1). Error bounds on the solution and a condition estimate are also provided. In the following description, af stands for the subarray af(iaf:iaf+n-1, jaf:jaf+n-1). The routine p?gesvx performs the following steps: 1. If fact = 'E', real scaling factors R and C are computed to equilibrate the system: trans = 'N': diag(R)*A*diag(C) *diag(C)-1*X = diag(R)*B ScaLAPACK Routines 6 1681 trans = 'T': (diag(R)*A*diag(C))T *diag(R)-1*X = diag(C)*B trans = 'C': (diag(R)*A*diag(C))H *diag(R)-1*X = diag(C)*B Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(R)*A*diag(C) and B by diag(R)*B (if trans='N') or diag(c)*B (if trans = 'T' or 'C'). 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = P L U, where P is a permutation matrix, L is a unit lower triangular matrix, and U is upper triangular. 3. The factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than relative machine precision, steps 4 - 6 are skipped. 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(C) (if trans = 'N') or diag(R) (if trans = 'T' or 'C') so that it solves the original system before equilibration. Input Parameters fact (global) CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F' then, on entry, af and ipiv contain the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by r and c. Arrays a, af, and ipiv are not modified. If fact = 'N', the matrix A is copied to af and factored. If fact = 'E', the matrix A is equilibrated if necessary, then copied to af and factored. trans (global) CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose); If trans = 'T', the system has the form AT*X = B (Transpose); If trans = 'C', the system has the form AH*X = B (Conjugate transpose); n (global) INTEGER. The number of linear equations; the order of the submatrix A (n = 0). nrhs (global) INTEGER. The number of right hand sides; the number of columns of the distributed submatrices B and X (nrhs = 0). a, af, b, work (local) REAL for psgesvx DOUBLE PRECISION for pdgesvx COMPLEX for pcgesvx DOUBLE COMPLEX for pzgesvx. Pointers into the local memory to arrays of local dimension a(lld_a,LOCc(ja+n-1)), af(lld_af,LOCc(ja+n-1)), b(lld_b,LOCc(jb+nrhs-1)), work(lwork), respectively. The array a contains the matrix A. If fact = 'F' and equed is not 'N', then A must have been equilibrated by the scaling factors in r and/or c. The array af is an input argument if fact = 'F'. In this case it contains on entry the factored form of the matrix A, that is, the factors L and U from the factorization A = P*L*U as computed by p?getrf. If equed is not 'N', then af is the factored form of the equilibrated matrix A. 6 Intel® Math Kernel Library Reference Manual 1682 The array b contains on entry the matrix B whose columns are the righthand sides for the systems of equations. work(*) is a workspace array. The dimension of work is (lwork). ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the submatrix A(ia:ia+n-1, ja:ja+n-1), respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. iaf, jaf (global) INTEGER. The row and column indices in the global array af indicating the first row and the first column of the subarray af(iaf:iaf +n-1, jaf:jaf+n-1), respectively. descaf (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix AF. ib, jb (global) INTEGER. The row and column indices in the global array B indicating the first row and the first column of the submatrix B(ib:ib+n-1, jb:jb+nrhs-1), respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. ipiv (local) INTEGER array. The dimension of ipiv is (LOCr(m_a)+mb_a). The array ipiv is an input argument if fact = 'F' . On entry, it contains the pivot indices from the factorization A = P*L*U as computed by p?getrf; (local) row i of the matrix was interchanged with the (global) row ipiv(i). This array must be aligned with A(ia:ia+n-1, *). equed (global) CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. equed is an input argument if fact = 'F' . It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'); If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r); If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c); If equed = 'B', both row and column equilibration was done; A has been replaced by diag(r)*A*diag(c). r, c (local) REAL for single precision flavors; DOUBLE PRECISION for double precision flavors. Arrays, dimension LOCr(m_a) and LOCc(n_a), respectively. The array r contains the row scale factors for A, and the array c contains the column scale factors for A. These arrays are input arguments if fact = 'F' only; otherwise they are output arguments. If equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If fact = 'F' and equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. ScaLAPACK Routines 6 1683 If fact = 'F' and equed = 'C' or 'B', each element of c must be positive. Array r is replicated in every process column, and is aligned with the distributed matrix A. Array c is replicated in every process row, and is aligned with the distributed matrix A. ix, jx (global) INTEGER. The row and column indices in the global array X indicating the first row and the first column of the submatrix X(ix:ix+n-1, jx:jx+nrhs-1), respectively. descx (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix X. lwork (local or global) INTEGER. The dimension of the array work ; must be at least max(p?gecon(lwork), p?gerfs(lwork))+LOCr(n_a) . iwork (local, psgesvx/pdgesvx only) INTEGER. Workspace array. The dimension of iwork is (liwork). liwork (local, psgesvx/pdgesvx only) INTEGER. The dimension of the array iwork , must be at least LOCr(n_a) . rwork (local) REAL for pcgesvx DOUBLE PRECISION for pzgesvx. Workspace array, used in complex flavors only. The dimension of rwork is (lrwork). lrwork (local or global, pcgesvx/pzgesvx only) INTEGER. The dimension of the array rwork;must be at least 2*LOCc(n_a) . Output Parameters x (local) REAL for psgesvx DOUBLE PRECISION for pdgesvx COMPLEX for pcgesvx DOUBLE COMPLEX for pzgesvx. Pointer into the local memory to an array of local dimension x(lld_x,LOCc(jx+nrhs-1)). If info = 0, the array x contains the solution matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: diag(C)-1*X, if trans = 'N' and equed = 'C' or 'B'; and diag(R)-1*X, if trans = 'T' or 'C' and equed = 'R' or 'B'. a Array a is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If equed ? 'N', A is scaled on exit as follows: equed = 'R': A = diag(R)*A equed = 'C': A = A*diag(c) equed = 'B': A = diag(R)*A*diag(c) af If fact = 'N' or 'E', then af is an output argument and on exit returns the factors L and U from the factorization A = P*L*U of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). See the description of a for the form of the equilibrated matrix. b Overwritten by diag(R)*B if trans = 'N' and equed = 'R' or 'B'; overwritten by diag(c)*B if trans = 'T' and equed = 'C' or 'B'; not changed if equed = 'N'. r, c These arrays are output arguments if fact ? 'F'. 6 Intel® Math Kernel Library Reference Manual 1684 See the description of r, c in Input Arguments section. rcond (global) REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. ferr, berr (local) REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION LOCc(n_b) each. Contain the component-wise forward and relative backward errors, respectively, for each solution vector. Arrays ferr and berr are both replicated in every process row, and are aligned with the matrices B and X. ipiv If fact = 'N' or 'E', then ipiv is an output argument and on exit contains the pivot indices from the factorization A = P*L*U of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). equed If fact ? 'F' , then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). work(1) If info=0, on exit work(1) returns the minimum value of lwork required for optimum performance. iwork(1) If info=0, on exit iwork(1) returns the minimum value of liwork required for optimum performance. rwork(1) If info=0, on exit rwork(1) returns the minimum value of lrwork required for optimum performance. info INTEGER. If info=0, the execution is successful. info < 0: if the ith argument is an array and the jth entry had an illegal value, then info = -(i*100+j); if the ith argument is a scalar and had an illegal value, then info = -i. If info = i, and i = n, then U(i,i) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed. If info = i, and i = n +1, then U is nonsingular, but rcond is less than machine precision. The factorization has been completed, but the matrix is singular to working precision and the solution and error bounds have not been computed. p?gbsv Computes the solution to the system of linear equations with a general banded distributed matrix and multiple right-hand sides. Syntax call psgbsv(n, bwl, bwu, nrhs, a, ja, desca, ipiv, b, ib, descb, work, lwork, info) call pdgbsv(n, bwl, bwu, nrhs, a, ja, desca, ipiv, b, ib, descb, work, lwork, info) call pcgbsv(n, bwl, bwu, nrhs, a, ja, desca, ipiv, b, ib, descb, work, lwork, info) call pzgbsv(n, bwl, bwu, nrhs, a, ja, desca, ipiv, b, ib, descb, work, lwork, info) ScaLAPACK Routines 6 1685 Include Files • C: mkl_scalapack.h Description The p?gbsv routine computes the solution to a real or complex system of linear equations sub(A)*X = sub(B), where sub(A) = A(1:n, ja:ja+n-1) is an n-by-n real/complex general banded distributed matrix with bwl subdiagonals and bwu superdiagonals, and X and sub(B)= B(ib:ib+n-1, 1:rhs) are n-by-nrhs distributed matrices. The LU decomposition with partial pivoting and row interchanges is used to factor sub(A) as sub(A) = P*L*U*Q, where P and Q are permutation matrices, and L and U are banded lower and upper triangular matrices, respectively. The matrix Q represents reordering of columns for the sake of parallelism, while P represents reordering of rows for numerical stability using classic partial pivoting. Input Parameters n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the distributed submatrix sub(A) (n = 0). bwl (global) INTEGER. The number of subdiagonals within the band of A (0= bwl = n-1 ). bwu (global) INTEGER. The number of superdiagonals within the band of A (0= bwu = n-1 ). nrhs (global) INTEGER. The number of right hand sides; the number of columns of the distributed submatrix sub(B) (nrhs = 0). a, b (local) REAL for psgbsv DOUBLE PRECISON for pdgbsv COMPLEX for pcgbsv DOUBLE COMPLEX for pzgbsv. Pointers into the local memory to arrays of local dimension a(lld_a,LOCc(ja+n-1)) and b(lld_b,LOCc(nrhs)), respectively. On entry, the array a contains the local pieces of the global array A. On entry, the array b contains the right hand side distributed matrix sub(B). ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(dtype_) = 501, then dlen_ = 7; else if desca(dtype_) = 1, then dlen_ = 9. ib (global) INTEGER. The row index in the global array B that points to the first row of the matrix to be operated on (which may be either all of B or a submatrix of B). descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. If descb(dtype_) = 502, then dlen_ = 7; else if descb(dtype_) = 1, then dlen_ = 9. work (local) REAL for psgbsv 6 Intel® Math Kernel Library Reference Manual 1686 DOUBLE PRECISON for pdgbsv COMPLEX for pcgbsv DOUBLE COMPLEX for pzgbsv. Workspace array of dimension (lwork). lwork (local or global) INTEGER. The size of the array work, must be at least lwork = (NB+bwu)*(bwl+bwu)+6*(bwl+bwu)*(bwl+2*bwu) + + max(nrhs *(NB+2*bwl+4*bwu), 1). Output Parameters a On exit, contains details of the factorization. Note that the resulting factorization is not the same factorization as returned from LAPACK. Additional permutations are performed on the matrix for the sake of parallelism. b On exit, this array contains the local pieces of the solution distributed matrix X. ipiv (local) INTEGER array. The dimension of ipiv must be at least desca(NB). This array contains pivot indices for local factorizations. You should not alter the contents between factorization and solve. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. info INTEGER. If info=0, the execution is successful. info < 0: If the ith argument is an array and the j-th entry had an illegal value, then info = -(i*100+j); if the ith argument is a scalar and had an illegal value, then info = -i. info > 0: If info = k = NPROCS, the submatrix stored on processor info and factored locally was not nonsingular, and the factorization was not completed. If info = k > NPROCS, the submatrix stored on processor info-NPROCS representing interactions with other processors was not nonsingular, and the factorization was not completed. p?dbsv Solves a general band system of linear equations. Syntax call psdbsv(n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, work, lwork, info) call pddbsv(n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, work, lwork, info) call pcdbsv(n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, work, lwork, info) call pzdbsv(n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?dbsv routine solves the following system of linear equations: A(1:n, ja:ja+n-1)* X = B(ib:ib+n-1, 1:nrhs), ScaLAPACK Routines 6 1687 where A(1:n, ja:ja+n-1) is an n-by-n real/complex banded diagonally dominant-like distributed matrix with bandwidth bwl, bwu. Gaussian elimination without pivoting is used to factor a reordering of the matrix into LU. Input Parameters n (global) INTEGER. The order of the distributed submatrix A, (n = 0). bwl (global) INTEGER. Number of subdiagonals. 0 = bwl = n-1. bwu (global) INTEGER. Number of subdiagonals. 0 = bwu = n-1. nrhs (global) INTEGER. The number of right-hand sides; the number of columns of the distributed submatrix B, (nrhs = 0). a (local). REAL for psdbsv DOUBLE PRECISION for pddbsv COMPLEX for pcdbsv DOUBLE COMPLEX for pzdbsv. Pointer into the local memory to an array with leading dimension lld_a = (bwl+bwu+1) (stored in desca). On entry, this array contains the local pieces of the distributed matrix. ja (global) INTEGER. The index in the global array a that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array of dimension dlen. If 1d type (dtype_a=501 or 502), dlen = 7; If 2d type (dtype_a=1), dlen = 9. The array descriptor for the distributed matrix A. Contains information of mapping of A to memory. b (local) REAL for psdbsv DOUBLE PRECISON for pddbsv COMPLEX for pcdbsv DOUBLE COMPLEX for pzdbsv. Pointer into the local memory to an array of local lead dimension lld_b = nb. On entry, this array contains the local pieces of the right hand sides B(ib:ib+n-1, 1:nrhs). ib (global) INTEGER. The row index in the global array b that points to the first row of the matrix to be operated on (which may be either all of b or a submatrix of B). descb (global and local) INTEGER array of dimension dlen. If 1d type (dtype_b =502), dlen = 7; If 2d type (dtype_b =1), dlen = 9. The array descriptor for the distributed matrix B. Contains information of mapping of B to memory. work (local). REAL for psdbsv DOUBLE PRECISON for pddbsv COMPLEX for pcdbsv DOUBLE COMPLEX for pzdbsv. Temporary workspace. This space may be overwritten in between calls to routines. work must be the size given in lwork. 6 Intel® Math Kernel Library Reference Manual 1688 lwork (local or global) INTEGER. Size of user-input workspace work. If lwork is too small, the minimal acceptable size will be returned in work(1) and an error code is returned. lwork = nb(bwl+bwu)+6max(bwl,bwu)*max(bwl,bwu) +max((max(bwl,bwu)nrhs), max(bwl,bwu)*max(bwl,bwu)) Output Parameters a On exit, this array contains information containing details of the factorization. Note that permutations are performed on the matrix, so that the factors returned are different from those returned by LAPACK. b On exit, this contains the local piece of the solutions distributed matrix X. work On exit, work(1) contains the minimal lwork. info (local) INTEGER. If info=0, the execution is successful. < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. > 0: If info = k < NPROCS, the submatrix stored on processor info and factored locally was not positive definite, and the factorization was not completed. If info = k > NPROCS, the submatrix stored on processor info-NPROCS representing interactions with other processors was not positive definite, and the factorization was not completed. p?dtsv Solves a general tridiagonal system of linear equations. Syntax call psdtsv(n, nrhs, dl, d, du, ja, desca, b, ib, descb, work, lwork, info) call pddtsv(n, nrhs, dl, d, du, ja, desca, b, ib, descb, work, lwork, info) call pcdtsv(n, nrhs, dl, d, du, ja, desca, b, ib, descb, work, lwork, info) call pzdtsv(n, nrhs, dl, d, du, ja, desca, b, ib, descb, work, lwork, info) Include Files • C: mkl_scalapack.h Description The routine solves a system of linear equations A(1:n, ja:ja+n-1) * X = B(ib:ib+n-1, 1:nrhs), where A(1:n, ja:ja+n-1) is an n-by-n complex tridiagonal diagonally dominant-like distributed matrix. Gaussian elimination without pivoting is used to factor a reordering of the matrix into L U. Input Parameters n (global) INTEGER. The order of the distributed submatrix A (n = 0). nrhs INTEGER. The number of right hand sides; the number of columns of the distributed matrix B (nrhs = 0). ScaLAPACK Routines 6 1689 dl (local). REAL for psdtsv DOUBLE PRECISION for pddtsv COMPLEX for pcdtsv DOUBLE COMPLEX for pzdtsv. Pointer to local part of global vector storing the lower diagonal of the matrix. Globally, dl(1)is not referenced, and dl must be aligned with d. Must be of size > desca( nb_ ). d (local). REAL for psdtsv DOUBLE PRECISION for pddtsv COMPLEX for pcdtsv DOUBLE COMPLEX for pzdtsv. Pointer to local part of global vector storing the main diagonal of the matrix. du (local). REAL for psdtsv DOUBLE PRECISION for pddtsv COMPLEX for pcdtsv DOUBLE COMPLEX for pzdtsv. Pointer to local part of global vector storing the upper diagonal of the matrix. Globally, du(n) is not referenced, and du must be aligned with d. ja (global) INTEGER. The index in the global array a that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array of dimension dlen. If 1d type (dtype_a=501 or 502), dlen = 7; If 2d type (dtype_a=1), dlen = 9. The array descriptor for the distributed matrix A. Contains information of mapping of A to memory. b (local) REAL for psdtsv DOUBLE PRECISONfor pddtsv COMPLEX for pcdtsv DOUBLE COMPLEX for pzdtsv. Pointer into the local memory to an array of local lead dimension lld_b > nb. On entry, this array contains the local pieces of the right hand sides B(ib:ib+n-1, 1:nrhs). ib (global) INTEGER. The row index in the global array b that points to the first row of the matrix to be operated on (which may be either all of b or a submatrix of B). descb (global and local) INTEGER array of dimension dlen. If 1d type (dtype_b =502), dlen = 7; If 2d type (dtype_b =1), dlen = 9. The array descriptor for the distributed matrix B. Contains information of mapping of B to memory. work (local). REAL for psdtsv DOUBLE PRECISON for pddtsv COMPLEX for pcdtsv DOUBLE COMPLEX for pzdtsv. Temporary workspace. This space may be overwritten in between calls to routines. work must be the size given in lwork. 6 Intel® Math Kernel Library Reference Manual 1690 lwork (local or global) INTEGER. Size of user-input workspace work. If lwork is too small, the minimal acceptable size will be returned in work(1) and an error code is returned. lwork > (12*NPCOL+3*nb)+max((10+2*min(100, nrhs))*NPCOL+4*nrhs, 8*NPCOL) Output Parameters dl On exit, this array contains information containing the * factors of the matrix. d On exit, this array contains information containing the * factors of the matrix. Must be of size > desca( nb_ ). du On exit, this array contains information containing the * factors of the matrix. Must be of size > desca( nb_ ). b On exit, this contains the local piece of the solutions distributed matrix X. work On exit, work(1) contains the minimal lwork. info (local) INTEGER. If info=0, the execution is successful. < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. > 0: If info = k < NPROCS, the submatrix stored on processor info and factored locally was not positive definite, and the factorization was not completed. If info = k > NPROCS, the submatrix stored on processor info-NPROCS representing interactions with other processors was not positive definite, and the factorization was not completed. p?posv Solves a symmetric positive definite system of linear equations. Syntax call psposv(uplo, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) call pdposv(uplo, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) call pcposv(uplo, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) call pzposv(uplo, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, info) Include Files • C: mkl_scalapack.h Description The p?posv routine computes the solution to a real/complex system of linear equations sub(A)*X = sub(B), where sub(A) denotes A(ia:ia+n-1,ja:ja+n-1) and is an n-by-n symmetric/Hermitian distributed positive definite matrix and X and sub(B) denoting B(ib:ib+n-1,jb:jb+nrhs-1) are n-by-nrhs distributed matrices. The Cholesky decomposition is used to factor sub(A) as sub(A) = UT*U, if uplo = 'U', or sub(A) = L*LT, if uplo = 'L', ScaLAPACK Routines 6 1691 where U is an upper triangular matrix and L is a lower triangular matrix. The factored form of sub(A) is then used to solve the system of equations. Input Parameters uplo (global). CHARACTER. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of sub(A) is stored. n (global) INTEGER. The order of the distributed submatrix sub(A) (n = 0). nrhs INTEGER. The number of right-hand sides; the number of columns of the distributed submatrix sub(B) (nrhs = 0). a (local) REAL for psposv DOUBLE PRECISION for pdposv COMPLEX for pcposv COMPLEX*16 for pzposv. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +n-1)). On entry, this array contains the local pieces of the n-by-n symmetric distributed matrix sub(A) to be factored. If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular part of the matrix, and its strictly lower triangular part is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular part of the distributed matrix, and its strictly upper triangular part is not referenced. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. b (local) REAL for psposv DOUBLE PRECISON for pdposv COMPLEX for pcposv COMPLEX*16 for pzposv. Pointer into the local memory to an array of dimension (lld_b,LOC(jb +nrhs-1)). On entry, the local pieces of the right hand sides distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the global array b indicating the first row and the first column of the submatrix B, respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. Output Parameters a On exit, if info = 0, this array contains the local pieces of the factor U or L from the Cholesky factorization sub(A) = UH*U, or L*LH. b On exit, if info = 0, sub(B) is overwritten by the solution distributed matrix X. info (global) INTEGER. If info =0, the execution is successful. If info < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. 6 Intel® Math Kernel Library Reference Manual 1692 If info > 0: If info = k, the leading minor of order k, A(ia:ia+k-1, ja:ja+k-1) is not positive definite, and the factorization could not be completed, and the solution has not been computed. p?posvx Solves a symmetric or Hermitian positive definite system of linear equations. Syntax call psposvx(fact, uplo, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, equed, sr, sc, b, ib, jb, descb, x, ix, jx, descx, rcond, ferr, berr, work, lwork, iwork, liwork, info) call pdposvx(fact, uplo, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, equed, sr, sc, b, ib, jb, descb, x, ix, jx, descx, rcond, ferr, berr, work, lwork, iwork, liwork, info) call pcposvx(fact, uplo, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, equed, sr, sc, b, ib, jb, descb, x, ix, jx, descx, rcond, ferr, berr, work, lwork, iwork, liwork, info) call pzposvx(fact, uplo, n, nrhs, a, ia, ja, desca, af, iaf, jaf, descaf, equed, sr, sc, b, ib, jb, descb, x, ix, jx, descx, rcond, ferr, berr, work, lwork, iwork, liwork, info) Include Files • C: mkl_scalapack.h Description The p?posvx routine uses the Cholesky factorization A=UT*U or A=L*LT to compute the solution to a real or complex system of linear equations A(ia:ia+n-1, ja:ja+n-1)*X = B(ib:ib+n-1, jb:jb+nrhs-1), where A(ia:ia+n-1, ja:ja+n-1) is a n-by-n matrix and X and B(ib:ib+n-1,jb:jb+nrhs-1) are n-bynrhs matrices. Error bounds on the solution and a condition estimate are also provided. In the following comments y denotes Y(iy:iy+m-1, jy:jy+k-1) a m-by-k matrix where y can be a, af, b and x. The routine p?posvx performs the following steps: 1. If fact = 'E', real scaling factors s are computed to equilibrate the system: diag(sr)*A*diag(sc)*inv(diag(sc))*X = diag(sr)*B Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(sr)*A*diag(sc) and B by diag(sr)*B . 2. If fact = 'N' or 'E', the Cholesky decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = UT*U, if uplo = 'U', or A = L*LT, if uplo = 'L', where U is an upper triangular matrix and L is a lower triangular matrix. ScaLAPACK Routines 6 1693 3. The factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, steps 4-6 are skipped 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(sr) so that it solves the original system before equilibration. Input Parameters fact (global) CHARACTER. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F': on entry, af contains the factored form of A. If equed = 'Y', the matrix A has been equilibrated with scaling factors given by s. a and af will not be modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated if necessary, then copied to af and factored. uplo (global) CHARACTER. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored. n (global) INTEGER. The order of the distributed submatrix sub(A) (n = 0). nrhs (global) INTEGER. The number of right-hand sides; the number of columns of the distributed submatrices B and X. (nrhs = 0). a (local) REAL for psposvx DOUBLE PRECISION for pdposvx COMPLEX for pcposvx DOUBLE COMPLEX for pzposvx. Pointer into the local memory to an array of local dimension (lld_a, LOCc(ja+n-1)). On entry, the symmetric/Hermitian matrix A, except if fact = 'F' and equed = 'Y', then A must contain the equilibrated matrix diag(sr)*A*diag(sc). If uplo = 'U', the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A, and the strictly lower triangular part of A is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A, and the strictly upper triangular part of A is not referenced. A is not modified if fact = 'F' or 'N', or if fact = 'E' and equed = 'N' on exit. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. af (local) REAL for psposvx DOUBLE PRECISION for pdposvx COMPLEX for pcposvx DOUBLE COMPLEX for pzposvx. 6 Intel® Math Kernel Library Reference Manual 1694 Pointer into the local memory to an array of local dimension (lld_af, LOCc(ja+n-1)). If fact = 'F', then af is an input argument and on entry contains the triangular factor U or L from the Cholesky factorization A = UT*U or A = L*LT, in the same storage format as A. If equed ? 'N', then af is the factored form of the equilibrated matrix diag(sr)*A*diag(sc). iaf, jaf (global) INTEGER. The row and column indices in the global array af indicating the first row and the first column of the submatrix AF, respectively. descaf (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix AF. equed (global). CHARACTER. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'); If equed = 'Y', equilibration was done and A has been replaced by diag(sr)*A*diag(sc). sr (local) REAL for psposvx DOUBLE PRECISION for pdposvx COMPLEX for pcposvx DOUBLE COMPLEX for pzposvx. Array, DIMENSION (lld_a). The array s contains the scale factors for A. This array is an input argument if fact = 'F' only; otherwise it is an output argument. If equed = 'N', s is not accessed. If fact = 'F' and equed = 'Y', each element of s must be positive. b (local) REAL for psposvx DOUBLE PRECISION for pdposvx COMPLEX for pcposvx DOUBLE COMPLEX for pzposvx. Pointer into the local memory to an array of local dimension (lld_b, LOCc(jb+nrhs-1)). On entry, the n-by-nrhs right-hand side matrix B. ib, jb (global) INTEGER. The row and column indices in the global array b indicating the first row and the first column of the submatrix B, respectively. descb (global and local) INTEGER. Array, dimension (dlen_). The array descriptor for the distributed matrix B. x (local) REAL for psposvx DOUBLE PRECISION for pdposvx COMPLEX for pcposvx DOUBLE COMPLEX for pzposvx. Pointer into the local memory to an array of local dimension (lld_x, LOCc(jx+nrhs-1)). ix, jx (global) INTEGER. The row and column indices in the global array x indicating the first row and the first column of the submatrix X, respectively. descx (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix X. work (local) ScaLAPACK Routines 6 1695 REAL for psposvx DOUBLE PRECISION for pdposvx COMPLEX for pcposvx DOUBLE COMPLEX for pzposvx. Workspace array, DIMENSION (lwork). lwork (local or global) INTEGER. The dimension of the array work. lwork is local input and must be at least lwork = max(p?pocon(lwork), p?porfs(lwork)) + LOCr(n_a). lwork = 3*desca(lld_). If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. iwork (local) INTEGER. Workspace array, dimension (liwork). liwork (local or global) INTEGER. The dimension of the array iwork. liwork is local input and must be at least liwork = desca(lld_) liwork = LOCr(n_a). If liwork = -1, then liwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, if fact = 'E' and equed = 'Y', a is overwritten by diag(sr)*a*diag(sc). af If fact = 'N', then af is an output argument and on exit returns the triangular factor U or L from the Cholesky factorization A = UT*U or A = L*LT of the original matrix A. If fact = 'E', then af is an output argument and on exit returns the triangular factor U or L from the Cholesky factorization A = UT*U or A = L*LT of the equilibrated matrix A (see the description of A for the form of the equilibrated matrix). equed If fact ? 'F' , then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). sr This array is an output argument if fact ? 'F'. See the description of sr in Input Arguments section. sc This array is an output argument if fact ? 'F'. See the description of sc in Input Arguments section. b On exit, if equed = 'N', b is not modified; if trans = 'N' and equed = 'R' or 'B', b is overwritten by diag(r)*b; if trans = 'T' or 'C' and equed = 'C' or 'B', b is overwritten by diag(c)*b. x (local) REAL for psposvx DOUBLE PRECISION for pdposvx COMPLEX for pcposvx DOUBLE COMPLEX for pzposvx. If info = 0 the n-by-nrhs solution matrix X to the original system of equations. 6 Intel® Math Kernel Library Reference Manual 1696 Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is inv(diag(sc))*X if trans = 'N' and equed = 'C' or 'B', or inv(diag(sr))*X if trans = 'T' or 'C' and equed = 'R' or 'B'. rcond (global) REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision (in particular, if rcond=0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(LOC,n_b). The estimated forward error bounds for each solution vector X(j) (the j-th column of the solution matrix X). If xtrue is the true solution, ferr(j) bounds the magnitude of the largest entry in (X(j) - xtrue) divided by the magnitude of the largest entry in X(j). The quality of the error bound depends on the quality of the estimate of norm(inv(A)) computed in the code; if the estimate of norm(inv(A)) is accurate, the error bound is guaranteed. berr (local) REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(LOC,n_b). The componentwise relative backward error of each solution vector X(j) (the smallest relative change in any entry of A or B that makes X(j) an exact solution). work(1) (local) On exit, work(1) returns the minimal and optimal liwork. info (global) INTEGER. If info=0, the execution is successful. < 0: if info = -i, the i-th argument had an illegal value > 0: if info = i, and i is = n: if info = i, the leading minor of order i of a is not positive definite, so the factorization could not be completed, and the solution and error bounds could not be computed. = n+1: rcond is less than machine precision. The factorization has been completed, but the matrix is singular to working precision, and the solution and error bounds have not been computed. p?pbsv Solves a symmetric/Hermitian positive definite banded system of linear equations. Syntax call pspbsv(uplo, n, bw, nrhs, a, ja, desca, b, ib, descb, work, lwork, info) call pdpbsv(uplo, n, bw, nrhs, a, ja, desca, b, ib, descb, work, lwork, info) call pcpbsv(uplo, n, bw, nrhs, a, ja, desca, b, ib, descb, work, lwork, info) call pzpbsv(uplo, n, bw, nrhs, a, ja, desca, b, ib, descb, work, lwork, info) Include Files • C: mkl_scalapack.h ScaLAPACK Routines 6 1697 Description The p?pbsv routine solves a system of linear equations A(1:n, ja:ja+n-1)*X = B(ib:ib+n-1, 1:nrhs), where A(1:n, ja:ja+n-1) is an n-by-n real/complex banded symmetric positive definite distributed matrix with bandwidth bw. Cholesky factorization is used to factor a reordering of the matrix into L*L'. Input Parameters uplo (global) CHARACTER. Must be 'U' or 'L'. Indicates whether the upper or lower triangular of A is stored. If uplo = 'U', the upper triangular A is stored If uplo = 'L', the lower triangular of A is stored. n (global) INTEGER. The order of the distributed matrix A (n = 0). bw (global) INTEGER. The number of subdiagonals in L or U. 0 = bw = n-1. nrhs (global) INTEGER. The number of right-hand sides; the number of columns in B (nrhs = 0). a (local). REAL for pspbsv DOUBLE PRECISON for pdpbsv COMPLEX for pcpbsv DOUBLE COMPLEX for pzpbsv. Pointer into the local memory to an array with leading dimension lld_a = (bw+1) (stored in desca). On entry, this array contains the local pieces of the distributed matrix sub(A) to be factored. ja (global) INTEGER. The index in the global array a that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. b (local) REAL for pspbsv DOUBLE PRECISON for pdpbsv COMPLEX for pcpbsv DOUBLE COMPLEX for pzpbsv. Pointer into the local memory to an array of local lead dimension lld_b = nb. On entry, this array contains the local pieces of the right hand sides B(ib:ib+n-1, 1:nrhs). ib (global) INTEGER. The row index in the global array b that points to the first row of the matrix to be operated on (which may be either all of b or a submatrix of B). descb (global and local) INTEGER array of dimension dlen. If 1D type (dtype_b =502), dlen = 7; If 2D type (dtype_b =1), dlen = 9. The array descriptor for the distributed matrix B. Contains information of mapping of B to memory. work (local). REAL for pspbsv DOUBLE PRECISON for pdpbsv COMPLEX for pcpbsv 6 Intel® Math Kernel Library Reference Manual 1698 DOUBLE COMPLEX for pzpbsv. Temporary workspace. This space may be overwritten in between calls to routines. work must be the size given in lwork. lwork (local or global) INTEGER. Size of user-input workspace work. If lwork is too small, the minimal acceptable size will be returned in work(1)and an error code is returned. lwork = (nb+2*bw)*bw +max((bw*nrhs), bw*bw) Output Parameters a On exit, this array contains information containing details of the factorization. Note that permutations are performed on the matrix, so that the factors returned are different from those returned by LAPACK. b On exit, contains the local piece of the solutions distributed matrix X. work On exit, work(1) contains the minimal lwork. info (global). INTEGER. If info=0, the execution is successful. < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. > 0: If info = k = NPROCS, the submatrix stored on processor info and factored locally was not positive definite, and the factorization was not completed. If info = k > NPROCS, the submatrix stored on processor info-NPROCS representing interactions with other processors was not positive definite, and the factorization was not completed. p?ptsv Syntax Solves a symmetric or Hermitian positive definite tridiagonal system of linear equations. call psptsv(n, nrhs, d, e, ja, desca, b, ib, descb, work, lwork, info) call pdptsv(n, nrhs, d, e, ja, desca, b, ib, descb, work, lwork, info) call pcptsv(n, nrhs, d, e, ja, desca, b, ib, descb, work, lwork, info) call pzptsv(n, nrhs, d, e, ja, desca, b, ib, descb, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?ptsv routine solves a system of linear equations A(1:n, ja:ja+n-1)*X = B(ib:ib+n-1, 1:nrhs), where A(1:n, ja:ja+n-1) is an n-by-n real tridiagonal symmetric positive definite distributed matrix. Cholesky factorization is used to factor a reordering of the matrix into L*L'. Input Parameters n (global) INTEGER. The order of matrix A (n = 0). ScaLAPACK Routines 6 1699 nrhs (global) INTEGER. The number of right-hand sides; the number of columns of the distributed submatrix B (nrhs = 0). d (local) REAL for psptsv DOUBLE PRECISON for pdptsv COMPLEX for pcptsv DOUBLE COMPLEX for pzptsv. Pointer to local part of global vector storing the main diagonal of the matrix. e (local) REAL for psptsv DOUBLE PRECISON for pdptsv COMPLEX for pcptsv DOUBLE COMPLEX for pzptsv. Pointer to local part of global vector storing the upper diagonal of the matrix. Globally, du(n) is not referenced, and du must be aligned with d. ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array of dimension dlen. If 1d type (dtype_a=501 or 502), dlen = 7; If 2d type (dtype_a=1), dlen = 9. The array descriptor for the distributed matrix A. Contains information of mapping of A to memory. b (local) REAL for psptsv DOUBLE PRECISON for pdptsv COMPLEX for pcptsv DOUBLE COMPLEX for pzptsv. Pointer into the local memory to an array of local lead dimension lld_b = nb. On entry, this array contains the local pieces of the right hand sides B(ib:ib+n-1, 1:nrhs). ib (global) INTEGER. The row index in the global array b that points to the first row of the matrix to be operated on (which may be either all of b or a submatrix of B). descb (global and local) INTEGER array of dimension dlen. If 1d type (dtype_b = 502), dlen = 7; If 2d type (dtype_b = 1), dlen = 9. The array descriptor for the distributed matrix B. Contains information of mapping of B to memory. work (local). REAL for psptsv DOUBLE PRECISON for pdptsv COMPLEX for pcptsv DOUBLE COMPLEX for pzptsv. Temporary workspace. This space may be overwritten in between calls to routines. work must be the size given in lwork. 6 Intel® Math Kernel Library Reference Manual 1700 lwork (local or global) INTEGER. Size of user-input workspace work. If lwork is too small, the minimal acceptable size will be returned in work(1) and an error code is returned. lwork > (12*NPCOL+3*nb)+max((10+2*min(100, nrhs))*NPCOL+4*nrhs, 8*NPCOL). Output Parameters d On exit, this array contains information containing the factors of the matrix. Must be of size greater than or equal to desca(nb_). e On exit, this array contains information containing the factors of the matrix. Must be of size greater than or equal to desca(nb_). b On exit, this contains the local piece of the solutions distributed matrix X. work On exit, work(1) contains the minimal lwork. info (local) INTEGER. If info=0, the execution is successful. < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. > 0: If info = k = NPROCS, the submatrix stored on processor info and factored locally was not positive definite, and the factorization was not completed. If info = k > NPROCS, the submatrix stored on processor info-NPROCS representing interactions with other processors was not positive definite, and the factorization was not completed. p?gels Solves overdetermined or underdetermined linear systems involving a matrix of full rank. Syntax call psgels(trans, m, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, work, lwork, info) call pdgels(trans, m, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, work, lwork, info) call pcgels(trans, m, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, work, lwork, info) call pzgels(trans, m, n, nrhs, a, ia, ja, desca, b, ib, jb, descb, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?gels routine solves overdetermined or underdetermined real/ complex linear systems involving an mby- n matrix sub(A) = A(ia:ia+m-1,ja:ja+n-1), or its transpose/ conjugate-transpose, using a QTQ or LQ factorization of sub(A). It is assumed that sub(A) has full rank. The following options are provided: 1. If trans = 'N' and m = n: find the least squares solution of an overdetermined system, that is, solve the least squares problem minimize ||sub(B) - sub(A)*X|| 2. If trans = 'N' and m < n: find the minimum norm solution of an underdetermined system sub(A)*X = sub(B). 3. If trans = 'T' and m = n: find the minimum norm solution of an undetermined system sub(A)T*X = sub(B). ScaLAPACK Routines 6 1701 4. If trans = 'T' and m < n: find the least squares solution of an overdetermined system, that is, solve the least squares problem minimize ||sub(B) - sub(A)T*X||, where sub(B) denotes B(ib:ib+m-1, jb:jb+nrhs-1) when trans = 'N' and B(ib:ib+n-1, jb:jb +nrhs-1) otherwise. Several right hand side vectors b and solution vectors x can be handled in a single call; when trans = 'N', the solution vectors are stored as the columns of the n-by-nrhs right hand side matrix sub(B) and the m-by-nrhs right hand side matrix sub(B) otherwise. Input Parameters trans (global) CHARACTER. Must be 'N', or 'T'. If trans = 'N', the linear system involves matrix sub(A); If trans = 'T', the linear system involves the transposed matrix AT (for real flavors only). m (global) INTEGER. The number of rows in the distributed submatrix sub (A) (m = 0). n (global) INTEGER. The number of columns in the distributed submatrix sub (A) (n = 0). nrhs (global) INTEGER. The number of right-hand sides; the number of columns in the distributed submatrices sub(B) and X. (nrhs = 0). a (local) REAL for psgels DOUBLE PRECISION for pdgels COMPLEX for pcgels DOUBLE COMPLEX for pzgels. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +n-1)). On entry, contains the m-by-n matrix A. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. b (local) REAL for psgels DOUBLE PRECISION for pdgels COMPLEX for pcgels DOUBLE COMPLEX for pzgels. Pointer into the local memory to an array of local dimension (lld_b, LOCc(jb+nrhs-1)). On entry, this array contains the local pieces of the distributed matrix B of right-hand side vectors, stored columnwise; sub(B) is m-by-nrhs if trans='N', and n-by-nrhs otherwise. ib, jb (global) INTEGER. The row and column indices in the global array b indicating the first row and the first column of the submatrix B, respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. work (local) REAL for psgels DOUBLE PRECISION for pdgels COMPLEX for pcgels DOUBLE COMPLEX for pzgels. Workspace array with dimension lwork. 6 Intel® Math Kernel Library Reference Manual 1702 lwork (local or global) INTEGER. The dimension of the array work lwork is local input and must be at least lwork = ltau + max(lwf, lws), where if m > n, then ltau = numroc(ja+min(m,n)-1, nb_a, MYCOL, csrc_a, NPCOL), lwf = nb_a*(mpa0 + nqa0 + nb_a) lws = max((nb_a*(nb_a-1))/2, (nrhsqb0 + mpb0)*nb_a) + nb_a*nb_a else ltau = numroc(ia+min(m,n)-1, mb_a, MYROW, rsrc_a, NPROW), lwf = mb_a * (mpa0 + nqa0 + mb_a) lws = max((mb_a*(mb_a-1))/2, (npb0 + max(nqa0 + numroc(numroc(n+iroffb, mb_a, 0, 0, NPROW), mb_a, 0, 0, lcmp), nrhsqb0))*mb_a) + mb_a*mb_a end if, where lcmp = lcm/NPROW with lcm = ilcm(NPROW, NPCOL), iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, MYROW, rsrc_a, NPROW), iacol= indxg2p(ja, nb_a, MYROW, rsrc_a, NPROW) mpa0 = numroc(m+iroffa, mb_a, MYROW, iarow, NPROW), nqa0 = numroc(n+icoffa, nb_a, MYCOL, iacol, NPCOL), iroffb = mod(ib-1, mb_b), icoffb = mod(jb-1, nb_b), ibrow = indxg2p(ib, mb_b, MYROW, rsrc_b, NPROW), ibcol = indxg2p(jb, nb_b, MYCOL, csrc_b, NPCOL), mpb0 = numroc(m+iroffb, mb_b, MYROW, icrow, NPROW), nqb0 = numroc(n+icoffb, nb_b, MYCOL, ibcol, NPCOL), ilcm, indxg2p and numroc are ScaLAPACK tool functions; MYROW, MYCOL, NPROW, and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, If m = n, sub(A) is overwritten by the details of its QR factorization as returned by p?geqrf; if m < n, sub(A) is overwritten by details of its LQ factorization as returned by p?gelqf. b On exit, sub(B) is overwritten by the solution vectors, stored columnwise: if trans = 'N' and m = n, rows 1 to n of sub(B) contain the least squares solution vectors; the residual sum of squares for the solution in each column is given by the sum of squares of elements n+1 to m in that column; If trans = 'N' and m < n, rows 1 to n of sub(B) contain the minimum norm solution vectors; If trans = 'T' and m = n, rows 1 to m of sub(B) contain the minimum norm solution vectors; if trans = 'T' and m < n, rows 1 to m of sub(B) contain the least squares solution vectors; the residual sum of squares for the solution in each column is given by the sum of squares of elements m+1 to n in that column. work(1) On exit, work(1) contains the minimum value of lwork required for optimum performance. ScaLAPACK Routines 6 1703 info (global) INTEGER. = 0: the execution is successful. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i* 100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?syev Computes selected eigenvalues and eigenvectors of a symmetric matrix. Syntax call pssyev(jobz, uplo, n, a, ia, ja, desca, w, z, iz, jz, descz, work, lwork, info) call pdsyev(jobz, uplo, n, a, ia, ja, desca, w, z, iz, jz, descz, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?syev routine computes all eigenvalues and, optionally, eigenvectors of a real symmetric matrix A by calling the recommended sequence of ScaLAPACK routines. In its present form, the routine assumes a homogeneous system and makes no checks for consistency of the eigenvalues or eigenvectors across the different processes. Because of this, it is possible that a heterogeneous system may return incorrect results without any error messages. Input Parameters np = the number of rows local to a given process. nq = the number of columns local to a given process. jobz (global). CHARACTER. Must be 'N' or 'V'. Specifies if it is necessary to compute the eigenvectors: If jobz ='N', then only eigenvalues are computed. If jobz ='V', then eigenvalues and eigenvectors are computed. uplo (global). CHARACTER. Must be 'U' or 'L'. Specifies whether the upper or lower triangular part of the symmetric matrix A is stored: If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n (global) INTEGER. The number of rows and columns of the matrix A (n = 0). a (local) REAL for pssyev. DOUBLE PRECISION for pdsyev. Block cyclic array of global dimension (n, n) and local dimension (lld_a, LOC c(ja+n-1)). On entry, the symmetric matrix A. If uplo = 'U', only the upper triangular part of A is used to define the elements of the symmetric matrix. If uplo = 'L', only the lower triangular part of A is used to define the elements of the symmetric matrix. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. 6 Intel® Math Kernel Library Reference Manual 1704 desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. iz, jz (global) INTEGER. The row and column indices in the global array z indicating the first row and the first column of the submatrix Z, respectively. descz (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix Z. work (local) REAL for pssyev. DOUBLE PRECISION for pdsyev. Array, DIMENSION (lwork). lwork (local) INTEGER. See below for definitions of variables used to define lwork. If no eigenvectors are requested (jobz = 'N'), then lwork = 5*n + sizesytrd + 1, where sizesytrd is the workspace for p?sytrd and is max(NB*(np +1), 3*NB). If eigenvectors are requested (jobz = 'V') then the amount of workspace required to guarantee that all eigenvectors are computed is: qrmem = 2*n-2 lwmin = 5*n + n*ldc + max(sizemqrleft, qrmem) + 1 Variable definitions: nb = desca(mb_) = desca(nb_) = descz(mb_) = descz(nb_); nn = max(n, nb, 2); desca(rsrc_) = desca(rsrc_) = descz(rsrc_) = descz(csrc_) = 0 np = numroc(nn, nb, 0, 0, NPROW) nq = numroc(max(n, nb, 2), nb, 0, 0, NPCOL) nrc = numroc(n, nb, myprowc, 0, NPROCS) ldc = max(1, nrc) sizemqrleft is the workspace for p?ormtr when its side argument is 'L'. myprowc is defined when a new context is created as follows: call blacs_get(desca(ctxt_), 0, contextc) call blacs_gridinit(contextc, 'R', NPROCS, 1) call blacs_gridinfo(contextc, nprowc, npcolc, myprowc, mypcolc) If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, the lower triangle (if uplo='L') or the upper triangle (if uplo='U') of A, including the diagonal, is destroyed. w (global). REAL for pssyev DOUBLE PRECISION for pdsyev Array, DIMENSION (n). On normal exit, the first m entries contain the selected eigenvalues in ascending order. z (local). REAL for pssyev DOUBLE PRECISION for pdsyev ScaLAPACK Routines 6 1705 Array, global dimension (n, n), local dimension (lld_z, LOCc(jz+n-1)). If jobz = 'V', then on normal exit the first m columns of z contain the orthonormal eigenvectors of the matrix corresponding to the selected eigenvalues. If jobz = 'N', then z is not referenced. work(1) On output, work(1) returns the workspace needed to guarantee completion. If the input parameters are incorrect, work(1) may also be incorrect. If jobz = 'N' work(1) = minimal (optimal) amount of workspace If jobz = 'V' work(1) = minimal workspace required to generate all the eigenvectors. info (global) INTEGER. If info = 0, the execution is successful. If info < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. If info > 0: If info= 1 through n, the i-th eigenvalue did not converge in ?steqr2 after a total of 30n iterations. If info= n+1, then p?syev has detected heterogeneity by finding that eigenvalues were not identical across the process grid. In this case, the accuracy of the results from p?syev cannot be guaranteed. p?syevd Computes all eigenvalues and eigenvectors of a real symmetric matrix by using a divide and conquer algorithm. Syntax call pssyevd(jobz, uplo, n, a, ia, ja, desca, w, z, iz, jz, descz, work, lwork, iwork, liwork, info) call pdsyevd(jobz, uplo, n, a, ia, ja, desca, w, z, iz, jz, descz, work, lwork, iwork, liwork, info) Include Files • C: mkl_scalapack.h Description The p?syevd routine computes all eigenvalues and eigenvectors of a real symmetric matrix A by using a divide and conquer algorithm. Input Parameters np = the number of rows local to a given process. nq = the number of columns local to a given process. jobz (global). CHARACTER*1. Must be 'N' or 'V'. Specifies if it is necessary to compute the eigenvectors: If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo (global). CHARACTER*1. Must be 'U' or 'L'. 6 Intel® Math Kernel Library Reference Manual 1706 Specifies whether the upper or lower triangular part of the Hermitian matrix A is stored: If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n (global) INTEGER. The number of rows and columns of the matrix A (n = 0). a (local). REAL for pssyevd DOUBLE PRECISION for pdsyevd. Block cyclic array of global dimension (n, n) and local dimension (lld_a, LOCc(ja+n-1)). On entry, the symmetric matrix A. If uplo = 'U', only the upper triangular part of A is used to define the elements of the symmetric matrix. If uplo = 'L', only the lower triangular part of A is used to define the elements of the symmetric matrix. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(ctxt_) is incorrect, p?syevd cannot guarantee correct error reporting. iz, jz (global) INTEGER. The row and column indices in the global array z indicating the first row and the first column of the submatrix Z, respectively. descz (global and local) INTEGER array, dimension dlen_. The array descriptor for the distributed matrix Z. descz(ctxt_) must equal desca(ctxt_). work (local). REAL for pssyevd DOUBLE PRECISION for pdsyevd. Array, DIMENSION lwork. lwork (local). INTEGER. The dimension of the array work. If eigenvalues are requested: lwork = max( 1+6*n + 2*np*nq, trilwmin) + 2*n with trilwmin = 3*n + max( nb*( np + 1), 3*nb ) np = numroc( n, nb, myrow, iarow, NPROW) nq = numroc( n, nb, mycol, iacol, NPCOL) If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the size required for optimal performance for all work arrays. The required workspace is returned as the first element of the corresponding work arrays, and no error message is issued by pxerbla. iwork (local) INTEGER. Workspace array, dimension liwork. liwork (local) INTEGER, dimension of iwork. liwork = 7*n + 8*npcol + 2. Output Parameters a On exit, the lower triangle (if uplo = 'L'), or the upper triangle (if uplo = 'U') of A, including the diagonal, is overwritten. w (global). REAL for pssyevd DOUBLE PRECISION for pdsyevd. ScaLAPACK Routines 6 1707 Array, DIMENSION n. If info = 0, w contains the eigenvalues in the ascending order. z (local). REAL for pssyevd DOUBLE PRECISION for pdsyevd. Array, global dimension (n, n), local dimension (lld_z, LOCc(jz+n-1)). The z parameter contains the orthonormal eigenvectors of the matrix A. work(1) On exit, returns adequate workspace to allow optimal performance. iwork(1) (local). On exit, if liwork > 0, iwork(1) returns the optimal liwork. info (global) INTEGER. If info = 0, the execution is successful. If info < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j). If the i-th argument is a scalar and had an illegal value, then info = -i. If info > 0: The algorithm failed to compute the info/(n+1)-th eigenvalue while working on the submatrix lying in global rows and columns mod(info,n +1). p?syevx Computes selected eigenvalues and, optionally, eigenvectors of a symmetric matrix. Syntax call pssyevx(jobz, range, uplo, n, a, ia, ja, desca, vl, vu, il, iu, abstol, m, nz, w, orfac, z, iz, jz, descz, work, lwork, iwork, liwork, ifail, iclustr, gap, info) call pdsyevx(jobz, range, uplo, n, a, ia, ja, desca, vl, vu, il, iu, abstol, m, nz, w, orfac, z, iz, jz, descz, work, lwork, iwork, liwork, ifail, iclustr, gap, info) Include Files • C: mkl_scalapack.h Description The p?syevx routine computes selected eigenvalues and, optionally, eigenvectors of a real symmetric matrix A by calling the recommended sequence of ScaLAPACK routines. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Input Parameters np = the number of rows local to a given process. nq = the number of columns local to a given process. jobz (global). CHARACTER*1. Must be 'N' or 'V'. Specifies if it is necessary to compute the eigenvectors: If jobz ='N', then only eigenvalues are computed. If jobz ='V', then eigenvalues and eigenvectors are computed. range (global). CHARACTER*1. Must be 'A', 'V', or 'I'. If range = 'A', all eigenvalues will be found. 6 Intel® Math Kernel Library Reference Manual 1708 If range = 'V', all eigenvalues in the half-open interval [vl, vu] will be found. If range = 'I', the eigenvalues with indices il through iu will be found. uplo (global). CHARACTER*1. Must be 'U' or 'L'. Specifies whether the upper or lower triangular part of the symmetric matrix A is stored: If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n (global) INTEGER. The number of rows and columns of the matrix A (n = 0). a (local). REAL for pssyevx DOUBLE PRECISION for pdsyevx. Block cyclic array of global dimension (n, n) and local dimension (lld_a, LOCc(ja+n-1)). On entry, the symmetric matrix A. If uplo = 'U', only the upper triangular part of A is used to define the elements of the symmetric matrix. If uplo = 'L', only the lower triangular part of A is used to define the elements of the symmetric matrix. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. vl, vu (global) REAL for pssyevx DOUBLE PRECISION for pdsyevx. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues; vl = vu. Not referenced if range = 'A' or 'I'. il, iu (global) INTEGER. If range ='I', the indices of the smallest and largest eigenvalues to be returned. Constraints: il = 1 min(il,n) = iu = n Not referenced if range = 'A' or 'V'. abstol (global). REAL for pssyevx DOUBLE PRECISION for pdsyevx. If jobz='V', setting abstol to p?lamch(context, 'U') yields the most orthogonal eigenvectors. The absolute error tolerance for the eigenvalues. An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a, b] of width less than or equal to abstol + eps * max(|a|,|b|), where eps is the machine precision. If abstol is less than or equal to zero, then eps*norm(T) will be used in its place, where norm(T) is the 1-norm of the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*p?lamch('S') not zero. If this routine returns with ((mod(info,2).ne.0).or. * (mod(info/8,2).ne.0)), indicating that some eigenvalues or eigenvectors did not converge, try setting abstol to 2*p?lamch('S'). orfac (global). REAL for pssyevx DOUBLE PRECISION for pdsyevx. ScaLAPACK Routines 6 1709 Specifies which eigenvectors should be reorthogonalized. Eigenvectors that correspond to eigenvalues which are within tol=orfac*norm(A)of each other are to be reorthogonalized. However, if the workspace is insufficient (see lwork), tol may be decreased until all eigenvectors to be reorthogonalized can be stored in one process. No reorthogonalization will be done if orfac equals zero. A default value of 1.0e-3 is used if orfac is negative. orfac should be identical on all processes. iz, jz (global) INTEGER. The row and column indices in the global array z indicating the first row and the first column of the submatrix Z, respectively. descz (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix Z.descz(ctxt_) must equal desca(ctxt_). work (local) REAL for pssyevx. DOUBLE PRECISION for pdsyevx. Array, DIMENSION (lwork). lwork (local) INTEGER. The dimension of the array work. See below for definitions of variables used to define lwork. If no eigenvectors are requested (jobz = 'N'), then lwork = 5*n + max(5*nn, NB*(np0 + 1)). If eigenvectors are requested (jobz = 'V'), then the amount of workspace required to guarantee that all eigenvectors are computed is: lwork = 5*n + max(5*nn, np0*mq0 + 2*NB*NB) + iceil(neig, NPROW*NPCOL)*nn The computed eigenvectors may not be orthogonal if the minimal workspace is supplied and orfac is too small. If you want to guarantee orthogonality (at the cost of potentially poor performance) you should add the following to lwork: (clustersize-1)*n, where clustersize is the number of eigenvalues in the largest cluster, where a cluster is defined as a set of close eigenvalues: {w(k),..., w(k+clustersize-1)| w(j+1) = w(j)) + orfac*2*norm(A)}, where neig = number of eigenvectors requested nb = desca(mb_) = desca(nb_) = descz(mb_) = descz(nb_); nn = max(n, nb, 2); desca(rsrc_) = desca(nb_) = descz(rsrc_) = descz(csrc_) = 0; np0 = numroc(nn, nb, 0, 0, NPROW); mq0 = numroc(max(neig, nb, 2), nb, 0, 0, NPCOL) iceil(x, y) is a ScaLAPACK function returning ceiling(x/y) If lwork is too small to guarantee orthogonality, p?syevx attempts to maintain orthogonality in the clusters with the smallest spacing between the eigenvalues. If lwork is too small to compute all the eigenvectors requested, no computation is performed and info= -23 is returned. Note that when range='V', number of requested eigenvectors are not known until the eigenvalues are computed. In this case and if lwork is large enough to compute the eigenvalues, p?sygvx computes the eigenvalues and as many eigenvectors as possible. Relationship between workspace, orthogonality & performance: 6 Intel® Math Kernel Library Reference Manual 1710 Greater performance can be achieved if adequate workspace is provided. In some situations, performance can decrease as the provided workspace increases above the workspace amount shown below: lwork = max(lwork, 5*n + nsytrd_lwopt), where lwork, as defined previously, depends upon the number of eigenvectors requested, and nsytrd_lwopt = n + 2*(anb+1)*(4*nps+2) + (nps + 3)*nps; anb = pjlaenv(desca(ctxt_), 3, 'p?syttrd', 'L', 0, 0, 0, 0); sqnpc = int(sqrt(dble(NPROW * NPCOL))); nps = max(numroc(n, 1, 0, 0, sqnpc), 2*anb); numroc is a ScaLAPACK tool functions; pjlaenv is a ScaLAPACK environmental inquiry function MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. For large n, no extra workspace is needed, however the biggest boost in performance comes for small n, so it is wise to provide the extra workspace (typically less than a megabyte per process). If clustersize > n/sqrt(NPROW*NPCOL), then providing enough space to compute all the eigenvectors orthogonally will cause serious degradation in performance. At the limit (that is, clustersize = n-1) p?stein will perform no better than ?stein on single processor. For clustersize = n/sqrt(NPROW*NPCOL) reorthogonalizing all eigenvectors will increase the total execution time by a factor of 2 or more. For clustersize > n/sqrt(NPROW*NPCOL) execution time will grow as the square of the cluster size, all other factors remaining equal and assuming enough workspace. Less workspace means less reorthogonalization but faster execution. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the size required for optimal performance for all work arrays. Each of these values is returned in the first entry of the corresponding work arrays, and no error message is issued by pxerbla. iwork (local) INTEGER. Workspace array. liwork (local) INTEGER, dimension of iwork. liwork = 6*nnp Where: nnp = max(n, NPROW*NPCOL + 1, 4) If liwork = -1, then liwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, the lower triangle (if uplo = 'L') or the upper triangle (if uplo = 'U')of A, including the diagonal, is overwritten. m (global) INTEGER. The total number of eigenvalues found; 0 = m = n. nz (global) INTEGER. Total number of eigenvectors computed. 0 = nz = m. The number of columns of z that are filled. If jobz ? 'V', nz is not referenced. If jobz = 'V', nz = m unless the user supplies insufficient space and p? syevx is not able to detect this before beginning computation. To get all the eigenvectors requested, the user must supply both sufficient space to hold ScaLAPACK Routines 6 1711 the eigenvectors in z (m.le.descz(n_)) and sufficient workspace to compute them. (See lwork). p?syevx is always able to detect insufficient space without computation unless range.eq.'V'. w (global). REAL for pssyevx DOUBLE PRECISION for pdsyevx. Array, DIMENSION (n). The first m elements contain the selected eigenvalues in ascending order. z (local). REAL for pssyevx DOUBLE PRECISION for pdsyevx. Array, global dimension (n, n), local dimension (lld_z, LOCc(jz+n-1)). If jobz = 'V', then on normal exit the first m columns of z contain the orthonormal eigenvectors of the matrix corresponding to the selected eigenvalues. If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. If jobz = 'N', then z is not referenced. work(1) On exit, returns workspace adequate workspace to allow optimal performance. iwork(1) On return, iwork(1) contains the amount of integer workspace required. ifail (global) INTEGER. Array, DIMENSION (n). If jobz = 'V', then on normal exit, the first m elements of ifail are zero. If (mod(info,2). ne.0) on exit, then ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. iclustr (global) INTEGER. Array, DIMENSION (2*NPROW*NPCOL) This array contains indices of eigenvectors corresponding to a cluster of eigenvalues that could not be reorthogonalized due to insufficient workspace (see lwork, orfac and info). Eigenvectors corresponding to clusters of eigenvalues indexed iclustr(2*i-1) to iclustr(2*i), could not be reorthogonalized due to lack of workspace. Hence the eigenvectors corresponding to these clusters may not be orthogonal. iclustr() is a zero terminated array. (iclustr(2*k).ne.0. and. iclustr(2*k+1).eq.0) if and only if k is the number of clusters. iclustr is not referenced if jobz = 'N'. gap (global) REAL for pssyevx DOUBLE PRECISION for pdsyevx. Array, DIMENSION (NPROW*NPCOL) This array contains the gap between eigenvalues whose eigenvectors could not be reorthogonalized. The output values in this array correspond to the clusters indicated by the array iclustr. As a result, the dot product between eigenvectors corresponding to the ith cluster may be as high as (C*n)/gap(i) where C is a small constant. info (global) INTEGER. If info = 0, the execution is successful. If info < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. 6 Intel® Math Kernel Library Reference Manual 1712 If info > 0: if (mod(info,2).ne.0), then one or more eigenvectors failed to converge. Their indices are stored in ifail. Ensure abstol=2.0*p?lamch('U'). If (mod(info/2,2).ne.0), then eigenvectors corresponding to one or more clusters of eigenvalues could not be reorthogonalized because of insufficient workspace.The indices of the clusters are stored in the array iclustr. If (mod(info/4,2).ne.0), then space limit prevented p?syevxf rom computing all of the eigenvectors between vl and vu. The number of eigenvectors computed is returned in nz. If (mod(info/8,2).ne.0), then p?stebz failed to compute eigenvalues. Ensure abstol=2.0*p?lamch('U'). p?heev Computes all eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix. Syntax call pcheev(jobz, uplo, n, a, ia, ja, desca, w, z, iz, jz, descz, work, lwork, rwork, lrwork, info) call pzheev(jobz, uplo, n, a, ia, ja, desca, w, z, iz, jz, descz, work, lwork, rwork, lrwork, info) Include Files • C: mkl_scalapack.h Description The p?heev routine computes all eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A by calling the recommended sequence of ScaLAPACK routines. The routine assumes a homogeneous system and makes spot checks of the consistency of the eigenvalues across the different processes. A heterogeneous system may return incorrect results without any error messages. Input Parameters np = the number of rows local to a given process. nq = the number of columns local to a given process. jobz (global). CHARACTER*1. Must be 'N' or 'V'. Specifies if it is necessary to compute the eigenvectors: If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo (global). CHARACTER*1. Must be 'U' or 'L'. Specifies whether the upper or lower triangular part of the Hermitian matrix A is stored: If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n (global) INTEGER. The number of rows and columns of the matrix A (n = 0). a (local). COMPLEX for pcheev DOUBLE COMPLEX for pzheev. ScaLAPACK Routines 6 1713 Block cyclic array of global dimension (n, n) and local dimension (lld_a, LOCc(ja+n-1)). On entry, the Hermitian matrix A. If uplo = 'U', only the upper triangular part of A is used to define the elements of the Hermitian matrix. If uplo = 'L', only the lower triangular part of A is used to define the elements of the Hermitian matrix. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(ctxt_) is incorrect, p?heev cannot guarantee correct error reporting. iz, jz (global) INTEGER. The row and column indices in the global array z indicating the first row and the first column of the submatrix Z, respectively. descz (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix Z. descz(ctxt_) must equal desca(ctxt_). work (local). COMPLEX for pcheev DOUBLE COMPLEX for pzheev. Array, DIMENSION lwork. lwork (local). INTEGER. The dimension of the array work. If only eigenvalues are requested (jobz = 'N'): lwork = max(nb*(np0 + 1), 3) + 3*n If eigenvectors are requested (jobz = 'V'), then the amount of workspace required: lwork = (np0+nq0+nb)*nb + 3*n + n2 with nb = desca( mb_ ) = desca( nb_ ) = nb = descz( mb_ ) = descz( nb_ ) np0 = numroc(nn, nb, 0, 0, NPROW). nq0 = numroc( max( n, nb, 2 ), nb, 0, 0, NPCOL). If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the size required for optimal performance for all work arrays. The required workspace is returned as the first element of the corresponding work arrays, and no error message is issued by pxerbla. rwork (local). REAL for pcheev DOUBLE PRECISION for pzheev. Workspace array, DIMENSION lrwork. lrwork (local) INTEGER. The dimension of the array rwork. See below for definitions of variables used to define lrwork. If no eigenvectors are requested (jobz = 'N'), then lrwork = 2*n. If eigenvectors are requested (jobz = 'V'), then lrwork = 2*n + 2*n-2. If lrwork = -1, then lrwork is global input and a workspace query is assumed; the routine only calculates the minimum size required for the rwork array. The required workspace is returned as the first element of rwork, and no error message is issued by pxerbla. 6 Intel® Math Kernel Library Reference Manual 1714 Output Parameters a On exit, the lower triangle (if uplo = 'L'), or the upper triangle (if uplo = 'U') of A, including the diagonal, is overwritten. w (global). REAL for pcheev DOUBLE PRECISION for pzheev. Array, DIMENSION n. The first m elements contain the selected eigenvalues in ascending order. z (local). COMPLEX for pcheev DOUBLE COMPLEX for pzheev. Array, global dimension (n, n), local dimension (lld_z, LOCc(jz+n-1)). If jobz ='V', then on normal exit the first m columns of z contain the orthonormal eigenvectors of the matrix corresponding to the selected eigenvalues. If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. If jobz = 'N', then z is not referenced. work(1) On exit, returns adequate workspace to allow optimal performance. If jobz ='N', then work(1) = minimal workspace only for eigenvalues. If jobz ='V', then work(1) = minimal workspace required to generate all the eigenvectors. rwork(1) (local) COMPLEX for pcheev DOUBLE COMPLEX for pzheev. On output, rwork(1) returns workspace required to guarantee completion. info (global) INTEGER. If info = 0, the execution is successful. If info < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j). If the i-th argument is a scalar and had an illegal value, then info = -i. If info > 0: If info = 1 through n, the i-th eigenvalue did not converge in ?steqr2 after a total of 30*n iterations. If info = n+1, then p?heev detected heterogeneity, and the accuracy of the results cannot be guaranteed. p?heevd Computes all eigenvalues and eigenvectors of a complex Hermitian matrix by using a divide and conquer algorithm. Syntax call pcheevd(jobz, uplo, n, a, ia, ja, desca, w, z, iz, jz, descz, work, lwork, rwork, lrwork, iwork, liwork, info) call pzheevd(jobz, uplo, n, a, ia, ja, desca, w, z, iz, jz, descz, work, lwork, rwork, lrwork, iwork, liwork, info) ScaLAPACK Routines 6 1715 Include Files • C: mkl_scalapack.h Description The p?heevd routine computes all eigenvalues and eigenvectors of a complex Hermitian matrix A by using a divide and conquer algorithm. Input Parameters np = the number of rows local to a given process. nq = the number of columns local to a given process. jobz (global). CHARACTER*1. Must be 'N' or 'V'. Specifies if it is necessary to compute the eigenvectors: If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. uplo (global). CHARACTER*1. Must be 'U' or 'L'. Specifies whether the upper or lower triangular part of the Hermitian matrix A is stored: If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n (global) INTEGER. The number of rows and columns of the matrix A (n = 0). a (local). COMPLEX for pcheevd DOUBLE COMPLEX for pzheevd. Block cyclic array of global dimension (n, n) and local dimension (lld_a, LOCc(ja+n-1)). On entry, the Hermitian matrix A. If uplo = 'U', only the upper triangular part of A is used to define the elements of the Hermitian matrix. If uplo = 'L', only the lower triangular part of A is used to define the elements of the Hermitian matrix. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(ctxt_) is incorrect, p?heevd cannot guarantee correct error reporting. iz, jz (global) INTEGER. The row and column indices in the global array z indicating the first row and the first column of the submatrix Z, respectively. descz (global and local) INTEGER array, dimension dlen_. The array descriptor for the distributed matrix Z. descz(ctxt_) must equal desca(ctxt_). work (local). COMPLEX for pcheevd DOUBLE COMPLEX for pzheevd. Array, DIMENSION lwork. lwork (local). INTEGER. The dimension of the array work. If eigenvalues are requested: lwork = n + (nb0 + mq0 + nb)*nb with np0 = numroc( max( n, nb, 2 ), nb, 0, 0, NPROW) mq0 = numroc( max( n, nb, 2 ), nb, 0, 0, NPCOL) 6 Intel® Math Kernel Library Reference Manual 1716 If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the size required for optimal performance for all work arrays. The required workspace is returned as the first element of the corresponding work arrays, and no error message is issued by pxerbla. rwork (local). REAL for pcheevd DOUBLE PRECISION for pzheevd. Workspace array, DIMENSION lrwork. lrwork (local) INTEGER. The dimension of the array rwork. lrwork = 1 + 9*n + 3*np*nq, with np = numroc( n, nb, myrow, iarow, NPROW) nq = numroc( n, nb, mycol, iacol, NPCOL) iwork (local) INTEGER. Workspace array, dimension liwork. liwork (local) INTEGER, dimension of iwork. liwork = 7*n + 8*npcol + 2. Output Parameters a On exit, the lower triangle (if uplo = 'L'), or the upper triangle (if uplo = 'U') of A, including the diagonal, is overwritten. w (global). REAL for pcheevd DOUBLE PRECISION for pzheevd. Array, DIMENSION n. If info = 0, w contains the eigenvalues in the ascending order. z (local). COMPLEX for pcheevd DOUBLE COMPLEX for pzheevd. Array, global dimension (n, n), local dimension (lld_z, LOCc(jz+n-1)). The z parameter contains the orthonormal eigenvectors of the matrix A. work(1) On exit, returns adequate workspace to allow optimal performance. rwork(1) (local) COMPLEX for pcheevd DOUBLE COMPLEX for pzheevd. On output, rwork(1) returns workspace required to guarantee completion. iwork(1) (local). On return, iwork(1) contains the amount of integer workspace required. info (global) INTEGER. If info = 0, the execution is successful. If info < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j). If the i-th argument is a scalar and had an illegal value, then info = -i. If info > 0: If info = 1 through n, the i-th eigenvalue did not converge. p?heevx Computes selected eigenvalues and, optionally, eigenvectors of a Hermitian matrix. ScaLAPACK Routines 6 1717 Syntax call pcheevx(jobz, range, uplo, n, a, ia, ja, desca, vl, vu, il, iu, abstol, m, nz, w, orfac, z, iz, jz, descz, work, lwork, rwork, lrwork, iwork, liwork, ifail, iclustr, gap, info) call pzheevx(jobz, range, uplo, n, a, ia, ja, desca, vl, vu, il, iu, abstol, m, nz, w, orfac, z, iz, jz, descz, work, lwork, rwork, lrwork, iwork, liwork, ifail, iclustr, gap, info) Include Files • C: mkl_scalapack.h Description The p?heevx routine computes selected eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A by calling the recommended sequence of ScaLAPACK routines. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Input Parameters np = the number of rows local to a given process. nq = the number of columns local to a given process. jobz (global). CHARACTER*1. Must be 'N' or 'V'. Specifies if it is necessary to compute the eigenvectors: If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. range (global). CHARACTER*1. Must be 'A', 'V', or 'I'. If range = 'A', all eigenvalues will be found. If range = 'V', all eigenvalues in the half-open interval [vl, vu] will be found. If range = 'I', the eigenvalues with indices il through iu will be found. uplo (global). CHARACTER*1. Must be 'U' or 'L'. Specifies whether the upper or lower triangular part of the Hermitian matrix A is stored: If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n (global) INTEGER. The number of rows and columns of the matrix A (n = 0). a (local). COMPLEX for pcheevx DOUBLE COMPLEX for pzheevx. Block cyclic array of global dimension (n, n) and local dimension (lld_a, LOC c(ja+n-1)). On entry, the Hermitian matrix A. If uplo = 'U', only the upper triangular part of A is used to define the elements of the Hermitian matrix. If uplo = 'L', only the lower triangular part of A is used to define the elements of the Hermitian matrix. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(ctxt_) is incorrect, p?heevx cannot guarantee correct error reporting. 6 Intel® Math Kernel Library Reference Manual 1718 vl, vu (global) REAL for pcheevx DOUBLE PRECISION for pzheevx. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues; not referenced if range = 'A' or 'I'. il, iu (global) INTEGER. If range ='I', the indices of the smallest and largest eigenvalues to be returned. Constraints: il = 1; min(il,n) = iu = n. Not referenced if range = 'A' or 'V'. abstol (global). REAL for pcheevx DOUBLE PRECISION for pzheevx. If jobz='V', setting abstol to p?lamch(context, 'U') yields the most orthogonal eigenvectors. The absolute error tolerance for the eigenvalues. An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a, b] of width less than or equal to abstol+eps*max(|a|,|b|), where eps is the machine precision. If abstol is less than or equal to zero, then eps*norm(T) will be used in its place, where norm(T) is the 1-norm of the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues are computed most accurately when abstol is set to twice the underflow threshold 2*p?lamch('S'), not zero. If this routine returns with ((mod(info,2).ne.0).or.(mod(info/8,2).ne.0)), indicating that some eigenvalues or eigenvectors did not converge, try setting abstol to 2*p?lamch('S'). orfac (global). REAL for pcheevx DOUBLE PRECISION for pzheevx. Specifies which eigenvectors should be reorthogonalized. Eigenvectors that correspond to eigenvalues which are within tol=orfac*norm(A) of each other are to be reorthogonalized. However, if the workspace is insufficient (see lwork), tol may be decreased until all eigenvectors to be reorthogonalized can be stored in one process. No reorthogonalization will be done if orfac equals zero. A default value of 1.0e-3 is used if orfac is negative. orfac should be identical on all processes. iz, jz (global) INTEGER. The row and column indices in the global array z indicating the first row and the first column of the submatrix Z, respectively. descz (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix Z. descz( ctxt_ ) must equal desca( ctxt_ ). work (local). COMPLEX for pcheevx DOUBLE COMPLEX for pzheevx. Array, DIMENSION lwork. lwork (local). INTEGER. The dimension of the array work. If only eigenvalues are requested: lwork = n + max(nb*(np0 + 1), 3) If eigenvectors are requested: lwork = n + (np0+mq0+nb)*nb with nq0 = numroc(nn, nb, 0, 0, NPCOL). ScaLAPACK Routines 6 1719 lwork = 5*n + max(5*nn, np0*mq0+2*nb*nb) + iceil(neig, NPROW*NPCOL)*nn For optimal performance, greater workspace is needed, that is lwork = max(lwork, nhetrd_lwork) where lwork is as defined above, and nhetrd_lwork = n + 2*(anb +1)*(4*nps+2) + (nps+1)*nps ictxt = desca(ctxt_) anb = pjlaenv(ictxt, 3, 'pchettrd', 'L', 0, 0, 0, 0) sqnpc = sqrt(dble(NPROW * NPCOL)) nps = max(numroc(n, 1, 0, 0, sqnpc), 2*anb) If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the size required for optimal performance for all work arrays. Each of these values is returned in the first entry of the corresponding work arrays, and no error message is issued by pxerbla. rwork (local) REAL for pcheevx DOUBLE PRECISION for pzheevx. Workspace array, DIMENSION lrwork. lrwork (local) INTEGER. The dimension of the array work. See below for definitions of variables used to define lwork. If no eigenvectors are requested (jobz = 'N'), then lrwork = 5*nn +4*n. If eigenvectors are requested (jobz = 'V'), then the amount of workspace required to guarantee that all eigenvectors are computed is: lrwork = 4*n + max(5*nn, np0*mq0+2*nb*nb) + iceil(neig, NPROW*NPCOL)*nn The computed eigenvectors may not be orthogonal if the minimal workspace is supplied and orfac is too small. If you want to guarantee orthogonality (at the cost of potentially poor performance) you should add the following values to lrwork: (clustersize-1)*n, where clustersize is the number of eigenvalues in the largest cluster, where a cluster is defined as a set of close eigenvalues: {w(k),..., w(k+clustersize-1)|w(j+1) = w(j) +orfac*2*norm(A)}. Variable definitions: neig = number of eigenvectors requested; nb = desca(mb_) = desca(nb_) = descz(mb_) = descz(nb_); nn = max(n, NB, 2); desca(rsrc_) = desca(nb_) = descz(rsrc_) = descz(csrc_) = 0; np0 = numroc(nn, nb, 0, 0, NPROW); mq0 = numroc(max(neig, nb, 2), nb, 0, 0, NPCOL); iceil(x, y) is a ScaLAPACK function returning ceiling(x/y) When lrwork is too small: If lwork is too small to guarantee orthogonality, p?heevx attempts to maintain orthogonality in the clusters with the smallest spacing between the eigenvalues. If lwork is too small to compute all the eigenvectors requested, no computation is performed and info= -23 is returned. Note that when range='V', p?heevx does not know how many eigenvectors are requested until the eigenvalues are computed. Therefore, when range='V' 6 Intel® Math Kernel Library Reference Manual 1720 and as long as lwork is large enough to allow p?heevx to compute the eigenvalues, p?heevx will compute the eigenvalues and as many eigenvectors as it can. Relationship between workspace, orthogonality and performance: If clustersize = n/sqrt(NPROW*NPCOL), then providing enough space to compute all the eigenvectors orthogonally will cause serious degradation in performance. In the limit (that is, clustersize = n-1) p?stein will perform no better than ?stein on 1 processor. For clustersize = n/sqrt(NPROW*NPCOL) reorthogonalizing all eigenvectors will increase the total execution time by a factor of 2 or more. For clustersize > n/sqrt(NPROW*NPCOL) execution time will grow as the square of the cluster size, all other factors remaining equal and assuming enough workspace. Less workspace means less reorthogonalization but faster execution. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the size required for optimal performance for all work arrays. Each of these values is returned in the first entry of the corresponding work arrays, and no error message is issued by pxerbla. iwork (local) INTEGER. Workspace array. liwork (local) INTEGER, dimension of iwork. liwork = 6*nnp Where: nnp = max(n, NPROW*NPCOL+1, 4) If liwork = -1, then liwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, the lower triangle (if uplo = 'L'), or the upper triangle (if uplo = 'U') of A, including the diagonal, is overwritten. m (global) INTEGER. The total number of eigenvalues found; 0 = m = n. nz (global) INTEGER. Total number of eigenvectors computed. 0 = nz = m. The number of columns of z that are filled. If jobz ? 'V', nz is not referenced. If jobz = 'V', nz = m unless the user supplies insufficient space and p? heevx is not able to detect this before beginning computation. To get all the eigenvectors requested, the user must supply both sufficient space to hold the eigenvectors in z (m.le.descz(n_)) and sufficient workspace to compute them. (See lwork). p?heevx is always able to detect insufficient space without computation unless range.eq.'V'. w (global). REAL for pcheevx DOUBLE PRECISION for pzheevx. Array, DIMENSION (n). The first m elements contain the selected eigenvalues in ascending order. z (local). COMPLEX for pcheevx DOUBLE COMPLEX for pzheevx. Array, global dimension (n, n), local dimension (lld_z, LOCc(jz+n-1)). ScaLAPACK Routines 6 1721 If jobz ='V', then on normal exit the first m columns of z contain the orthonormal eigenvectors of the matrix corresponding to the selected eigenvalues. If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. If jobz = 'N', then z is not referenced. work(1) On exit, returns adequate workspace to allow optimal performance. rwork (local). REAL for pcheevx DOUBLE PRECISION for pzheevx. Array, DIMENSION (lrwork). On return, rwork(1) contains the optimal amount of workspace required for efficient execution. If jobz='N' rwork(1) = optimal amount of workspace required to compute eigenvalues efficiently. If jobz='V' rwork(1) = optimal amount of workspace required to compute eigenvalues and eigenvectors efficiently with no guarantee on orthogonality. If range='V', it is assumed that all eigenvectors may be required. iwork(1) (local) On return, iwork(1) contains the amount of integer workspace required. ifail (global) INTEGER. Array, DIMENSION (n). If jobz ='V', then on normal exit, the first m elements of ifail are zero. If (mod(info,2).ne.0) on exit, then ifail contains the indices of the eigenvectors that failed to converge. If jobz = 'N', then ifail is not referenced. iclustr (global) INTEGER. Array, DIMENSION (2*NPROW*NPCOL). This array contains indices of eigenvectors corresponding to a cluster of eigenvalues that could not be reorthogonalized due to insufficient workspace (see lwork, orfac and info). Eigenvectors corresponding to clusters of eigenvalues indexed iclustr(2*i-1) to iclustr(2*i), could not be reorthogonalized due to lack of workspace. Hence the eigenvectors corresponding to these clusters may not be orthogonal. iclustr() is a zero terminated array. (iclustr(2*k).ne.0. and. iclustr(2*k +1).eq.0) if and only if k is the number of clusters. iclustr is not referenced if jobz = 'N'. gap (global) REAL for pcheevx DOUBLE PRECISION for pzheevx. Array, DIMENSION (NPROW*NPCOL) This array contains the gap between eigenvalues whose eigenvectors could not be reorthogonalized. The output values in this array correspond to the clusters indicated by the array iclustr. As a result, the dot product between eigenvectors corresponding to the i-th cluster may be as high as (C*n)/gap(i) where C is a small constant. info (global) INTEGER. If info = 0, the execution is successful. If info < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j). If the i-th argument is a scalar and had an illegal value, then info = -i. 6 Intel® Math Kernel Library Reference Manual 1722 If info > 0: If (mod(info,2).ne.0), then one or more eigenvectors failed to converge. Their indices are stored in ifail. Ensure abstol=2.0*p?lamch('U') If (mod(info/2,2).ne.0), then eigenvectors corresponding to one or more clusters of eigenvalues could not be reorthogonalized because of insufficient workspace.The indices of the clusters are stored in the array iclustr. If (mod(info/4,2).ne.0), then space limit prevented p?syevx from computing all of the eigenvectors between vl and vu. The number of eigenvectors computed is returned in nz. If (mod(info/8,2).ne.0), then p?stebz failed to compute eigenvalues. Ensure abstol=2.0*p?lamch('U'). p?gesvd Computes the singular value decomposition of a general matrix, optionally computing the left and/or right singular vectors. Syntax call psgesvd(jobu, jobvt, m, n, a, ia, ja, desca, s, u, iu, ju, descu, vt, ivt, jvt, descvt, work, lwork, info) call pdgesvd(jobu, jobvt, m, n, a, ia, ja, desca, s, u, iu, ju, descu, vt, ivt, jvt, descvt, work, lwork, info) call pcgesvd(jobu, jobvt, m, n, a, ia, ja, desca, s, u, iu, ju, descu, vt, ivt, jvt, descvt, work, lwork, rwork, info) call pzgesvd(jobu, jobvt, m, n, a, ia, ja, desca, s, u, iu, ju, descu, vt, ivt, jvt, descvt, work, lwork, rwork, info) Include Files • C: mkl_scalapack.h Description The p?gesvd routine computes the singular value decomposition (SVD) of an m-by-n matrix A, optionally computing the left and/or right singular vectors. The SVD is written A = U*S*VT, where S is an m-by-n matrix that is zero except for its min(m, n) diagonal elements, U is an m-by-m orthogonal matrix, and V is an n-by-n orthogonal matrix. The diagonal elements of S are the singular values of A and the columns of U and V are the corresponding right and left singular vectors, respectively. The singular values are returned in array s in decreasing order and only the first min(m,n) columns of U and rows of vt = VT are computed. Input Parameters mp = number of local rows in A and U nq = number of local columns in A and VT size = min(m, n) sizeq = number of local columns in U sizep = number of local rows in VT ScaLAPACK Routines 6 1723 jobu (global). CHARACTER*1. Specifies options for computing all or part of the matrix U. If jobu = 'V', the first size columns of U (the left singular vectors) are returned in the array u; If jobu ='N', no columns of U (no left singular vectors)are computed. jobvt (global) CHARACTER*1. Specifies options for computing all or part of the matrix VT. If jobvt = 'V', the first size rows of VT (the right singular vectors) are returned in the array vt; If jobvt = 'N', no rows of VT(no right singular vectors) are computed. m (global) INTEGER. The number of rows of the matrix A (m = 0). n (global) INTEGER. The number of columns in A (n = 0). a (local). REAL for psgesvd DOUBLE PRECISION for pdgesvd COMPLEX for pcgesvd COMPLEX*16 for pzgesvd Block cyclic array, global dimension (m, n), local dimension (mp, nq). work(lwork) is a workspace array. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. iu, ju (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix U, respectively. descu (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix U. ivt, jvt (global) INTEGER. The row and column indices in the global array vt indicating the first row and the first column of the submatrix VT, respectively. descvt (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix VT. work (local). REAL for psgesvd DOUBLE PRECISION for pdgesvd COMPLEX for pcgesvd COMPLEX*16 for pzgesvd Workspace array, dimension (lwork) lwork (local) INTEGER. The dimension of the array work; lwork > 2 + 6*sizeb + max(watobd, wbdtosvd), where sizeb = max(m, n), and watobd and wbdtosvd refer, respectively, to the workspace required to bidiagonalize the matrix A and to go from the bidiagonal matrix to the singular value decomposition U S VT. For watobd, the following holds: watobd = max(max(wp?lange,wp?gebrd), max(wp?lared2d, wp? lared1d)), where wp?lange, wp?lared1d, wp?lared2d, wp?gebrd are the workspaces required respectively for the subprograms p?lange, p?lared1d, p? lared2d, p?gebrd. Using the standard notation mp = numroc(m, mb, MYROW, desca(ctxt_), NPROW), nq = numroc(n, nb, MYCOL, desca(lld_), NPCOL), the workspaces required for the above subprograms are 6 Intel® Math Kernel Library Reference Manual 1724 wp?lange = mp, wp?lared1d = nq0, wp?lared2d = mp0, wp?gebrd = nb*(mp + nq + 1) + nq, where nq0 and mp0 refer, respectively, to the values obtained at MYCOL = 0 and MYROW = 0. In general, the upper limit for the workspace is given by a workspace required on processor (0,0): watobd = nb*(mp0 + nq0 + 1) + nq0. In case of a homogeneous process grid this upper limit can be used as an estimate of the minimum workspace for every processor. For wbdtosvd, the following holds: wbdtosvd = size*(wantu*nru + wantvt*ncvt) + max(w?bdsqr, max(wantu*wp?ormbrqln, wantvt*wp?ormbrprt)), where wantu(wantvt) = 1, if left/right singular vectors are wanted, and wantu(wantvt) = 0, otherwise. w?bdsqr, wp?ormbrqln, and wp?ormbrprt refer respectively to the workspace required for the subprograms ?bdsqr, p?ormbr(qln), and p?ormbr(prt), where qln and prt are the values of the arguments vect, side, and trans in the call to p?ormbr. nru is equal to the local number of rows of the matrix U when distributed 1-dimensional "column" of processes. Analogously, ncvt is equal to the local number of columns of the matrix VT when distributed across 1-dimensional "row" of processes. Calling the LAPACK procedure ?bdsqr requires w?bdsqr = max(1, 2*size + (2*size - 4)* max(wantu, wantvt)) on every processor. Finally, wp?ormbrqln = max((nb*(nb-1))/2, (sizeq+mp)*nb)+nb*nb, wp?ormbrprt = max((mb*(mb-1))/2, (sizep+nq)*mb)+mb*mb, If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum size for the work array. The required workspace is returned as the first element of work and no error message is issued by pxerbla. rwork REAL for psgesvd DOUBLE PRECISION for pdgesvd COMPLEX for pcgesvd COMPLEX*16 for pzgesvd Workspace array, dimension (1 + 4*sizeb) Output Parameters a On exit, the contents of a are destroyed. s (global). REAL for psgesvd DOUBLE PRECISION for pdgesvd COMPLEX for pcgesvd COMPLEX*16 for pzgesvd Array, DIMENSION (size). Contains the singular values of A sorted so that s(i) = s(i+1). u (local). REAL for psgesvd DOUBLE PRECISION for pdgesvd COMPLEX for pcgesvd COMPLEX*16 for pzgesvd local dimension (mp, sizeq), global dimension (m, size) If jobu = 'V', u contains the first min(m, n) columns of U. ScaLAPACK Routines 6 1725 If jobu = 'N' or 'O', u is not referenced. vt (local). REAL for psgesvd DOUBLE PRECISION for pdgesvd COMPLEX for pcgesvd COMPLEX*16 for pzgesvd local dimension (sizep, nq), global dimension (size, n) If jobvt = 'V', vt contains the first size rows of VTif jobu = 'N', vt is not referenced. work On exit, if info = 0, then work(1) returns the required minimal size of lwork. rwork On exit, if info = 0, then rwork(1) returns the required size of rwork. info (global) INTEGER. If info = 0, the execution is successful. If info < 0, If info = -i, the ith parameter had an illegal value. If info > 0 i, then if ?bdsqr did not converge, If info = min(m,n) + 1, then p?gesvd has detected heterogeneity by finding that eigenvalues were not identical across the process grid. In this case, the accuracy of the results from p?gesvd cannot be guaranteed. See Also ?bdsqr p?ormbr pxerbla p?sygvx Computes selected eigenvalues and, optionally, eigenvectors of a real generalized symmetric definite eigenproblem. Syntax call pssygvx(ibtype, jobz, range, uplo, n, a, ia, ja, desca, b, ib, jb, descb, vl, vu, il, iu, abstol, m, nz, w, orfac, z, iz, jz, descz, work, lwork, iwork, liwork, ifail, iclustr, gap, info) call pdsygvx(ibtype, jobz, range, uplo, n, a, ia, ja, desca, b, ib, jb, descb, vl, vu, il, iu, abstol, m, nz, w, orfac, z, iz, jz, descz, work, lwork, iwork, liwork, ifail, iclustr, gap, info) Include Files • C: mkl_scalapack.h Description The p?sygvx routine computes all the eigenvalues, and optionally, the eigenvectors of a real generalized symmetric-definite eigenproblem, of the form sub(A)*x = ?*sub(B)*x, sub(A) sub(B)*x = ?*x, or sub(B)*sub(A)*x = ?*x. Here x denotes eigen vectors, ? (lambda) denotes eigenvalues, sub(A) denoting A(ia:ia+n-1, ja:ja +n-1) is assumed to symmetric, and sub(B) denoting B(ib:ib+n-1, jb:jb+n-1) is also positive definite. Input Parameters ibtype (global) INTEGER. Must be 1 or 2 or 3. 6 Intel® Math Kernel Library Reference Manual 1726 Specifies the problem type to be solved: If ibtype = 1, the problem type is sub(A)*x = lambda*sub(B)*x; If ibtype = 2, the problem type is sub(A)*sub(B)*x = lambda*x; If ibtype = 3, the problem type is sub(B)*sub(A)*x = lambda*x. jobz (global). CHARACTER*1. Must be 'N' or 'V'. If jobz ='N', then compute eigenvalues only. If jobz ='V', then compute eigenvalues and eigenvectors. range (global). CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes eigenvalues in the interval: [vl, vu] If range = 'I', the routine computes eigenvalues with indices il through iu. uplo (global). CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays a and b store the upper triangles of sub(A) and sub (B); If uplo = 'L', arrays a and b store the lower triangles of sub(A) and sub (B). n (global). INTEGER. The order of the matrices sub(A) and sub (B), n = 0. a (local) REAL for pssygvx DOUBLE PRECISION for pdsygvx. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +n-1)). On entry, this array contains the local pieces of the n-by-n symmetric distributed matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular part of the matrix. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular part of the matrix. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(ctxt_) is incorrect, p?sygvx cannot guarantee correct error reporting. b (local). REAL for pssygvx DOUBLE PRECISION for pdsygvx. Pointer into the local memory to an array of dimension (lld_b, LOCc(jb +n-1)). On entry, this array contains the local pieces of the n-by-n symmetric distributed matrix sub(B). If uplo = 'U', the leading n-by-n upper triangular part of sub(B) contains the upper triangular part of the matrix. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular part of the matrix. ib, jb (global) INTEGER. The row and column indices in the global array b indicating the first row and the first column of the submatrix B, respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. descb(ctxt_) must be equal to desca(ctxt_). vl, vu (global) REAL for pssygvx ScaLAPACK Routines 6 1727 DOUBLE PRECISION for pdsygvx. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues. If range = 'A' or 'I', vl and vu are not referenced. il, iu (global) INTEGER. If range = 'I', the indices in ascending order of the smallest and largest eigenvalues to be returned. Constraint: il = 1, min(il, n)= iu = n If range = 'A' or 'V', il and iu are not referenced. abstol (global) REAL for pssygvx DOUBLE PRECISION for pdsygvx. If jobz='V', setting abstol to p?lamch(context, 'U') yields the most orthogonal eigenvectors. The absolute error tolerance for the eigenvalues. An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol + eps*max(|a|,|b|), where eps is the machine precision. If abstol is less than or equal to zero, then eps*norm(T) will be used in its place, where norm(T) is the 1-norm of the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*p?lamch('S') not zero. If this routine returns with ((mod(info,2).ne.0).or.*(mod(info/8,2).ne.0)), indicating that some eigenvalues or eigenvectors did not converge, try setting abstol to 2*p?lamch('S'). orfac (global). REAL for pssygvx DOUBLE PRECISION for pdsygvx. Specifies which eigenvectors should be reorthogonalized. Eigenvectors that correspond to eigenvalues which are within tol=orfac*norm(A) of each other are to be reorthogonalized. However, if the workspace is insufficient (see lwork), tol may be decreased until all eigenvectors to be reorthogonalized can be stored in one process. No reorthogonalization will be done if orfac equals zero. A default value of 1.0e-3 is used if orfac is negative. orfac should be identical on all processes. iz, jz (global) INTEGER. The row and column indices in the global array z indicating the first row and the first column of the submatrix Z, respectively. descz (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix Z.descz(ctxt_) must equal desca(ctxt_). work (local) REAL for pssygvx DOUBLE PRECISION for pdsygvx. Workspace array, dimension (lwork) lwork (local) INTEGER. Dimension of the array work. See below for definitions of variables used to define lwork. If no eigenvectors are requested (jobz = 'N'), then lwork = 5*n + max(5*nn, NB*(np0 + 1)). If eigenvectors are requested (jobz = 'V'), then the amount of workspace required to guarantee that all eigenvectors are computed is: 6 Intel® Math Kernel Library Reference Manual 1728 lwork = 5*n + max(5*nn, np0*mq0 + 2*nb*nb) + iceil(neig, NPROW*NPCOL)*nn. The computed eigenvectors may not be orthogonal if the minimal workspace is supplied and orfac is too small. If you want to guarantee orthogonality at the cost of potentially poor performance you should add the following to lwork: (clustersize-1)*n, where clustersize is the number of eigenvalues in the largest cluster, where a cluster is defined as a set of close eigenvalues: {w(k),..., w(k+clustersize-1)|w(j+1) = w(j) + orfac*2*norm(A)} Variable definitions: neig = number of eigenvectors requested, nb = desca(mb_) = desca(nb_) = descz(mb_) = descz(nb_), nn = max(n, nb, 2), desca(rsrc_) = desca(nb_) = descz(rsrc_) = descz(csrc_) = 0, np0 = numroc(nn, nb, 0, 0, NPROW), mq0 = numroc(max(neig, nb, 2), nb, 0, 0, NPCOL) iceil(x, y) is a ScaLAPACK function returning ceiling(x/y) If lwork is too small to guarantee orthogonality, p?syevx attempts to maintain orthogonality in the clusters with the smallest spacing between the eigenvalues. If lwork is too small to compute all the eigenvectors requested, no computation is performed and info= -23 is returned. Note that when range='V', number of requested eigenvectors are not known until the eigenvalues are computed. In this case and if lwork is large enough to compute the eigenvalues, p?sygvx computes the eigenvalues and as many eigenvectors as possible. Greater performance can be achieved if adequate workspace is provided. In some situations, performance can decrease as the provided workspace increases above the workspace amount shown below: lwork = max(lwork, 5*n + nsytrd_lwopt, nsygst_lwopt), where lwork, as defined previously, depends upon the number of eigenvectors requested, and nsytrd_lwopt = n + 2*(anb+1)*(4*nps+2) + (nps+3)*nps nsygst_lwopt = 2*np0*nb + nq0*nb + nb*nb anb = pjlaenv(desca(ctxt_), 3, p?syttrd ', 'L', 0, 0, 0, 0) sqnpc = int(sqrt(dble(NPROW * NPCOL))) nps = max(numroc(n, 1, 0, 0, sqnpc), 2*anb) NB = desca(mb_) np0 = numroc(n, nb, 0, 0, NPROW) nq0 = numroc(n, nb, 0, 0, NPCOL) numroc is a ScaLAPACK tool functions; pjlaenv is a ScaLAPACK environmental inquiry function MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. For large n, no extra workspace is needed, however the biggest boost in performance comes for small n, so it is wise to provide the extra workspace (typically less than a Megabyte per process). If clustersize = n/sqrt(NPROW*NPCOL), then providing enough space to compute all the eigenvectors orthogonally will cause serious degradation in performance. At the limit (that is, clustersize = n-1) p?stein will perform no better than ?stein on a single processor. ScaLAPACK Routines 6 1729 For clustersize = n/sqrt(NPROW*NPCOL) reorthogonalizing all eigenvectors will increase the total execution time by a factor of 2 or more. For clustersize > n/sqrt(NPROW*NPCOL) execution time will grow as the square of the cluster size, all other factors remaining equal and assuming enough workspace. Less workspace means less reorthogonalization but faster execution. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the size required for optimal performance for all work arrays. Each of these values is returned in the first entry of the corresponding work arrays, and no error message is issued by pxerbla. iwork (local) INTEGER. Workspace array. liwork (local) INTEGER, dimension of iwork. liwork = 6*nnp Where: nnp = max(n, NPROW*NPCOL + 1, 4) If liwork = -1, then liwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, If jobz = 'V', and if info = 0, sub(A) contains the distributed matrix Z of eigenvectors. The eigenvectors are normalized as follows: for ibtype = 1 or 2, ZT*sub(B)*Z = i; for ibtype = 3, ZT*inv(sub(B))*Z = i. If jobz = 'N', then on exit the upper triangle (if uplo='U') or the lower triangle (if uplo='L') of sub(A), including the diagonal, is destroyed. b On exit, if info = n, the part of sub(B) containing the matrix is overwritten by the triangular factor U or L from the Cholesky factorization sub(B) = UT*U or sub(B) = L*LT. m (global) INTEGER. The total number of eigenvalues found, 0 = m = n. nz (global) INTEGER. Total number of eigenvectors computed. 0 = nz = m. The number of columns of z that are filled. If jobz ? 'V', nz is not referenced. If jobz = 'V', nz = m unless the user supplies insufficient space and p? sygvx is not able to detect this before beginning computation. To get all the eigenvectors requested, the user must supply both sufficient space to hold the eigenvectors in z (m.le.descz(n_)) and sufficient workspace to compute them. (See lwork below.) p?sygvx is always able to detect insufficient space without computation unless range.eq.'V'. w (global) REAL for pssygvx DOUBLE PRECISION for pdsygvx. Array, DIMENSION (n). On normal exit, the first m entries contain the selected eigenvalues in ascending order. z (local). REAL for pssygvx DOUBLE PRECISION for pdsygvx. 6 Intel® Math Kernel Library Reference Manual 1730 global dimension (n, n), local dimension (lld_z, LOCc(jz+n-1)). If jobz = 'V', then on normal exit the first m columns of z contain the orthonormal eigenvectors of the matrix corresponding to the selected eigenvalues. If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. If jobz = 'N', then z is not referenced. work If jobz='N' work(1) = optimal amount of workspace required to compute eigenvalues efficiently If jobz = 'V' work(1) = optimal amount of workspace required to compute eigenvalues and eigenvectors efficiently with no guarantee on orthogonality. If range='V', it is assumed that all eigenvectors may be required. ifail (global) INTEGER. Array, DIMENSION (n). ifail provides additional information when info.ne.0 If (mod(info/16,2).ne.0) then ifail(1) indicates the order of the smallest minor which is not positive definite. If (mod(info,2).ne.0) on exit, then ifail contains the indices of the eigenvectors that failed to converge. If neither of the above error conditions hold and jobz = 'V', then the first m elements of ifail are set to zero. iclustr (global) INTEGER. Array, DIMENSION (2*NPROW*NPCOL). This array contains indices of eigenvectors corresponding to a cluster of eigenvalues that could not be reorthogonalized due to insufficient workspace (see lwork, orfac and info). Eigenvectors corresponding to clusters of eigenvalues indexed iclustr(2*i-1) to iclustr(2*i), could not be reorthogonalized due to lack of workspace. Hence the eigenvectors corresponding to these clusters may not be orthogonal. iclustr() is a zero terminated array. (iclustr(2*k).ne.0.and. iclustr(2*k+1).eq.0) if and only if k is the number of clusters iclustr is not referenced if jobz = 'N'. gap (global) REAL for pssygvx DOUBLE PRECISION for pdsygvx. Array, DIMENSION (NPROW*NPCOL). This array contains the gap between eigenvalues whose eigenvectors could not be reorthogonalized. The output values in this array correspond to the clusters indicated by the array iclustr. As a result, the dot product between eigenvectors corresponding to the i-th cluster may be as high as (C*n)/gap(i), where C is a small constant. info (global) INTEGER. If info = 0, the execution is successful. If info <0: the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. If info > 0: If (mod(info,2).ne.0), then one or more eigenvectors failed to converge. Their indices are stored in ifail. ScaLAPACK Routines 6 1731 If (mod(info,2,2).ne.0), then eigenvectors corresponding to one or more clusters of eigenvalues could not be reorthogonalized because of insufficient workspace. The indices of the clusters are stored in the array iclustr. If (mod(info/4,2).ne.0), then space limit prevented p?sygvx from computing all of the eigenvectors between vl and vu. The number of eigenvectors computed is returned in nz. If (mod(info/8,2).ne.0), then p?stebz failed to compute eigenvalues. If (mod(info/16,2).ne.0), then B was not positive definite. ifail(1) indicates the order of the smallest minor which is not positive definite. p?hegvx Computes selected eigenvalues and, optionally, eigenvectors of a complex generalized Hermitian definite eigenproblem. Syntax call pchegvx(ibtype, jobz, range, uplo, n, a, ia, ja, desca, b, ib, jb, descb, vl, vu, il, iu, abstol, m, nz, w, orfac, z, iz, jz, descz, work, lwork, rwork, lrwork, iwork, liwork, ifail, iclustr, gap, info) call pzhegvx(ibtype, jobz, range, uplo, n, a, ia, ja, desca, b, ib, jb, descb, vl, vu, il, iu, abstol, m, nz, w, orfac, z, iz, jz, descz, work, lwork, rwork, lrwork, iwork, liwork, ifail, iclustr, gap, info) Include Files • C: mkl_scalapack.h Description The p?hegvx routine computes all the eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitian-definite eigenproblem, of the form sub(A)*x = ?*sub(B)*x, sub(A)*sub(B)*x = ?*x, or sub(B)*sub(A)*x = ?*x. Here sub (A) denoting A(ia:ia+n-1, ja:ja+n-1) and sub(B) are assumed to be Hermitian and sub(B) denoting B(ib:ib+n-1, jb:jb+n-1) is also positive definite. Input Parameters ibtype (global) INTEGER. Must be 1 or 2 or 3. Specifies the problem type to be solved: If ibtype = 1, the problem type is sub(A)*x = lambda*sub(B)*x; If ibtype = 2, the problem type is sub(A)*sub(B)*x = lambda*x; If ibtype = 3, the problem type is sub(B)*sub(A)*x = lambda*x. jobz (global). CHARACTER*1. Must be 'N' or 'V'. If jobz ='N', then compute eigenvalues only. If jobz ='V', then compute eigenvalues and eigenvectors. range (global). CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. 6 Intel® Math Kernel Library Reference Manual 1732 If range = 'V', the routine computes eigenvalues in the interval: [vl, vu] If range = 'I', the routine computes eigenvalues with indices il through iu. uplo (global). CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', arrays a and b store the upper triangles of sub(A) and sub (B); If uplo = 'L', arrays a and b store the lower triangles of sub(A) and sub (B). n (global). INTEGER. The order of the matrices sub(A) and sub (B) (n = 0). a (local) COMPLEX for pchegvx DOUBLE COMPLEX for pzhegvx. Pointer into the local memory to an array of dimension (lld_a, LOCc(ja +n-1)). On entry, this array contains the local pieces of the n-by-n Hermitian distributed matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular part of the matrix. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular part of the matrix. ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix A, respectively. desca (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix A. If desca(ctxt_) is incorrect, p?hegvx cannot guarantee correct error reporting. b (local). COMPLEX for pchegvx DOUBLE COMPLEX for pzhegvx. Pointer into the local memory to an array of dimension (lld_b, LOCc(jb +n-1)). On entry, this array contains the local pieces of the n-by-n Hermitian distributed matrix sub(B). If uplo = 'U', the leading n-by-n upper triangular part of sub(B) contains the upper triangular part of the matrix. If uplo = 'L', the leading n-by-n lower triangular part of sub(B) contains the lower triangular part of the matrix. ib, jb (global) INTEGER. The row and column indices in the global array b indicating the first row and the first column of the submatrix B, respectively. descb (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix B. descb(ctxt_) must be equal to desca(ctxt_). vl, vu (global) REAL for pchegvx DOUBLE PRECISION for pzhegvx. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues. If range = 'A' or 'I', vl and vu are not referenced. il, iu (global) INTEGER. ScaLAPACK Routines 6 1733 If range = 'I', the indices in ascending order of the smallest and largest eigenvalues to be returned. Constraint: il = 1, min(il, n) = iu = n If range = 'A' or 'V', il and iu are not referenced. abstol (global) REAL for pchegvx DOUBLE PRECISION for pzhegvx. If jobz='V', setting abstol to p?lamch(context, 'U') yields the most orthogonal eigenvectors. The absolute error tolerance for the eigenvalues. An approximate eigenvalue is accepted as converged when it is determined to lie in an interval [a,b] of width less than or equal to abstol + eps*max(|a|,|b|), where eps is the machine precision. If abstol is less than or equal to zero, then eps*norm(T) will be used in its place, where norm(T) is the 1-norm of the tridiagonal matrix obtained by reducing A to tridiagonal form. Eigenvalues will be computed most accurately when abstol is set to twice the underflow threshold 2*p?lamch('S') not zero. If this routine returns with ((mod(info,2).ne.0).or. * (mod(info/8,2).ne.0)), indicating that some eigenvalues or eigenvectors did not converge, try setting abstol to 2*p?lamch('S'). orfac (global). REAL for pchegvx DOUBLE PRECISION for pzhegvx. Specifies which eigenvectors should be reorthogonalized. Eigenvectors that correspond to eigenvalues which are within tol=orfac*norm(A) of each other are to be reorthogonalized. However, if the workspace is insufficient (see lwork), tol may be decreased until all eigenvectors to be reorthogonalized can be stored in one process. No reorthogonalization will be done if orfac equals zero. A default value of 1.0E-3 is used if orfac is negative. orfac should be identical on all processes. iz, jz (global) INTEGER. The row and column indices in the global array z indicating the first row and the first column of the submatrix Z, respectively. descz (global and local) INTEGER array, dimension (dlen_). The array descriptor for the distributed matrix Z.descz( ctxt_ ) must equal desca( ctxt_ ). work (local) COMPLEX for pchegvx DOUBLE COMPLEX for pzhegvx. Workspace array, dimension (lwork) lwork (local). INTEGER. The dimension of the array work. If only eigenvalues are requested: lwork = n+ max(NB*(np0 + 1), 3) If eigenvectors are requested: lwork = n + (np0+ mq0 + NB)*NB with nq0 = numroc(nn, NB, 0, 0, NPCOL). For optimal performance, greater workspace is needed, that is lwork = max(lwork, n, nhetrd_lwopt, nhegst_lwopt) where lwork is as defined above, and nhetrd_lwork = 2*(anb+1)*(4*nps+2) + (nps + 1)*nps; nhegst_lwopt = 2*np0*nb + nq0*nb + nb*nb nb = desca(mb_) 6 Intel® Math Kernel Library Reference Manual 1734 np0 = numroc(n, nb, 0, 0, NPROW) nq0 = numroc(n, nb, 0, 0, NPCOL) ictxt = desca(ctxt_) anb = pjlaenv(ictxt, 3, 'p?hettrd', 'L', 0, 0, 0, 0) sqnpc = sqrt(dble(NPROW * NPCOL)) nps = max(numroc(n, 1, 0, 0, sqnpc), 2*anb) numroc is a ScaLAPACK tool functions; pjlaenv is a ScaLAPACK environmental inquiry function MYROW, MYCOL, NPROW and NPCOL can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the size required for optimal performance for all work arrays. Each of these values is returned in the first entry of the corresponding work arrays, and no error message is issued by pxerbla. rwork (local) REAL for pchegvx DOUBLE PRECISION for pzhegvx. Workspace array, DIMENSION (lrwork). lrwork (local) INTEGER. The dimension of the array rwork. See below for definitions of variables used to define lrwork. If no eigenvectors are requested (jobz = 'N'), then lrwork = 5*nn+4*n If eigenvectors are requested (jobz = 'V'), then the amount of workspace required to guarantee that all eigenvectors are computed is: lrwork = 4*n + max(5*nn, np0*mq0)+ iceil(neig, NPROW*NPCOL)*nn The computed eigenvectors may not be orthogonal if the minimal workspace is supplied and orfac is too small. If you want to guarantee orthogonality (at the cost of potentially poor performance) you should add the following value to lrwork: (clustersize-1)*n, where clustersize is the number of eigenvalues in the largest cluster, where a cluster is defined as a set of close eigenvalues: {w(k),..., w(k+clustersize-1)| w(j+1) = w(j)+orfac*2*norm(A)} Variable definitions: neig = number of eigenvectors requested; nb = desca(mb_) = desca(nb_) = descz(mb_) = descz(nb_); nn = max(n, nb, 2); desca(rsrc_) = desca(nb_) = descz(rsrc_) = descz(csrc_) = 0 ; np0 = numroc(nn, nb, 0, 0, NPROW); mq0 = numroc(max(neig, nb, 2), nb, 0, 0, NPCOL); iceil(x, y) is a ScaLAPACK function returning ceiling(x/y). When lrwork is too small: If lwork is too small to guarantee orthogonality, p?hegvx attempts to maintain orthogonality in the clusters with the smallest spacing between the eigenvalues. If lwork is too small to compute all the eigenvectors requested, no computation is performed and info= -25 is returned. Note that when range='V', p?hegvx does not know how many eigenvectors are requested until the eigenvalues are computed. Therefore, when range='V' and as ScaLAPACK Routines 6 1735 long as lwork is large enough to allow p?hegvx to compute the eigenvalues, p?hegvx will compute the eigenvalues and as many eigenvectors as it can. Relationship between workspace, orthogonality & performance: If clustersize > n/sqrt(NPROW*NPCOL), then providing enough space to compute all the eigenvectors orthogonally will cause serious degradation in performance. In the limit (that is, clustersize = n-1) p?stein will perform no better than ?stein on 1 processor. For clustersize = n/sqrt(NPROW*NPCOL) reorthogonalizing all eigenvectors will increase the total execution time by a factor of 2 or more. For clustersize > n/sqrt(NPROW*NPCOL) execution time will grow as the square of the cluster size, all other factors remaining equal and assuming enough workspace. Less workspace means less reorthogonalization but faster execution. If lwork = -1, then lrwork is global input and a workspace query is assumed; the routine only calculates the size required for optimal performance for all work arrays. Each of these values is returned in the first entry of the corresponding work arrays, and no error message is issued by pxerbla. iwork (local) INTEGER. Workspace array. liwork (local) INTEGER, dimension of iwork. liwork = 6*nnp Where: nnp = max(n, NPROW*NPCOL + 1, 4) If liwork = -1, then liwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, if jobz = 'V', then if info = 0, sub(A) contains the distributed matrix Z of eigenvectors. The eigenvectors are normalized as follows: If ibtype = 1 or 2, then ZH*sub(B)*Z = i; If ibtype = 3, then ZH*inv(sub(B))*Z = i. If jobz = 'N', then on exit the upper triangle (if uplo='U') or the lower triangle (if uplo='L') of sub(A), including the diagonal, is destroyed. b On exit, if info = n, the part of sub(B) containing the matrix is overwritten by the triangular factor U or L from the Cholesky factorization sub(B) = UH*U, or sub(B) = L*LH. m (global) INTEGER. The total number of eigenvalues found, 0 = m = n. nz (global) INTEGER. Total number of eigenvectors computed. 0 < nz < m. The number of columns of z that are filled. If jobz ? 'V', nz is not referenced. If jobz = 'V', nz = m unless the user supplies insufficient space and p? hegvx is not able to detect this before beginning computation. To get all the eigenvectors requested, the user must supply both sufficient space to hold the eigenvectors in z (m. le. descz(n_)) and sufficient workspace to compute them. (See lwork below.) The routine p?hegvx is always able to detect insufficient space without computation unless range = 'V'. w (global) REAL for pchegvx 6 Intel® Math Kernel Library Reference Manual 1736 DOUBLE PRECISION for pzhegvx. Array, DIMENSION (n). On normal exit, the first m entries contain the selected eigenvalues in ascending order. z (local). COMPLEX for pchegvx DOUBLE COMPLEX for pzhegvx. global dimension (n, n), local dimension (lld_z, LOCc(jz+n-1)). If jobz = 'V', then on normal exit the first m columns of z contain the orthonormal eigenvectors of the matrix corresponding to the selected eigenvalues. If an eigenvector fails to converge, then that column of z contains the latest approximation to the eigenvector, and the index of the eigenvector is returned in ifail. If jobz = 'N', then z is not referenced. work On exit, work(1) returns the optimal amount of workspace. rwork On exit, rwork(1) contains the amount of workspace required for optimal efficiency If jobz='N' rwork(1) = optimal amount of workspace required to compute eigenvalues efficiently If jobz='V' rwork(1) = optimal amount of workspace required to compute eigenvalues and eigenvectors efficiently with no guarantee on orthogonality. If range='V', it is assumed that all eigenvectors may be required when computing optimal workspace. ifail (global) INTEGER. Array, DIMENSION (n). ifail provides additional information when info.ne.0 If (mod(info/16,2).ne.0), then ifail(1) indicates the order of the smallest minor which is not positive definite. If (mod(info,2).ne.0) on exit, then ifail(1) contains the indices of the eigenvectors that failed to converge. If neither of the above error conditions are held, and jobz = 'V', then the first m elements of ifail are set to zero. iclustr (global) INTEGER. Array, DIMENSION (2*NPROW*NPCOL). This array contains indices of eigenvectors corresponding to a cluster of eigenvalues that could not be reorthogonalized due to insufficient workspace (see lwork, orfac and info). Eigenvectors corresponding to clusters of eigenvalues indexed iclustr(2*i-1) to iclustr(2*i), could not be reorthogonalized due to lack of workspace. Hence the eigenvectors corresponding to these clusters may not be orthogonal. iclustr() is a zero terminated array. (iclustr(2*k).ne. 0.and.clustr(2*k+1).eq.0) if and only if k is the number of clusters. iclustr is not referenced if jobz = 'N'. gap (global) REAL for pchegvx DOUBLE PRECISION for pzhegvx. Array, DIMENSION (NPROW*NPCOL). This array contains the gap between eigenvalues whose eigenvectors could not be reorthogonalized. The output values in this array correspond to the clusters indicated by the array iclustr. As a result, the dot product between eigenvectors corresponding to the i-th cluster may be as high as (C*n)/gap(i), where C is a small constant. ScaLAPACK Routines 6 1737 info (global) INTEGER. If info = 0, the execution is successful. If info <0: the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. If info > 0: If (mod(info,2).ne.0), then one or more eigenvectors failed to converge. Their indices are stored in ifail. If (mod(info,2,2).ne.0), then eigenvectors corresponding to one or more clusters of eigenvalues could not be reorthogonalized because of insufficient workspace. The indices of the clusters are stored in the array iclustr. If (mod(info/4,2).ne.0), then space limit prevented p?sygvx from computing all of the eigenvectors between vl and vu. The number of eigenvectors computed is returned in nz. If (mod(info/8,2).ne.0), then p?stebz failed to compute eigenvalues. If (mod(info/16,2).ne.0), then B was not positive definite. ifail(1) indicates the order of the smallest minor which is not positive definite. 6 Intel® Math Kernel Library Reference Manual 1738 ScaLAPACK Auxiliary and Utility Routines 7 This chapter describes the Intel® Math Kernel Library implementation of ScaLAPACK Auxiliary Routines and Utility Functions and Routines. The library includes routines for both real and complex data. NOTE ScaLAPACK routines are provided only with Intel® MKL versions for Linux* and Windows* OSs. Routine naming conventions, mathematical notation, and matrix storage schemes used for ScaLAPACK auxiliary and utility routines are the same as described in previous chapters. Some routines and functions may have combined character codes, such as sc or dz. For example, the routine pscsum1 uses a complex input array and returns a real value. Auxiliary Routines ScaLAPACK Auxiliary Routines Routine Name Data Types Description p?lacgv c,z Conjugates a complex vector. p?max1 c,z Finds the index of the element whose real part has maximum absolute value (similar to the Level 1 PBLAS p?amax, but using the absolute value to the real part). ?combamax1 c,z Finds the element with maximum real part absolute value and its corresponding global index. p?sum1 sc,dz Forms the 1-norm of a complex vector similar to Level 1 PBLAS p? asum, but using the true absolute value. p?dbtrsv s,d,c,z Computes an LU factorization of a general tridiagonal matrix with no pivoting. The routine is called by p?dbtrs. p?dttrsv s,d,c,z Computes an LU factorization of a general band matrix, using partial pivoting with row interchanges. The routine is called by p? dttrs. p?gebd2 s,d,c,z Reduces a general rectangular matrix to real bidiagonal form by an orthogonal/unitary transformation (unblocked algorithm). p?gehd2 s,d,c,z Reduces a general matrix to upper Hessenberg form by an orthogonal/unitary similarity transformation (unblocked algorithm). p?gelq2 s,d,c,z Computes an LQ factorization of a general rectangular matrix (unblocked algorithm). p?geql2 s,d,c,z Computes a QL factorization of a general rectangular matrix (unblocked algorithm). p?geqr2 s,d,c,z Computes a QR factorization of a general rectangular matrix (unblocked algorithm). p?gerq2 s,d,c,z Computes an RQ factorization of a general rectangular matrix (unblocked algorithm). 1739 Routine Name Data Types Description p?getf2 s,d,c,z Computes an LU factorization of a general matrix, using partial pivoting with row interchanges (local blocked algorithm). p?labrd s,d,c,z Reduces the first nb rows and columns of a general rectangular matrix A to real bidiagonal form by an orthogonal/unitary transformation, and returns auxiliary matrices that are needed to apply the transformation to the unreduced part of A. p?lacon s,d,c,z Estimates the 1-norm of a square matrix, using the reverse communication for evaluating matrix-vector products. p?laconsb s,d Looks for two consecutive small subdiagonal elements. p?lacp2 s,d,c,z Copies all or part of a distributed matrix to another distributed matrix. p?lacp3 s,d Copies from a global parallel array into a local replicated array or vice versa. p?lacpy s,d,c,z Copies all or part of one two-dimensional array to another. p?laevswp s,d,c,z Moves the eigenvectors from where they are computed to ScaLAPACK standard block cyclic array. p?lahrd s,d,c,z Reduces the first nb columns of a general rectangular matrix A so that elements below the kth subdiagonal are zero, by an orthogonal/unitary transformation, and returns auxiliary matrices that are needed to apply the transformation to the unreduced part of A. p?laiect s,d,c,z Exploits IEEE arithmetic to accelerate the computations of eigenvalues. (C interface function). p?lange s,d,c,z Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element, of a general rectangular matrix. p?lanhs s,d,c,z Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element, of an upper Hessenberg matrix. p?lansy, p?lanhe s,d,c,z/ c,z Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element of a real symmetric or complex Hermitian matrix. p?lantr s,d,c,z Returns the value of the 1-norm, Frobenius norm, infinity-norm, or the largest absolute value of any element, of a triangular matrix. p?lapiv s,d,c,z Applies a permutation matrix to a general distributed matrix, resulting in row or column pivoting. p?laqge s,d,c,z Scales a general rectangular matrix, using row and column scaling factors computed by p?geequ. p?laqsy s,d,c,z Scales a symmetric/Hermitian matrix, using scaling factors computed by p?poequ. p?lared1d s,d Redistributes an array assuming that the input array bycol is distributed across rows and that all process columns contain the same copy of bycol. 7 Intel® Math Kernel Library Reference Manual 1740 Routine Name Data Types Description p?lared2d s,d Redistributes an array assuming that the input array byrow is distributed across columns and that all process rows contain the same copy of byrow . p?larf s,d,c,z Applies an elementary reflector to a general rectangular matrix. p?larfb s,d,c,z Applies a block reflector or its transpose/conjugate-transpose to a general rectangular matrix. p?larfc c,z Applies the conjugate transpose of an elementary reflector to a general matrix. p?larfg s,d,c,z Generates an elementary reflector (Householder matrix). p?larft s,d,c,z Forms the triangular vector T of a block reflector H=I-VTVH p?larz s,d,c,z Applies an elementary reflector as returned by p?tzrzf to a general matrix. p?larzb s,d,c,z Applies a block reflector or its transpose/conjugate-transpose as returned by p?tzrzf to a general matrix. p?larzc c,z Applies (multiplies by) the conjugate transpose of an elementary reflector as returned by p?tzrzf to a general matrix. p?larzt s,d,c,z Forms the triangular factor T of a block reflector H=I-VTVH as returned by p?tzrzf. p?lascl s,d,c,z Multiplies a general rectangular matrix by a real scalar defined as Cto/Cfrom. p?laset s,d,c,z Initializes the off-diagonal elements of a matrix to a and the diagonal elements to ß. p?lasmsub s,d Looks for a small subdiagonal element from the bottom of the matrix that it can safely set to zero. p?lassq s,d,c,z Updates a sum of squares represented in scaled form. p?laswp s,d,c,z Performs a series of row interchanges on a general rectangular matrix. p?latra s,d,c,z Computes the trace of a general square distributed matrix. p?latrd s,d,c,z Reduces the first nb rows and columns of a symmetric/Hermitian matrix A to real tridiagonal form by an orthogonal/unitary similarity transformation. p?latrz s,d,c,z Reduces an upper trapezoidal matrix to upper triangular form by means of orthogonal/unitary transformations. p?lauu2 s,d,c,z Computes the product UUH or LHL, where U and L are upper or lower triangular matrices (local unblocked algorithm). p?lauum s,d,c,z Computes the product UUH or LHL, where U and L are upper or lower triangular matrices. p?lawil s,d Forms the Wilkinson transform. p?org2l/p?ung2l s,d,c,z Generates all or part of the orthogonal/unitary matrix Q from a QL factorization determined by p?geqlf (unblocked algorithm). ScaLAPACK Auxiliary and Utility Routines 7 1741 Routine Name Data Types Description p?org2r/p?ung2r s,d,c,z Generates all or part of the orthogonal/unitary matrix Q from a QR factorization determined by p?geqrf (unblocked algorithm). p?orgl2/p?ungl2 s,d,c,z Generates all or part of the orthogonal/unitary matrix Q from an LQ factorization determined by p?gelqf (unblocked algorithm). p?orgr2/p?ungr2 s,d,c,z Generates all or part of the orthogonal/unitary matrix Q from an RQ factorization determined by p?gerqf (unblocked algorithm). p?orm2l/p?unm2l s,d,c,z Multiplies a general matrix by the orthogonal/unitary matrix from a QL factorization determined by p?geqlf (unblocked algorithm). p?orm2r/p?unm2r s,d,c,z Multiplies a general matrix by the orthogonal/unitary matrix from a QR factorization determined by p?geqrf (unblocked algorithm). p?orml2/p?unml2 s,d,c,z Multiplies a general matrix by the orthogonal/unitary matrix from an LQ factorization determined by p?gelqf (unblocked algorithm). p?ormr2/p?unmr2 s,d,c,z Multiplies a general matrix by the orthogonal/unitary matrix from an RQ factorization determined by p?gerqf (unblocked algorithm). p?pbtrsv s,d,c,z Solves a single triangular linear system via frontsolve or backsolve where the triangular matrix is a factor of a banded matrix computed by p?pbtrf. p?pttrsv s,d,c,z Solves a single triangular linear system via frontsolve or backsolve where the triangular matrix is a factor of a tridiagonal matrix computed by p?pttrf. p?potf2 s,d,c,z Computes the Cholesky factorization of a symmetric/Hermitian positive definite matrix (local unblocked algorithm). p?rscl s,d,cs,zd Multiplies a vector by the reciprocal of a real scalar. p?sygs2/p?hegs2 s,d,c,z Reduces a symmetric/Hermitian definite generalized eigenproblem to standard form, using the factorization results obtained from p? potrf (local unblocked algorithm). p?sytd2/p?hetd2 s,d,c,z Reduces a symmetric/Hermitian matrix to real symmetric tridiagonal form by an orthogonal/unitary similarity transformation (local unblocked algorithm). p?trti2 s,d,c,z Computes the inverse of a triangular matrix (local unblocked algorithm). ?lamsh s,d Sends multiple shifts through a small (single node) matrix to maximize the number of bulges that can be sent through. ?laref s,d Applies Householder reflectors to matrices on either their rows or columns. ?lasorte s,d Sorts eigenpairs by real and complex data types. ?lasrt2 s,d Sorts numbers in increasing or decreasing order. ?stein2 s,d Computes the eigenvectors corresponding to specified eigenvalues of a real symmetric tridiagonal matrix, using inverse iteration. ?dbtf2 s,d,c,z Computes an LU factorization of a general band matrix with no pivoting (local unblocked algorithm). 7 Intel® Math Kernel Library Reference Manual 1742 Routine Name Data Types Description ?dbtrf s,d,c,z Computes an LU factorization of a general band matrix with no pivoting (local blocked algorithm). ?dttrf s,d,c,z Computes an LU factorization of a general tridiagonal matrix with no pivoting (local blocked algorithm). ?dttrsv s,d,c,z Solves a general tridiagonal system of linear equations using the LU factorization computed by ?dttrf. ?pttrsv s,d,c,z Solves a symmetric (Hermitian) positive-definite tridiagonal system of linear equations, using the LDLH factorization computed by ? pttrf. ?steqr2 s,d Computes all eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the implicit QL or QR method. p?lacgv Conjugates a complex vector. Syntax call pclacgv(n, x, ix, jx, descx, incx) call pzlacgv(n, x, ix, jx, descx, incx) Include Files • C: mkl_scalapack.h Description The p?lacgv routine conjugates a complex vector sub(x) of length n, where sub(x) denotes X(ix, jx:jx +n-1) if incx = m_x, and X(ix:ix+n-1, jx) if incx = 1. Input Parameters n (global) INTEGER. The length of the distributed vector sub(x). x (local). COMPLEX for pclacgv COMPLEX*16 for pzlacgv.Pointer into the local memory to an array of DIMENSION (lld_x,*). On entry the vector to be conjugated x(i) = X(ix+ (jx-1)*m_x+(i-1)*incx), 1 = i = n. ix (global) INTEGER.The row index in the global array x indicating the first row of sub(x). jx (global) INTEGER. The column index in the global array x indicating the first column of sub(x). descx (global and local) INTEGER. Array, DIMENSION (dlen_). The array descriptor for the distributed matrix X. incx (global) INTEGER.The global increment for the elements of X. Only two values of incx are supported in this version, namely 1 and m_x. incx must not be zero. ScaLAPACK Auxiliary and Utility Routines 7 1743 Output Parameters x (local). On exit, the conjugated vector. p?max1 Finds the index of the element whose real part has maximum absolute value (similar to the Level 1 PBLAS p?amax, but using the absolute value to the real part). Syntax call pcmax1(n, amax, indx, x, ix, jx, descx, incx) call pzmax1(n, amax, indx, x, ix, jx, descx, incx) Include Files • C: mkl_scalapack.h Description The p?max1 routine computes the global index of the maximum element in absolute value of a distributed vector sub(x). The global index is returned in indx and the value is returned in amax, where sub(x) denotes X(ix:ix+n-1, jx) if incx = 1, X(ix, jx:jx+n-1) if incx = m_x. Input Parameters n (global) pointer to INTEGER. The number of components of the distributed vector sub(x). n = 0. x (local) COMPLEX for pcmax1. COMPLEX*16 for pzmax1 Array containing the local pieces of a distributed matrix of dimension of at least ((jx-1)*m_x+ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub (x). ix (global) INTEGER.The row index in the global array X indicating the first row of sub(x). jx (global) INTEGER. The column index in the global array X indicating the first column of sub(x) descx (global and local) INTEGER. Array, DIMENSION (dlen_). The array descriptor for the distributed matrix X. incx (global) INTEGER.The global increment for the elements of X. Only two values of incx are supported in this version, namely 1 and m_x. incx must not be zero. Output Parameters amax (global output) pointer to REAL.The absolute value of the largest entry of the distributed vector sub(x) only in the scope of sub(x). indx (global output) pointer to INTEGER.The global index of the element of the distributed vector sub(x) whose real part has maximum absolute value. 7 Intel® Math Kernel Library Reference Manual 1744 ?combamax1 Finds the element with maximum real part absolute value and its corresponding global index. Syntax call ccombamax1(v1, v2) call zcombamax1(v1, v2) Include Files • C: mkl_scalapack.h Description The ?combamax1 routine finds the element having maximum real part absolute value as well as its corresponding global index. Input Parameters v1 (local) COMPLEX for ccombamax1 COMPLEX*16 for zcombamax1 Array, DIMENSION 2. The first maximum absolute value element and its global index. v1(1)=amax, v1(2)=indx. v2 (local) COMPLEX for ccombamax1 COMPLEX*16 for zcombamax1 Array, DIMENSION 2. The second maximum absolute value element and its global index. v2(1)=amax, v2(2)=indx. Output Parameters v1 (local). The first maximum absolute value element and its global index. v1(1)=amax, v1(2)=indx. p?sum1 Forms the 1-norm of a complex vector similar to Level 1 PBLAS p?asum, but using the true absolute value. Syntax call pscsum1(n, asum, x, ix, jx, descx, incx) call pdzsum1(n, asum, x, ix, jx, descx, incx) Include Files • C: mkl_scalapack.h Description The p?sum1 routine returns the sum of absolute values of a complex distributed vector sub(x) in asum, where sub(x) denotes X(ix:ix+n-1, jx:jx), if incx = 1, X(ix:ix, jx:jx+n-1), if incx = m_x. Based on p?asum from the Level 1 PBLAS. The change is to use the 'genuine' absolute value. ScaLAPACK Auxiliary and Utility Routines 7 1745 Input Parameters n (global) pointer to INTEGER. The number of components of the distributed vector sub(x). n = 0. x (local ) COMPLEX for pscsum1 COMPLEX*16 for pdzsum1. Array containing the local pieces of a distributed matrix of dimension of at least ((jx-1)*m_x+ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub (x). ix (global) INTEGER.The row index in the global array X indicating the first row of sub(x). jx (global) INTEGER. The column index in the global array X indicating the first column of sub(x) descx (global and local) INTEGER. Array, DIMENSION 8. The array descriptor for the distributed matrix X. incx (global) INTEGER.The global increment for the elements of X. Only two values of incx are supported in this version, namely 1 and m_x. Output Parameters asum (local) Pointer to REAL. The sum of absolute values of the distributed vector sub(x) only in its scope. p?dbtrsv Computes an LU factorization of a general triangular matrix with no pivoting. The routine is called by p? dbtrs. Syntax call psdbtrsv(uplo, trans, n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pddbtrsv(uplo, trans, n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pcdbtrsv(uplo, trans, n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pzdbtrsv(uplo, trans, n, bwl, bwu, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?dbtrsv routine solves a banded triangular system of linear equations A(1 :n, ja:ja+n-1) * X = B(ib:ib+n-1, 1 :nrhs) or A(1 :n, ja:ja+n-1)T * X = B(ib:ib+n-1, 1 :nrhs) (for real flavors); A(1 :n, ja:ja+n-1)H * X = B(ib:ib +n-1, 1 :nrhs) (for complex flavors), 7 Intel® Math Kernel Library Reference Manual 1746 where A(1 :n, ja:ja+n-1) is a banded triangular matrix factor produced by the Gaussian elimination code PD@(dom_pre)BTRF and is stored in A(1 :n, ja:ja+n-1) and af. The matrix stored in A(1 :n, ja:ja+n-1) is either upper or lower triangular according to uplo, and the choice of solving A(1 :n, ja:ja+n-1) or A(1 :n, ja:ja+n-1)T is dictated by the user by the parameter trans. Routine p?dbtrf must be called first. Input Parameters uplo (global) CHARACTER. If uplo='U', the upper triangle of A(1:n, ja:ja+n-1) is stored, if uplo = 'L', the lower triangle of A(1:n, ja:ja+n-1) is stored. trans (global) CHARACTER. If trans = 'N', solve with A(1:n, ja:ja+n-1), if trans = 'C', solve with conjugate transpose A(1:n, ja:ja+n-1). n (global) INTEGER. The order of the distributed submatrix A;(n= 0). bwl (global) INTEGER. Number of subdiagonals. 0 = bwl = n-1. bwu (global) INTEGER. Number of subdiagonals. 0 = bwu = n-1. nrhs (global) INTEGER. The number of right-hand sides; the number of columns of the distributed submatrix B (nrhs= 0). a (local). REAL for psdbtrsv DOUBLE PRECISION for pddbtrsv COMPLEX for pcdbtrsv COMPLEX*16 for pzdbtrsv. Pointer into the local memory to an array with first DIMENSION lld_a=(bwl +bwu+1)(stored in desca). On entry, this array contains the local pieces of the n-by-n unsymmetric banded distributed Cholesky factor L or LT*A(1 :n, ja:ja+n-1). This local portion is stored in the packed banded format used in LAPACK. See the Application Notes below and the ScaLAPACK manual for more detail on the format of distributed matrices. ja (global) INTEGER. The index in the global array a that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array of DIMENSION (dlen_). if 1d type (dtype_a = 501 or 502), dlen= 7; if 2d type (dtype_a = 1), dlen= 9. The array descriptor for the distributed matrix A. Contains information of mapping of A to memory. b (local) REAL for psdbtrsv DOUBLE PRECISION for pddbtrsv COMPLEX for pcdbtrsv COMPLEX*16 for pzdbtrsv. Pointer into the local memory to an array of local lead DIMENSION lld_b=nb. On entry, this array contains the local pieces of the right-hand sides B(ib:ib+n-1, 1:nrhs). ib (global) INTEGER. The row index in the global array b that points to the first row of the matrix to be operated on (which may be either all of b or a submatrix of B). descb (global and local) INTEGER array of DIMENSION (dlen_). if 1d type (dtype_b =502), dlen=7; ScaLAPACK Auxiliary and Utility Routines 7 1747 if 2d type (dtype_b =1), dlen=9. The array descriptor for the distributed matrix B. Contains information of mapping B to memory. laf (local) INTEGER. Size of user-input Auxiliary Filling space af. laf must be =nb*(bwl+bwu)+6*max(bwl, bwu)*max(bwl, bwu). If laf is not large enough, an error code is returned and the minimum acceptable size will be returned in af(1). work (local). REAL for psdbtrsv DOUBLE PRECISION for pddbtrsv COMPLEX for pcdbtrsv COMPLEX*16 for pzdbtrsv. Temporary workspace. This space may be overwritten in between calls to routines. work must be the size given in lwork. lwork (local or global) INTEGER. Size of user-input workspace work. If lwork is too small, the minimal acceptable size will be returned in work(1) and an error code is returned. lwork= max(bwl, bwu)*nrhs. Output Parameters a (local). This local portion is stored in the packed banded format used in LAPACK. Please see the ScaLAPACK manual for more detail on the format of distributed matrices. b On exit, this contains the local piece of the solutions distributed matrix X. af (local). REAL for psdbtrsv DOUBLE PRECISION for pddbtrsv COMPLEX for pcdbtrsv COMPLEX*16 for pzdbtrsv. Auxiliary Filling Space. Filling is created during the factorization routine p? dbtrf and this is stored in af. If a linear system is to be solved using p? dbtrf after the factorization routine, af must not be altered after the factorization. work On exit, work( 1 ) contains the minimal lwork. info (local). INTEGER. If info = 0, the execution is successful. < 0: If the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?dttrsv Computes an LU factorization of a general band matrix, using partial pivoting with row interchanges. The routine is called by p?dttrs. Syntax call psdttrsv(uplo, trans, n, nrhs, dl, d, du, ja, desca, b, ib, descb, af, laf, work, lwork, info) 7 Intel® Math Kernel Library Reference Manual 1748 call pddttrsv(uplo, trans, n, nrhs, dl, d, du, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pcdttrsv(uplo, trans, n, nrhs, dl, d, du, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pzdttrsv(uplo, trans, n, nrhs, dl, d, du, ja, desca, b, ib, descb, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?dttrsv routine solves a tridiagonal triangular system of linear equations A(1 :n, ja:ja+n-1)*X = B(ib:ib+n-1, 1 :nrhs) or A(1 :n, ja:ja+n-1)T * X = B(ib:ib+n-1, 1 :nrhs) for real flavors; A(1 :n, ja:ja+n-1)H * X = B(ib:ib+n-1, 1 :nrhs) for complex flavors, where A(1 :n, ja:ja+n-1) is a tridiagonal matrix factor produced by the Gaussian elimination code PS@(dom_pre)TTRF and is stored in A(1 :n, ja:ja+n-1) and af. The matrix stored in A(1 :n, ja:ja+n-1) is either upper or lower triangular according to uplo, and the choice of solving A(1 :n, ja:ja+n-1) or A(1 :n, ja:ja+n-1)T is dictated by the user by the parameter trans. Routine p?dttrf must be called first. Input Parameters uplo (global) CHARACTER. If uplo='U', the upper triangle of A(1:n, ja:ja+n-1) is stored, if uplo = 'L', the lower triangle of A(1:n, ja:ja+n-1) is stored. trans (global) CHARACTER. If trans = 'N', solve with A(1:n, ja:ja+n-1), if trans = 'C', solve with conjugate transpose A(1:n, ja:ja+n-1). n (global) INTEGER. The order of the distributed submatrix A;(n= 0). nrhs (global) INTEGER. The number of right-hand sides; the number of columns of the distributed submatrix B(ib:ib+n-1, 1:nrhs). (nrhs= 0). dl (local). REAL for psdttrsv DOUBLE PRECISION for pddttrsv COMPLEX for pcdttrsv COMPLEX*16 for pzdttrsv. Pointer to local part of global vector storing the lower diagonal of the matrix. Globally, dl(1) is not referenced, and dl must be aligned with d. Must be of size =desca( nb_ ). d (local). REAL for psdttrsv DOUBLE PRECISION for pddttrsv COMPLEX for pcdttrsv COMPLEX*16 for pzdttrsv. Pointer to local part of global vector storing the main diagonal of the matrix. ScaLAPACK Auxiliary and Utility Routines 7 1749 du (local). REAL for psdttrsv DOUBLE PRECISION for pddttrsv COMPLEX for pcdttrsv COMPLEX*16 for pzdttrsv. Pointer to local part of global vector storing the upper diagonal of the matrix. Globally, du(n) is not referenced, and du must be aligned with d. ja (global) INTEGER. The index in the global array a that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local). INTEGER array of DIMENSION (dlen_). if 1d type (dtype_a = 501 or 502), dlen= 7; if 2d type (dtype_a = 1), dlen= 9. The array descriptor for the distributed matrix A. Contains information of mapping of A to memory. b (local) REAL for psdttrsv DOUBLE PRECISION for pddttrsv COMPLEX for pcdttrsv COMPLEX*16 for pzdttrsv. Pointer into the local memory to an array of local lead DIMENSION lld_b=nb. On entry, this array contains the local pieces of the right-hand sides B(ib:ib+n-1, 1 :nrhs). ib (global). INTEGER. The row index in the global array b that points to the first row of the matrix to be operated on (which may be either all of b or a submatrix of B). descb (global and local).INTEGER array of DIMENSION (dlen_). if 1d type (dtype_b = 502), dlen=7; if 2d type (dtype_b = 1), dlen= 9. The array descriptor for the distributed matrix B. Contains information of mapping B to memory. laf (local). INTEGER. Size of user-input Auxiliary Filling space af. laf must be = 2*(nb+2). If laf is not large enough, an error code is returned and the minimum acceptable size will be returned in af(1). work (local). REAL for psdttrsv DOUBLE PRECISION for pddttrsv COMPLEX for pcdttrsv COMPLEX*16 for pzdttrsv. Temporary workspace. This space may be overwritten in between calls to routines. work must be the size given in lwork. lwork (local or global).INTEGER. Size of user-input workspace work. If lwork is too small, the minimal acceptable size will be returned in work(1) and an error code is returned. lwork= 10*npcol+4*nrhs. 7 Intel® Math Kernel Library Reference Manual 1750 Output Parameters dl (local). On exit, this array contains information containing the factors of the matrix. d On exit, this array contains information containing the factors of the matrix. Must be of size =desca (nb_ ). b On exit, this contains the local piece of the solutions distributed matrix X. af (local). REAL for psdttrsv DOUBLE PRECISION for pddttrsv COMPLEX for pcdttrsv COMPLEX*16 for pzdttrsv. Auxiliary Filling Space. Filling is created during the factorization routine p? dttrf and this is stored in af. If a linear system is to be solved using p? dttrs after the factorization routine, af must not be altered after the factorization. work On exit, work(1) contains the minimal lwork. info (local). INTEGER. If info=0, the execution is successful. if info< 0: If the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?gebd2 Reduces a general rectangular matrix to real bidiagonal form by an orthogonal/unitary transformation (unblocked algorithm). Syntax call psgebd2(m, n, a, ia, ja, desca, d, e, tauq, taup, work, lwork, info) call pdgebd2(m, n, a, ia, ja, desca, d, e, tauq, taup, work, lwork, info) call pcgebd2(m, n, a, ia, ja, desca, d, e, tauq, taup, work, lwork, info) call pzgebd2(m, n, a, ia, ja, desca, d, e, tauq, taup, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?gebd2 routine reduces a real/complex general m-by-n distributed matrix sub(A) = A(ia:ia+m-1, ja:ja +n-1) to upper or lower bidiagonal form B by an orthogonal/unitary transformation: Q'*sub(A)*P = B. If m = n, B is the upper bidiagonal; if mn), and U is upper triangular (upper trapezoidal if m < n). This is the right-looking Parallel Level 2 BLAS version of the algorithm. Input Parameters m (global). INTEGER. The number of rows to be operated on, that is, the number of rows of the distributed submatrix sub(A). (m=0). n (global).INTEGER. The number of columns to be operated on, that is, the number of columns of the distributed submatrix sub(A). (nb_a - mod(ja-1, nb_a)=n=0). a (local). REAL for psgetf2 DOUBLE PRECISION for pdgetf2 COMPLEX for pcgetf2 COMPLEX*16 for pzgetf2. Pointer into the local memory to an array of DIMENSION(lld_a, LOCc(ja +n-1)). On entry, this array contains the local pieces of the m-by-n distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix A. Output Parameters ipiv (local). INTEGER. Array, DIMENSION(LOCr(m_a) + mb_a). This array contains the pivoting information. ipiv(i) -> The global row that local row i was swapped with. This array is tied to the distributed matrix A. info (local). INTEGER. If info = 0: successful exit. If info < 0: • if the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), 7 Intel® Math Kernel Library Reference Manual 1764 • if the i-th argument is a scalar and had an illegal value, then info = - i. If info > 0: If info = k, u(ia+k-1, ja+k-1 ) is exactly zero. The factorization has been completed, but the factor u is exactly singular, and division by zero will occur if it is used to solve a system of equations. p?labrd Reduces the first nb rows and columns of a general rectangular matrix A to real bidiagonal form by an orthogonal/unitary transformation, and returns auxiliary matrices that are needed to apply the transformation to the unreduced part of A. Syntax call pslabrd(m, n, nb, a, ia, ja, desca, d, e, tauq, taup, x, ix, jx, descx, y, iy, jy, descy, work) call pdlabrd(m, n, nb, a, ia, ja, desca, d, e, tauq, taup, x, ix, jx, descx, y, iy, jy, descy, work) call pclabrd(m, n, nb, a, ia, ja, desca, d, e, tauq, taup, x, ix, jx, descx, y, iy, jy, descy, work) call pzlabrd(m, n, nb, a, ia, ja, desca, d, e, tauq, taup, x, ix, jx, descx, y, iy, jy, descy, work) Include Files • C: mkl_scalapack.h Description The p?labrd routine reduces the first nb rows and columns of a real/complex general m-by-n distributed matrix sub(A) = A(ia:ia+m-1, ja:ja+n-1) to upper or lower bidiagonal form by an orthogonal/unitary transformation Q'* A * P, and returns the matrices X and Y necessary to apply the transformation to the unreduced part of sub(A). If m =n, sub(A) is reduced to upper bidiagonal form; if m < n, sub(A) is reduced to lower bidiagonal form. This is an auxiliary routine called by p?gebrd. Input Parameters m (global). INTEGER. The number of rows to be operated on, that is, the number of rows of the distributed submatrix sub(A). (m = 0). n (global).INTEGER. The number of columns to be operated on, that is, the number of columns of the distributed submatrix sub(A). (n = 0). nb (global) INTEGER. The number of leading rows and columns of sub(A) to be reduced. a (local). REAL for pslabrd DOUBLE PRECISION for pdlabrd COMPLEX for pclabrd COMPLEX*16 for pzlabrd. ScaLAPACK Auxiliary and Utility Routines 7 1765 Pointer into the local memory to an array of DIMENSION(lld_a, LOCc(ja +n-1)). On entry, this array contains the local pieces of the general distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the global array a indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix A. ix, jx (global) INTEGER. The row and column indices in the global array x indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix X. iy, jy (global) INTEGER. The row and column indices in the global array y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix Y. work (local). REAL for pslabrd DOUBLE PRECISION for pdlabrd COMPLEX for pclabrd COMPLEX*16 for pzlabrd Workspace array, DIMENSION(lwork) lwork = nb_a + nq, with nq = numroc(n+mod(ia-1, nb_y), nb_y, mycol, iacol, npcol) iacol = indxg2p (ja, nb_a, mycol, csrc_a, npcol) indxg2p and numroc are ScaLAPACK tool functions; myrow, mycol, nprow, and npcol can be determined by calling the subroutine blacs_gridinfo. Output Parameters a (local) On exit, the first nb rows and columns of the matrix are overwritten; the rest of the distributed matrix sub(A) is unchanged. If m = n, elements on and below the diagonal in the first nb columns, with the array tauq, represent the orthogonal/unitary matrix Q as a product of elementary reflectors;and elements above the diagonal in the first nb rows, with the array taup, represent the orthogonal/unitary matrix P as a product of elementary reflectors. If m < n, elements below the diagonal in the first nb columns, with the array tauq, represent the orthogonal/unitary matrix Q as a product of elementary reflectors, and elements on and above the diagonal in the first nb rows, with the array taup, represent the orthogonal/unitary matrix P as a product of elementary reflectors. See Application Notes below. d (local). REAL for pslabrd DOUBLE PRECISION for pdlabrd COMPLEX for pclabrd COMPLEX*16 for pzlabrd 7 Intel® Math Kernel Library Reference Manual 1766 Array, DIMENSION LOCr(ia+min(m,n)-1) if m = n; LOCc(ja +min(m,n)-1) otherwise. The distributed diagonal elements of the bidiagonal distributed matrix B: d(i) = A(ia+i-1, ja+i-1). d is tied to the distributed matrix A. e (local). REAL for pslabrd DOUBLE PRECISION for pdlabrd COMPLEX for pclabrd COMPLEX*16 for pzlabrd Array, DIMENSION LOCr(ia+min(m,n)-1) if m = n; LOCc(ja +min(m,n)-2) otherwise. The distributed off-diagonal elements of the bidiagonal distributed matrix B: if m = n, E(i) = A(ia+i-1, ja+i) for i = 1, 2, ..., n-1; if m 0. a (local) REAL for pslatrz DOUBLE PRECISION for pdlatrz COMPLEX for pclatrz COMPLEX*16 for pzlatrz. 7 Intel® Math Kernel Library Reference Manual 1828 Pointer into the local memory to an array of DIMENSION(lld_a, LOCc(ja +n-1)). On entry, the local pieces of the m-by-n distributed matrix sub(A), which is to be factored. ia (global) INTEGER. The row index in the global array A indicating the first row of sub(A). ja (global) INTEGER. The column index in the global array A indicating the first column of sub(A). desca (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. work (local) REAL for pslatrz DOUBLE PRECISION for pdlatrz COMPLEX for pclatrz COMPLEX*16 for pzlatrz. Workspace array, DIMENSION (lwork). lwork = nq0 + max(1, mp0), where iroff = mod(ia-1, mb_a), icoff = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, myrow, rsrc_a, nprow), iacol = indxg2p(ja, nb_a, mycol, csrc_a, npcol), mp0 = numroc(m+iroff, mb_a, myrow, iarow, nprow), nq0 = numroc(n+icoff, nb_a, mycol, iacol, npcol), numroc, indxg2p, and numroc are ScaLAPACK tool functions; myrow, mycol, nprow, and npcol can be determined by calling the subroutine blacs_gridinfo. Output Parameters a On exit, the leading m-by-m upper triangular part of sub(A) contains the upper triangular matrix R, and elements n-l+1 to n of the first m rows of sub(A), with the array tau, represent the orthogonal/unitary matrix Z as a product of m elementary reflectors. tau (local) REAL for pslatrz DOUBLE PRECISION for pdlatrz COMPLEX for pclatrz COMPLEX*16 for pzlatrz. Array, DIMENSION(LOCr(ja+m-1)). This array contains the scalar factors of the elementary reflectors. tau is tied to the distributed matrix A. Application Notes The factorization is obtained by Householder's method. The k-th transformation matrix, Z(k), which is used (or, in case of complex routines, whose conjugate transpose is used) to introduce zeros into the (m - k + 1)-th row of sub(A), is given in the form where ScaLAPACK Auxiliary and Utility Routines 7 1829 tau is a scalar and z( k ) is an (n-m)-element vector. tau and z( k ) are chosen to annihilate the elements of the k-th row of sub(A). The scalar tau is returned in the k-th element of tau and the vector u( k ) in the k-th row of sub(A), such that the elements of z(k ) are in a( k, m + 1 ), ..., a( k, n ). The elements of R are returned in the upper triangular part of sub(A). Z is given by Z = Z(1)Z(2)...Z(m). p?lauu2 Computes the product U*U' or L'*L, where U and L are upper or lower triangular matrices (local unblocked algorithm). Syntax call pslauu2(uplo, n, a, ia, ja, desca) call pdlauu2(uplo, n, a, ia, ja, desca) call pclauu2(uplo, n, a, ia, ja, desca) call pzlauu2(uplo, n, a, ia, ja, desca) Include Files • C: mkl_scalapack.h Description The p?lauu2 routine computes the product U*U' or L'*L, where the triangular factor U or L is stored in the upper or lower triangular part of the distributed matrix sub(A)= A(ia:ia+n-1, ja:ja+n-1). If uplo = 'U' or 'u', then the upper triangle of the result is stored, overwriting the factor U in sub(A). If uplo = 'L' or 'l', then the lower triangle of the result is stored, overwriting the factor L in sub(A). This is the unblocked form of the algorithm, calling BLAS Level 2 Routines. No communication is performed by this routine, the matrix to operate on should be strictly local to one process. Input Parameters uplo (global) CHARACTER*1. Specifies whether the triangular factor stored in the matrix sub(A) is upper or lower triangular: = U: upper triangular = L: lower triangular. n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the triangular factor U or L. n = 0. a (local) REAL for pslauu2 DOUBLE PRECISION for pdlauu2 7 Intel® Math Kernel Library Reference Manual 1830 COMPLEX for pclauu2 COMPLEX*16 for pzlauu2. Pointer into the local memory to an array of DIMENSION(lld_a, LOCc(ja +n-1). On entry, the local pieces of the triangular factor U or L. ia (global) INTEGER. The row index in the global array A indicating the first row of sub(A). ja (global) INTEGER. The column index in the global array A indicating the first column of sub(A). desca (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. Output Parameters a (local) On exit, if uplo = 'U', the upper triangle of the distributed matrix sub(A) is overwritten with the upper triangle of the product U*U'; if uplo = 'L', the lower triangle of sub(A) is overwritten with the lower triangle of the product L'*L. p?lauum Computes the product U*U' or L'*L, where U and L are upper or lower triangular matrices. Syntax call pslauum(uplo, n, a, ia, ja, desca) call pdlauum(uplo, n, a, ia, ja, desca) call pclauum(uplo, n, a, ia, ja, desca) call pzlauum(uplo, n, a, ia, ja, desca) Include Files • C: mkl_scalapack.h Description The p?lauum routine computes the product U*U' or L'*L, where the triangular factor U or L is stored in the upper or lower triangular part of the matrix sub(A)= A(ia:ia+n-1, ja:ja+n-1). If uplo = 'U' or 'u', then the upper triangle of the result is stored, overwriting the factor U in sub(A). If uplo = 'L' or 'l', then the lower triangle of the result is stored, overwriting the factor L in sub(A). This is the blocked form of the algorithm, calling Level 3 PBLAS. Input Parameters uplo (global) CHARACTER*1. Specifies whether the triangular factor stored in the matrix sub(A) is upper or lower triangular: = 'U': upper triangular = 'L': lower triangular. n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the triangular factor U or L. n = 0. ScaLAPACK Auxiliary and Utility Routines 7 1831 a (local) REAL for pslauum DOUBLE PRECISION for pdlauum COMPLEX for pclauum COMPLEX*16 for pzlauum. Pointer into the local memory to an array of DIMENSION(lld_a, LOCc(ja +n-1). On entry, the local pieces of the triangular factor U or L. ia (global) INTEGER. The row index in the global array A indicating the first row of sub(A). ja (global) INTEGER. The column index in the global array A indicating the first column of sub(A). desca (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. Output Parameters a (local) On exit, if uplo = 'U', the upper triangle of the distributed matrix sub(A) is overwritten with the upper triangle of the product U*U' ; if uplo = 'L', the lower triangle of sub(A) is overwritten with the lower triangle of the product L'*L. p?lawil Forms the Wilkinson transform. Syntax call pslawil(ii, jj, m, a, desca, h44, h33, h43h34, v) call pdlawil(ii, jj, m, a, desca, h44, h33, h43h34, v) Include Files • C: mkl_scalapack.h Description The p?lawil routine gets the transform given by h44, h33, and h43h34 into v starting at row m. Input Parameters ii (global) INTEGER. Row owner of h(m+2, m+2). jj (global) INTEGER. Column owner of h(m+2, m+2). m (global) INTEGER. On entry, the location from where the transform starts (row m). Unchanged on exit. a (global) REAL for pslawil DOUBLE PRECISION for pdlawil Array, DIMENSION (desca(lld_),*). On entry, the Hessenberg matrix. Unchanged on exit. desca (global and local) INTEGER 7 Intel® Math Kernel Library Reference Manual 1832 Array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. Unchanged on exit. h43h34 (global) REAL for pslawil DOUBLE PRECISION for pdlawil These three values are for the double shift QR iteration. Unchanged on exit. Output Parameters v (global) REAL for pslawil DOUBLE PRECISION for pdlawil Array of size 3 that contains the transform on output. p?org2l/p?ung2l Generates all or part of the orthogonal/unitary matrix Q from a QL factorization determined by p?geqlf (unblocked algorithm). Syntax call psorg2l(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pdorg2l(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pcung2l(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pzung2l(m, n, k, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?org2l/p?ung2l routine generates an m-by-n real/complex distributed matrix Q denoting A(ia:ia +m-1, ja:ja+n-1) with orthonormal columns, which is defined as the last n columns of a product of k elementary reflectors of order m: Q = H(k)*...*H(2)*H(1) as returned by p?geqlf. Input Parameters m (global) INTEGER. The number of rows to be operated on, that is, the number of rows of the distributed submatrix Q. m = 0. n (global) INTEGER. The number of columns to be operated on, that is, the number of columns of the distributed submatrix Q. m = n = 0. k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. n= k = 0. a REAL for psorg2l DOUBLE PRECISION for pdorg2l COMPLEX for pcung2l COMPLEX*16 for pzung2l. ScaLAPACK Auxiliary and Utility Routines 7 1833 Pointer into the local memory to an array, DIMENSION (lld_a, LOCc(ja +n-1). On entry, the j-th column must contain the vector that defines the elementary reflector H(j), ja+n-k = j = ja+n-k, as returned by p? geqlf in the k columns of its distributed matrix argument A(ia:*,ja+nk: ja+n-1). ia (global) INTEGER. The row index in the global array A indicating the first row of sub(A). ja (global) INTEGER. The column index in the global array A indicating the first column of sub(A). desca (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psorg2l DOUBLE PRECISION for pdorg2l COMPLEX for pcung2l COMPLEX*16 for pzung2l. Array, DIMENSION LOCc(ja+n-1). This array contains the scalar factor tau(j) of the elementary reflector H(j), as returned by p?geqlf. work (local) REAL for psorg2l DOUBLE PRECISION for pdorg2l COMPLEX for pcung2l COMPLEX*16 for pzung2l. Workspace array, DIMENSION (lwork). lwork (local or global) INTEGER. The dimension of the array work. lwork is local input and must be at least lwork = mpa0 + max(1, nqa0), where iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, myrow, rsrc_a, nprow), iacol = indxg2p(ja, nb_a, mycol, csrc_a, npcol), mpa0 = numroc(m+iroffa, mb_a, myrow, iarow, nprow), nqa0 = numroc(n+icoffa, nb_a, mycol, iacol, npcol). indxg2p and numroc are ScaLAPACK tool functions; myrow, mycol, nprow, and npcol can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, this array contains the local pieces of the m-by-n distributed matrix Q. work On exit, work(1) returns the minimal and optimal lwork. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, 7 Intel® Math Kernel Library Reference Manual 1834 then info = - (i*100 +j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?org2r/p?ung2r Generates all or part of the orthogonal/unitary matrix Q from a QR factorization determined by p?geqrf (unblocked algorithm). Syntax call psorg2r(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pdorg2r(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pcung2r(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pzung2r(m, n, k, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?org2r/p?ung2r routine generates an m-by-n real/complex matrix Q denoting A(ia:ia+m-1, ja:ja +n-1) with orthonormal columns, which is defined as the first n columns of a product of k elementary reflectors of order m: Q = H(1)*H(2)*...*H(k) as returned by p?geqrf. Input Parameters m (global) INTEGER. The number of rows to be operated on, that is, the number of rows of the distributed submatrix Q.m = 0. n (global) INTEGER. The number of columns to be operated on, that is, the number of columns of the distributed submatrix Q. m = n = 0. k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. n = k = 0. a REAL for psorg2r DOUBLE PRECISION for pdorg2r COMPLEX for pcung2r COMPLEX*16 for pzung2r. Pointer into the local memory to an array, DIMENSION(lld_a, LOCc(ja +n-1). On entry, the j-th column must contain the vector that defines the elementary reflector H(j), ja = j = ja+k-1, as returned by p?geqrf in the k columns of its distributed matrix argument A(ia:*,ja:ja+k-1). ia (global) INTEGER. The row index in the global array A indicating the first row of sub(A). ja (global) INTEGER. ScaLAPACK Auxiliary and Utility Routines 7 1835 The column index in the global array A indicating the first column of sub(A). desca (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psorg2r DOUBLE PRECISION for pdorg2r COMPLEX for pcung2r COMPLEX*16 for pzung2r. Array, DIMENSION LOCc(ja+k-1). This array contains the scalar factor tau(j) of the elementary reflector H(j), as returned by p?geqrf. This array is tied to the distributed matrix A. work (local) REAL for psorg2r DOUBLE PRECISION for pdorg2r COMPLEX for pcung2r COMPLEX*16 for pzung2r. Workspace array, DIMENSION (lwork). lwork (local or global) INTEGER. The dimension of the array work. lwork is local input and must be at least lwork = mpa0 + max(1, nqa0), where iroffa = mod(ia-1, mb_a , icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, myrow, rsrc_a, nprow), iacol = indxg2p(ja, nb_a, mycol, csrc_a, npcol), mpa0 = numroc(m+iroffa, mb_a, myrow, iarow, nprow), nqa0 = numroc(n+icoffa, nb_a, mycol, iacol, npcol). indxg2p and numroc are ScaLAPACK tool functions; myrow, mycol, nprow, and npcol can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, this array contains the local pieces of the m-by-n distributed matrix Q. work On exit, work(1) returns the minimal and optimal lwork. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?orgl2/p?ungl2 Generates all or part of the orthogonal/unitary matrix Q from an LQ factorization determined by p?gelqf (unblocked algorithm). 7 Intel® Math Kernel Library Reference Manual 1836 Syntax call psorgl2(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pdorgl2(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pcungl2(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pzungl2(m, n, k, a, ia, ja, desca, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?orgl2/p?ungl2 routine generates a m-by-n real/complex matrix Q denoting A(ia:ia+m-1, ja:ja +n-1) with orthonormal rows, which is defined as the first m rows of a product of k elementary reflectors of order n Q = H(k)*...*H(2)*H(1) (for real flavors), Q = (H(k))H*...*(H(2))H*(H(1))H (for complex flavors) as returned by p?gelqf. Input Parameters m (global) INTEGER. The number of rows to be operated on, that is, the number of rows of the distributed submatrix Q. m = 0. n (global) INTEGER. The number of columns to be operated on, that is, the number of columns of the distributed submatrix Q. n = m = 0. k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. m = k = 0. a REAL for psorgl2 DOUBLE PRECISION for pdorgl2 COMPLEX for pcungl2 COMPLEX*16 for pzungl2. Pointer into the local memory to an array, DIMENSION (lld_a, LOCc(ja +n-1). On entry, the i-th row must contain the vector that defines the elementary reflector H(i), ia = i = ia+k-1, as returned by p?gelqf in the k rows of its distributed matrix argument A(ia:ia+k-1, ja:*). ia (global) INTEGER. The row index in the global array A indicating the first row of sub(A). ja (global) INTEGER. The column index in the global array A indicating the first column of sub(A). desca (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psorgl2 DOUBLE PRECISION for pdorgl2 COMPLEX for pcungl2 COMPLEX*16 for pzungl2. ScaLAPACK Auxiliary and Utility Routines 7 1837 Array, DIMENSION LOCr(ja+k-1). This array contains the scalar factors tau(i) of the elementary reflectors H(i), as returned by p?gelqf. This array is tied to the distributed matrix A. WORK (local) REAL for psorgl2 DOUBLE PRECISION for pdorgl2 COMPLEX for pcungl2 COMPLEX*16 for pzungl2. Workspace array, DIMENSION (lwork). lwork (local or global) INTEGER. The dimension of the array work. lwork is local input and must be at least lwork = nqa0 + max(1, mpa0), where iroffa = mod(ia-1, mb_a), icoffa = mod(ja-1, nb_a), iarow = indxg2p(ia, mb_a, myrow, rsrc_a, nprow), iacol = indxg2p(ja, nb_a, mycol, csrc_a, npcol), mpa0 = numroc(m+iroffa, mb_a, myrow, iarow, nprow), nqa0 = numroc(n+icoffa, nb_a, mycol, iacol, npcol). indxg2p and numroc are ScaLAPACK tool functions; myrow, mycol, nprow, and npcol can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, this array contains the local pieces of the m-by-n distributed matrix Q. work On exit, work(1) returns the minimal and optimal lwork. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?orgr2/p?ungr2 Generates all or part of the orthogonal/unitary matrix Q from an RQ factorization determined by p?gerqf (unblocked algorithm). Syntax call psorgr2(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pdorgr2(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pcungr2(m, n, k, a, ia, ja, desca, tau, work, lwork, info) call pzungr2(m, n, k, a, ia, ja, desca, tau, work, lwork, info) 7 Intel® Math Kernel Library Reference Manual 1838 Include Files • C: mkl_scalapack.h Description The p?orgr2/p?ungr2 routine generates an m-by-n real/complex matrix Q denoting A(ia:ia+m-1, ja:ja +n-1) with orthonormal rows, which is defined as the last m rows of a product of k elementary reflectors of order n Q = H(1)*H(2)*...*H(k) (for real flavors); Q = (H(1))H*(H(2))H...*(H(k))H (for complex flavors) as returned by p?gerqf. Input Parameters m (global) INTEGER. The number of rows to be operated on, that is, the number of rows of the distributed submatrix Q. m = 0. n (global) INTEGER. The number of columns to be operated on, that is, the number of columns of the distributed submatrix Q. n = m = 0. k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. m = k = 0. a REAL for psorgr2 DOUBLE PRECISION for pdorgr2 COMPLEX for pcungr2 COMPLEX*16 for pzungr2. Pointer into the local memory to an array, DIMENSION(lld_a, LOCc(ja +n-1). On entry, the i-th row must contain the vector that defines the elementary reflector H(i), ia+m-k = i = ia+m-1, as returned by p?gerqf in the k rows of its distributed matrix argument A(ia+m-k:ia+m-1, ja:*). ia (global) INTEGER. The row index in the global array A indicating the first row of sub(A). ja (global) INTEGER. The column index in the global array A indicating the first column of sub(A). desca (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psorgr2 DOUBLE PRECISION for pdorgr2 COMPLEX for pcungr2 COMPLEX*16 for pzungr2. Array, DIMENSION LOCr(ja+m-1). This array contains the scalar factors tau(i) of the elementary reflectors H(i), as returned by p?gerqf. This array is tied to the distributed matrix A. work (local) REAL for psorgr2 DOUBLE PRECISION for pdorgr2 COMPLEX for pcungr2 COMPLEX*16 for pzungr2. Workspace array, DIMENSION (lwork). ScaLAPACK Auxiliary and Utility Routines 7 1839 lwork (local or global) INTEGER. The dimension of the array work. lwork is local input and must be at least lwork = nqa0 + max(1, mpa0 ), where iroffa = mod( ia-1, mb_a ), icoffa = mod( ja-1, nb_a ), iarow = indxg2p( ia, mb_a, myrow, rsrc_a, nprow ), iacol = indxg2p( ja, nb_a, mycol, csrc_a, npcol ), mpa0 = numroc( m+iroffa, mb_a, myrow, iarow, nprow ), nqa0 = numroc( n+icoffa, nb_a, mycol, iacol, npcol ). indxg2p and numroc are ScaLAPACK tool functions; myrow, mycol, nprow, and npcol can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters a On exit, this array contains the local pieces of the m-by-n distributed matrix Q. work On exit, work(1) returns the minimal and optimal lwork. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?orm2l/p?unm2l Multiplies a general matrix by the orthogonal/unitary matrix from a QL factorization determined by p?geqlf (unblocked algorithm). Syntax call psorm2l(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pdorm2l(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pcunm2l(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pzunm2l(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?orm2l/p?unm2l routine overwrites the general real/complex m-by-n distributed matrix sub (C)=C(ic:ic+m-1,jc:jc+n-1) with Q*sub(C) if side = 'L' and trans = 'N', or 7 Intel® Math Kernel Library Reference Manual 1840 QT*sub(C) / QH*sub(C) if side = 'L' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors), or sub(C)*Q if side = 'R' and trans = 'N', or sub(C)*QT / sub(C)*QH if side = 'R' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors). where Q is a real orthogonal or complex unitary distributed matrix defined as the product of k elementary reflectors Q = H(k)*...*H(2)*H(1) as returned by p?geqlf . Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER. = 'L': apply Q or QT for real flavors (QH for complex flavors) from the left, = 'R': apply Q or QT for real flavors (QH for complex flavors) from the right. trans (global) CHARACTER. = 'N': apply Q (no transpose) = 'T': apply QT (transpose, for real flavors) = 'C': apply QH (conjugate transpose, for complex flavors) m (global) INTEGER. The number of rows to be operated on, that is, the number of rows of the distributed submatrix sub(C). m = 0. n (global) INTEGER. The number of columns to be operated on, that is, the number of columns of the distributed submatrix sub(C). n = 0. k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. If side = 'L', m = k = 0; if side = 'R', n = k = 0. a (local) REAL for psorm2l DOUBLE PRECISION for pdorm2l COMPLEX for pcunm2l COMPLEX*16 for pzunm2l. Pointer into the local memory to an array, DIMENSION(lld_a, LOCc(ja +k-1). On entry, the j-th row must contain the vector that defines the elementary reflector H(j), ja = j = ja+k-1, as returned by p?geqlf in the k columns of its distributed matrix argument A(ia:*,ja:ja+k-1). The argument A(ia:*,ja:ja+k-1) is modified by the routine but restored on exit. If side = 'L', lld_a = max(1, LOCr(ia+m-1)), if side = 'R', lld_a = max(1, LOCr(ia+n-1)). ia (global) INTEGER. The row index in the global array A indicating the first row of sub(A). ja (global) INTEGER. The column index in the global array A indicating the first column of sub(A). desca (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. tau (local) ScaLAPACK Auxiliary and Utility Routines 7 1841 REAL for psorm2l DOUBLE PRECISION for pdorm2l COMPLEX for pcunm2l COMPLEX*16 for pzunm2l. Array, DIMENSIONLOCc(ja+n-1). This array contains the scalar factor tau(j) of the elementary reflector H(j), as returned by p?geqlf. This array is tied to the distributed matrix A. c (local) REAL for psorm2l DOUBLE PRECISION for pdorm2l COMPLEX for pcunm2l COMPLEX*16 for pzunm2l. Pointer into the local memory to an array, DIMENSION(lld_c, LOCc(jc +n-1)).On entry, the local pieces of the distributed matrix sub (C). ic (global) INTEGER. The row index in the global array C indicating the first row of sub(C). jc (global) INTEGER. The column index in the global array C indicating the first column of sub(C). descc (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix C. work (local) REAL for psorm2l DOUBLE PRECISION for pdorm2l COMPLEX for pcunm2l COMPLEX*16 for pzunm2l. Workspace array, DIMENSION (lwork). On exit, work(1) returns the minimal and optimal lwork. lwork (local or global) INTEGER. The dimension of the array work. lwork is local input and must be at least if side = 'L', lwork = mpc0 + max(1, nqc0), if side = 'R', lwork = nqc0 + max(max(1, mpc0), numroc(numroc(n +icoffc, nb_a, 0, 0, npcol), nb_a, 0, 0, lcmq)), where lcmq = lcm/npcol, lcm = iclm(nprow, npcol), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, myrow, rsrc_c, nprow), iccol = indxg2p(jc, nb_c, mycol, csrc_c, npcol), Mqc0 = numroc(m+icoffc, nb_c, mycol, icrow, nprow), Npc0 = numroc(n+iroffc, mb_c, myrow, iccol, npcol), ilcm, indxg2p, and numroc are ScaLAPACK tool functions; myrow, mycol, nprow, and npcol can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. 7 Intel® Math Kernel Library Reference Manual 1842 Output Parameters c On exit, c is overwritten by Q*sub(C), or QT*sub(C)/ QH*sub(C), or sub(C)*Q, or sub(C)*QT / sub(C)*QH work On exit, work(1) returns the minimal and optimal lwork. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. NOTE The distributed submatrices A(ia:*, ja:*) and C(ic:ic+m-1,jc:jc+n-1) must verify some alignment properties, namely the following expressions should be true: If side = 'L', ( mb_a.eq.mb_c .AND. iroffa.eq.iroffc .AND. iarow.eq.icrow ) If side = 'R', ( mb_a.eq.nb_c .AND. iroffa.eq.iroffc ). p?orm2r/p?unm2r Multiplies a general matrix by the orthogonal/unitary matrix from a QR factorization determined by p? geqrf (unblocked algorithm). Syntax call psorm2r(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pdorm2r(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pcunm2r(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pzunm2r(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?orm2r/p?unm2r routine overwrites the general real/complex m-by-n distributed matrix sub (C)=C(ic:ic+m-1, jc:jc+n-1) with Q*sub(C) if side = 'L' and trans = 'N', or QT*sub(C) / QH*sub(C) if side = 'L' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors), or sub(C)*Q if side = 'R' and trans = 'N', or sub(C)*QT / sub(C)*QH if side = 'R' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors). where Q is a real orthogonal or complex unitary matrix defined as the product of k elementary reflectors ScaLAPACK Auxiliary and Utility Routines 7 1843 Q = H(k)*...*H(2)*H(1) as returned by p?geqrf . Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER. = 'L': apply Q or QT for real flavors (QH for complex flavors) from the left, = 'R': apply Q or QT for real flavors (QH for complex flavors) from the right. trans (global) CHARACTER. = 'N': apply Q (no transpose) = 'T': apply QT (transpose, for real flavors) = 'C': apply QH (conjugate transpose, for complex flavors) m (global) INTEGER. The number of rows to be operated on, that is, the number of rows of the distributed submatrix sub(C). m = 0. n (global) INTEGER. The number of columns to be operated on, that is, the number of columns of the distributed submatrix sub(C). n = 0. k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. If side = 'L', m = k = 0; if side = 'R', n = k = 0. a (local) REAL for psorm2r DOUBLE PRECISION for pdorm2r COMPLEX for pcunm2r COMPLEX*16 for pzunm2r. Pointer into the local memory to an array, DIMENSION(lld_a, LOCc(ja +k-1). On entry, the j-th column must contain the vector that defines the elementary reflector H(j), ja = j = ja+k-1, as returned by p?geqrf in the k columns of its distributed matrix argument A(ia:*,ja:ja+k-1). The argument A(ia:*,ja:ja+k-1) is modified by the routine but restored on exit. If side = 'L', lld_a = max(1, LOCr(ia+m-1)), if side = 'R', lld_a = max(1, LOCr(ia+n-1)). ia (global) INTEGER. The row index in the global array A indicating the first row of sub(A). ja (global) INTEGER. The column index in the global array A indicating the first column of sub(A). desca (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psorm2r DOUBLE PRECISION for pdorm2r COMPLEX for pcunm2r COMPLEX*16 for pzunm2r. Array, DIMENSION LOCc(ja+k-1). This array contains the scalar factors tau(j) of the elementary reflector H(j), as returned by p?geqrf. This array is tied to the distributed matrix A. c (local) 7 Intel® Math Kernel Library Reference Manual 1844 REAL for psorm2r DOUBLE PRECISION for pdorm2r COMPLEX for pcunm2r COMPLEX*16 for pzunm2r. Pointer into the local memory to an array, DIMENSION(lld_c, LOCc(jc +n-1)). On entry, the local pieces of the distributed matrix sub (C). ic (global) INTEGER. The row index in the global array C indicating the first row of sub(C). jc (global) INTEGER. The column index in the global array C indicating the first column of sub(C). descc (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix C. work (local) REAL for psorm2r DOUBLE PRECISION for pdorm2r COMPLEX for pcunm2r COMPLEX*16 for pzunm2r. Workspace array, DIMENSION (lwork). lwork (local or global) INTEGER. The dimension of the array work. lwork is local input and must be at least if side = 'L', lwork = mpc0 + max(1, nqc0), if side = 'R', lwork = nqc0 + max(max(1, mpc0), numroc(numroc(n +icoffc, nb_a, 0, 0, npcol), nb_a, 0, 0, lcmq)), where lcmq = lcm/npcol , lcm = iclm(nprow, npcol), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, myrow, rsrc_c, nprow), iccol = indxg2p(jc, nb_c, mycol, csrc_c, npcol), Mqc0 = numroc(m+icoffc, nb_c, mycol, icrow, nprow), Npc0 = numroc(n+iroffc, mb_c, myrow, iccol, npcol), ilcm, indxg2p and numroc are ScaLAPACK tool functions; myrow, mycol, nprow, and npcol can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c On exit, c is overwritten by Q*sub(C), or QT*sub(C)/ QH*sub(C), or sub(C)*Q, or sub(C)*QT / sub(C)*QH work On exit, work(1) returns the minimal and optimal lwork. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), ScaLAPACK Auxiliary and Utility Routines 7 1845 if the i-th argument is a scalar and had an illegal value, then info = -i. NOTE The distributed submatrices A(ia:*, ja:*) and C(ic:ic+m-1, jc:jc+n-1) must verify some alignment properties, namely the following expressions should be true: If side = 'L', (mb_a.eq.mb_c .AND. iroffa.eq.iroffc .AND. iarow.eq.icrow) If side = 'R', (mb_a.eq.nb_c .AND. iroffa.eq.iroffc). p?orml2/p?unml2 Multiplies a general matrix by the orthogonal/unitary matrix from an LQ factorization determined by p? gelqf (unblocked algorithm). Syntax call psorml2(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pdorml2(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pcunml2(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pzunml2(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?orml2/p?unml2 routine overwrites the general real/complex m-by-n distributed matrix sub (C)=C(ic:ic+m-1, jc:jc+n-1) with Q*sub(C) if side = 'L' and trans = 'N', or QT*sub(C) / QH*sub(C) if side = 'L' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors), or sub(C)*Q if side = 'R' and trans = 'N', or sub(C)*QT / sub(C)*QH if side = 'R' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors). where Q is a real orthogonal or complex unitary distributed matrix defined as the product of k elementary reflectors Q = H(k)*...*H(2)*H(1) (for real flavors) Q = (H(k))H*...*(H(2))H*(H(1))H (for complex flavors) as returned by p?gelqf . Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER. = 'L': apply Q or QT for real flavors (QH for complex flavors) from the left, 7 Intel® Math Kernel Library Reference Manual 1846 = 'R': apply Q or QT for real flavors (QH for complex flavors) from the right. trans (global) CHARACTER. = 'N': apply Q (no transpose) = 'T': apply QT (transpose, for real flavors) = 'C': apply QH (conjugate transpose, for complex flavors) m (global) INTEGER. The number of rows to be operated on, that is, the number of rows of the distributed submatrix sub(C). m = 0. n (global) INTEGER. The number of columns to be operated on, that is, the number of columns of the distributed submatrix sub(C). n = 0. k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. If side = 'L', m = k = 0; if side = 'R', n = k = 0. a (local) REAL for psorml2 DOUBLE PRECISION for pdorml2 COMPLEX for pcunml2 COMPLEX*16 for pzunml2. Pointer into the local memory to an array, DIMENSION (lld_a, LOCc(ja+m-1) if side='L', (lld_a, LOCc(ja+n-1) if side='R', where lld_a = max (1, LOCr(ia+k-1)). On entry, the i-th row must contain the vector that defines the elementary reflector H(i), ia = i = ia+k-1, as returned by p?gelqf in the k rows of its distributed matrix argument A(ia:ia+k-1, ja:*). The argument A(ia:ia+k-1, ja:*) is modified by the routine but restored on exit. ia (global) INTEGER. The row index in the global array A indicating the first row of sub(A). ja (global) INTEGER. The column index in the global array A indicating the first column of sub(A). desca (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psorml2 DOUBLE PRECISION for pdorml2 COMPLEX for pcunml2 COMPLEX*16 for pzunml2. Array, DIMENSION LOCc(ia+k-1). This array contains the scalar factors tau(i) of the elementary reflector H(i), as returned by p?gelqf. This array is tied to the distributed matrix A. c (local) REAL for psorml2 DOUBLE PRECISION for pdorml2 COMPLEX for pcunml2 COMPLEX*16 for pzunml2. Pointer into the local memory to an array, DIMENSION(lld_c, LOCc(jc +n-1)). On entry, the local pieces of the distributed matrix sub (C). ic (global) INTEGER. ScaLAPACK Auxiliary and Utility Routines 7 1847 The row index in the global array C indicating the first row of sub(C). jc (global) INTEGER. The column index in the global array C indicating the first column of sub(C). descc (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix C. work (local) REAL for psorml2 DOUBLE PRECISION for pdorml2 COMPLEX for pcunml2 COMPLEX*16 for pzunml2. Workspace array, DIMENSION (lwork). lwork (local or global) INTEGER. The dimension of the array work. lwork is local input and must be at least if side = 'L', lwork = mqc0 + max(max( 1, npc0), numroc(numroc(m+icoffc, mb_a, 0, 0, nprow), mb_a, 0, 0, lcmp)), if side = 'R', lwork = npc0 + max(1, mqc0), where lcmp = lcm / nprow, lcm = iclm(nprow, npcol), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, myrow, rsrc_c, nprow), iccol = indxg2p(jc, nb_c, mycol, csrc_c, npcol), Mpc0 = numroc(m+icoffc, mb_c, mycol, icrow, nprow), Nqc0 = numroc(n+iroffc, nb_c, myrow, iccol, npcol), ilcm, indxg2p and numroc are ScaLAPACK tool functions; myrow, mycol, nprow, and npcol can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c On exit, c is overwritten by Q*sub(C), or QT*sub(C)/ QH*sub(C), or sub(C)*Q, or sub(C)*QT / sub(C)*QH work On exit, work(1) returns the minimal and optimal lwork. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. NOTE The distributed submatrices A(ia:*, ja:*) and C(ic:ic+m-1, jc:jc+n-1) must verify some alignment properties, namely the following expressions should be true: If side = 'L', (nb_a.eq.mb_c .AND. icoffa.eq.iroffc) 7 Intel® Math Kernel Library Reference Manual 1848 If side = 'R', (nb_a.eq.nb_c .AND. icoffa.eq.icoffc .AND. iacol.eq.iccol). p?ormr2/p?unmr2 Multiplies a general matrix by the orthogonal/unitary matrix from an RQ factorization determined by p? gerqf (unblocked algorithm). Syntax call psormr2(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pdormr2(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pcunmr2(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) call pzunmr2(side, trans, m, n, k, a, ia, ja, desca, tau, c, ic, jc, descc, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?ormr2/p?unmr2 routine overwrites the general real/complex m-by-n distributed matrix sub (C)=C(ic:ic+m-1, jc:jc+n-1) with Q*sub(C) if side = 'L' and trans = 'N', or QT*sub(C) / QH*sub(C) if side = 'L' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors), or sub(C)*Q if side = 'R' and trans = 'N', or sub(C)*QT / sub(C)*QH if side = 'R' and trans = 'T' (for real flavors) or trans = 'C' (for complex flavors). where Q is a real orthogonal or complex unitary distributed matrix defined as the product of k elementary reflectors Q = H(1)*H(2)*...*H(k) (for real flavors) Q = (H(1))H*(H(2))H*...*(H(k))H (for complex flavors) as returned by p?gerqf . Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters side (global) CHARACTER. = 'L': apply Q or QT for real flavors (QH for complex flavors) from the left, = 'R': apply Q or QT for real flavors (QH for complex flavors) from the right. trans (global) CHARACTER. = 'N': apply Q (no transpose) = 'T': apply QT (transpose, for real flavors) = 'C': apply QH (conjugate transpose, for complex flavors) m (global) INTEGER. ScaLAPACK Auxiliary and Utility Routines 7 1849 The number of rows to be operated on, that is, the number of rows of the distributed submatrix sub(C). m = 0. n (global) INTEGER. The number of columns to be operated on, that is, the number of columns of the distributed submatrix sub(C). n = 0. k (global) INTEGER. The number of elementary reflectors whose product defines the matrix Q. If side = 'L', m = k = 0; if side = 'R', n = k = 0. a (local) REAL for psormr2 DOUBLE PRECISION for pdormr2 COMPLEX for pcunmr2 COMPLEX*16 for pzunmr2. Pointer into the local memory to an array, DIMENSION (lld_a, LOCc(ja+m-1) if side='L', (lld_a, LOCc(ja+n-1) if side='R', where lld_a = max (1, LOCr(ia+k-1)). On entry, the i-th row must contain the vector that defines the elementary reflector H(i), ia = i = ia+k-1, as returned by p?gerqf in the k rows of its distributed matrix argument A(ia:ia+k-1, ja:*). The argument A(ia:ia+k-1, ja:*) is modified by the routine but restored on exit. ia (global) INTEGER. The row index in the global array A indicating the first row of sub(A). ja (global) INTEGER. The column index in the global array A indicating the first column of sub(A). desca (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix A. tau (local) REAL for psormr2 DOUBLE PRECISION for pdormr2 COMPLEX for pcunmr2 COMPLEX*16 for pzunmr2. Array, DIMENSION LOCc(ia+k-1). This array contains the scalar factors tau(i) of the elementary reflector H(i), as returned by p?gerqf. This array is tied to the distributed matrix A. c (local) REAL for psormr2 DOUBLE PRECISION for pdormr2 COMPLEX for pcunmr2 COMPLEX*16 for pzunmr2. Pointer into the local memory to an array, DIMENSION(lld_c, LOCc(jc +n-1)). On entry, the local pieces of the distributed matrix sub (C). ic (global) INTEGER. The row index in the global array C indicating the first row of sub(C). jc (global) INTEGER. The column index in the global array C indicating the first column of sub(C). descc (global and local) INTEGER array of DIMENSION (dlen_). The array descriptor for the distributed matrix C. 7 Intel® Math Kernel Library Reference Manual 1850 work (local) REAL for psormr2 DOUBLE PRECISION for pdormr2 COMPLEX for pcunmr2 COMPLEX*16 for pzunmr2. Workspace array, DIMENSION (lwork). lwork (local or global) INTEGER. The dimension of the array work. lwork is local input and must be at least if side = 'L', lwork = mpc0 + max(max(1, nqc0), numroc(numroc(m +iroffc, mb_a, 0, 0, nprow), mb_a, 0, 0, lcmp)), if side = 'R', lwork = nqc0 + max(1, mpc0), where lcmp = lcm/nprow, lcm = iclm(nprow, npcol), iroffc = mod(ic-1, mb_c), icoffc = mod(jc-1, nb_c), icrow = indxg2p(ic, mb_c, myrow, rsrc_c, nprow), iccol = indxg2p(jc, nb_c, mycol, csrc_c, npcol), Mpc0 = numroc(m+iroffc, mb_c, myrow, icrow, nprow), Nqc0 = numroc(n+icoffc, nb_c, mycol, iccol, npcol), ilcm, indxg2p and numroc are ScaLAPACK tool functions; myrow, mycol, nprow, and npcol can be determined by calling the subroutine blacs_gridinfo. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. Output Parameters c On exit, c is overwritten by Q*sub(C), or QT*sub(C)/ QH*sub(C), or sub(C)*Q, or sub(C)*QT / sub(C)*QH work On exit, work(1) returns the minimal and optimal lwork. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. NOTE The distributed submatrices A(ia:*, ja:*) and C(ic:ic+m-1,jc:jc+n-1) must verify some alignment properties, namely the following expressions should be true: If side = 'L', ( nb_a.eq.mb_c .AND. icoffa.eq.iroffc ) If side = 'R', ( nb_a.eq.nb_c .AND. icoffa.eq.icoffc .AND. iacol.eq.iccol ). p?pbtrsv Solves a single triangular linear system via frontsolve or backsolve where the triangular matrix is a factor of a banded matrix computed by p?pbtrf. ScaLAPACK Auxiliary and Utility Routines 7 1851 Syntax call pspbtrsv(uplo, trans, n, bw, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pdpbtrsv(uplo, trans, n, bw, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pcpbtrsv(uplo, trans, n, bw, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pzpbtrsv(uplo, trans, n, bw, nrhs, a, ja, desca, b, ib, descb, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?pbtrsv routine solves a banded triangular system of linear equations A(1:n, ja:ja+n-1)*X = B(jb:jb+n-1, 1:nrhs) or A(1:n, ja:ja+n-1)T*X = B(jb:jb+n-1, 1:nrhs) for real flavors, A(1:n, ja:ja+n-1)H*X = B(jb:jb+n-1, 1:nrhs) for complex flavors, where A(1:n, ja:ja+n-1) is a banded triangular matrix factor produced by the Cholesky factorization code p?pbtrf and is stored in A(1:n, ja:ja+n-1) and af. The matrix stored in A(1:n, ja:ja+n-1) is either upper or lower triangular according to uplo. Routine p?pbtrf must be called first. Input Parameters uplo (global) CHARACTER. Must be 'U' or 'L'. If uplo = 'U', upper triangle of A(1:n, ja:ja+n-1) is stored; If uplo = 'L', lower triangle of A(1:n, ja:ja+n-1) is stored. trans (global) CHARACTER. Must be 'N' or 'T' or 'C'. If trans = 'N', solve with A(1:n, ja:ja+n-1); If trans = 'T' or 'C' for real flavors, solve with A(1:n, ja:ja+n-1)T. If trans = 'C' for complex flavors, solve with conjugate transpose(A(1:n, ja:ja+n-1)H. n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the distributed submatrix A(1:n, ja:ja+n-1). n = 0. bw (global) INTEGER. The number of subdiagonals in 'L' or 'U', 0 = bw = n-1. nrhs (global) INTEGER. The number of right hand sides; the number of columns of the distributed submatrix B(jb:jb+n-1, 1:nrhs); nrhs = 0. a (local) REAL for pspbtrsv DOUBLE PRECISION for pdpbtrsv COMPLEX for pcpbtrsv COMPLEX*16 for pzpbtrsv. 7 Intel® Math Kernel Library Reference Manual 1852 Pointer into the local memory to an array with the first DIMENSION lld_a = (bw+1), stored in desca. On entry, this array contains the local pieces of the n-by-n symmetric banded distributed Cholesky factor L or LT*A(1:n, ja:ja+n-1). This local portion is stored in the packed banded format used in LAPACK. See the Application Notes below and the ScaLAPACK manual for more detail on the format of distributed matrices. ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix A. If 1D type (dtype_a = 501), then dlen = 7; If 2D type (dtype_a = 1), then dlen = 9. Contains information on mapping of A to memory. (See ScaLAPACK manual for full description and options.) b (local) REAL for pspbtrsv DOUBLE PRECISION for pdpbtrsv COMPLEX for pcpbtrsv COMPLEX*16 for pzpbtrsv. Pointer into the local memory to an array of local lead DIMENSION lld_b = nb. On entry, this array contains the local pieces of the right hand sides B(jb:jb+n-1, 1:nrhs). ib (global) INTEGER. The row index in the global array B that points to the first row of the matrix to be operated on (which may be either all of B or a submatrix of B). descb (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix B. If 1D type (dtype_b = 502), then dlen = 7; If 2D type (dtype_b = 1), then dlen = 9. Contains information on mapping of B to memory. Please, see ScaLAPACK manual for full description and options. laf (local) INTEGER. The size of user-input auxiliary Fillin space af. Must be laf = (nb+2*bw)*bw . If laf is not large enough, an error code will be returned and the minimum acceptable size will be returned in af(1). work (local) REAL for pspbtrsv DOUBLE PRECISION for pdpbtrsv COMPLEX for pcpbtrsv COMPLEX*16 for pzpbtrsv. The array work is a temporary workspace array of DIMENSION lwork. This space may be overwritten in between calls to routines. lwork (local or global) INTEGER. The size of the user-input workspace work, must be at least lwork = bw*nrhs. If lwork is too small, the minimal acceptable size will be returned in work(1) and an error code is returned. ScaLAPACK Auxiliary and Utility Routines 7 1853 Output Parameters af (local) REAL for pspbtrsv DOUBLE PRECISION for pdpbtrsv COMPLEX for pcpbtrsv COMPLEX*16 for pzpbtrsv. The array af is of DIMENSION laf. It contains auxiliary Fillin space. Fillin is created during the factorization routine p?pbtrf and this is stored in af. If a linear system is to be solved using p?pbtrs after the factorization routine, af must not be altered after the factorization. b On exit, this array contains the local piece of the solutions distributed matrix X. work(1) On exit, work(1) contains the minimum value of lwork. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. Application Notes If the factorization routine and the solve routine are to be called separately to solve various sets of righthand sides using the same coefficient matrix, the auxiliary space af must not be altered between calls to the factorization routine and the solve routine. The best algorithm for solving banded and tridiagonal linear systems depends on a variety of parameters, especially the bandwidth. Currently, only algorithms designed for the case N/P >> bw are implemented. These algorithms go by many names, including Divide and Conquer, Partitioning, domain decompositiontype, etc. The Divide and Conquer algorithm assumes the matrix is narrowly banded compared with the number of equations. In this situation, it is best to distribute the input matrix A one-dimensionally, with columns atomic and rows divided amongst the processes. The basic algorithm divides the banded matrix up into P pieces with one stored on each processor, and then proceeds in 2 phases for the factorization or 3 for the solution of a linear system. 1. Local Phase: The individual pieces are factored independently and in parallel. These factors are applied to the matrix creating fill-in, which is stored in a non-inspectable way in auxiliary space af. Mathematically, this is equivalent to reordering the matrix A as PAPT and then factoring the principal leading submatrix of size equal to the sum of the sizes of the matrices factored on each processor. The factors of these submatrices overwrite the corresponding parts of A in memory. 2. Reduced System Phase: A small (bw*(P-1)) system is formed representing interaction of the larger blocks and is stored (as are its factors) in the space af. A parallel Block Cyclic Reduction algorithm is used. For a linear system, a parallel front solve followed by an analogous backsolve, both using the structure of the factored matrix, are performed. 3. Back Subsitution Phase: For a linear system, a local backsubstitution is performed on each processor in parallel. p?pttrsv Solves a single triangular linear system via frontsolve or backsolve where the triangular matrix is a factor of a tridiagonal matrix computed by p?pttrf . 7 Intel® Math Kernel Library Reference Manual 1854 Syntax call pspttrsv(uplo, n, nrhs, d, e, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pdpttrsv(uplo, n, nrhs, d, e, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pcpttrsv(uplo, trans, n, nrhs, d, e, ja, desca, b, ib, descb, af, laf, work, lwork, info) call pzpttrsv(uplo, trans, n, nrhs, d, e, ja, desca, b, ib, descb, af, laf, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?pttrsv routine solves a tridiagonal triangular system of linear equations A(1:n, ja:ja+n-1)*X = B(jb:jb+n-1, 1:nrhs) or A(1:n, ja:ja+n-1)T*X = B(jb:jb+n-1, 1:nrhs) for real flavors, A(1:n, ja:ja+n-1)H*X = B(jb:jb+n-1, 1:nrhs) for complex flavors, where A(1:n, ja:ja+n-1) is a tridiagonal triangular matrix factor produced by the Cholesky factorization code p?pttrf and is stored in A(1:n, ja:ja+n-1) and af. The matrix stored in A(1:n, ja:ja+n-1) is either upper or lower triangular according to uplo. Routine p?pttrf must be called first. Input Parameters uplo (global) CHARACTER. Must be 'U' or 'L'. If uplo = 'U', upper triangle of A(1:n, ja:ja+n-1) is stored; If uplo = 'L', lower triangle of A(1:n, ja:ja+n-1) is stored. trans (global) CHARACTER. Must be 'N' or 'C'. If trans = 'N', solve with A(1:n, ja:ja+n-1); If trans = 'C' (for complex flavors), solve with conjugate transpose (A(1:n, ja:ja+n-1))H. n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the distributed submatrix A(1:n, ja:ja+n-1). n = 0. nrhs (global) INTEGER. The number of right hand sides; the number of columns of the distributed submatrix B(jb:jb+n-1, 1:nrhs); nrhs = 0. d (local) REAL for pspttrsv DOUBLE PRECISION for pdpttrsv COMPLEX for pcpttrsv COMPLEX*16 for pzpttrsv. Pointer to the local part of the global vector storing the main diagonal of the matrix; must be of size = desca(nb_). e (local) ScaLAPACK Auxiliary and Utility Routines 7 1855 REAL for pspttrsv DOUBLE PRECISION for pdpttrsv COMPLEX for pcpttrsv COMPLEX*16 for pzpttrsv. Pointer to the local part of the global vector storing the upper diagonal of the matrix; must be of size = desca(nb_). Globally, du(n) is not referenced, and du must be aligned with d. ja (global) INTEGER. The index in the global array A that points to the start of the matrix to be operated on (which may be either all of A or a submatrix of A). desca (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix A. If 1D type (dtype_a = 501 or 502), then dlen = 7; If 2D type (dtype_a = 1), then dlen = 9. Contains information on mapping of A to memory. See ScaLAPACK manual for full description and options. b (local) REAL for pspttrsv DOUBLE PRECISION for pdpttrsv COMPLEX for pcpttrsv COMPLEX*16 for pzpttrsv. Pointer into the local memory to an array of local lead DIMENSION lld_b = nb. On entry, this array contains the local pieces of the right hand sides B(jb:jb+n-1, 1:nrhs). ib (global) INTEGER. The row index in the global array B that points to the first row of the matrix to be operated on (which may be either all of B or a submatrix of B). descb (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix B. If 1D type (dtype_b = 502), then dlen = 7; If 2D type (dtype_b = 1), then dlen = 9. Contains information on mapping of B to memory. See ScaLAPACK manual for full description and options. laf (local) INTEGER. The size of user-input auxiliary Fillin space af. Must be laf = (nb+2*bw)*bw. If laf is not large enough, an error code will be returned and the minimum acceptable size will be returned in af(1). work (local) REAL for pspttrsv DOUBLE PRECISION for pdpttrsv COMPLEX for pcpttrsv COMPLEX*16 for pzpttrsv. The array work is a temporary workspace array of DIMENSION lwork. This space may be overwritten in between calls to routines. lwork (local or global) INTEGER. The size of the user-input workspace work, must be at least lwork =(10+2*min(100, nrhs))*npcol+4*nrhs. If lwork is too small, the minimal acceptable size will be returned in work(1) and an error code is returned. 7 Intel® Math Kernel Library Reference Manual 1856 Output Parameters d, e (local). REAL for pspttrsv DOUBLE PRECISION for pdpttrsv COMPLEX for pcpttrsv COMPLEX*16 for pzpttrsv. On exit, these arrays contain information on the factors of the matrix. af (local) REAL for pspttrsv DOUBLE PRECISION for pdpttrsv COMPLEX for pcpttrsv COMPLEX*16 for pzpttrsv. The array af is of DIMENSION laf. It contains auxiliary Fillin space. Fillin is created during the factorization routine p?pbtrf and this is stored in af. If a linear system is to be solved using p?pttrs after the factorization routine, af must not be altered after the factorization. b On exit, this array contains the local piece of the solutions distributed matrix X. work(1) On exit, work(1) contains the minimum value of lwork. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. p?potf2 Computes the Cholesky factorization of a symmetric/ Hermitian positive definite matrix (local unblocked algorithm). Syntax call pspotf2(uplo, n, a, ia, ja, desca, info) call pdpotf2(uplo, n, a, ia, ja, desca, info) call pcpotf2(uplo, n, a, ia, ja, desca, info) call pzpotf2(uplo, n, a, ia, ja, desca, info) Include Files • C: mkl_scalapack.h Description The p?potf2 routine computes the Cholesky factorization of a real symmetric or complex Hermitian positive definite distributed matrix sub (A)=A(ia:ia+n-1, ja:ja+n-1). The factorization has the form sub(A) = U'*U, if uplo = 'U', or sub(A) = L*L', if uplo = 'L', where U is an upper triangular matrix, L is lower triangular. X' denotes transpose (conjugate transpose) of X. ScaLAPACK Auxiliary and Utility Routines 7 1857 Input Parameters uplo (global) CHARACTER. Specifies whether the upper or lower triangular part of the symmetric/ Hermitian matrix A is stored. = 'U': upper triangle of sub (A) is stored; = 'L': lower triangle of sub (A) is stored. n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the distributed submatrix sub (A). n = 0. a (local) REAL for pspotf2 DOUBLE PRECISION or pdpotf2 COMPLEX for pcpotf2 COMPLEX*16 for pzpotf2. Pointer into the local memory to an array of DIMENSION(lld_a, LOCc(ja +n-1)) containing the local pieces of the n-by-n symmetric distributed matrix sub(A) to be factored. If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular matrix and the strictly lower triangular part of this matrix is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the sub(A), respectively. desca (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix A. Output Parameters a (local) On exit, if uplo = 'U', the upper triangular part of the distributed matrix contains the Cholesky factor U; if uplo = 'L', the lower triangular part of the distributed matrix contains the Cholesky factor L. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100+j), if the i-th argument is a scalar and had an illegal value, then info = -i. > 0: if info = k, the leading minor of order k is not positive definite, and the factorization could not be completed. p?rscl Multiplies a vector by the reciprocal of a real scalar. Syntax call psrscl(n, sa, sx, ix, jx, descx, incx) 7 Intel® Math Kernel Library Reference Manual 1858 call pdrscl(n, sa, sx, ix, jx, descx, incx) call pcsrscl(n, sa, sx, ix, jx, descx, incx) call pzdrscl(n, sa, sx, ix, jx, descx, incx) Include Files • C: mkl_scalapack.h Description The p?rscl routine multiplies an n-element real/complex vector sub(x) by the real scalar 1/a. This is done without overflow or underflow as long as the final result sub(x)/a does not overflow or underflow. sub(x) denotes x(ix:ix+n-1, jx:jx), if incx = 1, and x(ix:ix, jx:jx+n-1), if incx = m_x. Input Parameters n (global) INTEGER. The number of components of the distributed vector sub(x). n = 0. sa REAL for psrscl/pcsrscl DOUBLE PRECISION for pdrscl/pzdrscl. The scalar a that is used to divide each component of the vector x. This parameter must be = 0. sx REAL forpsrscl DOUBLE PRECISION for pdrscl COMPLEX for pcsrscl COMPLEX*16 for pzdrscl. Array containing the local pieces of a distributed matrix of DIMENSION of at least ((jx-1)*m_x + ix + (n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix (global) INTEGER.The row index of the submatrix of the distributed matrix X to operate on. jx (global) INTEGER. The column index of the submatrix of the distributed matrix X to operate on. descx (global and local). INTEGER. Array of DIMENSION 8. The array descriptor for the distributed matrix X. incx (global) INTEGER. The increment for the elements of X. This version supports only two values of incx, namely 1 and m_x. Output Parameters sx On exit, the result x/a. p?sygs2/p?hegs2 Reduces a symmetric/Hermitian definite generalized eigenproblem to standard form, using the factorization results obtained from p?potrf (local unblocked algorithm). ScaLAPACK Auxiliary and Utility Routines 7 1859 Syntax call pssygs2(ibtype, uplo, n, a, ia, ja, desca, b, ib, jb, descb, info) call pdsygs2(ibtype, uplo, n, a, ia, ja, desca, b, ib, jb, descb, info) call pchegs2(ibtype, uplo, n, a, ia, ja, desca, b, ib, jb, descb, info) call pzhegs2(ibtype, uplo, n, a, ia, ja, desca, b, ib, jb, descb, info) Include Files • C: mkl_scalapack.h Description The p?sygs2/p?hegs2 routine reduces a real symmetric-definite or a complex Hermitian-definite generalized eigenproblem to standard form. Here sub(A) denotes A(ia:ia+n-1, ja:ja+n-1), and sub(B) denotes B(ib:ib+n-1, jb:jb+n-1). If ibtype = 1, the problem is sub(A)*x = ?*sub(B)*x and sub(A) is overwritten by inv(UT)*sub(A)*inv(U) or inv(L)*sub(A)*inv(LT) - for real flavors, and inv(UH)*sub(A)*inv(U) or inv(L)*sub(A)*inv(LH) - for complex flavors. If ibtype = 2 or 3, the problem is sub(A)*sub(B)x = ?*x or sub(B)*sub(A)x =?*x and sub(A) is overwritten by U*sub(A)*UT or L**T*sub(A)*L - for real flavors and U*sub(A)*UH or L**H*sub(A)*L - for complex flavors. The matrix sub(B) must have been previously factorized as UT*U or L*LT (for real flavors), or as UH*U or L*LH (for complex flavors) by p?potrf. Input Parameters ibtype (global) INTEGER. = 1: compute inv(UT)*sub(A)*inv(U), or inv(L)*sub(A)*inv(LT) for real subroutines, and inv(UH)*sub(A)*inv(U), or inv(L)*sub(A)*inv(LH) for complex subroutines; = 2 or 3: compute U*sub(A)*UT, or LT*sub(A)*L for real subroutines, and U*sub(A)*UH or LH*sub(A)*L for complex subroutines. uplo (global) CHARACTER Specifies whether the upper or lower triangular part of the symmetric/ Hermitian matrix sub(A) is stored, and how sub(B) is factorized. = 'U': Upper triangular of sub(A) is stored and sub(B) is factorized as UT*U (for real subroutines) or as UH*U (for complex subroutines). = 'L': Lower triangular of sub(A) is stored and sub(B) is factorized as L*LT (for real subroutines) or as L*LH (for complex subroutines) n (global) INTEGER. The order of the matrices sub(A) and sub(B). n = 0. 7 Intel® Math Kernel Library Reference Manual 1860 a (local) REAL for pssygs2 DOUBLE PRECISION for pdsygs2 COMPLEX for pchegs2 COMPLEX*16 for pzhegs2. Pointer into the local memory to an array, DIMENSION(lld_a, LOCc(ja +n-1)). On entry, this array contains the local pieces of the n-by-n symmetric/ Hermitian distributed matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular part of the matrix, and the strictly lower triangular part of sub(A) is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular part of the matrix, and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the sub(A), respectively. desca (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix A. B (local) REAL for pssygs2 DOUBLE PRECISION for pdsygs2 COMPLEX for pchegs2 COMPLEX*16 for pzhegs2. Pointer into the local memory to an array, DIMENSION(lld_b, LOCc(jb +n-1)). On entry, this array contains the local pieces of the triangular factor from the Cholesky factorization of sub(B) as returned by p?potrf. ib, jb (global) INTEGER. The row and column indices in the global array B indicating the first row and the first column of the sub(B), respectively. descb (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix B. Output Parameters a (local) On exit, if info = 0, the transformed matrix is stored in the same format as sub(A). info INTEGER. = 0: successful exit. < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100), if the i-th argument is a scalar and had an illegal value, then info = -i. p?sytd2/p?hetd2 Reduces a symmetric/Hermitian matrix to real symmetric tridiagonal form by an orthogonal/unitary similarity transformation (local unblocked algorithm). ScaLAPACK Auxiliary and Utility Routines 7 1861 Syntax call pssytd2(uplo, n, a, ia, ja, desca, d, e, tau, work, lwork, info) call pdsytd2(uplo, n, a, ia, ja, desca, d, e, tau, work, lwork, info) call pchetd2(uplo, n, a, ia, ja, desca, d, e, tau, work, lwork, info) call pzhetd2(uplo, n, a, ia, ja, desca, d, e, tau, work, lwork, info) Include Files • C: mkl_scalapack.h Description The p?sytd2/p?hetd2 routine reduces a real symmetric/complex Hermitian matrix sub(A) to symmetric/ Hermitian tridiagonal form T by an orthogonal/unitary similarity transformation: Q'*sub(A)*Q = T, where sub(A) = A(ia:ia+n-1, ja:ja+n-1). Input Parameters uplo (global) CHARACTER. Specifies whether the upper or lower triangular part of the symmetric/ Hermitian matrix sub(A) is stored: = 'U': upper triangular = 'L': lower triangular n (global) INTEGER. The number of rows and columns to be operated on, that is, the order of the distributed submatrix sub(A). n = 0. a (local) REAL for pssytd2 DOUBLE PRECISION for pdsytd2 COMPLEX for pchetd2 COMPLEX*16 for pzhetd2. Pointer into the local memory to an array, DIMENSION(lld_a, LOCc(ja +n-1)). On entry, this array contains the local pieces of the n-by-n symmetric/ Hermitian distributed matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of sub(A) contains the upper triangular part of the matrix, and the strictly lower triangular part of sub(A) is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of sub(A) contains the lower triangular part of the matrix, and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the sub(A), respectively. desca (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix A. work (local) REAL for pssytd2 DOUBLE PRECISION for pdsytd2 COMPLEX for pchetd2 COMPLEX*16 for pzhetd2. 7 Intel® Math Kernel Library Reference Manual 1862 The array work is a temporary workspace array of DIMENSION lwork. Output Parameters a On exit, if uplo = 'U', the diagonal and first superdiagonal of sub(A) are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements above the first superdiagonal, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors; if uplo = 'L', the diagonal and first subdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements below the first subdiagonal, with the array tau, represent the orthogonal/ unitary matrix Q as a product of elementary reflectors. See the Application Notes below. d (local) REAL for pssytd2/pchetd2 DOUBLE PRECISION for pdsytd2/pzhetd2. Array, DIMENSION(LOCc(ja+n-1)). The diagonal elements of the tridiagonal matrix T: d(i) = a(i,i); d is tied to the distributed matrix A. e (local) REAL for pssytd2/pchetd2 DOUBLE PRECISION for pdsytd2/pzhetd2. Array, DIMENSION(LOCc(ja+n-1)), if uplo = 'U', LOCc(ja+n-2) otherwise. The off-diagonal elements of the tridiagonal matrix T: e(i) = a(i,i+1) if uplo = 'U', e(i) = a(i+1,i) if uplo = 'L'. e is tied to the distributed matrix A. tau (local) REAL for pssytd2 DOUBLE PRECISION for pdsytd2 COMPLEX for pchetd2 COMPLEX*16 for pzhetd2. Array, DIMENSION(LOCc(ja+n-1)). The scalar factors of the elementary reflectors. tau is tied to the distributed matrix A. work(1) On exit, work(1) returns the minimal and optimal value of lwork. lwork (local or global) INTEGER. The dimension of the workspace array work. lwork is local input and must be at least lwork = 3n. If lwork = -1, then lwork is global input and a workspace query is assumed; the routine only calculates the minimum and optimal size for all work arrays. Each of these values is returned in the first entry of the corresponding work array, and no error message is issued by pxerbla. info (local) INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = -(i*100), if the i-th argument is a scalar and had an illegal value, then info = -i. ScaLAPACK Auxiliary and Utility Routines 7 1863 Application Notes If uplo = 'U', the matrix Q is represented as a product of elementary reflectors Q = H(n-1)*...*H(2)*H(1) Each H(i) has the form H(i) = I - tau*v*v', where tau is a real/complex scalar, and v is a real/complex vector with v(i+1:n) = 0 and v(i) = 1; v(1:i-1) is stored on exit in A(ia:ia+i-2, ja+i), and tau in TAU(ja+i-1). If uplo = 'L', the matrix Q is represented as a product of elementary reflectors Q = H(1)*H(2)*...*H(n-1). Each H(i) has the form H(i) = I - tau*v*v' , where tau is a real/complex scalar, and v is a real/complex vector with v(1:i) = 0 and v(i+1) = 1; v(i +2:n) is stored on exit in A(ia+i+1:ia+n-1, ja+i-1), and tau in TAU(ja+i-1). The contents of sub (A) on exit are illustrated by the following examples with n = 5: where d and e denotes diagonal and off-diagonal elements of T, and vi denotes an element of the vector defining H(i). NOTE The distributed submatrix sub(A) must verify some alignment properties, namely the following expression should be true: ( mb_a.eq.nb_a .AND. iroffa.eq.icoffa ) with iroffa = mod(ia - 1, mb_a) and icoffa = mod(ja -1, nb_a). p?trti2 Computes the inverse of a triangular matrix (local unblocked algorithm). Syntax call pstrti2(uplo, diag, n, a, ia, ja, desca, info) call pdtrti2(uplo, diag, n, a, ia, ja, desca, info) call pctrti2(uplo, diag, n, a, ia, ja, desca, info) call pztrti2(uplo, diag, n, a, ia, ja, desca, info) Include Files • C: mkl_scalapack.h 7 Intel® Math Kernel Library Reference Manual 1864 Description The p?trti2 routine computes the inverse of a real/complex upper or lower triangular block matrix sub (A) = A(ia:ia+n-1, ja:ja+n-1). This matrix should be contained in one and only one process memory space (local operation). Input Parameters uplo (global) CHARACTER*1. Specifies whether the matrix sub (A) is upper or lower triangular. = 'U': sub (A) is upper triangular = 'L': sub (A) is lower triangular. diag (global) CHARACTER*1. Specifies whether or not the matrix A is unit triangular. = 'N': sub (A) is non-unit triangular = 'U': sub (A) is unit triangular. n (global) INTEGER. The number of rows and columns to be operated on, i.e., the order of the distributed submatrix sub(A). n = 0. a (local) REAL for pstrti2 DOUBLE PRECISION for pdtrti2 COMPLEX for pctrti2 COMPLEX*16 for pztrti2. Pointer into the local memory to an array, DIMENSION(lld_a, LOCc(ja +n-1)). On entry, this array contains the local pieces of the triangular matrix sub(A). If uplo = 'U', the leading n-by-n upper triangular part of the matrix sub(A) contains the upper triangular part of the matrix, and the strictly lower triangular part of sub(A) is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of the matrix sub(A) contains the lower triangular part of the matrix, and the strictly upper triangular part of sub(A) is not referenced. If diag = 'U', the diagonal elements of sub(A) are not referenced either and are assumed to be 1. ia, ja (global) INTEGER. The row and column indices in the global array A indicating the first row and the first column of the sub(A), respectively. desca (global and local) INTEGER array, DIMENSION (dlen_). The array descriptor for the distributed matrix A. Output Parameters a On exit, the (triangular) inverse of the original matrix, in the same storage format. info INTEGER. = 0: successful exit < 0: if the i-th argument is an array and the j-entry had an illegal value, then info = - (i*100), if the i-th argument is a scalar and had an illegal value, then info = -i. ScaLAPACK Auxiliary and Utility Routines 7 1865 ?lamsh Sends multiple shifts through a small (single node) matrix to maximize the number of bulges that can be sent through. Syntax call slamsh(s, lds, nbulge, jblk, h, ldh, n, ulp) call dlamsh(s, lds, nbulge, jblk, h, ldh, n, ulp) Include Files • C: mkl_scalapack.h Description The ?lamsh routine sends multiple shifts through a small (single node) matrix to see how small consecutive subdiagonal elements are modified by subsequent shifts in an effort to maximize the number of bulges that can be sent through. The subroutine should only be called when there are multiple shifts/bulges (nbulge > 1) and the first shift is starting in the middle of an unreduced Hessenberg matrix because of two or more small consecutive subdiagonal elements. Input Parameters s (local) INTEGER. REAL for slamsh DOUBLE PRECISION for dlamsh Array, DIMENSION (lds,*). On entry, the matrix of shifts. Only the 2x2 diagonal of s is referenced. It is assumed that s has jblk double shifts (size 2). lds (local) INTEGER. On entry, the leading dimension of S; unchanged on exit. 11). nbulge should be less than the maximum determined (jblk). 1 0: if info = i, then i eigenvectors failed to converge in maxits iterations. Their indices are stored in the array ifail. ?dbtf2 Computes an LU factorization of a general band matrix with no pivoting (local unblocked algorithm). Syntax call sdbtf2(m, n, kl, ku, ab, ldab, info) call ddbtf2(m, n, kl, ku, ab, ldab, info) call cdbtf2(m, n, kl, ku, ab, ldab, info) call zdbtf2(m, n, kl, ku, ab, ldab, info) Include Files • C: mkl_scalapack.h Description The ?dbtf2 routine computes an LU factorization of a general real/complex m-by-n band matrix A without using partial pivoting with row interchanges. This is the unblocked version of the algorithm, calling BLAS Routines and Functions. Input Parameters m INTEGER. The number of rows of the matrix A(m = 0). n INTEGER. The number of columns in A(n = 0). kl INTEGER. The number of sub-diagonals within the band of A(kl = 0). ku INTEGER. The number of super-diagonals within the band of A(ku = 0). ab REAL for sdbtf2 DOUBLE PRECISION for ddbtf2 COMPLEX for cdbtf2 COMPLEX*16 for zdbtf2. Array, DIMENSION (ldab, n). The matrix A in band storage, in rows kl+1 to 2kl+ku+1; rows 1 to kl of the array need not be set. The j-th column of A is stored in the j-th column of the array ab as follows: ab(kl+ku+1+i-j,j) = A(i,j) for max(1,jku) = i = min(m,j+kl). ldab INTEGER. The leading dimension of the array ab. (ldab = 2kl + ku +1) Output Parameters ab On exit, details of the factorization: U is stored as an upper triangular band matrix with kl+ku superdiagonals in rows 1 to kl+ku+1, and the multipliers used during the factorization are stored in rows kl+ku+2 to 2*kl+ku+1. See the Application Notes below for further details. info INTEGER. = 0: successful exit < 0: if info = - i, the i-th argument had an illegal value, 7 Intel® Math Kernel Library Reference Manual 1872 > 0: if info = + i, u(i,i) is 0. The factorization has been completed, but the factor U is exactly singular. Division by 0 will occur if you use the factor U for solving a system of linear equations. Application Notes The band storage scheme is illustrated by the following example, when m = n = 6, kl = 2, ku = 1: The routine does not use array elements marked *; elements marked + need not be set on entry, but the routine requires them to store elements of U, because of fill-in resulting from the row interchanges. ?dbtrf Computes an LU factorization of a general band matrix with no pivoting (local blocked algorithm). Syntax call sdbtrf(m, n, kl, ku, ab, ldab, info) call ddbtrf(m, n, kl, ku, ab, ldab, info) call cdbtrf(m, n, kl, ku, ab, ldab, info) call zdbtrf(m, n, kl, ku, ab, ldab, info) Include Files • C: mkl_scalapack.h Description This routine computes an LU factorization of a real m-by-n band matrix A without using partial pivoting or row interchanges. This is the blocked version of the algorithm, calling BLAS Routines and Functions. Input Parameters m INTEGER. The number of rows of the matrix A (m = 0). n INTEGER. The number of columns in A(n = 0). kl INTEGER. The number of sub-diagonals within the band of A(kl = 0). ku INTEGER. The number of super-diagonals within the band of A(ku = 0). ab REAL for sdbtrf DOUBLE PRECISION for ddbtrf COMPLEX for cdbtrf COMPLEX*16 for zdbtrf. Array, DIMENSION (ldab, n). ScaLAPACK Auxiliary and Utility Routines 7 1873 The matrix A in band storage, in rows kl+1 to 2kl+ku+1; rows 1 to kl of the array need not be set. The j-th column of A is stored in the j-th column of the array ab as follows: ab(kl+ku+1+i-j,j) = A(i,j) for max(1,jku) = i = min(m,j+kl). ldab INTEGER. The leading dimension of the array ab. (ldab = 2kl + ku +1) Output Parameters ab On exit, details of the factorization: U is stored as an upper triangular band matrix with kl+ku superdiagonals in rows 1 to kl+ku+1, and the multipliers used during the factorization are stored in rows kl+ku+2 to 2*kl+ku+1. See the Application Notes below for further details. info INTEGER. = 0: successful exit < 0: if info = - i, the i-th argument had an illegal value, > 0: if info = + i, u(i,i) is 0. The factorization has been completed, but the factor U is exactly singular. Division by 0 will occur if you use the factor U for solving a system of linear equations. Application Notes The band storage scheme is illustrated by the following example, when m = n = 6, kl = 2, ku = 1: The routine does not use array elements marked *. ?dttrf Computes an LU factorization of a general tridiagonal matrix with no pivoting (local blocked algorithm). Syntax call sdttrf(n, dl, d, du, info) call ddttrf(n, dl, d, du, info) call cdttrf(n, dl, d, du, info) call zdttrf(n, dl, d, du, info) Include Files • C: mkl_scalapack.h Description The ?dttrf routine computes an LU factorization of a real or complex tridiagonal matrix A using elimination without partial pivoting. 7 Intel® Math Kernel Library Reference Manual 1874 The factorization has the form A = L*U, where L is a product of unit lower bidiagonal matrices and U is upper triangular with nonzeros only in the main diagonal and first superdiagonal. Input Parameters n INTEGER. The order of the matrix A(n = 0). dl, d, du REAL for sdttrf DOUBLE PRECISION for ddttrf COMPLEX for cdttrf COMPLEX*16 for zdttrf. Arrays containing elements of A. The array dl of DIMENSION(n - 1) contains the sub-diagonal elements of A. The array d of DIMENSION n contains the diagonal elements of A. The array du of DIMENSION(n - 1) contains the super-diagonal elements of A. Output Parameters dl Overwritten by the (n-1) multipliers that define the matrix L from the LU factorization of A. d Overwritten by the n diagonal elements of the upper triangular matrix U from the LU factorization of A. du Overwritten by the (n-1) elements of the first super-diagonal of U. info INTEGER. = 0: successful exit < 0: if info = - i, the i-th argument had an illegal value, > 0: if info = i, u(i,i) is exactly 0. The factorization has been completed, but the factor U is exactly singular. Division by 0 will occur if you use the factor U for solving a system of linear equations. ?dttrsv Solves a general tridiagonal system of linear equations using the LU factorization computed by ?dttrf. Syntax call sdttrsv(uplo, trans, n, nrhs, dl, d, du, b, ldb, info) call ddttrsv(uplo, trans, n, nrhs, dl, d, du, b, ldb, info) call cdttrsv(uplo, trans, n, nrhs, dl, d, du, b, ldb, info) call zdttrsv(uplo, trans, n, nrhs, dl, d, du, b, ldb, info) Include Files • C: mkl_scalapack.h Description The ?dttrsv routine solves one of the following systems of linear equations: L*X = B, LT*X = B, or LH*X = B, U*X = B, UT*X = B, or UH*X = B with factors of the tridiagonal matrix A from the LU factorization computed by ?dttrf. ScaLAPACK Auxiliary and Utility Routines 7 1875 Input Parameters uplo CHARACTER*1. Specifies whether to solve with L or U. trans CHARACTER. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', then A*X=B is solved for X (no transpose). If trans = 'T', then AT*X = B is solved for X (transpose). If trans = 'C', then AH*X = B is solved for X (conjugate transpose). n INTEGER. The order of the matrix A(n = 0). nrhs INTEGER. The number of right-hand sides, that is, the number of columns in the matrix B(nrhs = 0). dl,d,du,b REAL for sdttrsv DOUBLE PRECISION for ddttrsv COMPLEX for cdttrsv COMPLEX*16 for zdttrsv. Arrays of DIMENSIONs: dl(n -1 ), d(n), du(n -1 ), b(ldb,nrhs). The array dl contains the (n - 1) multipliers that define the matrix L from the LU factorization of A. The array d contains n diagonal elements of the upper triangular matrix U from the LU factorization of A. The array du contains the (n - 1) elements of the first super-diagonal of U. On entry, the array b contains the right-hand side matrix B. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?pttrsv Solves a symmetric (Hermitian) positive-definite tridiagonal system of linear equations, using the L*D*LH factorization computed by ?pttrf. Syntax call spttrsv(trans, n, nrhs, d, e, b, ldb, info) call dpttrsv(trans, n, nrhs, d, e, b, ldb, info) call cpttrsv(uplo, trans, n, nrhs, d, e, b, ldb, info) call zpttrsv(uplo, trans, n, nrhs, d, e, b, ldb, info) Include Files • C: mkl_scalapack.h Description The ?pttrsv routine solves one of the triangular systems: LT*X = B, or L*X = B for real flavors, 7 Intel® Math Kernel Library Reference Manual 1876 or L*X = B, or LH*X = B, U*X = B, or UH*X = B for complex flavors, where L (or U for complex flavors) is the Cholesky factor of a Hermitian positive-definite tridiagonal matrix A such that A = L*D*LH (computed by spttrf/dpttrf) or A = UH*D*U or A = L*D*LH (computed by cpttrf/zpttrf). Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Specifies whether the superdiagonal or the subdiagonal of the tridiagonal matrix A is stored and the form of the factorization: If uplo = 'U', e is the superdiagonal of U, and A = UH*D*U or A = L*D*LH; if uplo = 'L', e is the subdiagonal of L, and A = L*D*LH. The two forms are equivalent, if A is real. trans CHARACTER. Specifies the form of the system of equations: for real flavors: if trans = 'N': L*X = B (no transpose) if trans = 'T': LT*X = B (transpose) for complex flavors: if trans = 'N': U*X = B or L*X = B (no transpose) if trans = 'C': UH*X = B or LH*X = B (conjugate transpose). n INTEGER. The order of the tridiagonal matrix A. n = 0. nrhs INTEGER. The number of right hand sides, that is, the number of columns of the matrix B. nrhs = 0. d REAL array, DIMENSION (n). The n diagonal elements of the diagonal matrix D from the factorization computed by ?pttrf. e COMPLEX array, DIMENSION(n-1). The (n-1) off-diagonal elements of the unit bidiagonal factor U or L from the factorization computed by ?pttrf. See uplo. b COMPLEX array, DIMENSION (ldb, nrhs). On entry, the right hand side matrix B. ldb INTEGER. The leading dimension of the array b. ldb = max(1, n). Output Parameters b On exit, the solution matrix X. info INTEGER. = 0: successful exit < 0: if info = -i, the i-th argument had an illegal value. ScaLAPACK Auxiliary and Utility Routines 7 1877 ?steqr2 Computes all eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the implicit QL or QR method. Syntax call ssteqr2(compz, n, d, e, z, ldz, nr, work, info) call dsteqr2(compz, n, d, e, z, ldz, nr, work, info) Include Files • C: mkl_scalapack.h Description The ?steqr2 routine is a modified version of LAPACK routine ?steqr. The ?steqr2 routine computes all eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the implicit QL or QR method. ?steqr2 is modified from ?steqr to allow each ScaLAPACK process running ?steqr2 to perform updates on a distributed matrix Q. Proper usage of ?steqr2 can be gleaned from examination of ScaLAPACK routine p?syev. Input Parameters compz CHARACTER*1. Must be 'N' or 'I'. If compz = 'N', the routine computes eigenvalues only. If compz = 'I', the routine computes the eigenvalues and eigenvectors of the tridiagonal matrix T. z must be initialized to the identity matrix by p?laset or ?laset prior to entering this subroutine. n INTEGER. The order of the matrix T(n = 0). d, e, work REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays: d contains the diagonal elements of T. The dimension of d must be at least max(1, n). e contains the (n-1) subdiagonal elements of T. The dimension of e must be at least max(1, n-1). work is a workspace array. The dimension of work is max(1, 2*n-2). If compz = 'N', then work is not referenced. z (local) REAL for ssteqr2 DOUBLE PRECISION for dsteqr2 Array, global DIMENSION (n, n), local DIMENSION (ldz, nr). If compz = 'V', then z contains the orthogonal matrix used in the reduction to tridiagonal form. ldz INTEGER. The leading dimension of the array z. Constraints: ldz = 1, ldz = max(1, n), if eigenvectors are desired. nr INTEGER. nr = max(1, numroc(n, nb, myprow, 0, nprocs)). If compz = 'N', then nr is not referenced. 7 Intel® Math Kernel Library Reference Manual 1878 Output Parameters d REAL array, DIMENSION (n), for ssteqr2. DOUBLE PRECISION array, DIMENSION (n), for dsteqr2. On exit, the eigenvalues in ascending order, if info = 0. See also info. e REAL array, DIMENSION (n-1), for ssteqr2. DOUBLE PRECISION array, DIMENSION (n-1), for dsteqr2. On exit, e has been destroyed. z (local) REAL for ssteqr2 DOUBLE PRECISION for dsteqr2 Array, global DIMENSION (n, n), local DIMENSION (ldz, nr). On exit, if info = 0, then, if compz = 'V', z contains the orthonormal eigenvectors of the original symmetric matrix, and if compz = 'I', z contains the orthonormal eigenvectors of the symmetric tridiagonal matrix. If compz = 'N', then z is not referenced. info INTEGER. info = 0, the exit is successful. info < 0: if info = -i, the i-th had an illegal value. info > 0: the algorithm has failed to find all the eigenvalues in a total of 30n iterations; if info = i, then i elements of e have not converged to zero; on exit, d and e contain the elements of a symmetric tridiagonal matrix, which is orthogonally similar to the original matrix. Utility Functions and Routines This section describes ScaLAPACK utility functions and routines. Summary information about these routines is given in the following table: ScaLAPACK Utility Functions and Routines Routine Name Data Types Description p?labad s,d Returns the square root of the underflow and overflow thresholds if the exponent-range is very large. p?lachkieee s,d Performs a simple check for the features of the IEEE standard. (C interface function). p?lamch s,d Determines machine parameters for floating-point arithmetic. p?lasnbt s,d Computes the position of the sign bit of a floating-point number. (C interface function). pxerbla Error handling routine called by ScaLAPACK routines. p?labad Returns the square root of the underflow and overflow thresholds if the exponent-range is very large. Syntax call pslabad(ictxt, small, large) ScaLAPACK Auxiliary and Utility Routines 7 1879 call pdlabad(ictxt, small, large) Include Files • C: mkl_scalapack.h Description The p?labad routine takes as input the values computed by p?lamch for underflow and overflow, and returns the square root of each of these values if the log of large is sufficiently large. This subroutine is intended to identify machines with a large exponent range, such as the Crays, and redefine the underflow and overflow limits to be the square roots of the values computed by p?lamch. This subroutine is needed because p?lamch does not compensate for poor arithmetic in the upper half of the exponent range, as is found on a Cray. In addition, this routine performs a global minimization and maximization on these values, to support heterogeneous computing networks. Input Parameters ictxt (global) INTEGER. The BLACS context handle in which the computation takes place. small (local). REAL PRECISION for pslabad. DOUBLE PRECISION for pdlabad. On entry, the underflow threshold as computed by p?lamch. large (local). REAL PRECISION for pslabad. DOUBLE PRECISION for pdlabad. On entry, the overflow threshold as computed by p?lamch. Output Parameters small (local). On exit, if log10(large) is sufficiently large, the square root of small, otherwise unchanged. large (local). On exit, if log10(large) is sufficiently large, the square root of large, otherwise unchanged. p?lachkieee Performs a simple check for the features of the IEEE standard. (C interface function). Syntax void pslachkieee(int *isieee, float *rmax, float *rmin); void pdlachkieee(int *isieee, float *rmax, float *rmin); Include Files • C: mkl_scalapack.h Description The p?lachkieee routine performs a simple check to make sure that the features of the IEEE standard are implemented. In some implementations, p?lachkieee may not return. 7 Intel® Math Kernel Library Reference Manual 1880 Note that all arguments are call-by-reference so that this routine can be directly called from Fortran code. This is a ScaLAPACK internal subroutine and arguments are not checked for unreasonable values. Input Parameters rmax (local). REAL for pslachkieee DOUBLE PRECISION for pdlachkieee The overflow threshold(= ?lamch ('O')). rmin (local). REAL for pslachkieee DOUBLE PRECISION for pdlachkieee The underflow threshold(= ?lamch ('U')). Output Parameters isieee (local). INTEGER. On exit, isieee = 1 implies that all the features of the IEEE standard that we rely on are implemented. On exit, isieee = 0 implies that some the features of the IEEE standard that we rely on are missing. p?lamch Determines machine parameters for floating-point arithmetic. Syntax val = pslamch(ictxt, cmach) val = pdlamch(ictxt, cmach) Include Files • C: mkl_scalapack.h Description The p?lamch routine determines single precision machine parameters. Input Parameters ictxt (global). INTEGER.The BLACS context handle in which the computation takes place. cmach (global) CHARACTER*1. Specifies the value to be returned by p?lamch: = 'E' or 'e', p?lamch := eps = 'S' or 's' , p?lamch := sfmin = 'B' or 'b', p?lamch := base = 'P' or 'p', p?lamch := eps*base = 'N' or 'n', p?lamch := t = 'R' or 'r', p?lamch := rnd = 'M' or 'm', p?lamch := emin = 'U' or 'u', p?lamch := rmin = 'L' or 'l', p?lamch := emax = 'O' or 'o', p?lamch := rmax, where ScaLAPACK Auxiliary and Utility Routines 7 1881 eps = relative machine precision sfmin = safe minimum, such that 1/sfmin does not overflow base = base of the machine prec = eps*base t = number of (base) digits in the mantissa rnd = 1.0 when rounding occurs in addition, 0.0 otherwise emin = minimum exponent before (gradual) underflow rmin = underflow threshold - base(emin-1) emax = largest exponent before overflow rmax = overflow threshold - (baseemax)*(1-eps) Output Parameters val Value returned by the routine. p?lasnbt Computes the position of the sign bit of a floatingpoint number. (C interface function). Syntax void pslasnbt(int *ieflag); void pdlasnbt(int *ieflag); Include Files • C: mkl_scalapack.h Description The p?lasnbt routine finds the position of the signbit of a single/double precision floating point number. This routine assumes IEEE arithmetic, and hence, tests only the 32-nd bit (for single precision) or 32-nd and 64- th bits (for double precision) as a possibility for the signbit. sizeof(int) is assumed equal to 4 bytes. If a compile time flag (NO_IEEE) indicates that the machine does not have IEEE arithmetic, ieflag = 0 is returned. Output Parameters ieflag INTEGER. This flag indicates the position of the signbit of any single/double precision floating point number. ieflag = 0, if the compile time flag NO_IEEE indicates that the machine does not have IEEE arithmetic, or if sizeof(int) is different from 4 bytes. ieflag = 1 indicates that the signbit is the 32-nd bit for a single precision routine. In the case of a double precision routine: ieflag = 1 indicates that the signbit is the 32-nd bit (Big Endian). ieflag = 2 indicates that the signbit is the 64-th bit (Little Endian). pxerbla Error handling routine called by ScaLAPACK routines. Syntax call pxerbla(ictxt, srname, info) 7 Intel® Math Kernel Library Reference Manual 1882 Include Files • C: mkl_scalapack.h Description This routine is an error handler for the ScaLAPACK routines. It is called by a ScaLAPACK routine if an input parameter has an invalid value. A message is printed. Program execution is not terminated. For the ScaLAPACK driver and computational routines, a RETURN statement is issued following the call to pxerbla. Control returns to the higher-level calling routine, and it is left to the user to determine how the program should proceed. However, in the specialized low-level ScaLAPACK routines (auxiliary routines that are Level 2 equivalents of computational routines), the call to pxerbla() is immediately followed by a call to BLACS_ABORT() to terminate program execution since recovery from an error at this level in the computation is not possible. It is always good practice to check for a nonzero value of info on return from a ScaLAPACK routine. Installers may consider modifying this routine in order to call system-specific exception-handling facilities. Input Parameters ictxt (global) INTEGER The BLACS context handle, indicating the global context of the operation. The context itself is global. srname (global) CHARACTER*6 The name of the routine which called pxerbla. info (global) INTEGER. The position of the invalid parameter in the parameter list of the calling routine. ScaLAPACK Auxiliary and Utility Routines 7 1883 7 Intel® Math Kernel Library Reference Manual 1884 Sparse Solver Routines 8 Intel® Math Kernel Library (Intel® MKL) provides user-callable sparse solver software to solve real or complex, symmetric, structurally symmetric or non-symmetric, positive definite, indefinite or Hermitian sparse linear system of equations. The terms and concepts required to understand the use of the Intel MKL sparse solver routines are discussed in the Appendix A "Linear Solvers Basics". If you are familiar with linear sparse solvers and sparse matrix storage schemes, you can skip these sections and go directly to the interface descriptions. This chapter describes the direct sparse solver PARDISO* and the alternative interface for the direct sparse solver referred to here as DSS interface; iterative sparse solvers (ISS) based on the reverse communication interface (RCI); and two preconditioners based on the incomplete LU factorization technique. PARDISO* - Parallel Direct Sparse Solver Interface This section describes the interface to the shared-memory multiprocessing parallel direct sparse solver known as the PARDISO* solver. The interface is Fortran, but it can be called from C programs by observing Fortran parameter passing and naming conventions used by the supported compilers and operating systems. A discussion of the algorithms used in the PARDISO* software and more information on the solver can be found at http://www.pardiso-project.org. The current implementation of the PARDISO solver additionally supports the out-of-core (OOC) version. The PARDISO package is a high-performance, robust, memory efficient, and easy to use software package for solving large sparse symmetric and unsymmetric linear systems of equations on shared memory multiprocessors. The solver uses a combination of left- and right-looking Level-3 BLAS supernode techniques [Schenk00-2]. To improve sequential and parallel sparse numerical factorization performance, the algorithms are based on a Level-3 BLAS update and pipelining parallelism is used with a combination of left- and rightlooking supernode techniques [Schenk00, Schenk01, Schenk02, Schenk03]. The parallel pivoting methods allow complete supernode pivoting to compromise numerical stability and scalability during the factorization process. For sufficiently large problem sizes, numerical experiments demonstrate that the scalability of the parallel algorithm is nearly independent of the shared-memory multiprocessing architecture. The following table lists the names of the PARDISO routines and describes their general use. PARDISO Routines Routine Description pardiso Calculates the solution of a set of sparse linear equations with multiple right-hand sides. pardisoinit Initialize PARDISO with default parameters depending on the matrix type. pardiso_64 Calculates the solution of a set of sparse linear equations with multiple right-hand sides, 64-bit integer version. pardiso_getenv pardiso_setenv Retrieves additional values from the PARDISO handle. pardiso_getenv pardiso_setenv Sets additional values in the PARDISO handle. The PARDISO solver supports a wide range of sparse matrix types (see the figure below) and computes the solution of real or complex sparse linear system of equations on shared-memory multiprocessing architectures. Sparse Matrices That Can Be Solved with the PARDISO* Solver 1885 The PARDISO solver performs four tasks: • analysis and symbolic factorization • numerical factorization • forward and backward substitution including iterative refinement • termination to release all internal solver memory. You can find a code example that uses the PARDISO interface routine to solve systems of linear equations in the examples\solver\source folder of your Intel MKL directory. pardiso Calculates the solution of a set of sparse linear equations with multiple right-hand sides. Syntax Fortran: call pardiso (pt, maxfct, mnum, mtype, phase, n, a, ia, ja, perm, nrhs, iparm, msglvl, b, x, error) C: pardiso (pt, &maxfct, &mnum, &mtype, &phase, &n, a, ia, ja, perm, &nrhs, iparm, &msglvl, b, x, &error); Include Files • FORTRAN 77: mkl_pardiso.f77 • Fortran 90: mkl_pardiso.f90 • C: mkl_pardiso.h Description The routine pardiso calculates the solution of a set of sparse linear equations A*X = B with multiple right-hand sides, using a parallel LU, LDL or LLT factorization, where A is an n-by-n matrix, and X and B are n-by-nrhs matrices. Supported Matrix Types. The analysis steps performed by pardiso depends on the structure of the input matrix A. 8 Intel® Math Kernel Library Reference Manual 1886 Symmetric Matrices: The solver first computes a symmetric fill-in reducing permutation P based on either the minimum degree algorithm [Liu85] or the nested dissection algorithm from the METIS package [Karypis98] (both included with Intel MKL), followed by the parallel left-right looking numerical Cholesky factorization [Schenk00-2] of PAPT = LLT for symmetric positive-definite matrices, or PAPT = LDLT for symmetric indefinite matrices. The solver uses diagonal pivoting, or 1x1 and 2x2 Bunch and Kaufman pivoting for symmetric indefinite matrices, and an approximation of X is found by forward and backward substitution and iterative refinements. Whenever numerically acceptable 1x1 and 2x2 pivots cannot be found within the diagonal supernode block, the coefficient matrix is perturbed. One or two passes of iterative refinements may be required to correct the effect of the perturbations. This restricting notion of pivoting with iterative refinements is effective for highly indefinite symmetric systems. Furthermore, for a large set of matrices from different applications areas, this method is as accurate as a direct factorization method that uses complete sparse pivoting techniques [Schenk04]. Another method of improving the pivoting accuracy is to use symmetric weighted matching algorithms. These algorithms identify large entries in the coefficient matrix A that, if permuted close to the diagonal, permit the factorization process to identify more acceptable pivots and proceed with fewer pivot perturbations. These algorithms are based on maximum weighted matchings and improve the quality of the factor in a complementary way to the alternative idea of using more complete pivoting techniques. The inertia is also computed for real symmetric indefinite matrices. Structurally Symmetric Matrices: The solver first computes a symmetric fill-in reducing permutation P followed by the parallel numerical factorization of PAPT = QLUT. The solver uses partial pivoting in the supernodes and an approximation of X is found by forward and backward substitution and iterative refinements. Unsymmetric Matrices: The solver first computes a non-symmetric permutation PMPS and scaling matrices Dr and Dc with the aim of placing large entries on the diagonal to enhance reliability of the numerical factorization process [Duff99]. In the next step the solver computes a fill-in reducing permutation P based on the matrix PMPSA + (PMPSA)T followed by the parallel numerical factorization QLUR = PPMPSDrADcP with supernode pivoting matrices Q and R. When the factorization algorithm reaches a point where it cannot factor the supernodes with this pivoting strategy, it uses a pivoting perturbation strategy similar to [Li99]. The magnitude of the potential pivot is tested against a constant threshold of alpha = eps*||A2||inf , where eps is the machine precision, A2 = P*PMPS*Dr*A*Dc*P, and ||A2||inf is the infinity norm of the scaled and permuted matrix A. Any tiny pivots encountered during elimination are set to the sign (lII)*eps*||A2||inf, which trades off some numerical stability for the ability to keep pivots from getting too small. Although many failures could render the factorization well-defined but essentially useless, in practice the diagonal elements are rarely modified for a large class of matrices. The result of this pivoting approach is that the factorization is, in general, not exact and iterative refinement may be needed. Direct-Iterative Preconditioning for Unsymmetric Linear Systems. The solver enables to use a combination of direct and iterative methods [Sonn89] to accelerate the linear solution process for transient simulation. Most of applications of sparse solvers require solutions of systems with gradually changing values of the nonzero coefficient matrix, but the same identical sparsity pattern. In these applications, the analysis phase of the solvers has to be performed only once and the numerical factorizations are the important time-consuming steps during the simulation. PARDISO uses a numerical factorization A = LU for the first system and applies the factors L and U for the next steps in a preconditioned Krylow-Subspace iteration. If the iteration does not converge, the solver automatically switches back to the numerical factorization. This method can be applied to unsymmetric matrices in PARDISO. You can select the method using only one input parameter. For further details see the parameter description (iparm(4), iparm(20)). Single and Double Precision Computations. Sparse Solver Routines 8 1887 PARDISO solves tasks using single or double precision. Each precision has its benefits and drawbacks. Double precision variables have more digits to store value, so the solver uses more memory for keeping data. But this mode solves matrices with better accuracy, and input matrices can have large condition numbers. Single precision variables have fewer digits to store values, so the solver uses less memory than in the double precision mode. Additionally this mode usually takes less time. But as computations are made more roughly, only numerically stable process can use single precision. Separate Forward and Backward Substitution. The solver execution step ( see parameter phase = 33 below) can be divided into two or three separate substitutions: forward, backward, and possible diagonal . This separation can be explained by the examples of solving systems with different matrix types. A real symmetric positive definite matrix A (mtype = 2) is factored by PARDISO as A = L*LT . In this case the solution of the system A*x=b can be found as sequence of substitutions: L*y=b (forward substitution, phase =331) and LT*x=y (backward substitution, phase =333). A real unsymmetric matrix A (mtype = 11) is factored by PARDISO as A = L*U . In this case the solution of the system A*x=b can be found by the following sequence: L*y=b (forward substitution, phase =331) and U*x=y (backward substitution, phase =333). Solving a system with a real symmetric indefinite matrix A (mtype = -2) is slightly different from the cases above. PARDISO factors this matrix as A=LDLT, and the solution of the system A*x=b can be calculated as the following sequence of substitutions: L*y=b (forward substitution, phase =331) s: D*v=y (diagonal substitution, phase =332) and, finally LT*x=v (backward substitution, phase =333). Diagonal substitution makes sense only for indefinite matrices (mtype = -2, -4, 6). For matrices of other types a solution can be found as described in the first two examples. NOTE The number of refinement steps (iparm(8)) must be set to zero if a solution is calculated with separate substitutions (phase = 331, 332, 333), otherwise PARDISO produces the wrong result. NOTE Different pivoting (iparm(21)) produces different LDLT factorization. Therefore results of forward, diagonal and backward substitutions with diagonal pivoting can differ from results of the same steps with Bunch and Kaufman pivoting. Of course, the final results of sequential execution of forward, diagonal and backward substitution are equal to the results of the full solving step (phase=33) regardless of the pivoting used. Sparse Data Storage. Sparse data storage in PARDISO follows the scheme described in Sparse Matrix Storage Format with ja standing for columns, ia for rowIndex, and a for values. The algorithms in PARDISO require column indices ja to be in increasing order per row and that the diagonal element in each row be present for any structurally symmetric matrix. For symmetric or unsymmetric matrices the diagonal elements are not necessary: they may be present or not. NOTE The presence of diagonal elements for symmetric matrices is not mandatory starting from the Intel MKL 10.3 beta release. CAUTION It's recommended to set explicitly zero diagonal elements for symmetric matrices because in the opposite case PARDISO creates internal copies of arrays ia, ja and a full of diagonal elements that requires additional memory and computational time. However, in general, memory and time overheads are not significant comparing to the memory and the time needed to factor and solve the matrix. 8 Intel® Math Kernel Library Reference Manual 1888 Input Parameters NOTE Parameters types in this section are specified in FORTRAN 77 notation. See PARDISO Parameters in Tabular Form for detailed description of types of PARDISO parameters in C/Fortran 90 notations. pt INTEGER Array, DIMENSION (64) Pointer to the address of solver internal data. These addresses are passed to the solver and all related internal memory management is organized through this pointer. NOTE pt is an integer array with 64 entries. It is very important that the pointer is initialized with zero at the first call of pardiso. After that first do not modify the pointer, as a serious memory leak can occur. The integer length must be 4 bytes on 32-bit operating systems and 8 bytes on 64-bit operating systems. maxfct INTEGER Maximum number of factors with identical nonzero sparsity structure that must be keep at the same time in memory. In most applications this value is equal to 1. It is possible to store several different factorizations with the same nonzero structure at the same time in the internal data management of the solver. pardiso can process several matrices with an identical matrix sparsity pattern and it can store the factors of these matrices at the same time. Matrices with a different sparsity structure can be kept in memory with different memory address pointers pt. mnum INTEGER Indicates the actual matrix for the solution phase. With this scalar you can define which matrix to factorize. The value must be: 1 = mnum = maxfct. In most applications this value is 1. mtype INTEGER Defines the matrix type, which influences the pivoting method. The PARDISO solver supports the following matrices: 1 real and structurally symmetric 2 real and symmetric positive definite -2 real and symmetric indefinite 3 complex and structurally symmetric 4 complex and Hermitian positive definite -4 complex and Hermitian indefinite 6 complex and symmetric 11 real and unsymmetric 13 complex and unsymmetric phase INTEGER Controls the execution of the solver. Usually it is a two- or three-digit integer ij (10i + j, 1=i=3, i=j=3 for normal execution modes). The i digit indicates the starting phase of execution, j indicates the ending phase. PARDISO has the following phases of execution: Sparse Solver Routines 8 1889 • Phase 1: Fill-reduction analysis and symbolic factorization • Phase 2: Numerical factorization • Phase 3: Forward and Backward solve including iterative refinements This phase can be divided into two or three separate substitutions: forward, backward, and diagonal (see above). • Termination and Memory Release Phase (phase= 0) If a previous call to the routine has computed information from previous phases, execution may start at any phase. The phase parameter can have the following values: phase Solver Execution Steps 11 Analysis 12 Analysis, numerical factorization 13 Analysis, numerical factorization, solve, iterative refinement 22 Numerical factorization 23 Numerical factorization, solve, iterative refinement 33 Solve, iterative refinement 331 like phase=33, but only forward substitution 332 like phase=33, but only diagonal substitution 333 like phase=33, but only backward substitution 0 Release internal memory for L and U matrix number mnum -1 Release all internal memory for all matrices n INTEGER Number of equations in the sparse linear systems of equations A*X = B. Constraint: n > 0. a DOUBLE PRECISION - for real types of matrices (mtype=1, 2, -2 and 11) and for double precision PARDISO (iparm(28)=0) REAL - for real types of matrices (mtype=1, 2, -2 and 11) and for single precision PARDISO (iparm(28)=1) DOUBLE COMPLEX - for complex types of matrices (mtype=3, 6, 13, 14 and -4) and for double precision PARDISO (iparm(28)=0) COMPLEX - for complex types of matrices (mtype=3, 6, 13, 14 and -4) and for single precision PARDISO (iparm(28)=1) Array. Contains the non-zero elements of the coefficient matrix A corresponding to the indices in ja. The size of a is the same as that of ja and the coefficient matrix can be either real or complex. The matrix must be stored in compressed sparse row format with increasing values of ja for each row. Refer to values array description in Storage Formats for the Direct Sparse Solvers for more details. 8 Intel® Math Kernel Library Reference Manual 1890 NOTE The non-zero elements of each row of the matrix A must be stored in increasing order. For symmetric or structural symmetric matrices, it is also important that the diagonal elements are available and stored in the matrix. If the matrix is symmetric, the array a is only accessed in the factorization phase, in the triangular solution and iterative refinement phase. Unsymmetric matrices are accessed in all phases of the solution process. ia INTEGER Array, dimension (n+1). For i=n, ia(i) points to the first column index of row i in the array ja in compressed sparse row format. That is, ia(I) gives the index of the element in array a that contains the first non-zero element from row i of A. The last element ia(n+1) is taken to be equal to the number of non-zero elements in A, plus one. Refer to rowIndex array description in Storage Formats for the Direct Sparse Solvers for more details. The array ia is also accessed in all phases of the solution process. Indexing of ia is one-based by default, but it can be changed to zero-based by setting the appropriate value to the parameter iparm(35). ja INTEGER Array ja(*) contains column indices of the sparse matrix A stored in compressed sparse row format. The indices in each row must be sorted in increasing order. The array ja is also accessed in all phases of the solution process. For structurally symmetric matrices it is assumed that diagonal elements, which are zero, are also stored in the list of non-zero elements in a and ja. For symmetric matrices, the solver needs only the upper triangular part of the system as is shown for columns array in Storage Formats for the Direct Sparse Solvers. Indexing of ja is one-based by default, but it can be changed to zero-based by setting the appropriate value to the parameter iparm(35). perm INTEGER Array, dimension (n). Holds the permutation vector of size n. You can use it to apply your own fill-in reducing ordering to the solver. The array perm is defined as follows. Let A be the original matrix and B = P*A*PT be the permuted matrix. Row (column) i of A is the perm(i) row (column) of B. The permutation vector perm is used by the solver if iparm(5) = 1. The array perm is also used to return the permutation vector calculated during fill-in reducing ordering stage. The permutation vector is returned into the perm array if iparm(5) = 2. Indexing of perm is one-based by default, but it can be changed to zerobased by setting the appropriate value to the parameter iparm(35). NOTE The first elements of row, column and permutation are numbered as array elements 1 by default (Fortran style, or one based array indexing), but these first elements can be numbered as array elements 0 (C style, or zero based array indexing) by setting the appropriate value to the parameter iparm(35). nrhs INTEGER Number of right-hand sides that need to be solved for. iparm INTEGER Sparse Solver Routines 8 1891 Array, dimension (64). This array is used to pass various parameters to PARDISO and to return some useful information after execution of the solver. If iparm(1) = 0, PARDISO uses default values for iparm(2) through iparm(64). The individual components of the iparm array are described below (some of them are described in the Output Parameters section). iparm(1)- use default values. If iparm(1) = 0, iparm(2) through iparm(64) are filled with default values, otherwise you must set all values in iparm from iparm(2) to iparm(64). iparm(2) - fill-in reducing ordering. iparm(2) controls the fill-in reducing ordering for the input matrix. If iparm(2) = 0, the minimum degree algorithm is applied [Li99]. If iparm(2) = 2, the solver uses the nested dissection algorithm from the METIS package [Karypis98]. If iparm(2) = 3, the parallel (OpenMP) version of the nested dissection algorithm is used. It can decrease the time of computations on multi-core computers, especially when PARDISO Phase 1 takes significant time. The default value of iparm(2) is 2. CAUTION You can control the parallel execution of the solver by explicitly setting the environment variable MKL_NUM_THREADS. If fewer processors are available than specified, the execution may slow down instead of speeding up. If the variable MKL_NUM_THREADS is not defined, then the solver uses all available processors. iparm(3)- currently is not used. iparm(4) - preconditioned CGS. This parameter controls preconditioned CGS [Sonn89] for unsymmetric or structurally symmetric matrices and Conjugate-Gradients for symmetric matrices. iparm(4) has the form iparm(4)= 10*L+K. The K and L values have the meanings as follows. Value of K Description 0 The factorization is always computed as required by phase. 1 CGS iteration replaces the computation of LU. The preconditioner is LU that was computed at a previous step (the first step or last step with a failure) in a sequence of solutions needed for identical sparsity patterns. 2 CGS iteration for symmetric matrices replaces the computation of LU. The preconditioner is LU that was computed at a previous step (the first step or last step with a failure) in a sequence of solutions needed for identical sparsity patterns. Value L: The value L controls the stopping criterion of the Krylow-Subspace iteration: 8 Intel® Math Kernel Library Reference Manual 1892 epsCGS = 10-L is used in the stopping criterion ||dxi|| / ||dxi|| < epsCGS with ||dxi|| = ||inv(L*U)*ri|| and ri is the residue at iteration i of the preconditioned Krylow-Subspace iteration. Strategy: A maximum number of 150 iterations is fixed with the assumption that the iteration will converge before consuming half the factorization time. Intermediate convergence rates and residue excursions are checked and can terminate the iteration process. If phase =23, then the factorization for a given A is automatically recomputed in cases where the Krylow-Subspace iteration failed, and the corresponding direct solution is returned. Otherwise the solution from the preconditioned Krylow-Subspace iteration is returned. Using phase =33 results in an error message (error =-4) if the stopping criteria for the Krylow-Subspace iteration can not be reached. More information on the failure can be obtained from iparm(20). The default is iparm(4)=0, and other values are only recommended for an advanced user. iparm(4) must be greater or equal to zero. Examples: iparm(4) Description 31 LU-preconditioned CGS iteration with a stopping criterion of 1.0E-3 for unsymmetric matrices 61 LU-preconditioned CGS iteration with a stopping criterion of 1.0E-6 for unsymmetric matrices 62 LU-preconditioned CGS iteration with a stopping criterion of 1.0E-6 for symmetric matrices iparm(5)- user permutation. This parameter controls whether user supplied fill-in reducing permutation is used instead of the integrated multiple-minimum degree or nested dissection algorithms. Another possible use of this parameter is to control obtaining the fill-in reducing permutation vector calculated during the reordering stage of PARDISO. This option is useful for testing reordering algorithms, adapting the code to special applications problems (for instance, to move zero diagonal elements to the end of P*A*PT), or for using the permutation vector more than once for equal or similar matrices. For definition of the permutation, see the description of the perm parameter. If parm(5)=0 (default value), then the array perm is not used by PARDISO; if parm(5)=1, then the user supplied fill-in reducing permutation in the array perm is used; if parm(5)=2, then PARDISO returns the permutation vector into the array perm. iparm(6)- write solution on x. If iparm(6) = 0 (default value), then the array x contains the solution and the value of b is not changed. If iparm(6) = 1, then the solver stores the solution in the right-hand side b. Note that the array x is always used. The default value of iparm(6) is 0. iparm(8) - iterative refinement step. On entry to the solve and iterative refinement step, iparm(8)must be set to the maximum number of iterative refinement steps that the solver performs. The solver does not perform more than the absolute value of Sparse Solver Routines 8 1893 iparm(8)steps of iterative refinement and stops the process if a satisfactory level of accuracy of the solution in terms of backward error is achieved. If iparm(8)< 0, the accumulation of the residue uses extended precision real and complex data types. Perturbed pivots result in iterative refinement (independent of iparm(8)=0) and the number of executed iterations is reported in iparm(7). The solver automatically performs two steps of iterative refinements when perturbed pivots are obtained during the numerical factorization and iparm(8) = 0. The number of performed iterative refinement steps is reported in iparm(7). The default value for iparm(8) is 0. iparm(9) This parameter is reserved for future use. Its value must be set to 0. iparm(10)- pivoting perturbation. This parameter instructs PARDISO how to handle small pivots or zero pivots for unsymmetric matrices (mtype =11 or mtype =13) and symmetric matrices (mtype =-2, mtype =-4, or mtype =6). For these matrices the solver uses a complete supernode pivoting approach. When the factorization algorithm reaches a point where it cannot factor the supernodes with this pivoting strategy, it uses a pivoting perturbation strategy similar to [Li99], [Schenk04]. The magnitude of the potential pivot is tested against a constant threshold of alpha = eps*||A2||inf, where eps = 10(-iparm(10)), A2 = P*PMPS*Dr*A*Dc*P, and ||A2||inf is the infinity norm of the scaled and permuted matrix A. Any tiny pivots encountered during elimination are set to the sign (lII)*eps*||A2||inf - this trades off some numerical stability for the ability to keep pivots from getting too small. Small pivots are therefore perturbed with eps = 10(- iparm(10)). For unsymmetric matrices (mtype =11 or mtype =13) the default value of iparm(10) is 13 and therefore eps = 1.0E-13. For symmetric indefinite matrices (mtype =-2, mtype =-4, or mtype =6) the default value of iparm(10) is 8, and therefore eps = 1.0E-8. iparm(11)- scaling vectors. PARDISO uses a maximum weight matching algorithm to permute large elements on the diagonal and to scale the matrix so that the diagonal elements are equal to 1 and the absolute values of the off-diagonal entries are less or equal to 1. This scaling method is applied only to unsymmetric matrices (mtype =11 or mtype =13). The scaling can also be used for symmetric indefinite matrices (mtype =-2, mtype =-4, or mtype =6) when the symmetric weighted matchings are applied (iparm(13)= 1). Use iparm(11) = 1 (scaling) and iparm(13) = 1 (matching) for highly indefinite symmetric matrices, for example, from interior point optimizations or saddle point problems. Note that in the analysis phase (phase=11) you must provide the numerical values of the matrix A in case of scaling and symmetric weighted matching. 8 Intel® Math Kernel Library Reference Manual 1894 The default value of iparm(11) is 1 for unsymmetric matrices (mtype =11 or mtype =13). The default value of iparm(11) is 0 for symmetric indefinite matrices (mtype =-2, mtype =-4, or mtype =6). iparm(12) - solving with transposed or conjugate transposed matrix. If iparm(12)= 0, PARDISO solves a linear system Ax = b (default value). If iparm(12)= 1, PARDISO solves a conjugate transposed system AHx = b based on the factorization of the matrix A. If iparm(12)= 2, PARDISO solves a transposed system ATx = b based on the factorization of the matrix A. NOTE For real matrices the terms conjugate transposed and transposed are equivalent. iparm(13) - improved accuracy using (non-)symmetric weighted matchings. PARDISO can use a maximum weighted matching algorithm to permute large elements close the diagonal. This strategy adds an additional level of reliability to our factorization methods and can be seen as a complement to the alternative idea of using more complete pivoting techniques during the numerical factorization. Use iparm(11)=1 (scalings) and iparm(13)=1 (matchings) for highly indefinite symmetric matrices, for example from interior point optimizations or saddle point problems. Note that in the analysis phase (phase =11) you must provide the numerical values of the matrix A in the case of scalings and symmetric weighted matchings. The default value of iparm(13) is 1 for unsymmetric matrices (mtype =11 or mtype =13). The default value of iparm(13) is 0 for symmetric matrices (mtype =-2, mtype =-4, or mtype =6). iparm(18) - numbers of non-zero elements in the factors. If iparm(18)< 0 on entry, the solver reports the numbers of non-zero elements in the factors. The default value of iparm(18)is -1. iparm(19)- MFLOPS of factorization. If iparm(19)< 0 on entry, the solver reports the number of MFLOPS (1.0E6) that are necessary to factor the matrix A. Reporting this number increases the reordering time. The default value of iparm(19) is 0. iparm(21) - pivoting for symmetric indefinite matrices. iparm(21) controls the pivoting method for sparse symmetric indefinite matrices. If iparm(21) = 0, then 1x1 diagonal pivoting is used. If iparm(21) = 1, then 1x1 and 2x2 Bunch and Kaufman pivoting is used in the factorization process. Sparse Solver Routines 8 1895 NOTE Use iparm(11) = 1 (scaling) and iparm(13) = 1 (matchings) for highly indefinite symmetric matrices, for example from interior point optimizations or saddle point problems. The default value of iparm(21) is 1. Bunch and Kaufman pivoting is available for matrices: mtype=-2, mtype=-4, or mtype=6. iparm(24) - parallel factorization control. This parameter selects the scheduling method for the parallel numerical factorization. If iparm(24) = 0 (default value), then PARDISO uses the previous parallel factorization. If iparm(24) = 1, then PARDISO uses new two-level scheduling algorithm. This algorithm generally improves scalability in case of parallel factorization on many threads (more than eight). The two-level scheduling factorization algorithm is enabled by default in previous MKL releases for matrices mtype=11. If you see performance degradation for such matrices with the default value, set manually iparm(24)=1. iparm(25) - parallel forward/backward solve control. If iparm(25) = 0 (default value), then PARDISO uses a parallel algorithm for the solve step. If iparm(25) = 1, then PARDISO uses sequential forward and backward solve. This feature is available only for in-core version. iparm(27) - matrix checker. If iparm(27)=0 (default value), PARDISO does not check the sparse matrix representation. If iparm(27)=1, then PARDISO checks integer arrays ia and ja. In particular, PARDISO checks whether column indices are sorted in increasing order within each row. iparm(28) - sets single or double precision of PARDISO. If iparm(28)=0, then the input arrays (matrix a, vectors x and b) and all internal arrays must be presented in double precision. If iparm(28)=1, then the input arrays must be presented in single precision. In this case all internal computations are performed in single precision. Depending on the sign of iparm(8), refinement steps can be calculated in quad or double precision for double precision accuracy, and in double or single precision for single precision accuracy. Default value of iparm(28) is 0 (double precision). Important iparm(28) value is stored in the PARDISO handle between PARDISO calls, so the precision mode can be changed only during the solver's phase 1. 8 Intel® Math Kernel Library Reference Manual 1896 iparm(31) - partial solution for sparse right-hand sides and sparse solution. This parameter controls the solution method if the right hand side contains a few nonzero components. It can be also used if only few components of the solution vector are needed, or if you want to reduce computation cost at solver step. To use this option define the input permutation vector perm so that perm(i) = 1 means that the i-the component in the right-hand side is nonzero or the i-th component in the solution vector is computed. If iparm(31) =0 (default value), this option is disabled. If iparm(31) =1, the right hand side must be sparse, and the i-th component in the solution vector is computed if perm(i) = 1. You can set perm(i) = 1 only if the i-th component of the right hand side is nonzero. If iparm(31) =2, the right hand side must be sparse, all components of the solution vector are computed. perm(i) = 1 means that the i-th component of the right hand side is nonzero. In the last case the computation cost at solver step is reduced due to reduced forward solver step. To use iparm(31) =2, you must set the i-th component of the right hand side to zero explicitly if perm(i) is not equal to 1. If iparm(31) =3, the right hand side can be of any type and you must set perm(i) = 1 to compute the i-th component in the solution vector. The permutation vector perm must be present in all phases of Intel MKL PARDISO software. At the reordering step, the software overwrites the input vector perm by a permutation vector used by the software at the factorization and solver step. If m is the number of components such that perm(i) = 1, then the last m components of the output vector perm are a set of the indices i satisfying the condition perm(i) = 1 on input. NOTE Turning on this option often increases time used by PARDISO for factorization and reordering steps, but it enables time to be reduced for the solver step. Important This feature is available only for the in-core version, so to use it you must set iparm(60) =0. Set the parameters iparm(8) (iterative refinement steps), iparm(4) (preconditioned CGS), and iparm(5) (user permutation) to 0 as well. iparm(32) - iparm(34) - these parameters are reserved for future use. Their values must be set to 0. iparm(35) - C or Fortran style array indexing. iparm(35) determines the indexing base for input matrices. If iparm(35)=0 (default value), then PARDISO uses Fortran style indexing: first value is referenced as array element 1. Otherwise PARDISO uses C style indexing: the first value is referenced as array element 0. iparm(35) - iparm(59) - these parameters are reserved for future use. Their values must be set to 0. iparm(60) - version of PARDISO. Sparse Solver Routines 8 1897 iparm(60) controls what version of PARDISO - out-of-core (OC) version or in-core (IC) version - is used. The OC PARDISO can solve very large problems by holding the matrix factors in files on the disk. Because of that the amount of main memory required by OC PARDISO is significantly reduced. If iparm(60) = 0 (default value), then IC PARDISO is used. If iparm(60) = 1 - then IC PARDISO is used if the total memory of RAM (in megabytes) needed for storing the matrix factors is less than sum of two values of the environment variables: MKL_PARDISO_OOC_MAX_CORE_SIZE (its default value is 2000 MB) and MKL_PARDISO_OOC_MAX_SWAP_SIZE (its default value is 0 MB); otherwise OOC PARDISO is used. In this case amount of RAM used by OOC PARDISO can not exceed the value of MKL_PARDISO_OOC_MAX_CORE_SIZE. If iparm(60) = 2 - then OOC PARDISO is used. If iparm(60) is equal to 1 or 2, and the total peak memory needed for storing the local arrays is more than MKL_PARDISO_OOC_MAX_CORE_SIZE, the program stops with error -9. In this case, increase MKL_PARDISO_OOC_MAX_CORE_SIZE. OOC parameters can be set in a configuration file. You can set the path to this file and its name using environmental variable MKL_PARDISO_OOC_CFG_PATH and MKL_PARDISO_OOC_CFG_FILE_NAME. Path and name are as follows: /< MKL_PARDISO_OOC_CFG_FILE_NAME> for Linux* OS, and \< MKL_PARDISO_OOC_CFG_FILE_NAME> for Windows* OS. By default, the name of the file is pardiso_ooc.cfg and it is placed to the current directory. All temporary data files can be deleted or stored when the calculations are completed in accordance with the value of the environmental variable MKL_PARDISO_OOC_KEEP_FILE. If it is set to 1 (default value), then all files are deleted, if it is set to 0, then all files are stored. By default, the OOC PARDISO uses the current directory for storing data, and all work arrays associated with the matrix factors are stored in files named ooc_temp with different extensions. These default values can be changed by using the environmental variable MKL_PARDISO_OOC_PATH. To set the environmental variables MKL_PARDISO_OOC_MAX_CORE_SIZE, MKL_PARDISO_OOC_MAX_SWAP_SIZE, MKL_PARDISO_OOC_KEEP_FILE, and MKL_PARDISO_OOC_PATH, create the configuration file with the following lines: MKL_PARDISO_OOC_PATH = \ooc_file MKL_PARDISO_OOC_MAX_CORE_SIZE = N MKL_PARDISO_OOC_MAX_SWAP_SIZE = K MKL_PARDISO_OOC_KEEP_FILE = 0 (or 1) where is the directory for storing data, ooc_file is the file name without any extension, N is the maximum size of RAM in megabytes available for PARDISO (default value is 2000 MB), K is the maximum swap size in megabytes available for PARDISO (default value is 0 MB). Do not set N greater than the size of the RAM and K greater than the size of the swap. WARNING The maximum length of the path lines in the configuration files is 1000 characters. 8 Intel® Math Kernel Library Reference Manual 1898 Alternatively the environment variables can be set via command line. For Linux* OS: export MKL_PARDISO_OOC_PATH = /ooc_file export MKL_PARDISO_OOC_MAX_CORE_SIZE = N export MKL_PARDISO_OOC_MAX_CORE_SIZE = K export MKL_PARDISO_OOC_KEEP_FILE = 0 (or 1) For Windows* OS: set MKL_PARDISO_OOC_PATH = \ooc_file set MKL_PARDISO_OOC_MAX_CORE_SIZE = N set MKL_PARDISO_OOC_MAX_CORE_SIZE = K set MKL_PARDISO_OOC_KEEP_FILE = 0 (or 1) NOTE The values specified in a command line have higher priorities - it means that if a variable is changed in the configuration file and in the command line, OOC PARDISO uses only value defined in the command line. NOTE You can switch between IC and OOC modes after the reordering phase. There are some recommendations and limitations: • Set iparm(60) before reordering phase to get better PARDISO performance. • Two-level factorization algorithm is not supported in the OOC mode. If you set two-level algorithm in the OOC mode then PARDISO returns error -1. • Switching between IC and OOC modes after reordering phase is not available in sequential mode. The program returns error -1. iparm(61), iparm(62), iparm(64) - these parameters are reserved for future use. Their values must be set to 0. msglvl INTEGER Message level information. If msglvl = 0 then PARDISO generates no output, if msglvl = 1 the solver prints statistical information to the screen. b DOUBLE PRECISION - for real types of matrices (mtype=1, 2, -2 and 11) and for double precision PARDISO (iparm(28)=0) REAL - for real types of matrices (mtype=1, 2, -2 and 11) and for single precision PARDISO (iparm(28)=1) DOUBLE COMPLEX - for complex types of matrices (mtype=3, 6, 13, 14 and -4) and for double precision PARDISO (iparm(28)=0) COMPLEX - for complex types of matrices (mtype=3, 6, 13, 14 and -4) and for single precision PARDISO (iparm(28)=1) Array, dimension (n, nrhs). On entry, contains the right-hand side vector/ matrix B, which is placed in memory contiguously. The b(i+(k-1)×nrhs) must hold the i-th component of k-th right-hand side vector. Note that b is only accessed in the solution phase. Output Parameters (See also PARDISO Parameters in Tabular Form.) Sparse Solver Routines 8 1899 pt This parameter contains internal address pointers. iparm On output, some iparm values report information such as the numbers of non-zero elements in the factors. iparm(7)- number of performed iterative refinement steps. The number of iterative refinement steps that are actually performed during the solve step. iparm(14)- number of perturbed pivots. After factorization, iparm(14) contains the number of perturbed pivots during the elimination process for mtype =11, mtype =13, mtype =-2, mtype =-4, or mtype =-6. iparm(15) - peak memory symbolic factorization. The parameter iparm(15) reports the total peak memory in kilobytes that the solver needs during the analysis and symbolic factorization phase. This value is only computed in phase 1. iparm(16) - permanent memory symbolic factorization. The parameter iparm(16) reports the permanent memory in kilobytes from the analysis and symbolic factorization phase that the solver needs in the factorization and solve phases. This value is only computed in phase 1. iparm(17) - size of factors /memory numerical factorization and solution. The parameter iparm(17) provides the size in kilobytes of the total memory consumed by IC PARDISO for internal float point arrays. This parameter is computed in phase 1. See iparm(63) for the OOC mode. The total peak memory solver consumption for all phases is max(iparm(15), iparm(16)+iparm(17)) iparm(18) - number of non-zero elements in factors. The solver reports the numbers of non-zero elements on the factors if iparm(18) < 0 on entry. iparm(19) - MFLOPS of factorization. The solver reports the number of operations in MFLOPS (1.0E6 operations) that are necessary to factor the matrix A if iparm(19) < 0 on entry. iparm(20) - CG/CGS diagnostics. The value is used to give CG/CGS diagnostics (for example, the number of iterations and cause of failure): If iparm(20)> 0, CGS succeeded, and the number of iterations executed are reported in iparm(20). If iparm(20 )< 0, iterations executed, but CG/CGS failed. The error report details in iparm(20) are of the form: iparm(20)= - it_cgs*10 - cgs_error. If phase= 23, then the factors L and U are recomputed for the matrix A and the error flag error=0 in case of a successful factorization. If phase = 33, then error = -4 signals failure. Description of cgs_error is given in the table below: cgs_error Description 1 fluctuations of the residue are too large 8 Intel® Math Kernel Library Reference Manual 1900 cgs_error Description 2 ||dxmax_it_cgs/2|| is too large (slow convergence) 3 stopping criterion is not reached at max_it_cgs 4 perturbed pivots causes iterative refinement 5 factorization is too fast for this matrix. It is better to use the factorization method with iparm(4)=0 iparm(22) - inertia: number of positive eigenvalues. The parameter iparm(22) reports the number of positive eigenvalues for symmetric indefinite matrices. iparm(23) - inertia: number of negative eigenvalues. The parameter iparm(23) reports the number of negative eigenvalues for symmetric indefinite matrices. iparm(30) - the number of the equation where PARDISO detects zero or negative pivot If the solver detects a zero or negative pivot for matrix types mtype = 2 (real positive definite matrix) and mtype = 4 (complex and Hermitian positive definite matrices), the factorization is stopped, PARDISO returns immediately with an error (error = -4) and iparm(30) contains the number of the equation where the first zero or negative pivot is detected. iparm(63) - size of the minimum OOC memory for numerical factorization and solution. The parameter iparm(63) provides the size in kilobytes of the minimum memory required by OOC PARDISO for internal float point arrays. This parameter is computed in phase 1. Total peak memory consumption of OOC PARDISO can be estimated as max(iparm(15), iparm(16)+iparm(63)) b On output, the array is replaced with the solution if iparm(6) = 1. x DOUBLE PRECISION - for real types of matrices (mtype=1, 2, -2 and 11) and for double precision PARDISO (iparm(28)=0) REAL - for real types of matrices (mtype=1, 2, -2 and 11) and for single precision PARDISO (iparm(28)=1) DOUBLE COMPLEX - for complex types of matrices (mtype=3, 6, 13, 14 and -4) and for double precision PARDISO (iparm(28)=0) COMPLEX - for complex types of matrices (mtype=3, 6, 13, 14 and -4) and for single precision PARDISO (iparm(28)=1) Array, dimension (n,nrhs). If iparm(6)=0 it contains solution vector/ matrix X, which is placed contiguously in memory. The x(i+(k-1)× nrhs) element must hold the i-th component of the k-th solution vector. Note that x is only accessed in the solution phase. error INTEGER The error indicator according to the below table: error Information 0 no error -1 input inconsistent -2 not enough memory -3 reordering problem Sparse Solver Routines 8 1901 error Information -4 zero pivot, numerical factorization or iterative refinement problem -5 unclassified (internal) error -6 reordering failed (matrix types 11 and 13 only) -7 diagonal matrix is singular -8 32-bit integer overflow problem -9 not enough memory for OOC -10 problems with opening OOC temporary files -11 read/write problems with the OOC data file See Also mkl_progress pardisoinit Initialize PARDISO with default parameters in accordance with the matrix type. Syntax Fortran: call pardisoinit (pt, mtype, iparm) C: pardisoinit (pt, &mtype, iparm); Include Files • FORTRAN 77: mkl_pardiso.f77 • Fortran 90: mkl_pardiso.f90 • C: mkl_pardiso.h Description This function initializes PARDISO internal address pointer pt with zero values (as needed for the very first call of PARDISO) and sets default iparm values in accordance with the matrix type. Intel MKL supplies the pardisoinit routine to be compatible with PARDISO 3.2 or lower distributed by the University of Basel. NOTE An alternative way to set default PARDISO iparm values is to call pardiso with iparm(1)=0. In this case you must initialize the internal address pointer pt with zero values manually. NOTE The pardisoinit routine initializes only the in-core version of PARDISO. Switching on the outof core version of PARDISO as well as changing default iparm values can be done after the call to pardisoinit but before the first call to pardiso. 8 Intel® Math Kernel Library Reference Manual 1902 Input Parameters NOTE Parameters types in this section are specified in FORTRAN 77 notation. See PARDISO Parameters in Tabular Form section for detailed description of types of PARDISO parameters in C/Fortran 90 notations. mtype INTEGER This scalar value defines the matrix type. Based on this value pardisoinit sets default values for the iparm array. Refer to the section PARDISO Parameters in Tabular Form for more details about the default values of PARDISO. Output Parameters pt INTEGER for 32-bit architectures INTEGER*8 for 64-bit architectures Array, DIMENSION (64) Solver internal data address pointer. These addresses are passed to the solver, and all related internal memory management is organized through this array. The pardisoinit routine nullifies the array pt. NOTE It is very important that the pointer pt is initialized with zero before the first call of PARDISO. After that first call you should never modify the pointer, as a serious memory leak can occur. iparm INTEGER Array, dimension (64). This array is used to pass various parameters to PARDISO and to return some useful information after execution of the solver. The pardisoinit routine fills-in the iparm array with the default values. Refer to the section PARDISO Parameters in Tabular Form for more details about the default values of PARDISO. pardiso_64 Calculates the solution of a set of sparse linear equations with multiple right-hand sides, 64-bit integer version. Syntax Fortran: call pardiso_64 (pt, maxfct, mnum, mtype, phase, n, a, ia, ja, perm, nrhs, iparm, msglvl, b, x, error) C: pardiso_64 (pt, &maxfct, &mnum, &mtype, &phase, &n, a, ia, ja, perm, &nrhs, iparm, &msglvl, b, x, &error); Include Files • FORTRAN 77: mkl_pardiso.f77 • Fortran 90: mkl_pardiso.f90 • C: mkl_pardiso.h Sparse Solver Routines 8 1903 Description pardiso_64 is an alternative ILP64 (64-bit integer) version of the pardiso routine (see Description section for more details). The interface of pardiso_64 is equal to the interface of pardiso, but it accepts and returns all INTEGER data as INTEGER*8. Use pardiso_64 interface for solving large matrices (with the number of non-zero elements on the order of 500 million or more). You can use it together with the usual LP64 interfaces for the rest of Intel MKL functionality. In other words, if you use 64-bit integer version (pardiso_64), you do not need to re-link your applications with ILP64 libraries. Take into account that pardiso_64 may perform slower than regular pardiso on reordering and symbolic factorization phase. NOTE pardiso_64 is supported only in the 64-bit libraries. If pardiso_64 is called from the 32-bit libraries, it returns error =-12. Input Parameters The input parameters of pardiso_64 are equal to the input parameters of pardiso, but pardiso_64 accepts all INTEGER data as INTEGER*8. Output Parameters The output parameters of pardiso_64 are equal to output parameters of pardiso, but pardiso_64 returns all INTEGER data as INTEGER*8. See Also mkl_progress pardiso_getenv, pardiso_setenv Retrieves additional values from the PARDISO handle or sets them in it. Syntax error = pardiso_getenv(handle, param, value) error = pardiso_setenv(handle, param, value) outputtext(Interface): _INTEGER_t pardiso_getenv (const _MKL_DSS_HANDLE_t handle, const enum PARDISO_ENV_PARAM* param, char* value); _INTEGER_t pardiso_setenv (_MKL_DSS_HANDLE_t handle, const enum PARDISO_ENV_PARAM* param, const char* value); Include Files • FORTRAN 77: mkl_pardiso.f77 • Fortran 90: mkl_pardiso.f90 • C: mkl_pardiso.h NOTE pardiso_setenv requires the value parameter to be converted to the string in C notation if it is called from Fortran. You can do this using mkl_cvt_to_null_terminated_str subroutine declared in the mkl_dss.f77 or mkl_dss.f90 include files (see example below). 8 Intel® Math Kernel Library Reference Manual 1904 Description These functions operate with the PARDISO handle. The pardiso_getenv routine retrieves additional values from the PARDISO handle, and pardiso_setenv sets specified values in the PARDISO handle. These functions enable retrieving and setting the name of the PARDISO OOC file. To retrieve the PARDISO OOC file name, you can apply this function to any non-empty handle. To set the the PARDISO OOC file name in the handle you must apply the function before reordering stage. That is you must apply the function only for the empty handle. This is because OOC file name is stored in the handle after reordering stage and it is not changed during further computations. NOTE 1024-byte internal buffer is used inside PARDISO for storing OOC file name. Allocate 1024-byte buffer (value parameter) for passing it to pardiso_getenv function. Input Parameters handle Input parameter for pardiso_getenv. Data object of the MKL_DSS_HANDLE type (see DSS Interface Description). param INTEGER. Specifies the required parameter. The only value is PARDISO_OCC_FILE_NAME, defined in the corresponding include file. value Input parameter for pardiso_setenv. STRING. Contains the name of the OOC file that must be used in the handle. Output Parameters value Output parameter for pardiso_getenv. STRING. Contains the name of the OOC file that is used in the handle. handle Output parameter for pardiso_setenv. Data object of the MKL_DSS_HANDLE type (see DSS Interface Description). Example (FORTRAN 90) INCLUDE 'mkl_pardiso.f90' INCLUDE 'mkl_dss.f90' PROGRAM pardiso_sym_f90 USE mkl_pardiso USE mkl_dss INTEGER*8 pt(64) CHARACTER*1024 file_name INTEGER buff(256), bufLen, error pt(1:64) = 0 file_name = 'pardiso_ooc_file' bufLen = len_trim(file_name) call mkl_cvt_to_null_terminated_str(buff, bufLen, trim(file_name)) error = pardiso_setenv(pt, PARDISO_OOC_FILE_NAME, buff) ! call pardiso() here END PROGRAM PARDISO Parameters in Tabular Form The following table lists all parameters of PARDISO and gives their brief descriptions. Sparse Solver Routines 8 1905 Paramet er Type Description Values Comments In/ Out pt(64) FORTRAN 77: INTEGER on 32-bit architectures, INTEGER*8 on 64- bit architectures Fortran 90: TYPE(MKL_PARDISO _HANDLE), INTENT(INOUT) C: void* Solver internal data address pointer 0 Must be initialized by zeros and never be modified later in/ out maxfct FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t* Maximal number of factors in memory >0 Generally used value is 1 in mnum FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t* The number of matrix (from 1 to maxfct) to solve [1; maxfct] Generally used value is 1 in mtype FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t* Matrix type 1 Real and structurally symmetric in 2 Real and symmetric positive definite -2 Real and symmetric indefinite 3 Complex and structurally symmetric 4 Complex and Hermitian positive definite -4 Complex and Hermitian indefinite 6 Complex and symmetric matrix 11 Real and unsymmetric matrix 13 Complex and unsymmetric matrix phase FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t* Controls the execution of the solver 11 Analysis in 12 Analysis, numerical factorization 13 Analysis, numerical factorization, solve 22 Numerical factorization 8 Intel® Math Kernel Library Reference Manual 1906 Paramet er Type Description Values Comments In/ Out 23 Numerical factorization, solve 33 Solve, iterative refinement 331 phase=33, but only forward substitution 332 phase=33, but only diagonal substitution 333 phase=33, but only backward substitution 0 Release internal memory for L and U of the matrix number mnum -1 Release all internal memory for all matrices n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t* Number of equations in the sparse linear system A*X = B >0 in a(*) FORTRAN 77: PARDISO_DATA_TYP E1) Fortran 90: PARDISO_DATA_TYP E1), INTENT(IN) C: void* Contains the nonzero elements of the coefficient matrix A * The size of a is the same as that of ja, and the coefficient matrix can be either real or complex. The matrix must be stored in CSR format with increasing values of ja for each row in ia(n +1) FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t* rowIndex array in CSR format >=0 0 < ia(i) <= ia(i+1) ia(i) gives the index of the element in array a that contains the first non-zero element from row i of A. The last element ia(n+1) is taken to be equal to the number of non-zero elements in A, plus one. Note: iparm(35) indicates whether row/column indexing starts from 1 or 0. in ja(*) FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t* columns array in CSR format >=0 The indices in each row must be sorted in increasing order. For symmetric and structurally symmetric matrices zero diagonal elements are also stored in a and ja. For symmetric in Sparse Solver Routines 8 1907 Paramet er Type Description Values Comments In/ Out matrices, the solver needs only the upper triangular part of the system. Note: iparm(35) indicates whether row/column indexing starts from 1 or 0. perm(n ) FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(INOUT) C: _INTEGER_t* Holds the permutation vector of size n >=0 Let B = P*A*PT be the permuted matrix. Row (column) i of A is the perm(i) row (column) of B. The numbering of the array must describe a permutation. You can apply your own fill-in reducing ordering (iparm(5)=1) or return the permutation from the solver (iparm(5)=2 ). Note: iparm(35) indicates whether row/column indexing starts from 1 or 0. in/ out nrhs FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t* Number of righthand sides that need to be solved for >=0 Generally used value is 1 in iparm( 64) FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(INOUT) C: _INTEGER_t* This array is used to pass various parameters to PARDISO and to return some useful information after execution of the solver * If iparm(1)=0 , PARDISO fills iparm(2) through iparm(64) with default values and uses them. in/ out msglvl FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t* Message level information 0 PARDISO generates no output in 1 PARDISO prints statistical information b(n*nr hs) FORTRAN 77: PARDISO_DATA_TYP E1) Fortran 90: PARDISO_DATA_TYP E1), INTENT(INOUT) C: void* Right-hand side vectors * On entry, contains the right hand side vector/matrix B, which is placed contiguously in memory. The b(i+(k-1) ×nrhs) element must hold the i-th component of k-th right-hand side vector. Note that b is only accessed in the solution phase. in/ out 8 Intel® Math Kernel Library Reference Manual 1908 Paramet er Type Description Values Comments In/ Out On output, the array is replaced with the solution if iparm(6)=1. x(n*nr hs) FORTRAN 77: PARDISO_DATA_TYP E1) Fortran 90: PARDISO_DATA_TYP E1), INTENT(OUT) C: void* Solution vectors * On output, if iparm(6)=1, contains solution vector/ matrix X which is placed contiguously in memory. The x(i+(k-1)×nrhs) element must hold the i-th component of k-th solution vector. Note that x is only accessed in the solution phase. out error FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(OUT) C: _INTEGER_t* Error indicator 0 No error out -1 Input inconsistent -2 Not enough memory -3 Reordering problem -4 Zero pivot, numerical factorization or iterative refinement problem -5 Unclassified (internal) error -6 Reordering failed (matrix types 11 and 13 only) -7 Diagonal matrix is singular -8 32-bit integer overflow problem -9 Not enough memory for OOC -10 Problems with opening OOC temporary files -11 Read/write problems with the OOC data file 1) See description of PARDISO_DATA_TYPE in the table below. The following table lists the values of PARDISO_DATA_TYPE depending on the matrix types and values of the parameter iparm(28). Data type value Matrix type mtype iparm(28) comments DOUBLE PRECISION 1, 2, -2, 11 0 Real matrices, double precision REAL 1 Real matrices, single precision DOUBLE COMPLEX 3, 6, 13, 4, -4 0 Complex matrices, double precision Sparse Solver Routines 8 1909 Data type value Matrix type mtype iparm(28) comments COMPLEX 1 Complex matrices, single precision The following table lists all individual components of the PARDISO iparm() parameter and their brief descriptions. Components not listed in the table must be initialized with 0. Default values in the column Values are denoted as X*. Compone nt Description Values Comments In/ Out INPUT, INPUT/OUTPUT PARAMETERS iparm(1 ) Use default values 0 iparm(2) -iparm(64) are filled with default values. in !=0 You must supply all values in components iparm(2) -iparm(64) iparm(2 ) Fill-in reducing ordering for the input matrix 0 The minimum degree algorithm. in 2* The nested dissection algorithm from the METIS package 3 The parallel (OpenMP) version of the nested dissection algorithm. iparm(4 ) Preconditioned CGS/ CG 0* Do not perform preconditioned Krylow-Subspace iterations. in 10*L+1 CGS iteration replaces the computation of LU. The preconditioner is LU that is computed at the previous step (the first step or last step with a failure) in a sequence of solutions needed for identical sparsity patterns. L controls the stopping criterion of the Krylow- Subspace iteration: epsCGS = 10^(-L) is used in the stopping criterion ||dxi|| / ||dx0|| < epsCGS, with ||dxi|| = ||inv(L*U)*ri|| and ri is the residuum at iteration i of the preconditioned Krylow-Subspace iteration. 10*L+2 Same as above, but CG iteration replaces the computation of LU. Designed for symmetric positive definite matrices. iparm(5 ) User permutation 0* User permutation in perm array is ignored. in 1 PARDISO uses the user supplied fill-in reducing permutation from perm array. iparm(2) is ignored. 2 PARDISO returns the permutation vector computed at phase 1 into perm array iparm(6 ) Write solution on x 0* The array x contains the solution; right-hand side vector b is kept unchanged. in 1 The solver stores the solution on the right-hand side b. Note that the array x is always used. iparm(8 ) Iterative refinement step 0* The solver automatically performs two steps of iterative refinements when perturbed pivots are obtained during the numerical factorization. in 8 Intel® Math Kernel Library Reference Manual 1910 Compone nt Description Values Comments In/ Out >0 Maximum number of iterative refinement steps that the solver performs. The solver performs not more than the absolute value of iparm(8) steps of iterative refinement and stops the process if a satisfactory level of accuracy of the solution in terms of backward error is achieved. The number of executed iterations is reported in iparm(7). <0 Same as above, but the accumulation of the residuum uses extended precision real and complex data types. iparm(1 0) Pivoting perturbation * This parameter instructs PARDISO how to handle small pivots or zero pivots for unsymmetric matrices (mtype =11, mtype =13) and symmetric matrices (mtype =-2, mtype =-4, mtype =6). Small pivots are perturbed with eps = 10^(- iparm(10)). in 13* The default value for unsymmetric matrices(mtype =11, mtype=13), eps = 10^(-13). 8* The default value for symmetric indefinite matrices (mtype =-2, mtype=-4, mtype=6), eps = 10^(-8). iparm(1 1) Scaling 0* Disable scaling. Default for symmetric indefinite matrices. in 1* Enable scaling. Default for unsymmetric matrices. Scale the matrix so that the diagonal elements are equal to 1 and the absolute values of the offdiagonal entries are less or equal to 1. This scaling method is applied to unsymmetric matrices (mtype = 11, mtype = 13). The scaling can also be used for symmetric indefinite matrices (mtype = -2, mtype = -4, mtype = 6) when the symmetric weighted matchings are applied (iparm(13)= 1). Note that in the analysis phase (phase=11) you must provide the numerical values of the matrix A in case of scaling. iparm(1 2) Solving with transposed or conjugate transposed matrix A 0* Solve a Ax=b linear system. 1 Solve a conjugate transposed system AHx = b based on the factorization of the matrix A. 2 Solve a transposed system ATx = b based on the factorization of the matrix A. iparm(1 3) Improved accuracy using (non-) symmetric weighted matching 0* Disable matching. Default for symmetric indefinite matrices. in 1* Enable matching. Default for symmetric indefinite matrices. Maximum weighted matching algorithm to permute large elements close to the diagonal. Sparse Solver Routines 8 1911 Compone nt Description Values Comments In/ Out It is recommended to use iparm(11)= 1 (scaling) and iparm(13)= 1 (matching) for highly indefinite symmetric matrices, for example from interior point optimizations or saddle point problems. Note that in the analysis phase (phase=11) you must provide the numerical values of the matrix A in case of symmetric weighted matching. iparm(1 8) Report the number of non-zero elements in the factors <0 Enable reporting if iparm(18) < 0 on entry. The default value is -1. in/ out >=0 Disable reporting. iparm(1 9) Report Mflops that are necessary to factor the matrix A. <0 Enable report if iparm(18) < 0 on entry. This increases the reordering time. in/ out >=0 Disable report. 0 is a default value. iparm(2 1) Pivoting for symmetric indefinite matrices 0 Apply 1x1 diagonal pivoting during the factorization process. in 1* Apply 1x1 and 2x2 Bunch and Kaufman pivoting during the factorization process. iparm(2 4) parallel factorization control 0 PARDISO uses the previous algorithm for factorization. Default value. in 1 PARDISO uses new two-level factorization algorithm. iparm(2 5) Parallel forward/ backward solve control 0 PARDISO uses the parallel algorithm for solve step. Default value. in 1 PARDISO uses the sequential forward and backward solve. iparm(2 7) Matrix checker 0* PARDISO does not check the sparse matrix representation. in 1 PARDISO checks integer arrays ia and ja. In particular, PARDISO checks whether column indices are sorted in increasing order within each row. iparm(2 8) Single or double precision of PARDISO 0* Input arrays (a, x and b) and all internal arrays must be presented in double precision. in 1 Input arrays (a, x and b) must be presented in single precision. In this case all internal computations are performed in single precision. iparm(3 1) Enables to solve partially for sparse right-hand sides and sparse solution 0 Disables this option. Default value. in 1 The right hand side is assumed to be sparse, perm(i)=1 means that the i-th component of the right hand side is nonzero, and this component of the solution vector is computed. 8 Intel® Math Kernel Library Reference Manual 1912 Compone nt Description Values Comments In/ Out 2 The right hand side is assumed to be sparse, perm(i)=1 means that the i-th component of the right hand side is nonzero, and all components of the solution vector are computed. 3 The right hand side can be of any type. If perm(i)=1, the i-th component of the solution vector is computed. iparm(3 5) One- or zero-based indexing of columns and rows 0* One-based indexing: columns and rows indexing in arrays ia, ja, and perm starts from 1. Default value. in 1 Zero-based indexing: columns and rows indexing in arrays ia, ja, and perm starts from 0. iparm(6 0) PARDISO mode 0* In-core PARDISO in 1 In-core PARDISO is used if the total memory needed for storing the matrix factors is less than the value of the environment variable MKL_PARDISO_OOC_MAX_CORE_SIZE. Otherwise out-of-core (OOC) PARDISO is used. 2 Out-of-core (OOC) PARDISO The OOC PARDISO can solve very large problems by holding the matrix factors in files on the disk. Hence the amount of RAM required by OOC PARDISO is significantly reduced. OUTPUT PARAMETERS iparm(7 ) Number of performed iterative refinement steps >=0 Reports the number of iterative refinement steps that were actually performed during the solve step. out iparm(1 4) Number of perturbed pivots >=0 After factorization, contains the number of perturbed pivots for the matrix types: 11, 13, -2, -4 and -6. out iparm(1 5) Peak memory on symbolic factorization >0 KB The total peak memory in kilobytes that the solver needs during the analysis and symbolic factorization phase. This value is only computed in phase 1. out iparm(1 6) Permanent memory on symbolic factorization >0 KB Permanent memory from the analysis and symbolic factorization phase in kilobytes that the solver needs in the factorization and solve phases. This value is only computed in phase 1. out iparm(1 7) Size of factors/Peak memory on numerical factorization and solution >0 KB This parameter provides the size in kilobytes of the total memory consumed by in-core PARDISO for internal float point arrays. This parameter is computed in phase 1. See iparm(63) for the OOC mode. The total peak memory consumed by PARDISO is max(iparm(15), iparm(16)+iparm(17)) out iparm(2 0) CG/CGS diagnostics >0 CGS succeeded, reports the number of completed iterations. out <0 CG/CGS failed (error=-4 after the solution phase) Sparse Solver Routines 8 1913 Compone nt Description Values Comments In/ Out iparm(20)= - it_cgs*10 - cgs_error. Possible values of cgs_error: 1 - fluctuations of the residuum are too large 2 - ||dx at max_it_cgs/2|| is too large (slow convergence) 3 - stopping criterion is not reached at max_it_cgs 4 - perturbed pivots causes iterative refinement iparm(2 2) Inertia: number of positive eigenvalues >=0 PARDISO reports the number of positive eigenvalues for symmetric indefinite matrices out iparm(2 3) Inertia: number of negative eigenvalues >=0 PARDISO reports the number of negative eigenvalues for symmetric indefinite matrices. out iparm(3 0) Number of zero or negative pivots >=0 If PARDISO detects zero or negative pivot for mtype=2 or mtype=4 types, the factorization is stopped, PARDISO returns immediately with an error = -4, and iparm(30) reports the number of the equation where the first zero or negative pivot is detected. out iparm(6 3) Size of the minimum OOC memory for numerical factorization and solution >0 KB This parameter provides the size in kilobytes of the minimum memory required by OOC PARDISO for internal float point arrays. This parameter is computed in phase 1. Total peak memory consumption of OOC PARDISO can be estimated as max(iparm(15), iparm(16)+iparm(63)) Direct Sparse Solver (DSS) Interface Routines Intel MKL supports the DSS interface, an alternative to the PARDISO* interface for the direct sparse solver. The DSS interface implements a group of user-callable routines that are used in the step-by-step solving process and utilizes the general scheme described in Appendix A Linear Solvers Basics for solving sparse systems of linear equations. This interface also includes one routine for gathering statistics related to the solving process and an auxiliary routine for passing character strings from Fortran routines to C routines. The current implementation of the DSS interface additionally supports the out-of-core (OOC) mode. Table "DSS Interface Routines" lists the names of the routines and describes their general use. DSS Interface Routines Routine Description dss_create Initializes the solver and creates the basic data structures necessary for the solver. This routine must be called before any other DSS routine. dss_define_structure Informs the solver of the locations of the non-zero elements of the array. dss_reorder Based on the non-zero structure of the matrix, computes a permutation vector to reduce fill-in during the factoring process. 8 Intel® Math Kernel Library Reference Manual 1914 Routine Description dss_factor_real, dss_factor_complex Computes the LU, LDLT or LLT factorization of a real or complex matrix. dss_solve_real, dss_solve_complex Computes the solution vector for a system of equations based on the factorization computed in the previous phase. dss_delete Deletes all data structures created during the solving process. dss_statistics Returns statistics about various phases of the solving process. mkl_cvt_to_null_terminated_str Passes character strings from Fortran routines to C routines. To find a single solution vector for a single system of equations with a single right hand side, invoke the Intel MKL DSS interface routines in this order: 1. dss_create 2. dss_define_structure 3. dss_reorder 4. dss_factor_real, dss_factor_complex 5. dss_solve_real, dss_solve_complex 6. dss_delete However, in certain applications it is necessary to produce solution vectors for multiple right-hand sides for a given factorization and/or factor several matrices with the same non-zero structure. Consequently, it is sometimes necessary to invoke the Intel MKL sparse routines in an order other than that listed, which is possible using the DSS interface. The solving process is conceptually divided into six phases. Figure "Typical order for invoking DSS interface routines" indicates the typical order in which the DSS interface routines can be invoked. Typical order for invoking DSS interface routines See the code examples that use the DSS interface routines to solve systems of linear equations in the examples\solver\source folder of your Intel MKL directory (dss_sym_f.f, dss_sym_c.c, dss_sym_f90.f90). Sparse Solver Routines 8 1915 DSS Interface Description Each DSS routine reads from or writes to a data object called a handle. Refer to Memory Allocation and Handles to determine the correct method for declaring a handle argument for each language. For simplicity, the descriptions in DSS routines refer to the data type as MKL_DSS_HANDLE. C and C++ programmers should refer to Calling Sparse Solver and Preconditioner Routines from C C++ for information on mapping Fortran types to C/C++ types. Routine Options The DSS routines have an integer argument (referred below to as opt) for passing various options to the routines. The permissible values for opt should be specified using only the symbol constants defined in the language-specific header files (see Implementation Details). The routines accept options for setting the message and termination levels as described in Table "Symbolic Names for the Message and Termination Levels Options". Additionally, each routine accepts the option MKL_DSS_DEFAULTS that sets the default values (as documented) for opt to the routine. Symbolic Names for the Message and Termination Levels Options Message Level Termination Level MKL_DSS_MSG_LVL_SUCCESS MKL_DSS_TERM_LVL_SUCCESS MKL_DSS_MSG_LVL_INFO MKL_DSS_TERM_LVL_INFO MKL_DSS_MSG_LVL_WARNING MKL_DSS_TERM_LVL_WARNING MKL_DSS_MSG_LVL_ERROR MKL_DSS_TERM_LVL_ERROR MKL_DSS_MSG_LVL_FATAL MKL_DSS_TERM_LVL_FATAL The settings for message and termination levels can be set on any call to a DSS routine. However, once set to a particular level, they remain at that level until they are changed in another call to a DSS routine. You can specify both message and termination level for a DSS routine by adding the options together. For example, to set the message level to debug and the termination level to error for all the DSS routines, use the following call: CALL dss_create( handle, MKL_DSS_MSG_LVL_INFO + MKL_DSS_TERM_LVL_ERROR) User Data Arrays Many of the DSS routines take arrays of user data as input. For example, user arrays are passed to the routine dss_define_structure to describe the location of the non-zero entries in the matrix. To minimize storage requirements and improve overall run-time efficiency, the Intel MKL DSS routines do not make copies of the user input arrays. WARNING Do not modify the contents of these arrays after they are passed to one of the solver routines. DSS Routines dss_create Initializes the solver. Syntax C: dss_create(handle, opt) 8 Intel® Math Kernel Library Reference Manual 1916 Fortran: call dss_create(handle, opt) Include Files • FORTRAN 77: mkl_dss.f77 • Fortran 90: mkl_dss.f90 • C: mkl_dss.h Description The dss_create routine initializes the solver. After the call to dss_create, all subsequent invocations of the Intel MKL DSS routines must use the value of the handle returned by dss_create. WARNING Do not write the value of handle directly. The default value of the parameter opt is MKL_DSS_MSG_LVL_WARNING + MKL_DSS_TERM_LVL_ERROR. By default, the DSS routines use double precision for solving systems of linear equations. The precision used by the DSS routines can be set to single mode by adding the following value to the opt parameter: MKL_DSS_SINGLE_PRECISION. As for PARDISO, input data and internal arrays are required to have single precision. By default, the DSS routines use Fortran style indexing for input arrays of integer types (the first value is referenced as array element 1). The indexing can be set to C style (the first value is referenced as array element 0) by adding the following value to the opt parameter: MKL_DSS_ZERO_BASED_INDEXING. This parameter can also control number of refinement steps used on the solution stage by specifying the two following values: MKL_DSS_REFINEMENT_OFF - maximum number of refinement steps is set to zero; MKL_DSS_REFINEMENT_ON (default value) - maximum number of refinement steps is set to 2. By default, DSS uses in-core computations. To launch the out-of-core version of DSS (OOC DSS) you can add to this parameter one of two possible values: MKL_DSS_OOC_STRONG and MKL_DSS_OOC_VARIABLE. MKL_DSS_OOC_STRONG - OOC DSS is used. MKL_DSS_OOC_VARIABLE - if the memory needed for the matrix factors is less than the value of the environment variable MKL_PARDISO_OOC_MAX_CORE_SIZE, then the OOC DSS uses the in-core kernels of PARDISO, otherwise it uses the OOC computations. The variable MKL_PARDISO_OOC_MAX_CORE_SIZE defines the maximum size of RAM allowed for storing work arrays associated with the matrix factors. It is ignored if MKL_DSS_OOC_STRONG is set. The default value of MKL_PARDISO_OOC_MAX_CORE_SIZE is 2000 MB. This value and default path and file name for storing temporary data can be changed using the configuration file pardiso_ooc.cfg or command line (See more details in the pardiso description above). WARNING Do not change the OOC DSS settings after they are specified in the routine dss_create. Sparse Solver Routines 8 1917 Input Parameters Name Type Description opt FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Parameter to pass the DSS options. The default value is MKL_DSS_MSG_LVL_WARNING + MKL_DSS_TERM_LVL_ERROR. Output Parameters Name Type Description handle FORTRAN 77: INTEGER*8 Fortran 90: TYPE (MKL_DSS_HANDLE), INTENT(OUT) C: _MKL_DSS_HANDLE_t* Pointer to the data structure storing intermediate DSS results (MKL_DSS_HANDLE). Return Values MKL_DSS_SUCCESS MKL_DSS_INVALID_OPTION MKL_DSS_OUT_OF_MEMORY MKL_DSS_MSG_LVL_ERR MKL_DSS_TERM_LVL_ERR dss_define_structure Communicates locations of non-zero elements in the matrix to the solver. Syntax C: dss_define_structure(handle, opt, rowIndex, nRows, nCols, columns, nNonZeros); Fortran: call dss_define_structure(handle, opt, rowIndex, nRows, nCols, columns, nNonZeros); Include Files • FORTRAN 77: mkl_dss.f77 • Fortran 90: mkl_dss.f90 • C: mkl_dss.h Description The routine dss_define_structure communicates the locations of the nNonZeros number of non-zero elements in a matrix of nRows * nCols size to the solver. The Intel MKL DSS software operates only on square matrices, so nRows must be equal to nCols. To communicate the locations of non-zero elements in the matrix, do the following: 1. Define the general non-zero structure of the matrix by specifying the value for the options argument opt. You can set the following values for real matrices: 8 Intel® Math Kernel Library Reference Manual 1918 • MKL_DSS_SYMMETRIC_STRUCTURE • MKL_DSS_SYMMETRIC • MKL_DSS_NON_SYMMETRIC and for complex matrices: • MKL_DSS_SYMMETRIC_STRUCTURE_COMPLEX • MKL_DSS_SYMMETRIC_COMPLEX • MKL_DSS_NON_SYMMETRIC_COMPLEX The information about the matrix type must be defined in dss_define_structure. 2. Provide the actual locations of the non-zeros by means of the arrays rowIndex and columns (see Sparse Matrix Storage Format). Input Parameters Name Type Description opt FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Parameter to pass the DSS options. The default value for the matrix structure is MKL_DSS_SYMMETRIC. rowIndex FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Array of size min(nRows, nCols)+1. Defines the location of non-zero entries in the matrix. nRows FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Number of rows in the matrix. nCols FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Number of columns in the matrix. columns FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Array of size nNonZeros. Defines the location of non-zero entries in the matrix. nNonZeros FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Number of non-zero elements in the matrix. Output Parameters Name Type Description handle FORTRAN 77: INTEGER*8 Fortran 90: TYPE (MKL_DSS_HANDLE), INTENT(INOUT) C: _MKL_DSS_HANDLE_t* Pointer to the data structure storing intermediate DSS results (MKL_DSS_HANDLE). Sparse Solver Routines 8 1919 Return Values MKL_DSS_SUCCESS MKL_DSS_STATE_ERR MKL_DSS_INVALID_OPTION MKL_DSS_STRUCTURE_ERR MKL_DSS_ROW_ERR MKL_DSS_COL_ERR MKL_DSS_NOT_SQUARE MKL_DSS_TOO_FEW_VALUES MKL_DSS_TOO_MANY_VALUES MKL_DSS_OUT_OF_MEMORY MKL_DSS_MSG_LVL_ERR MKL_DSS_TERM_LVL_ERR dss_reorder Computes or sets a permutation vector that minimizes the fill-in during the factorization phase. Syntax C: dss_reorder(handle, opt, perm) Fortran: call dss_reorder(handle, opt, perm) Include Files • FORTRAN 77: mkl_dss.f77 • Fortran 90: mkl_dss.f90 • C: mkl_dss.h Description If opt contains the option MKL_DSS_AUTO_ORDER, then the routine dss_reorder computes a permutation vector that minimizes the fill-in during the factorization phase. For this option, the routine ignores the contents of the perm array. If opt contains the option MKL_DSS_METIS_OPENMP_ORDER, then the routine dss_reorder computes permutation vector using the parallel (OpenMP) nested dissections algorithm to minimize the fill-in during the factorization phase. This option can be used to decrease the time of dss_reorder call on multi-core computers. For this option, the routine ignores the contents of the perm array. If opt contains the option MKL_DSS_MY_ORDER, then you must supply a permutation vector in the array perm. In this case, the array perm is of length nRows, where nRows is the number of rows in the matrix as defined by the previous call to dss_define_structure. If opt contains the option MKL_DSS_GET_ORDER, then the permutation vector computed during the dss_reorder call is copied to the array perm. In this case you must allocate the array perm beforehand. The permutation vector is computed in the same way as if the option MKL_DSS_AUTO_ORDER is set. 8 Intel® Math Kernel Library Reference Manual 1920 Input Parameters Name Type Description opt FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Parameter to pass the DSS options. The default value for the permutation type is MKL_DSS_AUTO_ORDER. perm FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Array of length nRows. Contains a user-defined permutation vector (accessed only if opt contains MKL_DSS_MY_ORDER or MKL_DSS_GET_ORDER). Output Parameters Name Type Description handle FORTRAN 77: INTEGER*8 Fortran 90: TYPE (MKL_DSS_HANDLE), INTENT(INOUT) C: _MKL_DSS_HANDLE_t* Pointer to the data structure storing intermediate DSS results (MKL_DSS_HANDLE). Return Values MKL_DSS_SUCCESS MKL_DSS_STATE_ERR MKL_DSS_INVALID_OPTION MKL_DSS_REORDER_ERR MKL_DSS_REORDER1_ERR MKL_DSS_I32BIT_ERR MKL_DSS_FAILURE MKL_DSS_OUT_OF_MEMORY MKL_DSS_MSG_LVL_ERR MKL_DSS_TERM_LVL_ERR dss_factor_real, dss_factor_complex Compute factorization of the matrix with previously specified location of non-zero elements. Syntax C: dss_factor_real(handle, opt, rValues) dss_factor_complex(handle, opt, cValues) Fortran 77: call dss_factor_real(handle, opt, rValues) call dss_factor_complex(handle, opt, cValues) Sparse Solver Routines 8 1921 Fortran 90: outputtext(unified Fortran 90 interface): call dss_factor(handle, opt, Values) outputtext(or FORTRAN 77 like interface): call dss_factor_real(handle, opt, rValues) call dss_factor_complex(handle, opt, cValues) Include Files • FORTRAN 77: mkl_dss.f77 • Fortran 90: mkl_dss.f90 • C: mkl_dss.h Description These routines compute factorization of the matrix whose non-zero locations were previously specified by a call to dss_define_structure and whose non-zero values are given in the array rValues, cValues or Values. Data type These arrays must be of length nNonZeros as defined in a previous call to dss_define_structure. NOTE The data type (single or double precision) of rValues, cValues, Values must be in correspondence with precision specified by the parameter opt in the routine dss_create. The opt argument can contain one of the following options: • MKL_DSS_POSITIVE_DEFINITE • MKL_DSS_INDEFINITE • MKL_DSS_HERMITIAN_POSITIVE_DEFINITE • MKL_DSS_HERMITIAN_INDEFINITE depending on your matrix's type. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters Name Type Description handle FORTRAN 77: INTEGER*8 Fortran 90: TYPE (MKL_DSS_HANDLE), INTENT(INOUT) C: _MKL_DSS_HANDLE_t* Pointer to the data structure storing intermediate DSS results (MKL_DSS_HANDLE). opt FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Parameter to pass the DSS options. The default value is MKL_DSS_POSITIVE_DEFINITE. rValues FORTRAN 77: REAL*4 or REAL*8 Array of elements of the matrix A. Real data, single or double precision as it is specified by the parameter opt in the routine dss_create. 8 Intel® Math Kernel Library Reference Manual 1922 Name Type Description Fortran 90: REAL(KIND=4), INTENT(IN) or REAL(KIND=8), INTENT(IN) C: VOID const* cValues FORTRAN 77: COMPLEX*8 or COMPLEX*16 Fortran 90: COMPLEX(KIND=4), INTENT(IN) or COMPLEX(KIND=8), INTENT(IN) C: VOID const* Array of elements of the matrix A. Complex data, single or double precision as it is specified by the parameter opt in the routine dss_create. Values Fortran 90: REAL(KIND=4), INTENT(OUT), or REAL(KIND=8), INTENT(OUT), or COMPLEX(KIND=4), INTENT(OUT), or COMPLEX(KIND=8), INTENT(OUT) Array of elements of the matrix A. Real or complex data, single or double precision as it is specified by the parameter opt in the routine dss_create. Return Values MKL_DSS_SUCCESS MKL_DSS_STATE_ERR MKL_DSS_INVALID_OPTION MKL_DSS_OPTION_CONFLICT MKL_DSS_VALUES_ERR MKL_DSS_OUT_OF_MEMORY MKL_DSS_ZERO_PIVOT MKL_DSS_FAILURE MKL_DSS_MSG_LVL_ERR MKL_DSS_TERM_LVL_ERR MKL_DSS_OOC_MEM_ERR MKL_DSS_OOC_OC_ERR MKL_DSS_OOC_RW_ERR See Also mkl_progress dss_solve_real, dss_solve_complex Compute the corresponding solution vector and place it in the output array. Sparse Solver Routines 8 1923 Syntax C: dss_solve_real(handle, opt, rRhsValues, nRhs, rSolValues) dss_solve_complex(handle, opt, cRhsValues, nRhs, cSolValues) Fortran 77: call dss_solve_real(handle, opt, rRhsValues, nRhs, rSolValues) call dss_solve_complex(handle, opt, cRhsValues, nRhs, cSolValues) Fortran 90: outputtext(unified Fortran 90 interface): call dss_solve(handle, opt, RhsValues, nRhs, SolValues) outputtext(or FORTRAN 77 like interface): call dss_solve_real(handle, opt, rRhsValues, nRhs, rSolValues) call dss_solve_complex(handle, opt, cRhsValues, nRhs, cSolValues) Include Files • FORTRAN 77: mkl_dss.f77 • Fortran 90: mkl_dss.f90 • C: mkl_dss.h Description For each right hand side column vector defined in the arrays rRhsValues, cRhsValues, or RhsValues, these routines compute the corresponding solution vector and place it in the arrays rSolValues, cSolValues, or SolValues respectively. NOTE The data type (single or double precision) of all arrays must be in correspondence with precision specified by the parameter opt in the routine dss_create. The lengths of the right-hand side and solution vectors, nCols and nRows respectively, must be defined in a previous call to dss_define_structure. By default, both routines perform the full solution step (it corresponds to phase = 33 in PARDISO). The parameter opt enables you to calculate the final solution step-by-step, calling forward and backward substitutions. If it is set to MKL_DSS_FORWARD_SOLVE, the forward substitution (corresponding to phase = 331 in PARDISO) is performed; if it is set to MKL_DSS_DIAGONAL_SOLVE, the diagonal substitution (corresponding to phase = 332 in PARDISO) is performed; if it is set to MKL_DSS_BACKWARD_SOLVE, the backward substitution (corresponding to phase = 333 in PARDISO) is performed. For more details about using these substitutions for different types of matrices, see the description of the PARDISO solver. This parameter also can control the number of refinement steps that is used on the solution stage: if it is set to MKL_DSS_REFINEMENT_OFF, the maximum number of refinement steps equal to zero, and if it is set to MKL_DSS_REFINEMENT_ON (default value), the maximum number of refinement steps is equal to 2. 8 Intel® Math Kernel Library Reference Manual 1924 MKL_DSS_CONJUGATE_SOLVE option added to the parameter opt enables solving a conjugate transposed system AHx = b based on the factorization of the matrix A. This option is equivalent to the parameter iparm(12)= 1 in PARDISO. MKL_DSS_TRANSPOSE_SOLVE option added to the parameter opt enables solving a transposed system ATx = b based on the factorization of the matrix A. This option is equivalent to the parameter iparm(12)= 2 in PARDISO. Input Parameters Name Type Description handle FORTRAN 77: INTEGER*8 Fortran 90: TYPE (MKL_DSS_HANDLE), INTENT(INOUT) C: _MKL_DSS_HANDLE_t* Pointer to the data structure storing intermediate DSS results (MKL_DSS_HANDLE). opt FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Parameter to pass the DSS options. nRhs FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Number of the right-hand sides in the linear equation. rRhsValues FORTRAN 77: REAL*4 or REAL*8 Fortran 90: REAL(KIND=4), INTENT(IN) or REAL(KIND=8), INTENT(IN) C: VOID const* Array of size nRows * nRhs. Contains real righthand side vectors. Real data, single or double precision as it is specified by the parameter opt in the routine dss_create. cRhsValues FORTRAN 77: COMPLEX*8 or COMPLEX*16 Fortran 90: COMPLEX(KIND=4), INTENT(IN) or COMPLEX(KIND=8), INTENT(IN) C: VOID const* Array of size nRows * nRhs. Contains complex right-hand side vectors. Complex data, single or double precision as it is specified by the parameter opt in the routine dss_create. RhsValues Fortran 90: REAL(KIND=4), INTENT(IN), or REAL(KIND=8), INTENT(IN), or COMPLEX(KIND=4), INTENT(IN), or COMPLEX(KIND=8), INTENT(IN) Array of size nRows * nRhs. Contains righthand side vectors. Real or complex data, single or double precision as it is specified by the parameter opt in the routine dss_create. Sparse Solver Routines 8 1925 Output Parameters Name Type Description rSolValues FORTRAN 77: REAL*4 or REAL*8 Fortran 90: REAL(KIND=4), INTENT(OUT) or REAL(KIND=8), INTENT(OUT) C: VOID const* Array of size nCols * nRhs. Contains real solution vectors. Real data, single or double precision as it is specified by the parameter opt in the routine dss_create. cSolValues FORTRAN 77: COMPLEX*8 or COMPLEX*16 Fortran 90: COMPLEX(KIND=4), INTENT(OUT) or COMPLEX(KIND=8), INTENT(OUT) C: VOID const* Array of size nCols * nRhs. Contains complex solution vectors. Complex data, single or double precision as it is specified by the parameter opt in the routine dss_create. SolValues Fortran 90: REAL(KIND=4), INTENT(OUT), or REAL(KIND=8), INTENT(OUT), or COMPLEX(KIND=4), INTENT(OUT), or COMPLEX(KIND=8), INTENT(OUT) Array of size nCols * nRhs. Contains solution vectors. Real or complex data, single or double precision as it is specified by the parameter opt in the routine dss_create. Return Values MKL_DSS_SUCCESS MKL_DSS_STATE_ERR MKL_DSS_INVALID_OPTION MKL_DSS_OUT_OF_MEMORY MKL_DSS_DIAG_ERR MKL_DSS_FAILURE MKL_DSS_MSG_LVL_ERR MKL_DSS_TERM_LVL_ERR MKL_DSS_OOC_MEM_ERR MKL_DSS_OOC_OC_ERR MKL_DSS_OOC_RW_ERR dss_delete Deletes all of the data structures created during the solutions process. Syntax C: dss_delete(handle, opt) 8 Intel® Math Kernel Library Reference Manual 1926 Fortran: call dss_delete(handle, opt) Include Files • FORTRAN 77: mkl_dss.f77 • Fortran 90: mkl_dss.f90 • C: mkl_dss.h Description The routine dss_delete deletes all data structures created during the solving process. Input Parameters Name Type Description opt FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Parameter to pass the DSS options. The default value is MKL_DSS_MSG_LVL_WARNING + MKL_DSS_TERM_LVL_ERROR. Output Parameters Name Type Description handle FORTRAN 77: INTEGER*8 Fortran 90: TYPE (MKL_DSS_HANDLE), INTENT(INOUT) C: _MKL_DSS_HANDLE_t* Pointer to the data structure storing intermediate DSS results (MKL_DSS_HANDLE). Return Values MKL_DSS_SUCCESS MKL_DSS_STATE_ERR MKL_DSS_INVALID_OPTION MKL_DSS_OUT_OF_MEMORY MKL_DSS_MSG_LVL_ERR MKL_DSS_TERM_LVL_ERR dss_statistics Returns statistics about various phases of the solving process. Syntax C: dss_statistics(handle, opt, statArr, retValues) Fortran: call dss_statistics(handle, opt, statArr, retValues) Sparse Solver Routines 8 1927 Include Files • FORTRAN 77: mkl_dss.f77 • Fortran 90: mkl_dss.f90 • C: mkl_dss.h Description The dss_statistics routine returns statistics about various phases of the solving process. This routine gathers the following statistics: – time taken to do reordering, – time taken to do factorization, – duration of problem solving, – determinant of the input matrix, – inertia of the input matrix, – number of floating point operations taken during factorization, – total peak memory needed during the analysis and symbolic factorization, – permanent memory needed from the analysis and symbolic factorization, – memory consumption for the factorization and solve phases. Statistics are returned in accordance with the input string specified by the parameter statArr. The value of the statistics is returned in double precision in a return array, which you must allocate beforehand. For multiple statistics, multiple string constants separated by commas can be used as input. Return values are put into the return array in the same order as specified in the input string. Statistics can only be requested at the appropriate stages of the solving process. For example, requesting FactorTime before a matrix is factored leads to errors. The following table shows the point at which each individual statistics item can be requested: Statistics Calling Sequences Type of Statistics When to Call ReorderTime After dss_reorder is completed successfully. FactorTime After dss_factor_real or dss_factor_complex is completed successfully. SolveTime After dss_solve_real or dss_solve_complex is completed successfully. Determinant After dss_factor_real or dss_factor_complex is completed successfully. Inertia After dss_factor_real is completed successfully and the matrix is real, symmetric, and indefinite. Flops After dss_factor_real or dss_factor_complex is completed successfully. Peakmem After dss_reorder is completed successfully. Factormem After dss_reorder is completed successfully. Solvemem After dss_factor_real or dss_factor_complex is completed successfully. Input Parameters Name Type Description handle FORTRAN 77: INTEGER*8 Fortran 90: TYPE (MKL_DSS_HANDLE), INTENT(IN) Pointer to the data structure storing intermediate DSS results (MKL_DSS_HANDLE). 8 Intel® Math Kernel Library Reference Manual 1928 Name Type Description C: _MKL_DSS_HANDLE_t* opt FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: _INTEGER_t const* Parameter to pass the DSS options. statArr FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: char const* Input string that defines the type of the returned statistics. The parameter can include one or more of the following string constants (case of the input string has no effect): ReorderTime Amount of time taken to do the reordering. FactorTime Amount of time taken to do the factorization. SolveTime Amount of time taken to solve the problem after factorization. Determinant Determinant of the matrix A. For real matrices: the determinant is returned as det_pow, det_base in two consecutive return array locations, where 1.0 = abs(det_base) < 10.0 and determinant = det_base*10(det_pow). For complex matrices: the determinant is returned as det_pow, det_re, det_im in three consecutive return array locations, where 1.0 =abs(det_re) + abs(det_im) < 10.0 and determinant = det_re, det_im*10(det_pow). Inertia Inertia of a real symmetric matrix is defined as a triplet of nonnegative integers (p,n,z), where p is the number of positive eigenvalues, n is the number of negative eigenvalues, and z is the number of zero eigenvalues. Inertia is returned as three consecutive return array locations p, n, z. Computing inertia is recommended only for stable matrices. Unstable matrices can lead to incorrect results. Inertia of a k-by-k real symmetric positive definite matrix is always (k, 0, 0). Therefore Inertia is returned only in cases of real symmetric indefinite matrices. For all other matrix types, an error message is returned. Flops Number of floating point operations performed during the factorization. Peakmem Total peak memory in kilobytes that the solver needs during the analysis and symbolic factorization phase. Sparse Solver Routines 8 1929 Name Type Description Factormem Permanent memory in kilobytes that the solver needs from the analysis and symbolic factorization phase in the factorization and solve phases. Solvemem Total double precision memory consumption (kilobytes) of the solver for the factorization and solve phases. NOTE To avoid problems in passing strings from Fortran to C, Fortran users must call the mkl_cvt_to_null_terminated_str routine before calling dss_statistics. Refer to the description of mkl_cvt_to_null_terminated_str for details. Output Parameters Name Type Description retValues FORTRAN 77: REAL*8 Fortran 90: REAL(KIND=8), INTENT(OUT) C: VOID const* Value of the statistics returned. Finding 'time used to reorder' and 'inertia' of a matrix The example below illustrates the use of the dss_statistics routine. To find the above values, call dss_statistics(handle, opt, statArr, retValue), where staArr is "ReorderTime,Inertia" In this example, retValue has the following values: retValue[0] Time to reorder. retValue[1] Positive Eigenvalues. retValue[2] Negative Eigenvalues. retValue[3] Zero Eigenvalues. Return Values MKL_DSS_SUCCESS MKL_DSS_INVALID_OPTION MKL_DSS_STATISTICS_INVALID_MATRIX MKL_DSS_STATISTICS_INVALID_STATE MKL_DSS_STATISTICS_INVALID_STRING MKL_DSS_MSG_LVL_ERR MKL_DSS_TERM_LVL_ERR mkl_cvt_to_null_terminated_str Passes character strings from Fortran routines to C routines. 8 Intel® Math Kernel Library Reference Manual 1930 Syntax mkl_cvt_to_null_terminated_str (destStr, destLen, srcStr) Include Files • FORTRAN 77: mkl_dss.f77 • Fortran 90: mkl_dss.f90 Description The routine mkl_cvt_to_null_terminated_str passes character strings from Fortran routines to C routines. The strings are converted into integer arrays before being passed to C. Using this routine avoids the problems that can occur on some platforms when passing strings from Fortran to C. The use of this routine is highly recommended. Input Parameters destLen INTEGER. Length of the output array destStr. srcStr STRING. Input string. Output Parameters destStr INTEGER. One-dimensional array of integers. Implementation Details Several aspects of the Intel MKL DSS interface are platform-specific and language-specific. To promote portability across platforms and ease of use across different languages, one of the following Intel MKL DSS language-specific header files can be included: • mkl_dss.f77 for F77 programs • mkl_dss.f90 for F90 programs • mkl_dss.h for C programs These header files define symbolic constants for returned error values, function options, certain defined data types, and function prototypes. NOTE Constants for options, returned error values, and message severities must be referred only by the symbolic names that are defined in these header files. Use of the Intel MKL DSS software without including one of the above header files is not supported. Memory Allocation and Handles To simplify the use of the Intel MKL DSS routines, they do not require you to allocate any temporary working storage. The solver itself allocates any required storage. To enable multiple users to access the solver simultaneously, the solver keeps track of the storage allocated for a particular application by using a data object called a handle. Each of the Intel MKL DSS routines creates, uses or deletes a handle. Consequently, each program must be able to allocate storage for a handle. The exact syntax for allocating storage for a handle varies from language to language. To standardize the handle declarations, the language-specific header files declare constants and defined data types that must be used when declaring a handle object in the user code. Fortran 90 programmers must declare a handle as: INCLUDE "mkl_dss.f90" TYPE(MKL_DSS_HANDLE) handle Sparse Solver Routines 8 1931 C and C++ programmers must declare a handle as: #include "mkl_dss.h" _MKL_DSS_HANDLE_t handle; FORTRAN 77 programmers using compilers that support eight byte integers, must declare a handle as: INCLUDE "mkl_dss.f77" INTEGER*8 handle Otherwise they can replace the INTEGER*8 data types with the DOUBLE PRECISION data type. In addition to the definition for the correct declaration of a handle, the include file also defines the following: • function prototypes for languages that support prototypes • symbolic constants that are used for the returned error values • user options for the solver routines • constants indicating message severity. Iterative Sparse Solvers based on Reverse Communication Interface (RCI ISS) Intel MKL supports the iterative sparse solvers (ISS) based on the reverse communication interface (RCI), referred to here as RCI ISS interface . The RCI ISS interface implements a group of user-callable routines that are used in the step-by-step solving process of a symmetric positive definite system (RCI conjugate gradient solver, or RCI CG), and of a non-symmetric indefinite (non-degenerate) system (RCI flexible generalized minimal residual solver, or RCI FGMRES) of linear algebraic equations. This interface uses the general RCI scheme described in [Dong95]. See the Appendix A Linear Solvers Basics for discussion of terms and concepts related to the ISS routines. RCI means that when the solver needs the results of certain operations (for example, matrix-vector multiplications), the user must perform them and pass the result to the solver. This gives great universality to the solver as it is independent of the specific implementation of the operations like the matrix-vector multiplication. To perform such operations, the user can use the built-in sparse matrix-vector multiplications and triangular solvers routines (see Sparse BLAS Level 2 and Level 3 Routines). NOTE The RCI CG solver is implemented in two versions: for system of equations with a single right hand side, and for system of equations with multiple right hand sides. The CG method may fail to compute the solution or compute the wrong solution if the matrix of the system is not symmetric and positive definite. The FGMRES method may fail if the matrix is degenerate. Table "RCI CG Interface Routines" lists the names of the routines, and describes their general use. RCI ISS Interface Routines Routine Description dcg_init, dcgmrhs_init, dfgmres_init Initializes the solver. dcg_check, dcgmrhs_check, dfgmres_check Checks the consistency and correctness of the user defined data. dcg, dcgmrhs, dfgmres Computes the approximate solution vector. dcg_get, dcgmrhs_get, dfgmres_get Retrieves the number of the current iteration. The Intel MKL RCI ISS interface routines are normally invoked in this order: 1. _init 2. _check 3. 8 Intel® Math Kernel Library Reference Manual 1932 4. _get Advanced users can change that order if they need it. Others should follow the above order of calls. The following diagram in Figure "Typical Order for Invoking RCI ISS Interface Routines" indicates the typical order in which the RCI ISS interface routines can be invoked. Typical Order for Invoking RCI ISS interface Routines Code examples that use the RCI ISS interface routines to solve systems of linear equations can be found in the examples\solver\source folder of your Intel MKL directory (cg_no_precon.f, cg_no_precon_c.c, cg_mrhs.f, cg_mrhs_precond.f, cg_mrhs_stop_crt.f, fgmres_no_precon_f.f, fgmres_no_precon_c.c). CG Interface Description All types in this documentation refer to the common Fortran types, INTEGER, and DOUBLE PRECISION. C and C++ programmers should refer to the section Calling Sparse Solver and Preconditioner Routines from C C++ for information on mapping Fortran types to C/C++ types. Each routine for the RCI CG solver is implemented in two versions: for a system of equations with a single right hand side (SRHS), and for a system of equations with multiple right hand sides (MRHS). The names of routines for a system with MRHS contain the suffix mrhs. Routine Options All of the RCI CG routines have common parameters for passing various options to the routines (see CG Common Parameters). The values for these parameters can be changed during computations. Sparse Solver Routines 8 1933 User Data Arrays Many of the RCI CG routines take arrays of user data as input. For example, user arrays are passed to the routine dcg to compute the solution of a system of linear algebraic equations. The Intel MKL RCI CG routines do not make copies of the user input arrays to minimize storage requirements and improve overall run-time efficiency. CG Common Parameters NOTE The default and initial values listed below are assigned to the parameters by calling the dcg_init/dcgmrhs_init routine. n INTEGER, this parameter sets the size of the problem in the dcg_init/ dcgmrhs_init routine. All the other routines uses the ipar(1) parameter instead. Note that the coefficient matrix A is a square matrix of size n*n. x DOUBLE PRECISION array of size n for SRHS, or matrix of size (n*nrhs) for MRHS. This parameter contains the current approximation to the solution. Before the first call to the dcg/dcgmrhs routine, it contains the initial approximation to the solution. nrhs INTEGER, this parameter sets the number of right-hand sides for MRHS routines. b DOUBLE PRECISION array containing a single right-hand side vector, or matrix of size (nrhs*n) containing right-hand side vectors. RCI_request INTEGER, this parameter gives information about the result of work of the RCI CG routines. Negative values of the parameter indicate that the routine completed with errors or warnings. The 0 value indicates successful completion of the task. Positive values mean that you must perform specific actions: RCI_request= 1 multiply the matrix by tmp (1:n,1), put the result in tmp(1:n,2), and return the control to the dcg/ dcgmrhs routine; RCI_request= 2 to perform the stopping tests. If they fail, return the control to the dcg/dcgmrhs routine. If the stopping tests succeed, it indicates that the solution is found and stored in the x array; RCI_request= 3 for SRHS: apply the preconditioner to tmp(1:n,3), put the result in tmp(1:n,4), and return the control to the dcg routine; for MRHS: apply the preconditioner to tmp(:, 3+ipar(3)), put the result in tmp(:,3), and return the control to the dcgmrhs routine. Note that the dcg_get/dcgmrhs_get routine does not change the parameter RCI_request. This enables use of this routine inside the reverse communication computations. ipar INTEGER array, of size 128 for SRHS, and of size (128+2*nrhs) for MRHS. This parameter specifies the integer set of data for the RCI CG computations: ipar(1) specifies the size of the problem. The dcg_init/ dcgmrhs_init routine assigns ipar(1)=n. All the other routines use this parameter instead of n. There is no default value for this parameter. 8 Intel® Math Kernel Library Reference Manual 1934 ipar(2) specifies the type of output for error and warning messages generated by the RCI CG routines. The default value 6 means that all messages are displayed on the screen. Otherwise, the error and warning messages are written to the newly created files dcg_errors.txt and dcg_check_warnings.txt, respectively. Note that if ipar(6) and ipar(7) parameters are set to 0, error and warning messages are not generated at all. ipar(3) for SRHS: contains the current stage of the RCI CG computations. The initial value is 1; for MRHS: contains the right-hand side for which the calculations are currently performed. WARNING Avoid altering this variable during computations. ipar(4) contains the current iteration number. The initial value is 0. ipar(5) specifies the maximum number of iterations. The default value is min(150, n). ipar(6) if the value is not equal to 0, the routines output error messages in accordance with the parameter ipar(2). Otherwise, the routines do not output error messages at all, but return a negative value of the parameter RCI_request. The default value is 1. ipar(7) if the value is not equal to 0, the routines output warning messages in accordance with the parameter ipar(2). Otherwise, the routines do not output warning messages at all, but they return a negative value of the parameter RCI_request. The default value is 1. ipar(8) if the value is not equal to 0, the dcg/dcgmrhs routine performs the stopping test for the maximum number of iterations: ipar(4) = ipar(5). Otherwise, the method is stopped and the corresponding value is assigned to the RCI_request. If the value is 0, the routine does not perform this stopping test. The default value is 1. ipar(9) if the value is not equal to 0, the dcg/dcgmrhs routine performs the residual stopping test: dpar(5) = dpar(4)= dpar(1)*dpar(3)+ dpar(2). Otherwise, the method is stopped and corresponding value is assigned to the RCI_request. If the value is 0, the routine does not perform this stopping test. The default value is 0. ipar(10) if the value is not equal to 0, the dcg/dcgmrhs routine requests a user-defined stopping test by setting the output parameter RCI_request=2. If the value is 0, the routine does not perform the user defined stopping test. The default value is 1. Sparse Solver Routines 8 1935 NOTE At least one of the parameters ipar(8)- ipar(10) must be set to 1. ipar(11) if the value is equal to 0, the dcg/dcgmrhs routine runs the non-preconditioned version of the corresponding CG method. Otherwise, the routine runs the preconditioned version of the CG method, and by setting the output parameter RCI_request=3, indicates that you must perform the preconditioning step. The default value is 0. ipar(11:128), ipar(11:128+2*nrh s) are reserved and not used in the current RCI CG SRHS and MRHS routines. NOTE You must declare the array ipar with length 128. While defining the array in the code using RCI CG SRHS as INTEGER ipar(11) works, there is no guarantee of future compatibility with Intel MKL. dpar DOUBLE PRECISION array, for SRHS of size 128, for MRHS of size (128+2*nrhs); this parameter is used to specify the double precision set of data for the RCI CG computations, specifically: dpar(1) specifies the relative tolerance. The default value is 1.0D-6. dpar(2) specifies the absolute tolerance. The default value is 0.0D-0. dpar(3) specifies the square norm of the initial residual (if it is computed in the dcg/dcgmrhs routine). The initial value is 0. dpar(4) service variable equal to dpar(1)*dpar(3)+dpar(2) (if it is computed in the dcg/dcgmrhs routine). The initial value is 0. dpar(5) - specifies the square norm of the current residual. The initial value is 0.0. dpar(6) specifies the square norm of residual from the previous iteration step (if available). The initial value is 0.0. dpar(7) contains the alpha parameter of the CG method. The initial value is 0.0. dpar(8) contains the beta parameter of the CG method, it is equal to dpar(5)/dpar(6) The initial value is 0.0. dpar(9:128), dpar(9:128+2*nrhs ) are reserved and not used in the current RCI CG SRHS and MRHS routines respectively. 8 Intel® Math Kernel Library Reference Manual 1936 NOTE You must declare the array dpar with length 128. While defining the array in the code using RCI CG SRHS as DOUBLE PRECISION dpar(8) works, there is no guarantee of future compatibility with Intel MKL. tmp DOUBLE PRECISION array of size (n,4)for SRHS, and (n,3+nrhs)for MRHS. This parameter is used to supply the double precision temporary space for the RCI CG computations, specifically: tmp(:,1) specifies the current search direction. The initial value is 0.0. tmp(:,2) contains the matrix multiplied by the current search direction. The initial value is 0.0. tmp(:,3) contains the current residual. The initial value is 0.0. tmp(:,4) contains the inverse of the preconditioner applied to the current residual. There is no initial value for this parameter. NOTE You can define this array in the code using RCI CG SRHS as DOUBLE PRECISION tmp(n,3) if you run only non-preconditioned CG iterations. Schemes of Using the RCI CG Routines The following pseudocode shows the general schemes of using the RCI CG routines. ... generate matrix A generate preconditioner C (optional) call dcg_init(n, x, b, RCI_request, ipar, dpar, tmp) change parameters in ipar, dpar if necessary call dcg_check(n, x, b, RCI_request, ipar, dpar, tmp) 1 call dcg(n, x, b, RCI_request, ipar, dpar, tmp) if (RCI_request.eq.1) then multiply the matrix A by tmp(1:n,1) and put the result in tmp(1:n,2) It is possible to use MKL Sparse BLAS Level 2 subroutines for this operation c proceed with CG iterations goto 1 endif if (RCI_request.eq.2)then do the stopping test if (test not passed) then c proceed with CG iterations Sparse Solver Routines 8 1937 go to 1 else c stop CG iterations goto 2 endif endif if (RCI_request.eq.3) then (optional) apply the preconditioner C inverse to tmp(1:n,3) and put the result in tmp(1:n,4) c proceed with CG iterations goto 1 end 2 call dcg_get(n, x, b, RCI_request, ipar, dpar, tmp, itercount) current iteration number is in itercount the computed approximation is in the array x FGMRES Interface Description All types in this documentation refer to the common Fortran types, INTEGER, and DOUBLE PRECISION. C and C++ programmers should refer to the section Calling Sparse Solver and Preconditioner Routines from C C++ for information on mapping Fortran types to C/C++ types. Routine Options All of the RCI FGMRES routines have common parameters for passing various options to the routines (see FGMRES Common Parameters). The values for these parameters can be changed during computations. User Data Arrays Many of the RCI FGMRES routines take arrays of user data as input. For example, user arrays are passed to the routine dfgmres to compute the solution of a system of linear algebraic equations. To minimize storage requirements and improve overall run-time efficiency, the Intel MKL RCI FGMRES routines do not make copies of the user input arrays. FGMRES Common Parameters NOTE The default and initial values listed below are assigned to the parameters by calling the dfgmres_init routine. n INTEGER, this parameter sets the size of the problem in the dfgmres_init routine. All the other routines uses ipar(1) parameter instead. Note that the coefficient matrix A is a square matrix of size n*n. 8 Intel® Math Kernel Library Reference Manual 1938 x DOUBLE PRECISION array, this parameter contains the current approximation to the solution vector. Before the first call to the dfgmres routine, it contains the initial approximation to the solution vector. b DOUBLE PRECISION array, this parameter contains the right-hand side vector. Depending on user requests (see the parameter ipar(13), it may later contain the approximate solution. RCI_request INTEGER, this parameter gives information about the result of work of the RCI FGMRES routines. Negative values of the parameter indicate that the routine completed with errors or warnings. The 0 value indicates successful completion of the task. Positive values mean that you must perform specific actions: RCI_request= 1 multiply the matrix by tmp(ipar(22)), put the result in tmp(ipar(23)), and return the control to the dfgmres routine; RCI_request= 2 perform the stopping tests. If they fail, return the control to the dfgres routine. Otherwise, the solution can be updated by a subsequent call to dfgmres_get routine; RCI_request= 3 apply the preconditioner to tmp(ipar(22)), put the result in tmp(ipar(23)), and return the control to the dfgmres routine. RCI_request= 4 check if the norm of the current orthogonal vector is zero, within the rounding or computational errors. Return the control to the dfgmres routine if it is not zero, otherwise complete the solution process by calling dfgmres_get routine. ipar(128) INTEGER array, this parameter specifies the integer set of data for the RCI FGMRES computations: ipar(1) specifies the size of the problem. The dfgmres_init routine assigns ipar(1)=n. All the other routines uses this parameter instead of n. There is no default value for this parameter. ipar(2) specifies the type of output for error and warning messages that are generated by the RCI FGMRES routines. The default value 6 means that all messages are displayed on the screen. Otherwise the error and warning messages are written to the newly created file MKL_RCI_FGMRES_Log.txt. Note that if ipar(6) and ipar(7) parameters are set to 0, error and warning messages are not generated at all. ipar(3) contains the current stage of the RCI FGMRES computations., The initial value is 1. WARNING Avoid altering this variable during computations. ipar(4) contains the current iteration number. The initial value is 0. Sparse Solver Routines 8 1939 ipar(5) specifies the maximum number of iterations. The default value is min (150,n). ipar(6) if the value is not 0, the routines output error messages in accordance with the parameter ipar(2). If it is 0, the routines do not output error messages at all, but return a negative value of the parameter RCI_request. The default value is 1. ipar(7) if the value is not 0, the routines output warning messages in accordance with the parameter ipar(2). Otherwise, the routines do not output warning messages at all, but they return a negative value of the parameter RCI_request. The default value is 1. ipar(8) if the value is not equal to 0, the dfmres routine performs the stopping test for the maximum number of iterations: ipar(4) = ipar(5). If the value is 0, the dfgmres routine does not perform this stopping test. The default value is 1. ipar(9) if the value is not 0, the dfgmres routine performs the residual stopping test: dpar(5) = dpar(4).If the value is 0, the dfgmres routine does not perform this stopping test. The default value is 0. ipar(10) if the value is not 0, the dfgmres routine indicates that the user-defined stopping test be performed by setting RCI_request=2. If the value is 0, the dfgmres routine does not perform the user-defined stopping test. The default value is 1. NOTE At least one of the parameters ipar(8)- ipar(10) must be set to 1. ipar(11) if the value is 0, the dfgmres routine runs the nonpreconditioned version of the FGMRES method. Otherwise, the routine runs the preconditioned version of the FGMRES method, and requests that you perform the preconditioning step by setting the output parameter RCI_request=3. The default value is 0. ipar(12) if the value is not equal to 0, the dfgmres routine performs the automatic test for zero norm of the currently generated vector: dpar(7) = dpar(8), where dpar(8) contains the tolerance value. Otherwise, the routine indicates that you must perform this check by setting the output parameter RCI_request=4. The default value is 0. ipar(13) if the value is equal to 0, the dfgmres_get routine updates the solution to the vector x according to the computations done by the dfgmres routine. If the value is positive, the routine writes the solution to the right hand side vector b. If the value is 8 Intel® Math Kernel Library Reference Manual 1940 negative, the routine returns only the number of the current iteration, and does not update the solution. The default value is 0. NOTE It is possible to call the dfgmres_get routine at any place in the code, but you must pay special attention to the parameter ipar(13). The RCI FGMRES iterations can be continued after the call to dfgmres_get routine only if the parameter ipar(13) is not equal to zero. If ipar(13) is positive, then the updated solution overwrites the right hand side in the vector b. If you want to run the restarted version of FGMRES with the same right hand side, then it must be saved in a different memory location before the first call to the dfgmres_get routine with positive ipar(13). ipar(14) contains the internal iteration counter that counts the number of iterations before the restart takes place. The initial value is 0. WARNING Do not alter this variable during computations. ipar(15) specifies the number of the non-restarted FGMRES iterations. To run the restarted version of the FGMRES method, assign the number of iterations to ipar(15) before the restart. The default value is min(150, n), which means that by default the nonrestarted version of FGMRES method is used. ipar(16) service variable specifying the location of the rotated Hessenberg matrix from which the matrix stored in the packed format (see Matrix Arguments in the Appendix B for details) is started in the tmp array. ipar(17) service variable specifying the location of the rotation cosines from which the vector of cosines is started in the tmp array. ipar(18) service variable specifying the location of the rotation sines from which the vector of sines is started in the tmp array. ipar(19) service variable specifying the location of the rotated residual vector from which the vector is started in the tmp array. ipar(20) service variable, specifies the location of the least squares solution vector from which the vector is started in the tmp array. Sparse Solver Routines 8 1941 ipar(21) service variable specifying the location of the set of preconditioned vectors from which the set is started in the tmp array. The memory locations in the tmp array starting from ipar(21) are used only for the preconditioned FGMRES method. ipar(22) specifies the memory location from which the first vector (source) used in operations requested via RCI_request is started in the tmp array. ipar(23) specifies the memory location from which the second vector (source) used in operations requested via RCI_request is started in the tmp array. ipar(24:128) are reserved and not used in the current RCI FGMRES routines. NOTE You must declare the array ipar with length 128. While defining the array in the code as INTEGER ipar(23) works, there is no guarantee of future compatibility with Intel MKL. dpar(128) DOUBLE PRECISION array, this parameter specifies the double precision set of data for the RCI CG computations, specifically: dpar(1) specifies the relative tolerance. The default value is 1.0D-6. dpar(2) specifies the absolute tolerance. The default value is 0.0D-0. dpar(3) specifies the Euclidean norm of the initial residual (if it is computed in the dfgmres routine). The initial value is 0.0. dpar(4) service variable equal to dpar(1)*dpar(3)+dpar(2) (if it is computed in the dfgmres routine). The initial value is 0.0. dpar(5) specifies the Euclidean norm of the current residual. The initial value is 0.0. dpar(6) specifies the Euclidean norm of residual from the previous iteration step (if available). The initial value is 0.0. dpar(7) contains the norm of the generated vector. The initial value is 0.0. NOTE In terms of [Saad03] this parameter is the coefficient hk+1,k of the Hessenberg matrix. dpar(8) contains the tolerance for the zero norm of the currently generated vector. The default value is 1.0D-12. dpar(9:128) are reserved and not used in the current RCI FGMRES routines. 8 Intel® Math Kernel Library Reference Manual 1942 NOTE You must declare the array dpar with length 128. While defining the array in the code as DOUBLE PRECISION dpar(8) works, there is no guarantee of future compatibility with Intel MKL. tmp DOUBLE PRECISION array of size ((2*ipar(15)+1)*n + ipar(15)*(ipar(15)+9)/2 + 1)) used to supply the double precision temporary space for the RCI FGMRES computations, specifically: tmp(1:ipar(16)-1) contains the sequence of vectors generated by the FGMRES method. The initial value is 0.0. tmp(ipar(16):ipar (17)-1) contains the rotated Hessenberg matrix generated by the FGMRES method; the matrix is stored in the packed format. There is no initial value for this part of tmp array. tmp(ipar(17):ipar (18)-1) contains the rotation cosines vector generated by the FGMRES method. There is no initial value for this part of tmp array. tmp(ipar(18):ipar (19)-1) contains the rotation sines vector generated by the FGMRES method. There is no initial value for this part of tmp array. tmp(ipar(19):ipar (20)-1) contains the rotated residual vector generated by the FGMRES method. There is no initial value for this part of tmp array. tmp(ipar(20):ipar (21)-1) contains the solution vector to the least squares problem generated by the FGMRES method. There is no initial value for this part of tmp array. tmp(ipar(21):) contains the set of preconditioned vectors generated for the FGMRES method by the user. This part of tmp array is not used if the non-preconditioned version of FGMRES method is called. There is no initial value for this part of tmp array. NOTE You can define this array in the code as DOUBLE PRECISION tmp((2*ipar(15)+1)*n + ipar(15)*(ipar(15)+9)/2 + 1)) if you run only non-preconditioned FGMRES iterations. Schemes of Using the RCI FGMRES Routines The following pseudocode shows the general schemes of using the RCI FGMRES routines. ... generate matrix A generate preconditioner C (optional) call dfgmres_init(n, x, b, RCI_request, ipar, dpar, tmp) change parameters in ipar, dpar if necessary call dfgmres_check(n, x, b, RCI_request, ipar, dpar, tmp) Sparse Solver Routines 8 1943 1 call dfgmres(n, x, b, RCI_request, ipar, dpar, tmp) if (RCI_request.eq.1) then multiply the matrix A by tmp(ipar(22)) and put the result in tmp(ipar(23)) It is possible to use MKL Sparse BLAS Level 2 subroutines for this operation c proceed with FGMRES iterations goto 1 endif if (RCI_request.eq.2) then do the stopping test if (test not passed) then c proceed with FGMRES iterations go to 1 else c stop FGMRES iterations goto 2 endif endif if (RCI_request.eq.3) then (optional) apply the preconditioner C inverse to tmp(ipar(22)) and put the result in tmp(ipar(23)) c proceed with FGMRES iterations goto 1 endif if (RCI_request.eq.4) then check the norm of the next orthogonal vector, it is contained in dpar(7) if (the norm is not zero up to rounding/computational errors) then c proceed with FGMRES iterations goto 1 else c stop FGMRES iterations goto 2 endif endif 2 call dfgmres_get(n, x, b, RCI_request, ipar, dpar, tmp, itercount) current iteration number is in itercount the computed approximation is in the array x For the FGMRES method, the array x initially contains the current approximation to the solution. It can be updated only by calling the routine dfgmres_get, which updates the solution in accordance with the computations performed by the routine dfgmres. 8 Intel® Math Kernel Library Reference Manual 1944 The above pseudocode demonstrates two main differences in the use of RCI FGMRES interface comparing with the CG Interface Description. The first difference relates to RCI_request=3: it uses different locations in the tmp array, which is two-dimensional for CG and one-dimensional for FGMRES. The second difference relates to RCI_request=4: the RCI CG interface never produces RCI_request=4. RCI ISS Routines dcg_init Initializes the solver. Syntax dcg_init(n, x, b, RCI_request, ipar, dpar, tmp) Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine dcg_init initializes the solver. After initialization, all subsequent invocations of the Intel MKL RCI CG routines use the values of all parameters returned by the routine dcg_init. Advanced users can skip this step and set these parameters directly in the appropriate routines. WARNING You can modify the contents of these arrays after they are passed to the solver routine only if you are sure that the values are correct and consistent. You can perform a basic check for correctness and consistency by calling the dcg_check routine, but it does not guarantee that the method will work correctly. Input Parameters n INTEGER.Sets the size of the problem. x DOUBLE PRECISION array of size n. Contains the initial approximation to the solution vector. Normally it is equal to 0 or to b. b DOUBLE PRECISION array of size n. Contains the right-hand side vector. Output Parameters RCI_request INTEGER. Gives information about the result of the routine. ipar INTEGER array of size 128. Refer to the CG Common Parameters. dpar DOUBLE PRECISION array of size 128. Refer to the CG Common Parameters. tmp DOUBLE PRECISION array of size (n,4). Refer to the CG Common Parameters. Return Values RCI_request= 0 Indicates that the task completed normally. RCI_request= -10000 Indicates failure to complete the task. Sparse Solver Routines 8 1945 dcg_check Checks consistency and correctness of the user defined data. Syntax dcg_check(n, x, b, RCI_request, ipar, dpar, tmp) Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine dcg_check checks consistency and correctness of the parameters to be passed to the solver routine dcg. However this operation does not guarantee that the solver returns the correct result. It only reduces the chance of making a mistake in the parameters of the method. Skip this operation only if you are sure that the correct data is specified in the solver parameters. The lengths of all vectors must be defined in a previous call to the dcg_init routine. Input Parameters n INTEGER. Sets the size of the problem. x DOUBLE PRECISION array of size n. Contains the initial approximation to the solution vector. Normally it is equal to 0 or to b. b DOUBLE PRECISION array of size n. Contains the right-hand side vector. Output Parameters RCI_request INTEGER. Gives information about result of the routine. ipar INTEGER array of size 128. Refer to the CG Common Parameters. dpar DOUBLE PRECISION array of size 128. Refer to the CG Common Parameters. tmp DOUBLE PRECISION array of size (n,4). Refer to the CG Common Parameters. Return Values RCI_request= 0 Indicates that the task completed normally. RCI_request= -1100 Indicates that the task is interrupted and the errors occur. RCI_request= -1001 Indicates that there are some warning messages. RCI_request= -1010 Indicates that the routine changed some parameters to make them consistent or correct. RCI_request= -1011 Indicates that there are some warning messages and that the routine changed some parameters. dcg Computes the approximate solution vector. Syntax dcg(n, x, b, RCI_request, ipar, dpar, tmp) 8 Intel® Math Kernel Library Reference Manual 1946 Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The dcg routine computes the approximate solution vector using the CG method [Young71]. The routine dcg uses the vector in the array x before the first call as an initial approximation to the solution. The parameter RCI_request gives you information about the task completion and requests results of certain operations that are required by the solver. Note that lengths of all vectors must be defined in a previous call to the dcg_init routine. Input Parameters n INTEGER. Sets the size of the problem. x DOUBLE PRECISION array of size n. Contains the initial approximation to the solution vector. b DOUBLE PRECISION array of size n. Contains the right-hand side vector. tmp DOUBLE PRECISION array of size (n,4). Refer to the CG Common Parameters. Output Parameters RCI_request INTEGER. Gives information about result of work of the routine. x DOUBLE PRECISION array of size n. Contains the updated approximation to the solution vector. ipar INTEGER array of size 128. Refer to the CG Common Parameters. dpar DOUBLE PRECISION array of size 128. Refer to the CG Common Parameters. tmp DOUBLE PRECISION array of size (n,4). Refer to the CG Common Parameters. Return Values RCI_request=0 Indicates that the task completed normally and the solution is found and stored in the vector x. This occurs only if the stopping tests are fully automatic. For the user defined stopping tests, see the description of the RCI_request= 2. RCI_request=-1 Indicates that the routine was interrupted because the maximum number of iterations was reached, but the relative stopping criterion was not met. This situation occurs only if you request both tests. RCI_request=-2 Indicates that the routine was interrupted because of an attempt to divide by zero. This situation happens if the matrix is non-positive definite or almost non-positive definite. RCI_request=- 10 Indicates that the routine was interrupted because the residual norm is invalid. This usually happens because the value dpar(6) was altered outside of the routine, or the dcg_check routine was not called. Sparse Solver Routines 8 1947 RCI_request=-11 Indicates that the routine was interrupted because it enters the infinite cycle. This usually happens because the values ipar(8), ipar(9), ipar(10) were altered outside of the routine, or the dcg_check routine was not called. RCI_request= 1 Indicates that you must multiply the matrix by tmp(1:n, 1), put the result in the tmp(1:n,2), and return the control back to the routine dcg. RCI_request= 2 Indicates that you must perform the stopping tests. If they fail, return control back to the dcg routine. Otherwise, the solution is found and stored in the vector x. RCI_request= 3 Indicates that you must apply the preconditioner to tmp(:, 3), put the result in the tmp(:, 4), and return the control back to the routine dcg. dcg_get Retrieves the number of the current iteration. Syntax dcg_get(n, x, b, RCI_request, ipar, dpar, tmp, itercount) Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine dcg_get retrieves the current iteration number of the solutions process. Input Parameters n INTEGER. Sets the size of the problem. x DOUBLE PRECISION array of size n. Contains the initial approximation vector to the solution. b DOUBLE PRECISION array of size n. Contains the right-hand side vector. RCI_request INTEGER. This parameter is not used. ipar INTEGER array of size 128. Refer to the CG Common Parameters. dpar DOUBLE PRECISION array of size 128. Refer to the CG Common Parameters. tmp DOUBLE PRECISION array of size (n,4). Refer to the CG Common Parameters. Output Parameters itercount INTEGER argument. Returns the current iteration number. Return Values The routine dcg_get has not return values. dcgmrhs_init Initializes the RCI CG solver with MHRS. 8 Intel® Math Kernel Library Reference Manual 1948 Syntax dcgmrhs_init(n, x, nrhs, b, method, RCI_request, ipar, dpar, tmp) Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine dcgmrhs_init initializes the solver. After initialization all subsequent invocations of the Intel MKL RCI CG with multiple right hand sides (MRHS) routines use the values of all parameters that are returned by dcgmrhs_init. Advanced users may skip this step and set the values to these parameters directly in the appropriate routines. WARNING You can modify the contents of these arrays after they are passed to the solver routine only if you are sure that the values are correct and consistent. You can perform a basic check for correctness and consistency by calling the dcgmrhs_check routine, but it does not guarantee that the method will work correctly. Input Parameters n INTEGER. Sets the size of the problem. x DOUBLE PRECISION matrix of size n*nrhs. Contains the initial approximation to the solution vectors. Normally it is equal to 0 or to b. nrhs INTEGER. Sets the number of right-hand sides. b DOUBLE PRECISION matrix of size nrhs*n. Contains the right-hand side vectors. method INTEGER. Specifies the method of solution: A value of 1 indicates CG with multiple right hand sides (default value) Output Parameters RCI_request INTEGER. Gives information about the result of the routine. ipar INTEGER array of size (128+2*nrhs). Refer to the CG Common Parameters. dpar DOUBLE PRECISION array of size (128+2*nrhs). Refer to the CG Common Parameters. tmp DOUBLE PRECISION array of size (n, 3+nrhs). Refer to the CG Common Parameters. Return Values RCI_request= 0 Indicates that the task completed normally. RCI_request= -10000 Indicates failure to complete the task. dcgmrhs_check Checks consistency and correctness of the user defined data. Syntax dcgmrhs_check(n, x, nrhs, b, RCI_request, ipar, dpar, tmp) Sparse Solver Routines 8 1949 Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine dcgmrhs_check checks the consistency and correctness of the parameters to be passed to the solver routine dcgmrhs. While this operation reduces the chance of making a mistake in the parameters, it does not guarantee that the solver returns the correct result. If you are sure that the correct data is specified in the solver parameters, you can skip this operation. The lengths of all vectors must be defined in a previous call to the dcgmrhs_init routine. Input Parameters n INTEGER. Sets the size of the problem. x DOUBLE PRECISION matrix of size n*nrhs. THAT IS TEST Contains the initial approximation to the solution vectors. Normally it is equal to 0 or to b. nrhs INTEGER. This parameter sets the number of right-hand sides. b DOUBLE PRECISION matrix of size (nrhs,n). Contains the right-hand side vectors. Output Parameters RCI_request INTEGER. Gives information about the results of the routine. ipar INTEGER array of size (128+2*nrhs). Refer to the CG Common Parameters. dpar DOUBLE PRECISION array of size (128+2*nrhs). Refer to the CG Common Parameters. tmp DOUBLE PRECISION array of size (n, 3+nrhs). Refer to the CG Common Parameters. Return Values RCI_request= 0 Indicates that the task completed normally. RCI_request= -1100 Indicates that the task is interrupted and the errors occur. RCI_request= -1001 Indicates that there are some warning messages. RCI_request= -1010 Indicates that the routine changed some parameters to make them consistent or correct. RCI_request= -1011 Indicates that there are some warning messages and that the routine changed some parameters. dcgmrhs Computes the approximate solution vectors. Syntax dcgmrhs(n, x, nrhs, b, RCI_request, ipar, dpar, tmp) Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h 8 Intel® Math Kernel Library Reference Manual 1950 Description The routine dcgmrhs computes approximate solution vectors using the CG with multiple right hand sides (MRHS) method [Young71]. The routine dcgmrhs uses the value that was in the x before the first call as an initial approximation to the solution. The parameter RCI_request gives information about task completion status and requests results of certain operations that are required by the solver. Note that lengths of all vectors are assumed to have been defined in a previous call to the dcgmrhs_init routine. Input Parameters n INTEGER. Sets the size of the problem, and the sizes of arrays x and b. x DOUBLE PRECISION matrix of size n*nrhs. Contains the initial approximation to the solution vectors. nrhs INTEGER. Sets the number of right-hand sides. b DOUBLE PRECISION matrix of size (nrhs*n). Contains the right-hand side vectors. tmp DOUBLE PRECISION array of size (n, 3+nrhs). Refer to the CG Common Parameters. Output Parameters RCI_request INTEGER. Gives information about result of work of the routine. x DOUBLE PRECISION matrix of size n-by-nrhs. Contains the updated approximation to the solution vectors. ipar INTEGER array of size (128+2*nrhs). Refer to the CG Common Parameters. dpar DOUBLE PRECISION array of size (128+2*nrhs). Refer to the CG Common Parameters. tmp DOUBLE PRECISION array of size (n, 3+nrhs). Refer to the CG Common Parameters. Return Values RCI_request=0 Indicates that the task completed normally and the solution is found and stored in the vector x. This occurs only if the stopping tests are fully automatic. For the user defined stopping tests, see the description of the RCI_request= 2. RCI_request=-1 Indicates that the routine was interrupted because the maximum number of iterations was reached, but the relative stopping criterion was not met. This situation occurs only if both tests are requested by the user. RCI_request=-2 The routine was interrupted because of an attempt to divide by zero. This situation happens if the matrix is nonpositive definite or almost non-positive definite. RCI_request=- 10 Indicates that the routine was interrupted because the residual norm is invalid. This usually happens because the value dpar(6) was altered outside of the routine, or the dcg_check routine was not called. RCI_request=-11 Indicates that the routine was interrupted because it enters the infinite cycle. This usually happens because the values ipar(8), ipar(9), ipar(10) were altered outside of the routine, or the dcg_check routine was not called. Sparse Solver Routines 8 1951 RCI_request= 1 Indicates that you must multiply the matrix by tmp(1:n, 1), put the result in the tmp(1:n,2), and return the control back to the routine dcg. RCI_request= 2 Indicates that you must perform the stopping tests. If they fail, return control back to the dcg routine. Otherwise, the solution is found and stored in the vector x. RCI_request= 3 Indicates that you must apply the preconditioner to tmp(:, 3), put the result in the tmp(:, 4), and return the control back to the routine dcg. dcgmrhs_get Retrieves the number of the current iteration. Syntax dcgmrhs_get(n, x, b, RCI_request, ipar, dpar, tmp, itercount) Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine dcgmrhs_get retrieves the current iteration number of the solving process. Input Parameters n INTEGER. Sets the size of the problem. x DOUBLE PRECISION matrix of size n*nrhs. Contains the initial approximation to the solution vectors. nrhs INTEGER. Sets the number of right-hand sides. b DOUBLE PRECISION matrix of size (nrhs,n). Contains the right-hand side . RCI_request INTEGER. This parameter is not used. ipar INTEGER array of size (128+2*nrhs). Refer to the CG Common Parameters. dpar DOUBLE PRECISION array of size (128+2*nrhs). Refer to the CG Common Parameters. tmp DOUBLE PRECISION array of size (n, 3+nrhs). Refer to the CG Common Parameters. Output Parameters itercount INTEGER argument. Returns the current iteration number. Return Values The routine dcgmrhs_get has no return values. dfgmres_init Initializes the solver. Syntax dfgmres_init(n, x, b, RCI_request, ipar, dpar, tmp) 8 Intel® Math Kernel Library Reference Manual 1952 Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine dfgmres_init initializes the solver. After initialization all subsequent invocations of Intel MKL RCI FGMRES routines use the values of all parameters that are returned by dfgmres_init. Advanced users may skip this step and set the values to these parameters directly in the appropriate routines. WARNING You can modify the contents of these arrays after they are passed to the solver routine only if you are sure that the values are correct and consistent. You can perform a basic check for correctness and consistency by calling the dfgmres_check routine, but it does not guarantee that the method will work correctly. Input Parameters n INTEGER. Sets the size of the problem. x DOUBLE PRECISION array of size n. Contains the initial approximation to the solution vector. Normally it is equal to 0 or to b. b DOUBLE PRECISION array of size n. Contains the right-hand side vector. Output Parameters RCI_request INTEGER. Gives information about the result of the routine. ipar INTEGER array of size 128. Refer to the FGMRES Common Parameters. dpar DOUBLE PRECISION array of size 128. Refer to the FGMRES Common Parameters. tmp DOUBLE PRECISION array of size ((2*ipar(15)+1)*n +ipar(15)*ipar(15)+9)/2 + 1). Refer to the FGMRES Common Parameters. Return Values RCI_request= 0 Indicates that the task completed normally. RCI_request= -10000 Indicates failure to complete the task. dfgmres_check Checks consistency and correctness of the user defined data. Syntax dfgmres_check(n, x, b, RCI_request, ipar, dpar, tmp) Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Sparse Solver Routines 8 1953 Description The routine dfgmres_check checks consistency and correctness of the parameters to be passed to the solver routine dfgmres. However, this operation does not guarantee that the method gives the correct result. It only reduces the chance of making a mistake in the parameters of the routine. Skip this operation only if you are sure that the correct data is specified in the solver parameters. The lengths of all vectors are assumed to have been defined in a previous call to the dfgmres_init routine. Input Parameters n INTEGER. Sets the size of the problem. x DOUBLE PRECISION array of size n. Contains the initial approximation to the solution vector. Normally it is equal to 0 or to b. b DOUBLE PRECISION array of size n. Contains the right-hand side vector. Output Parameters RCI_request INTEGER. Gives information about result of the routine. ipar INTEGER array of size 128. Refer to the FGMRES Common Parameters. dpar DOUBLE PRECISION array of size 128. Refer to the FGMRES Common Parameters. tmp DOUBLE PRECISION array of size ((2*ipar(15)+1)*n +ipar(15)*ipar(15)+9)/2 + 1). Refer to the FGMRES Common Parameters. Return Values RCI_request= 0 Indicates that the task completed normally. RCI_request= -1100 Indicates that the task is interrupted and the errors occur. RCI_request= -1001 Indicates that there are some warning messages. RCI_request= -1010 Indicates that the routine changed some parameters to make them consistent or correct. RCI_request= -1011 Indicates that there are some warning messages and that the routine changed some parameters. dfgmres Makes the FGMRES iterations. Syntax dfgmres(n, x, b, RCI_request, ipar, dpar, tmp) Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine dfgmres performs the FGMRES iterations [Saad03], using the value that was in the array x before the first call as an initial approximation of the solution vector. To update the current approximation to the solution, the dfgmres_get routine must be called. The RCI FGMRES iterations can be continued after the call to the dfgmres_get routine only if the value of the parameter ipar(13) is not equal to 0 (default value). Note that the updated solution overwrites the right hand side in the vector b if the parameter 8 Intel® Math Kernel Library Reference Manual 1954 ipar(13) is positive, and the restarted version of the FGMRES method can not be run. If you want to keep the right hand side, you must be save it in a different memory location before the first call to the dfgmres_get routine with a positive ipar(13). The parameter RCI_request gives information about the task completion and requests results of certain operations that the solver requires. The lengths of all the vectors must be defined in a previous call to the dfgmres_init routine. Input Parameters n INTEGER. Sets the size of the problem. x DOUBLE PRECISION array of size n. Contains the initial approximation to the solution vector. b DOUBLE PRECISION array of size n. Contains the right-hand side vector. tmp DOUBLE PRECISION array of size ((2*ipar(15)+1)*n +ipar(15)*ipar(15)+9)/2 + 1). Refer to the FGMRES Common Parameters. Output Parameters RCI_request INTEGER. Informs about result of work of the routine. ipar INTEGER array of size 128. Refer to the FGMRES Common Parameters. dpar DOUBLE PRECISION array of size 128. Refer to the FGMRES Common Parameters. tmp DOUBLE PRECISION array of size ((2*ipar(15)+1)*n +ipar(15)*ipar(15)+9)/2 + 1). Refer to the FGMRES Common Parameters. Return Values RCI_request=0 Indicates that the task completed normally and the solution is found and stored in the vector x. This occurs only if the stopping tests are fully automatic. For the user defined stopping tests, see the description of the RCI_request= 2 or 4. RCI_request=-1 Indicates that the routine was interrupted because the maximum number of iterations was reached, but the relative stopping criterion was not met. This situation occurs only if you request both tests. RCI_request= -10 Indicates that the routine was interrupted because of an attempt to divide by zero. Usually this happens if the matrix is degenerate or almost degenerate. However, it may happen if the parameter dpar is altered, or if the method is not stopped when the solution is found. RCI_request= -11 Indicates that the routine was interrupted because it entered an infinite cycle. Usually this happens because the values ipar(8), ipar(9), ipar(10) were altered outside of the routine, or the dfgmres_check routine was not called. RCI_request= -12 Indicates that the routine was interrupted because errors were found in the method parameters. Usually this happens if the parameters ipar and dpar were altered by mistake outside the routine. Sparse Solver Routines 8 1955 RCI_request= 1 Indicates that you must multiply the matrix by tmp(ipar(22)), put the result in the tmp(ipar(23)), and return the control back to the routine dfgmres. RCI_request= 2 Indicates that you must perform the stopping tests. If they fail, return control to the dfgmres routine. Otherwise, the FGMRES solution is found, and you can run the fgmres_get routine to update the computed solution in the vector x. RCI_request= 3 Indicates that you must apply the inverse preconditioner to ipar(22), put the result in the ipar(23), and return the control back to the routine dfgmres. RCI_request= 4 Indicates that you must check the norm of the currently generated vector. If it is not zero within the computational/ rounding errors, return control to the dfgmres routine. Otherwise, the FGMRES solution is found, and you can run the dfgmres_get routine to update the computed solution in the vector x. dfgmres_get Retrieves the number of the current iteration and updates the solution. Syntax dfgmres_get(n, x, b, RCI_request, ipar, dpar, tmp, itercount) Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine dfgmres_get retrieves the current iteration number of the solution process and updates the solution according to the computations performed by the dfgmres routine. To retrieve the current iteration number only, set the parameter ipar(13)= -1 beforehand. Normally, you should do this before proceeding further with the computations. If the intermediate solution is needed, the method parameters must be set properly. For details see FGMRES Common Parameters and the Iterative Sparse Solver code examples in the examples\solver\source folder of your Intel MKL directory (cg_no_precon.f, cg_no_precon_c.c, cg_mrhs.f, cg_mrhs_precond.f, cg_mrhs_stop_crt.f, fgmres_no_precon_f.f, fgmres_no_precon_c.c). Input Parameters n INTEGER. Sets the size of the problem. ipar INTEGER array of size 128. Refer to the FGMRES Common Parameters. dpar DOUBLE PRECISION array of size 128. Refer to the FGMRES Common Parameters. tmp DOUBLE PRECISION array of size ((2*ipar(15)+1)*n +ipar(15)*ipar(15)+9)/2 + 1). Refer to the FGMRES Common Parameters. 8 Intel® Math Kernel Library Reference Manual 1956 Output Parameters x DOUBLE PRECISION array of size n. If ipar(13)= 0, it contains the updated approximation to the solution according to the computations done in dfgmres routine. Otherwise, it is not changed. b DOUBLE PRECISION array of size n. If ipar(13)> 0, it contains the updated approximation to the solution according to the computations done in dfgmres routine. Otherwise, it is not changed. RCI_request INTEGER. Gives information about result of the routine. itercount INTEGER argument. Contains the value of the current iteration number. Return Values RCI_request= 0 Indicates that the task completed normally. RCI_request= -12 Indicates that the routine was interrupted because errors were found in the routine parameters. Usually this happens if the parameters ipar and dpar are altered by mistake outside of the routine. RCI_request= -10000 Indicates that the routine failed to complete the task. Implementation Details Several aspects of the Intel MKL RCI ISS interface are platform-specific and language-specific. To promote portability across platforms and ease of use across different languages, include one of the Intel MKL RCI ISS language-specific header files. The C-language header file defines these function prototypes: void dcg_init(int *n, double *x, double *b, int *rci_request, int *ipar, double dpar, double *tmp); void dcg_check(int *n, double *x, double *b, int *rci_request, int *ipar, double dpar, double *tmp); void dcg(int *n, double *x, double *b, int *rci_request, int *ipar, double dpar, double *tmp); void dcg_get(int *n, double *x, double *b, int *rci_request, int *ipar, double dpar, double *tmp, int *itercount); void dcgmrhs_init(int *n, double *x, int *nRhs, double *b, int *method, int *rci_request, int *ipar, double dpar, double *tmp); void dcgmrhs_check(int *n, double *x, int *nRhs, double *b, int *rci_request, int *ipar, double dpar, double *tmp); void dcgmrhs(int *n, double *x, int *nRhs, double *b, int *rci_request, int *ipar, double dpar, double *tmp); void dcgmrhs_get(int *n, double *x, int *nRhs, double *b, int *rci_request, int *ipar, double dpar, double *tmp, int *itercount); void dfgmres_init(int *n, double *x, double *b, int *rci_request, int *ipar, double dpar, double *tmp); void dfgmres_check(int *n, double *x, double *b, int *rci_request, int *ipar, double dpar, double *tmp); void dfgmres(int *n, double *x, double *b, int *rci_request, int *ipar, double dpar, double *tmp); Sparse Solver Routines 8 1957 void dfgmres_get(int *n, double *x, double *b, int *rci_request, int *ipar, double dpar, double *tmp, int *itercount); NOTE Intel MKL does not support the RCI ISS interface unless you include the language-specific header file. Preconditioners based on Incomplete LU Factorization Technique Preconditioners, or accelerators are used to accelerate an iterative solution process. In some cases, their use can reduce the number of iterations dramatically and thus lead to better solver performance. Although the terms preconditioner and accelerator are synonyms, hereafter only preconditioner is used. Intel MKL provides two preconditioners, ILU0 and ILUT, for sparse matrices presented in the format accepted in the Intel MKL direct sparse solvers (three-array variation of the CSR storage format described in Sparse Matrix Storage Format ). The used algorithms used are described in [Saad03]. The ILU0 preconditioner is based on a well-known factorization of the original matrix into a product of two triangular matrices: lower and upper triangular matrices. Usually, such decomposition leads to some fill-in in the resulting matrix structure in comparison with the original matrix. The distinctive feature of the ILU0 preconditioner is that it preserves the structure of the original matrix in the result. Unlike the ILU0 preconditioner, the ILUT preconditioner preserves some resulting fill-in in the preconditioner matrix structure. The distinctive feature of the ILUT algorithm is that it calculates each element of the preconditioner and saves each one if it satisfies two conditions simultaneously: its value is greater than the product of the given tolerance and matrix row norm, and its value is in the given bandwidth of the resulting preconditioner matrix. Both ILU0 and ILUT preconditioners can apply to any non-degenerate matrix. They can be used alone or together with the Intel MKL RCI FGMRES solver (see Sparse Solver Routines). Avoid using these preconditioners with MKL RCI CG solver because in general, they produce a non-symmetric resulting matrix even if the original matrix is symmetric. Usually, an inverse of the preconditioner is required in this case. To do this the Intel MKL triangular solver routine mkl_dcsrtrsv must be applied twice: for the lower triangular part of the preconditioner, and then for its upper triangular part. NOTE Although ILU0 and ILUT preconditioners apply to any non-degenerate matrix, in some cases the algorithm may fail to ensure successful termination and the required result. Whether or not the preconditioner produces an acceptable result can only be determined in practice. A preconditioner may increase the number of iterations for an arbitrary case of the system and the initial solution, and even ruin the convergence. It is your responsibility as a user to choose a suitable preconditioner. General Scheme of Using ILUT and RCI FGMRES Routines The general scheme for use is the same for both preconditioners. Some differences exist in the calling parameters of the preconditioners and in the subsequent call of two triangular solvers. You can see all these differences in the code examples for both preconditioners in the examples\solver\source folder of your Intel MKL directory (dcsrilu0_exampl1.c, dcsrilu0_exampl2.f, dcsrilut_exampl1.c, dcsrilut_exampl2.f). The following pseudocode shows the general scheme of using the ILUT preconditioner in the RCI FGMRES context. ... generate matrix A generate preconditioner C (optional) call dfgmres_init(n, x, b, RCI_request, ipar, dpar, tmp) 8 Intel® Math Kernel Library Reference Manual 1958 change parameters in ipar, dpar if necessary call dcsrilut(n, a, ia, ja, bilut, ibilut, jbilut, tol, maxfil, ipar, dpar, ierr) call dfgmres_check(n, x, b, RCI_request, ipar, dpar, tmp) 1 call dfgmres(n, x, b, RCI_request, ipar, dpar, tmp) if (RCI_request.eq.1) then multiply the matrix A by tmp(ipar(22)) and put the result in tmp(ipar(23)) c proceed with FGMRES iterations goto 1 endif if (RCI_request.eq.2) then do the stopping test if (test not passed) then c proceed with FGMRES iterations go to 1 else c stop FGMRES iterations. goto 2 endif endif if (RCI_request.eq.3) then c Below, trvec is an intermediate vector of length at least n c Here is the recommended use of the result produced by the ILUT routine. c via standard Intel MKL Sparse Blas solver routine mkl_dcsrtrsv. call mkl_dcsrtrsv('L','N','U', n, bilut, ibilut, jbilut, tmp(ipar(22)),trvec) call mkl_dcsrtrsv('U','N','N', n, bilut, ibilut, jbilut, trvec, tmp(ipar(23))) c proceed with FGMRES iterations goto 1 endif if (RCI_request.eq.4) then check the norm of the next orthogonal vector, it is contained in dpar(7) if (the norm is not zero up to rounding/computational errors) then c proceed with FGMRES iterations goto 1 else c stop FGMRES iterations goto 2 endif Sparse Solver Routines 8 1959 endif 2 call dfgmres_get(n, x, b, RCI_request, ipar, dpar, tmp, itercount) current iteration number is in itercount the computed approximation is in the array x ILU0 and ILUT Preconditioners Interface Description The concepts required to understand the use of the Intel MKL preconditioner routines are discussed in the Appendix A Linear Solvers Basics. In this section the FORTRAN style notations are used. All types refer to the standard Fortran types, INTEGER, and DOUBLE PRECISION. C and C++ programmers must refer to the section Calling Sparse Solver and Preconditioner Routines from C C++ for information on mapping Fortran types to C/C++ types. User Data Arrays The preconditioner routines take arrays of user data as input. To minimize storage requirements and improve overall run-time efficiency, the Intel MKL preconditioner routines do not make copies of the user input arrays. Common Parameters Some parameters of the preconditioners are common with the FGMRES Common Parameters. The routine dfgmres_init specifies their default and initial values. However, some parameters can be redefined with other values. These parameters are listed below. For the ILU0 preconditioner: ipar(2) - specifies the destination of error messages generated by the ILU0 routine. The default value 6 means that all error messages are displayed on the screen. Otherwise routine creates a log file called MKL_PREC_log.txt and writes error messages to it. Note if the parameter ipar(6) is set to 0, then error messages are not generated at all. ipar(6) - specifies whether error messages are generated. If its value is not equal to 0, the ILU0 routine returns error messages as specified by the parameter ipar(2). Otherwise, the routine does not generate error messages at all, but returns a negative value for the parameter ierr. The default value is 1. For the ILUT preconditioner: ipar(2) - specifies the destination of error messages generated by the ILUT routine. The default value 6 means that all messages are displayed on the screen. Otherwise routine creates a log file called MKL_PREC_log.txt and writes error messages to it. Note if the parameter ipar(6) is set to 0, then error messages are not generated at all. ipar(6) - specifies whether error messages are generated. If its value is not equal to 0, the ILUT routine returns error messages as specified by the parameter ipar(2). Otherwise, the routine does not generate error messages at all, but returns a negative value for the parameter ierr. The default value is 1. ipar(7) - if its value is greater than 0, the ILUT routine generates warning messages as specified by the parameter ipar(2) and continues calculations. If its value is equal to 0, the routine returns a positive value of the parameter ierr. If its value is less than 0, the routine generates a warning message as specified by the parameter ipar(2) and returns a positive value of the parameter ierr. The default value is 1. 8 Intel® Math Kernel Library Reference Manual 1960 dcsrilu0 ILU0 preconditioner based on incomplete LU factorization of a sparse matrix. Syntax Fortran: call dcsrilu0(n, a, ia, ja, bilu0, ipar, dpar, ierr) C: dcsrilu0(&n, a, ia, ja, bilu0, ipar, dpar, &ierr); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine dcsrilu0 computes a preconditioner B [Saad03] of a given sparse matrix A stored in the format accepted in the direct sparse solvers: A~B=L*U , where L is a lower triangular matrix with a unit diagonal, U is an upper triangular matrix with a non-unit diagonal, and the portrait of the original matrix A is used to store the incomplete factors L and U. Input Parameters n INTEGER. Size (number of rows or columns) of the original square n-by-n matrix A. a DOUBLE PRECISION. Array containing the set of elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to the values array description in the Sparse Matrix Storage Format for more details. ia INTEGER. Array of size (n+1) containing begin indices of rows of the matrix A such that ia(i) is the index in the array A of the first non-zero element from the row i. The value of the last element ia(n+1) is equal to the number of non-zero elements in the matrix A plus one. Refer to the rowIndex array description in the Sparse Matrix Storage Format for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its size is equal to the size of the array a. Refer to the columns array description in the Sparse Matrix Storage Format for more details. NOTE Column indices must be in ascending order for each row of matrix. ipar INTEGER array of size 128. This parameter specifies the integer set of data for both the ILU0 and RCI FGMRES computations. Refer to the ipar array description in the FGMRES Common Parameters for more details on FGMRES parameter entries. The entries that are specific to ILU0 are listed below. Sparse Solver Routines 8 1961 ipar(31) specifies how the routine operates when a zero diagonal element occurs during calculation. If this parameter is set to 0 (the default value set by the routine dfgmres_init), then the calculations are stopped and the routine returns a non-zero error value. Otherwise, the diagonal element is set to the value of dpar(32) and the calculations continue. NOTE You must declare the array ipar with length 128. While defining the array in the code as INTEGER ipar(31) works, there is no guarantee of future compatibility with Intel MKL. dpar DOUBLE PRECISION array of size 128. This parameter specifies the double precision set of data for both the ILU0 and RCI FGMRES computations. Refer to the dpar array description in the FGMRES Common Parameters for more details on FGMRES parameter entries. The entries specific to ILU0 are listed below. dpar(31) specifies a small value, which is compared with the computed diagonal elements. When ipar(31) is not 0, then diagonal elements less than dpar(31) are set to dpar(32). The default value is 1.0D-16. NOTE This parameter can be set to the negative value, because the calculation uses its absolute value. If this parameter is set to 0, the comparison with the diagonal element is not performed. dpar(32) specifies the value that is assigned to the diagonal element if its value is less than dpar(31) (see above). The default value is 1.0D-10. NOTE You must declare the array dpar with length 128. While defining the array in the code as DOUBLE PRECISION ipar(32) works, there is no guarantee of future compatibility with Intel MKL. Output Parameters bilu0 DOUBLE PRECISION. Array B containing non-zero elements of the resulting preconditioning matrix B, stored in the format accepted in direct sparse solvers. Its size is equal to the number of non-zero elements in the matrix A. Refer to the values array description in the Sparse Matrix Storage Format section for more details. ierr INTEGER. Error flag, gives information about the routine completion. 8 Intel® Math Kernel Library Reference Manual 1962 NOTE To present the resulting preconditioning matrix in the CSR format the arrays ia (row indices) and ja (column indices) of the input matrix must be used. Return Values ierr=0 Indicates that the task completed normally. ierr=-101 Indicates that the routine was interrupted and that error occurred: at least one diagonal element is omitted from the matrix in CSR format (see Sparse Matrix Storage Format). ierr=-102 Indicates that the routine was interrupted because the matrix contains a diagonal element with the value of zero. ierr=-103 Indicates that the routine was interrupted because the matrix contains a diagonal element which is so small that it could cause an overflow, or that it would cause a bad approximation to ILU0. ierr=-104 Indicates that the routine was because the memory is insufficient for the internal work array. ierr=-105 Indicates that the routine was because the input matrix size n is less than or equal to 0. ierr=-106 Indicates that the routine was because the column indices ja are not in the ascending order. Interfaces FORTRAN 77 and Fortran 95: SUBROUTINE dcsrilu0 (n, a, ia, ja, bilu0, ipar, dpar, ierr) INTEGER n, ierr, ipar(128) INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), bilu0(*), dpar(128) C: void dcsrilu0 (int *n, double *a, int *ia, int *ja, double *bilu0, int *ipar, double *dpar, int *ierr); dcsrilut ILUT preconditioner based on the incomplete LU factorization with a threshold of a sparse matrix. Syntax Fortran: call dcsrilut(n, a, ia, ja, bilut, bilut, ibilut, jbilut, tol, maxfil, ipar, dpar, ierr) C: dcsrilut(&n, a, ia, ja, bilut, bilut, ibilut, jbilut, &tol, &maxfil, ipar, dpar, &ierr); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Sparse Solver Routines 8 1963 Description The routine dcsrilut computes a preconditioner B [Saad03] of a given sparse matrix A stored in the format accepted in the direct sparse solvers: A~B=L*U , where L is a lower triangular matrix with unit diagonal and U is an upper triangular matrix with non-unit diagonal. The following threshold criteria are used to generate the incomplete factors L and U: 1) the resulting entry must be greater than the matrix current row norm multiplied by the parameter tol, and 2) the number of the non-zero elements in each row of the resulting L and U factors must not be greater than the value of the parameter maxfil. Input Parameters n INTEGER. Size (number of rows or columns) of the original square n-by-n matrix A. a DOUBLE PRECISION. Array containing all non-zero elements of the matrix A. The length of the array is equal to their number. Refer to values array description in the Sparse Matrix Storage Format section for more details. ia INTEGER. Array of size (n+1) containing indices of non-zero elements in the array A. ia(i)is the index of the first non-zero element from the row i. The value of the last element ia(n+1) is equal to the number of non-zeros in the matrix A plus one. Refer to the rowIndex array description in the Sparse Matrix Storage Format for more details. ja INTEGER. Array of size equal to the size of the array a. This array contains the column numbers for each non-zero element of the matrix A. Refer to the columns array description in the Sparse Matrix Storage Format for more details. NOTE Column numbers must be in ascending order for each row of matrix. tol DOUBLE PRECISION. Tolerance for threshold criterion for the resulting entries of the preconditioner. maxfil INTEGER. Maximum fill-in, which is half of the preconditioner bandwidth. The number of non-zero elements in the rows of the preconditioner can not exceed (2*maxfil+1). ipar INTEGER array of size 128. This parameter is used to specify the integer set of data for both the ILUT and RCI FGMRES computations. Refer to the ipar array description in the FGMRES Common Parameters for more details on FGMRES parameter entries. The entries specific to ILUT are listed below. ipar(31) specifies how the routine operates if the value of the computed diagonal element is less than the current matrix row norm multiplied by the value of the parameter tol. If ipar(31) = 0, then the calculation is stopped and the routine returns nonzero error value. Otherwise, the value of the diagonal element is set to a value determined by dpar(31) (see its description below), and the calculations continue. 8 Intel® Math Kernel Library Reference Manual 1964 NOTE There is no default value for ipar(31) even if the preconditioner is used within the RCI ISS context. Always set the value of this entry. NOTE You must declare the array ipar with length 128. While defining the array in the code as INTEGER ipar(31) works, there is no guarantee of future compatibility with Intel MKL. dpar DOUBLE PRECISION array of size 128. This parameter specifies the double precision set of data for both ILUT and RCI FGMRES computations. Refer to the dpar array description in the FGMRES Common Parameters for more details on FGMRES parameter entries. The entries that are specific to ILUT are listed below. dpar(31) used to adjust the value of small diagonal elements. Diagonal elements with a value less than the current matrix row norm multiplied by tol are replaced with the value of dpar(31) multiplied by the matrix row norm. NOTE There is no default value for dpar(31) entry even if the preconditioner is used within RCI ISS context. Always set the value of this entry. NOTE You must declare the array dpar with length 128. While defining the array in the code as DOUBLE PRECISION ipar(31) works, there is no guarantee of future compatibility with Intel MKL. Output Parameters bilut DOUBLE PRECISION. Array containing non-zero elements of the resulting preconditioning matrix B, stored in the format accepted in the direct sparse solvers. Refer to the values array description in the Sparse Matrix Storage Format for more details. The size of the array is equal to (2*maxfil+1)*nmaxfil*( maxfil+1)+1. NOTE Provide enough memory for this array before calling the routine. Otherwise, the routine may fail to complete successfully with a correct result. ibilut INTEGER. Array of size (n+1) containing indices of non-zero elements in the array bilut. ibilut(i) is the index of the first non-zero element from the row i. The value of the last element ibilut(n+1) is equal to the number of non-zeros in the matrix B plus one. Refer to the rowIndex array description in the Sparse Matrix Storage Format for more details. Sparse Solver Routines 8 1965 jbilut INTEGER. Array, its size is equal to the size of the array bilut. This array contains the column numbers for each non-zero element of the matrix B. Refer to the columns array description in the Sparse Matrix Storage Format for more details. ierr INTEGER. Error flag, informs about the routine completion. Return Values ierr=0 Indicates that the task completed normally. ierr=-101 Indicates that the routine was interrupted because of an error: the number of elements in some matrix row specified in the sparse format is equal to or less than 0. ierr=-102 Indicates that the routine was interrupted because the value of the computed diagonal element is less than the product of the given tolerance and the current matrix row norm, and it cannot be replaced as ipar(31)=0. ierr=-103 Indicates that the routine was interrupted because the element ia(i+1) is less than or equal to the element ia(i) (see Sparse Matrix Storage Format). ierr=-104 Indicates that the routine was interrupted because the memory is insufficient for the internal work arrays. ierr=-105 Indicates that the routine was interrupted because the input value of maxfil is less than 0. ierr=-106 Indicates that the routine was interrupted because the size n of the input matrix is less than 0. ierr=-107 Indicates that the routine was interrupted because an element of the array ja is less than 0, or greater than n (see Sparse Matrix Storage Format). ierr=101 The value of maxfil is greater than or equal to n. The calculation is performed with the value of maxfil set to (n-1). ierr=102 The value of tol is less than 0. The calculation is performed with the value of the parameter set to (-tol) ierr=103 The absolute value of tol is greater than value of dpar(31); it can result in instability of the calculation. ierr=104 The value of dpar(31) is equal to 0. It can cause calculations to fail. Interfaces FORTRAN 77 and Fortran 95: SUBROUTINE dcsrilut (n, a, ia, ja, bilut, ibilut, jbilut, tol, maxfil, ipar, dpar, ierr) INTEGER n, ierr, ipar(*), maxfil INTEGER ia(*), ja(*), ibilut(*), jbilut(*) DOUBLE PRECISION a(*), bilut(*), dpar(*), tol C: void dcsrilut (int *n, double *a, int *ia, int *ja, double *bilut, int *ibilut, int *jbilut, double *tol, int *maxfil, int *ipar, double *dpar, int *ierr); 8 Intel® Math Kernel Library Reference Manual 1966 Calling Sparse Solver and Preconditioner Routines from C/C+ + All of the Intel MKL sparse solver and preconditioner routines is designed to be called easily from FORTRAN 77 or Fortran 90. However, any of these routines can be invoked directly from C or C++ if you are familiar with the inter-language calling conventions of your platforms. These conventions include, but are not limited to, the argument passing mechanisms for the language, the data type mappings from Fortran to C/C++, and the platform specific method of decoration for Fortran external names. To promote portability, the C header files provide a set of macros and type definitions intended to hide the inter-language calling conventions and provide an interface to the Intel MKL sparse solver routines that appears natural for C/C++. For example, consider a hypothetical library routine foo that takes a real vector of length n, and returns an integer status. Fortran users would access such a function as: INTEGER n, status, foo REAL x(*) status = foo(x, n) As noted above, to invoke foo, C users need to know what C data types correspond to Fortran types INTEGER and REAL; what argument passing mechanism the Fortran compiler uses; and what, if any, name decoration the Fortran compiler performs when generating the external symbol foo. However, by using the C specific header file, for example mkl_solver.h, the invocation of foo, within a C program would look as follows: #include "mkl_solver.h" _INTEGER_t i, status; _REAL_t x[]; status = foo( x, i ); Note that in the above example, the header file mkl_solver.h provides definitions for the types _INTEGER_t and _REAL_t that correspond to the Fortran types INTEGER and REAL. To simplify calling of the Intel MKL sparse solver routines from C and C++, the following approach of providing C definitions of Fortran types is used: if an argument or a result from a sparse solver is documented as having the Fortran language specific type XXX, then the C and C++ header files provide an appropriate C language type definitions for the name _XXX_t. Caveat for C Users One of the key differences between C/C++ and Fortran is the argument passing mechanisms for the languages: Fortran programs pass arguments by reference and C/C++ programs pass arguments by value. In the above example, the header file mkl_solver.h attempts to hide this difference by defining a macro foo, which takes the address of the appropriate arguments. For example, on the Tru64 UNIX* operating system mkl_solver.h defines the macro as follows: #define foo(a,b) foo_((a), &(b)) Note how constants are treated when using the macro form of foo. foo( x, 10 ) is converted into foo_( x, &10 ). In a strictly ANSI compliant C compiler, taking the address of a constant is not permitted, so a strictly conforming program would look like: INTEGER_t iTen = 10; _REAL_t * x; status = foo( x, iTen ); However, some C compilers in a non-ANSI compliant mode enable taking the address of a constant for ease of use with Fortran programs. The form foo( x, 10 ) is acceptable for such compilers. Sparse Solver Routines 8 1967 8 Intel® Math Kernel Library Reference Manual 1968 Vector Mathematical Functions 9 This chapter describes Intel® MKL Vector Mathematical Functions Library (VML), which computes a mathematical function of each of the vector elements. VML includes a set of highly optimized functions (arithmetic, power, trigonometric, exponential, hyperbolic, special, and rounding) that operate on vectors of real and complex numbers. Application programs that improve performance with VML include nonlinear programming software, computation of integrals, financial calculations, computer graphics, and many others. VML functions fall into the following groups according to the operations they perform: • VML Mathematical Functions compute values of mathematical functions, such as sine, cosine, exponential, or logarithm, on vectors stored contiguously in memory. • VML Pack/Unpack Functions convert to and from vectors with positive increment indexing, vector indexing, and mask indexing (see Appendix B for details on vector indexing methods). • VML Service Functions set/get the accuracy modes and the error codes. The VML mathematical functions take an input vector as an argument, compute values of the respective function element-wise, and return the results in an output vector. All the VML mathematical functions can perform in-place operations, where the input and output arrays are at the same memory locations. The Intel MKL interfaces are given in the following include files: • mkl_vml.f77, which declares the FORTRAN 77 interfaces • mkl_vml.f90, which declares the Fortran 90 interfaces; the mkl_vml.fi include file available in the previous versions of Intel MKL is retained for backward compatibility • mkl_vml_functions.h, which declares the C interfaces The following directories provide examples that demonstrate how to use the VML functions: ${MKL}/examples/vmlc/source ${MKL}/examples/vmlf/source See VML performance and accuracy data in the online VML Performance and Accuracy Data document available at http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation/ Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Data Types, Accuracy Modes, and Performance Tips VML includes mathematical and pack/unpack vector functions for single and double precision vector arguments of real and compex types. The library provides Fortran- and C-interfaces for all functions, including the associated service functions. The Function Naming Conventions section below shows how to call these functions from different languages. Performance depends on a number of factors, including vectorization and threading overhead. The recommended usage is as follows: • Use VML for vector lengths larger than 40 elements. 1969 • Use the Intel® Compiler for vector lengths less than 40 elements. All VML vector functions support the following accuracy modes: • High Accuracy (HA), the default mode • Low Accuracy (LA), which improves performance by reducing accuracy of the two least significant bits • Enhanced Performance (EP), which provides better performance at the cost of significantly reduced accuracy. Approximately half of the bits in the mantissa are correct. Note that using the EP mode does not guarantee accurate processing of corner cases and special values. Although the default accuracy is HA, LA is sufficient in most cases. For applications that require less accuracy (for example, media applications, some Monte Carlo simulations, etc.), the EP mode may be sufficient. VML handles special values in accordance with the C99 standard [C99]. Use the vmlSetMode(mode) function (see Table "Values of the mode Parameter") to switch between the HA, LA, and EP modes. The vmlGetMode() function returns the current mode. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 See Also Function Naming Conventions Function Naming Conventions The VML function names are lowercase for Fortran (vsabs) and of mixed (lower and upper) case for C (vsAbs). The VML mathematical and pack/unpack function names have the following structure: v[m] where • v is a prefix indicating vector operations. • [m] is an optional prefix for mathematical functions that indicates additional argument to specify a VML mode for a given function call (see vmlSetMode for possible values and their description). • is a precision prefix that indicates one of the following the data types: s REAL for the Fortran interface, or float for the C interface d DOUBLE PRECISION for the Fortran interface, or double for the C interface. c COMPLEX for the Fortran interface, or MKL_Complex8 for the C interface. z DOUBLE COMPLEX for the Fortran interface, or MKL_Complex16 for the C interface. • indicates the function short name, with some of its letters in uppercase for the C interface. See examples in Table "VML Mathematical Functions". • field (written in uppercase for the C interface) is present only in the pack/unpack functions and indicates the indexing method used: i indexing with a positive increment v indexing with an index vector m indexing with a mask vector. 9 Intel® Math Kernel Library Reference Manual 1970 The VML service function names have the following structure: vml where indicates the function short name, with some of its letters in uppercase for the C interface. See examples in Table "VML Service Functions". To call VML functions from an application program, use conventional function calls. For example, call the vector single precision real exponential function as call vsexp ( n, a, y ) for the Fortran interface, or call vmsexp ( n, a, y, mode ) for the Fortran interface with a specified mode, or vsExp ( n, a, y ); for the C interface. Function Interfaces VML interfaces include the function names and argument lists. The following sections describe the Fortran and C interfaces for the VML functions. Note that some of the functions have multiple input and output arguments Some VML functions may also take scalar arguments as input. See the function description for the naming conventions of such arguments. VML Mathematical Functions Fortran: call v( n, a, [scalar input arguments,] y ) call v( n, a, b, [scalar input arguments,] y ) call v( n, a, y, z ) call vm( n, a, [scalar input arguments,] y, mode ) call vm( n, a, b, [scalar input arguments,] y, mode ) call vm( n, a, y, z, mode ) C: v( n, a, [scalar input arguments,] y ); v( n, a, b, [scalar input arguments,] y ); v( n, a, y, z ); vm( n, a, [scalar input arguments,] y, mode ); vm( n, a, b, [scalar input arguments,] y, mode ); vm( n, a, y, z, mode ); Pack Functions Fortran: call vpacki( n, a, inca, y ) call vpackv( n, a, ia, y ) call vpackm( n, a, ma, y ) C: vPackI( n, a, inca, y ); vPackV( n, a, ia, y ); vPackM( n, a, ma, y ); Vector Mathematical Functions 9 1971 Unpack Functions Fortran: call vunpacki( n, a, y, incy ) call vunpackv( n, a, y, iy ) call vunpackm( n, a, y, my ) C: vUnpackI( n, a, y, incy ); vUnpackV( n, a, y, iy ); vUnpackM( n, a, y, my ); Service Functions Fortran: oldmode = vmlsetmode( mode ) mode = vmlgetmode( ) olderr = vmlseterrstatus ( err ) err = vmlgeterrstatus( ) olderr = vmlclearerrstatus( ) oldcallback = vmlseterrorcallback( callback ) callback = vmlgeterrorcallback( ) oldcallback = vmlclearerrorcallback( ) C: oldmode = vmlSetMode( mode ); mode = vmlGetMode( void ); olderr = vmlSetErrStatus ( err ); err = vmlGetErrStatus( void ); olderr = vmlClearErrStatus( void ); oldcallback = vmlSetErrorCallBack( callback ); callback = vmlGetErrorCallBack( void ); oldcallback = vmlClearErrorCallBack( void ); Note that oldmode, oldcerr, and oldcallback refer to settings prior to the call. Input Parameters n number of elements to be calculated a first input vector b second input vector inca vector increment for the input vector a ia index vector for the input vector a ma mask vector for the input vector a incy vector increment for the output vector y iy index vector for the output vector y my mask vector for the output vector y err error code mode VML mode callback address of the callback function 9 Intel® Math Kernel Library Reference Manual 1972 Output Parameters y first output vector z second output vector err error code mode VML mode olderr former error code oldmode former VML mode callback address of the callback function oldcallback address of the former callback function See the data types of the parameters used in each function in the respective function description section. All the Intel MKL VML mathematical functions can perform in-place operations. Vector Indexing Methods VML mathematical functions work only with unit stride. To accommodate arrays with other increments, or more complicated indexing, you can gather the elements into a contiguous vector and then scatter them after the computation is complete. VML uses the three following indexing methods to do this task: • positive increment • index vector • mask vector The indexing method used in a particular function is indicated by the indexing modifier (see the description of the field in Function Naming Conventions). For more information on the indexing methods, see Vector Arguments in VML in Appendix B. Error Diagnostics The VML mathematical functions incorporate the error handling mechanism, which is controlled by the following service functions: vmlGetErrStatus, vmlSetErrStatus, vmlClearErrStatus These functions operate with a global variable called VML Error Status. The VML Error Status flags an error, a warning, or a successful execution of a VML function. vmlGetErrCallBack, vmlSetErrCallBack, vmlClearErrCallBack These functions enable you to customize the error handling. For example, you can identify a particular argument in a vector where an error occurred or that caused a warning. vmlSetMode, vmlGetMode These functions get and set a VML mode. If you set a new VML mode using the vmlSetMode function, you can store the previous VML mode returned by the routine and restore it at any point of your application. If both an error and a warning situation occur during the function call, the VML Error Status variable keeps only the value of the error code. See Table "Values of the VML Error Status" for possible values. If a VML function does not encounter errors or warnings, it sets the VML Error Status to VML_STATUS_OK. If you use the Fortran interface, call the error reporting function XERBLA to receive information about correctness of input arguments (VML_STATUS_BADSIZE and VML_STATUS_BADMEM). See Table "Values of the VML Error Status" for details. You can use the vmlSetMode and vmlGetMode functions to modify error handling behavoir. Depending on the VML mode, the error handling behavior includes the following operations: Vector Mathematical Functions 9 1973 • setting the VML Error Status to a value corresponding to the observed error or warning • setting the errno variable to one of the values described in Table "Set Values of the errno Variable" • writing error text information to the stderr stream • raising the appropriate exception on an error, if necessary • calling the additional error handler callback function that is set by vmlSetErrorCallBack. Set Values of the errno Variable Value of errno Description 0 No errors are detected. EINVAL The array dimension is not positive. EACCES NULL pointer is passed. EDOM At least one of array values is out of a range of definition. ERANGE At least one of array values caused a singularity, overflow or underflow. See Also vmlGetErrStatus vmlSetErrStatus vmlClearErrStatus vmlSetErrorCallBack vmlGetErrorCallBack vmlClearErrorCallBack vmlGetMode vmlSetMode VML Mathematical Functions This section describes VML functions that compute values of mathematical functions on real and complex vector arguments with unit increment. Each function is introduced by its short name, a brief description of its purpose, and the calling sequence for each type of data both for Fortran- and C-interfaces, as well as a description of the input/output arguments. The input range of parameters is equal to the mathematical range of the input data type, unless the function description specifies input threshold values, which mark off the precision overflow, as follows: • FLT_MAX denotes the maximum number representable in single precision real data type • DBL_MAX denotes the maximum number representable in double precision real data type Table "VML Mathematical Functions" lists available mathematical functions and associated data types. VML Mathematical Functions Function Data Types Description Arithmetic Functions v?Add s, d, c, z Addition of vector elements v?Sub s, d, c, z Subtraction of vector elements v?Sqr s, d Squaring of vector elements v?Mul s, d, c, z Multiplication of vector elements v?MulByConj c, z Multiplication of elements of one vector by conjugated elements of the second vector v?Conj c, z Conjugation of vector elements v?Abs s, d, c, z Computation of the absolute value of vector elements v?Arg c, z Computation of the argument of vector elements v?LinearFrac s, d Linear fraction transformation of vectors Power and Root Functions v?Inv s, d Inversion of vector elements 9 Intel® Math Kernel Library Reference Manual 1974 Function Data Types Description v?Div s, d, c, z Division of elements of one vector by elements of the second vector v?Sqrt s, d, c, z Computation of the square root of vector elements v?InvSqrt s, d Computation of the inverse square root of vector elements v?Cbrt s, d Computation of the cube root of vector elements v?InvCbrt s, d Computation of the inverse cube root of vector elements v?Pow2o3 s, d Raising each vector element to the power of 2/3 v?Pow3o2 s, d Raising each vector element to the power of 3/2 v?Pow s, d, c, z Raising each vector element to the specified power v?Powx s, d, c, z Raising each vector element to the constant power v?Hypot s, d Computation of the square root of sum of squares Exponential and Logarithmic Functions v?Exp s, d, c, z Computation of the exponential of vector elements v?Expm1 s, d Computation of the exponential of vector elements decreased by 1 v?Ln s, d, c, z Computation of the natural logarithm of vector elements v?Log10 s, d, c, z Computation of the denary logarithm of vector elements v?Log1p s, d Computation of the natural logarithm of vector elements that are increased by 1 Trigonometric Functions v?Cos s, d, c, z Computation of the cosine of vector elements v?Sin s, d, c, z Computation of the sine of vector elements v?SinCos s, d Computation of the sine and cosine of vector elements v?CIS c, z Computation of the complex exponent of vector elements (cosine and sine combined to complex value) v?Tan s, d, c, z Computation of the tangent of vector elements v?Acos s, d, c, z Computation of the inverse cosine of vector elements v?Asin s, d, c, z Computation of the inverse sine of vector elements v?Atan s, d, c, z Computation of the inverse tangent of vector elements v?Atan2 s, d Computation of the four-quadrant inverse tangent of elements of two vectors Hyperbolic Functions v?Cosh s, d, c, z Computation of the hyperbolic cosine of vector elements v?Sinh s, d, c, z Computation of the hyperbolic sine of vector elements v?Tanh s, d, c, z Computation of the hyperbolic tangent of vector elements v?Acosh s, d, c, z Computation of the inverse hyperbolic cosine of vector elements v?Asinh s, d, c, z Computation of the inverse hyperbolic sine of vector elements v?Atanh s, d, c, z Computation of the inverse hyperbolic tangent of vector elements. Special Functions v?Erf s, d Computation of the error function value of vector elements v?Erfc s, d Computation of the complementary error function value of vector elements v?CdfNorm s, d Computation of the cumulative normal distribution function value of vector elements v?ErfInv s, d Computation of the inverse error function value of vector elements v?ErfcInv s, d Computation of the inverse complementary error function value of vector elements v?CdfNormInv s, d Computation of the inverse cumulative normal distribution function value of vector elements v?LGamma s, d Computation of the natural logarithm for the absolute value of the gamma function of vector elements v?TGamma s, d Computation of the gamma function of vector elements Rounding Functions v?Floor s, d Rounding towards minus infinity v?Ceil s, d Rounding towards plus infinity v?Trunc s, d Rounding towards zero infinity Vector Mathematical Functions 9 1975 Function Data Types Description v?Round s, d Rounding to nearest integer v?NearbyInt s, d Rounding according to current mode v?Rint s, d Rounding according to current mode and raising inexact result exception v?Modf s, d Computation of the integer and fraction parts Special Value Notations This section defines notations of special values for complex functions. The definitions are provided in text, tables, or formulas. • z, z1, z2, etc. denote complex numbers. • i, i2=-1 is the imaginary unit. • x, X, x1, x2, etc. denote real imaginary parts. • y, Y, y1, y2, etc. denote imaginary parts. • X and Y represent any finite positive IEEE-754 floating point values, if not stated otherwise. • Quiet NaN and signaling NaN are denoted with QNAN and SNAN, respectively. • The IEEE-754 positive infinities or floating-point numbers are denoted with a + sign before X, Y, etc. • The IEEE-754 negative infinities or floating-point numbers are denoted with a - sign before X, Y, etc. CONJ(z) and CIS(z) are defined as follows: CONJ(x+i·y)=x-i·y CIS(y)=cos(y)+i·sin(y). The special value tables show the result of the function for the z argument at the intersection of the RE(z) column and the i*IM(z) row. If the function raises an exception on the argument z, the lower part of this cell shows the raised exception and the VML Error Status. An empty cell indicates that this argument is normal and the result is defined mathematically. Arithmetic Functions Arithmetic functions perform the basic mathematical operations like addition, subtraction, multiplication or computation of the absolute value of the vector elements. v?Add Performs element by element addition of vector a and vector b. Syntax Fortran: call vsadd( n, a, b, y ) call vmsadd( n, a, b, y, mode ) call vdadd( n, a, b, y ) call vmdadd( n, a, b, y, mode ) call vcadd( n, a, b, y ) call vmcadd( n, a, b, y, mode ) call vzadd( n, a, b, y ) call vmzadd( n, a, b, y, mode ) 9 Intel® Math Kernel Library Reference Manual 1976 C: vsAdd( n, a, b, y ); vmsAdd( n, a, b, y, mode ); vdAdd( n, a, b, y ); vmdAdd( n, a, b, y, mode ); vcAdd( n, a, b, y ); vmcAdd( n, a, b, y, mode ); vzAdd( n, a, b, y ); vmzAdd( n, a, b, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a, b FORTRAN 77: REAL for vsadd, vmsadd DOUBLE PRECISION for vdadd, vmdadd COMPLEX for vcadd, vmcadd DOUBLE COMPLEX for vzadd, vmzadd Fortran 90: REAL, INTENT(IN) for vsadd, vmsadd DOUBLE PRECISION, INTENT(IN) for vdadd, vmdadd COMPLEX, INTENT(IN) for vcadd, vmcadd DOUBLE COMPLEX, INTENT(IN) for vzadd, vmzadd C: const float* for vsAdd, vmsadd const double* for vdAdd, vmdadd const MKL_Complex8* for vcAdd, vmcadd const MKL_Complex16* for vzAdd, vmzadd FORTRAN: Arrays that specify the input vectors a and b. C: Pointers to arrays that contain the input vectors a and b. Vector Mathematical Functions 9 1977 Name Type Description mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsadd, vmsadd DOUBLE PRECISION for vdadd, vmdadd COMPLEX, for vcadd, vmcadd DOUBLE COMPLEX for vzadd, vmzadd Fortran 90: REAL, INTENT(OUT) for vsadd, vmsadd DOUBLE PRECISION, INTENT(OUT) for vdadd, vmdadd COMPLEX, INTENT(OUT) for vcadd, vmcadd DOUBLE COMPLEX, INTENT(OUT) for vzadd, vmzadd C: float* for vsAdd, vmsadd double* for vdAdd, vmdadd MKL_Complex8* for vcAdd, vmcadd MKL_Complex16* for vzAdd, vmzadd FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Add function performs element by element addition of vector a and vector b. Special values for Real Function v?Add(x) Argument 1 Argument 2 Result Exception +0 +0 +0 +0 -0 +0 -0 +0 +0 -0 -0 -0 +8 +8 +8 +8 -8 QNAN INVALID -8 +8 QNAN INVALID -8 -8 -8 SNAN any value QNAN INVALID any value SNAN QNAN INVALID QNAN non-SNAN QNAN 9 Intel® Math Kernel Library Reference Manual 1978 Argument 1 Argument 2 Result Exception non-SNAN QNAN QNAN Specifications for special values of the complex functions are defined according to the following formula Add(x1+i*y1,x2+i*y2) = (x1+x2) + i*(y1+y2) Overflow in a complex function occurs (supported in the HA/LA accuracy modes only) when x1, x2, y1, y2 are finite numbers, but the real or imaginary part of the exact result is so large that it does not fit the target precision. In this case, the function returns 8 in that part of the result, raises the OVERFLOW exception, and sets the VML Error Status to VML_STATUS_OVERFLOW. v?Sub Performs element by element subtraction of vector b from vector a. Syntax Fortran: call vssub( n, a, b, y ) call vmssub( n, a, b, y, mode ) call vdsub( n, a, b, y ) call vmdsub( n, a, b, y, mode ) call vcsub( n, a, b, y ) call vmcsub( n, a, b, y, mode ) call vzsub( n, a, b, y ) call vmzsub( n, a, b, y, mode ) C: vsSub( n, a, b, y ); vmsSub( n, a, b, y, mode ); vdSub( n, a, b, y ); vmdSub( n, a, b, y, mode ); vcSub( n, a, b, y ); vmcSub( n, a, b, y, mode ); vzSub( n, a, b, y ); vmzSub( n, a, b, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Vector Mathematical Functions 9 1979 Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a, b FORTRAN 77: REAL for vssub, vmssub DOUBLE PRECISION for vdsub, vmdsub COMPLEX for vcsub, vmcsub DOUBLE COMPLEX for vzsub, vmzsub Fortran 90: REAL, INTENT(IN) for vssub, vmssub DOUBLE PRECISION, INTENT(IN) for vdsub, vmdsub COMPLEX, INTENT(IN) for vcsub, vmcsub DOUBLE COMPLEX, INTENT(IN) for vzsub, vmzsub C: const float* for vsSub, vmssub const double* for vdSub, vmdsub const MKL_Complex8* for vcSub, vmcsub const MKL_Complex16* for vzSub, vmzsub FORTRAN: Arrays that specify the input vectors a and b. C: Pointers to arrays that contain the input vectors a and b. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vssub, vmssub DOUBLE PRECISION for vdsub, vmdsub COMPLEX for vcsub, vmcsub DOUBLE COMPLEX for vzsub, vmzsub Fortran 90: REALINTENT(OUT) for vssub, vmssub FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 1980 Name Type Description DOUBLE PRECISION, INTENT(OUT) for vdsub, vmdsub COMPLEX, INTENT(OUT) for vcsub, vmcsub DOUBLE COMPLEX, INTENT(OUT) for vzsub, vmzsub C: float* for vsSub, vmssub double* for vdSub, vmdsub MKL_Complex8* for vcSub, vmcsub MKL_Complex16* for vzSub, vmzsub Description The v?Sub function performs element by element subtraction of vector b from vector a. Special values for Real Function v?Sub(x) Argument 1 Argument 2 Result Exception +0 +0 +0 +0 -0 +0 -0 +0 -0 -0 -0 +0 +8 +8 QNAN INVALID +8 -8 +8 -8 +8 -8 -8 -8 QNAN INVALID SNAN any value QNAN INVALID any value SNAN QNAN INVALID QNAN non-SNAN QNAN non-SNAN QNAN QNAN Specifications for special values of the complex functions are defined according to the following formula Sub(x1+i*y1,x2+i*y2) = (x1-x2) + i*(y1-y2). Overflow in a complex function occurs (supported in the HA/LA accuracy modes only) when x1, x2, y1, y2 are finite numbers, but the real or imaginary part of the exact result is so large that it does not fit the target precision. In this case, the function returns 8 in that part of the result, raises the OVERFLOW exception, and sets the VML Error Status to VML_STATUS_OVERFLOW. v?Sqr Performs element by element squaring of the vector. Syntax Fortran: call vssqr( n, a, y ) call vmssqr( n, a, y, mode ) call vdsqr( n, a, y ) call vmdsqr( n, a, y, mode ) Vector Mathematical Functions 9 1981 C: vsSqr( n, a, y ); vmsSqr( n, a, y, mode ); vdSqr( n, a, y ); vmdSqr( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vssqr, vmssqr DOUBLE PRECISION for vdsqr, vmdsqr Fortran 90: REAL, INTENT(IN) for vssqr, vmssqr DOUBLE PRECISION, INTENT(IN) for vdsqr, vmdsqr C: const float* for vsSqr, vmssqr const double* for vdSqr, vmdsqr FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vssqr, vmssqr DOUBLE PRECISION for vdsqr, vmdsqr FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 1982 Name Type Description Fortran 90: REAL, INTENT(OUT) for vssqr, vmssqr DOUBLE PRECISION, INTENT(OUT) for vdsqr, vmdsqr C: float* for vsSqr, vmssqr double* for vdSqr, vmdsqr Description The v?Sqr function performs element by element squaring of the vector. Special Values for Real Function v?Sqr(x) Argument Result Exception +0 +0 -0 +0 +8 +8 -8 +8 QNAN QNAN SNAN QNAN INVALID v?Mul Performs element by element multiplication of vector a and vector b. Syntax Fortran: call vsmul( n, a, b, y ) call vmsmul( n, a, b, y, mode ) call vdmul( n, a, b, y ) call vmdmul( n, a, b, y, mode ) call vcmul( n, a, b, y ) call vmcmul( n, a, b, y, mode ) call vzmul( n, a, b, y ) call vmzcmul( n, a, b, y, mode ) C: vsMul( n, a, b, y ); vmsMul( n, a, b, y, mode ); vdMul( n, a, b, y ); vmdMul( n, a, b, y, mode ); vcMul( n, a, b, y ); vmcMul( n, a, b, y, mode ); vzMul( n, a, b, y ); Vector Mathematical Functions 9 1983 vmzMul( n, a, b, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a, b FORTRAN 77: REAL for vsmul, vmsmul DOUBLE PRECISION for vdmul, vmdmul COMPLEX for vcmul, vmcmul DOUBLE COMPLEX for vzmul, vmzmul Fortran 90: REAL, INTENT(IN) for vsmul, vmsmul DOUBLE PRECISION, INTENT(IN) for vdmul, vmdmul COMPLEX, INTENT(IN) for vcmul, vmcmul DOUBLE COMPLEX, INTENT(IN) for vzmul, vmzmul C: const float* for vsMul, vmsmul const double* for vdMul, vmdmul const MKL_Complex8* for vcMul, vmcMul const MKL_Complex16* for vzMul, vmzMul FORTRAN: Arrays that specify the input vectors a and b. C: Pointers to arrays that contain the input vectors a and b. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsmul, vmsmul FORTRAN: Array that specifies the output vector y. 9 Intel® Math Kernel Library Reference Manual 1984 Name Type Description DOUBLE PRECISION for vdmul, vmdmul COMPLEX, for vcmul, vmcmul DOUBLE COMPLEX for vzmul, vmzmul Fortran 90: REAL, INTENT(OUT) for vsmul, vmsmul DOUBLE PRECISION, INTENT(OUT) for vdmul, vmdmul COMPLEX, INTENT(OUT) for vcmul, vmcmul DOUBLE COMPLEX, INTENT(OUT) for vzmul, vmzmul C: float* for vsMul, vmsmul double* for vdMul, vmdmul MKL_Complex8* for vcMul, vmcMul MKL_Complex16* for vzMul, vmzMul C: Pointer to an array that contains the output vector y. Description The v?Mul function performs element by element multiplication of vector a and vector b. Special values for Real Function v?Mul(x) Argument 1 Argument 2 Result Exception +0 +0 +0 +0 -0 -0 -0 +0 -0 -0 -0 +0 +0 +8 QNAN INVALID +0 -8 QNAN INVALID -0 +8 QNAN INVALID -0 -8 QNAN INVALID +8 +0 QNAN INVALID +8 -0 QNAN INVALID -8 +0 QNAN INVALID -8 -0 QNAN INVALID +8 +8 +8 +8 -8 -8 -8 +8 -8 -8 -8 +8 SNAN any value QNAN INVALID any value SNAN QNAN INVALID QNAN non-SNAN QNAN non-SNAN QNAN QNAN Specifications for special values of the complex functions are defined according to the following formula Mul(x1+i*y1,x2+i*y2) = (x1*x2-y1*y2) + i*(x1*y2+y1*x2). Vector Mathematical Functions 9 1985 Overflow in a complex function occurs (supported in the HA/LA accuracy modes only) when x1, x2, y1, y2 are finite numbers, but the real or imaginary part of the exact result is so large that it does not fit the target precision. In this case, the function returns 8 in that part of the result, raises the OVERFLOW exception, and sets the VML Error Status to VML_STATUS_OVERFLOW. v?MulByConj Performs element by element multiplication of vector a element and conjugated vector b element. Syntax Fortran: call vcmulbyconj( n, a, b, y ) call vmcmulbyconj( n, a, b, y, mode ) call vzmulbyconj( n, a, b, y ) call vmzmulbyconj( n, a, b, y, mode ) C: vcMulByConj( n, a, b, y ); vmcMulByConj( n, a, b, y, mode ); vzMulByConj( n, a, b, y ); vmzMulByConj( n, a, b, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a, b FORTRAN 77: COMPLEX for vcmulbyconj, vmcmulbyconj DOUBLE COMPLEX for vzmulbyconj, vmzmulbyconj Fortran 90: COMPLEX, INTENT(IN) for vcmulbyconj, vmcmulbyconj DOUBLE COMPLEX, INTENT(IN) for vzmulbyconj, vmzmulbyconj C: const MKL_Complex8* for vcMulByConj, vmcMulByConj FORTRAN: Arrays that specify the input vectors a and b. C: Pointers to arrays that contain the input vectors a and b. 9 Intel® Math Kernel Library Reference Manual 1986 Name Type Description const MKL_Complex16* for vzMulByConj, vmzMulByConj mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: COMPLEX for vcmulbyconj, vmcmulbyconj DOUBLE COMPLEX for vzmulbyconj, vmzmulbyconj Fortran 90: COMPLEX, INTENT(OUT) for vcmulbyconj, vmcmulbyconj DOUBLE COMPLEX, INTENT(OUT) for vzmulbyconj, vmzmulbyconj C: MKL_Complex8* for vcMulByConj, vmcMulByConj MKL_Complex16* for vzMulByConj, vmzMulByConj FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?MulByConj function performs element by element multiplication of vector a element and conjugated vector b element. Specifications for special values of the functions are found according to the formula MulByConj(x1+i*y1,x2+i*y2) = Mul(x1+i*y1,x2-i*y2). Overflow in a complex function occurs (supported in the HA/LA accuracy modes only) when x1, x2, y1, y2 are finite numbers, but the real or imaginary part of the exact result is so large that it does not fit the target precision. In this case, the function returns 8 in that part of the result, raises the OVERFLOW exception, and sets the VML Error Status to VML_STATUS_OVERFLOW. v?Conj Performs element by element conjugation of the vector. Syntax Fortran: call vcconj( n, a, y ) call vmcconj( n, a, y, mode ) call vzconj( n, a, y ) call vmzconj( n, a, y, mode ) Vector Mathematical Functions 9 1987 C: vcConj( n, a, y ); vmcConj( n, a, y, mode ); vzConj( n, a, y ); vmzConj( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: COMPLEX, INTENT(IN) for vcconj, vmcconj DOUBLE COMPLEX, INTENT(IN) for vzconj, vmzconj Fortran 90: COMPLEX, INTENT(IN) for vcconj, vmcconj DOUBLE COMPLEX, INTENT(IN) for vzconj, vmzconj C: const MKL_Complex8* for vcConj, vmcconj const MKL_Complex16* for vzConj, vmzconj FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: COMPLEX, for vcconj, vmcconj DOUBLE COMPLEX for vzconj, vmzconj Fortran 90: COMPLEX, INTENT(OUT) for vcconj, vmcconj FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 1988 Name Type Description DOUBLE COMPLEX, INTENT(OUT) for vzconj, vmzconj C: MKL_Complex8* for vcConj, vmcconj MKL_Complex16* for vzConj, vmzconj Description The v?Conj function performs element by element conjugation of the vector. No special values are specified. The function does not raise floating-point exceptions. v?Abs Computes absolute value of vector elements. Syntax Fortran: call vsabs( n, a, y ) call vmsabs( n, a, y, mode ) call vdabs( n, a, y ) call vmdabs( n, a, y, mode ) call vcabs( n, a, y ) call vmcabs( n, a, y, mode ) call vzabs( n, a, y ) call vmzabs( n, a, y, mode ) C: vsAbs( n, a, y ); vmsAbs( n, a, y, mode ); vdAbs( n, a, y ); vmdAbs( n, a, y, mode ); vcAbs( n, a, y ); vmcAbs( n, a, y, mode ); vzAbs( n, a, y ); vmzAbs( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Vector Mathematical Functions 9 1989 Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsabs, vmsabs DOUBLE PRECISION for vdabs, vmdabs COMPLEX for vcabs, vmcabs DOUBLE COMPLEX for vzabs, vmzabs Fortran 90: REAL, INTENT(IN) for vsabs, vmsabs DOUBLE PRECISION, INTENT(IN) for vdabs, vmdabs COMPLEX, INTENT(IN) for vcabs, vmcabs DOUBLE COMPLEX, INTENT(IN) for vzabs, vmzabs C: const float* for vsabs, vmsabs const double* for vdabs, vmdabs const MKL_Complex8* for vcAbs, vmcAbs const MKL_Complex16* for vzAbs, vmzAbs FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsabs, vmsabs, vcabs, vmcabs DOUBLE PRECISION for vdabs, vmdabs, vzabs, vmzabs Fortran 90: REAL, INTENT(OUT) for vsabs, vmsabs, vcabs, vmcabs DOUBLE PRECISION, INTENT(OUT) for vdabs, vmdabs, vzabs, vmzabs FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 1990 Name Type Description C: float* for vsabs, vmsabs, vcAbs, vmcAbs double* for vdabs, vmdabs, vzAbs, vmzAbs Description The v?Abs function computes an absolute value of vector elements. Special Values for Real Function v?Abs(x) Argument Result Exception +0 +0 -0 +0 +8 +8 -8 +8 QNAN QNAN SNAN QNAN INVALID Specifications for special values of the complex functions are defined according to the following formula Abs(z) = Hypot(RE(z),IM(z)). v?Arg Computes argument of vector elements. Syntax Fortran: call vcarg( n, a, y ) call vmcarg( n, a, y, mode ) call vzarg( n, a, y ) call vmzarg( n, a, y, mode ) C: vcArg( n, a, y ); vmcArg( n, a, y, mode ); vzArg( n, a, y ); vmzArg( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Vector Mathematical Functions 9 1991 Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN77: COMPLEX for vcarg, vmcarg DOUBLE COMPLEX for vzarg, vmzarg Fortran 90: COMPLEX, INTENT(IN) for vcarg, vmcarg DOUBLE COMPLEX, INTENT(IN) for vzarg, vmzarg C: const MKL_Complex8* for vcArg, vmcArg const MKL_Complex16* for vzArg, vmcArg FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vcarg, vmcarg DOUBLE PRECISION for vzarg, vmzarg Fortran 90: REAL, INTENT(OUT) for vcarg, vmcarg DOUBLE PRECISION, INTENT(OUT) for vzarg, vmzarg C: float* for vcArg, vmcArg double* for vzArg, vmcArg FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Arg function computes argument of vector elements. See the Special Value Notations section for the conventions used in the table below. 9 Intel® Math Kernel Library Reference Manual 1992 Special Values for Complex Function v?Arg(z) RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·8 +3·p/4 +p/2 +p/2 +p/2 +p/2 +p/4 NAN +i·Y +p +p/2 +p/2 +0 NAN +i·0 +p +p +p +0 +0 +0 NAN -i·0 -p -p -p -0 -0 -0 NAN -i·Y -p -p/2 -p/2 -0 NAN -i·8 -3·p/4 -p/2 -p/2 -p/2 -p/2 -p/4 NAN +i·NAN NAN NAN NAN NAN NAN NAN NAN Notes: • raises INVALID exception when real or imaginary part of the argument is SNAN • Arg(z)=Atan2(IM(z),RE(z)). v?LinearFrac Performs linear fraction transformation of vectors a and b with scalar parameters. Syntax Fortran: call vslinearfrac( n, a, b, scalea, shifta, scaleb, shiftb, y ) call vmslinearfrac( n, a, b, scalea, shifta, scaleb, shiftb, y, mode ) call vdlinearfrac( n, a, b, scalea, shifta, scaleb, shiftb, y ) call vmdlinearfrac( n, a, b, scalea, shifta, scaleb, shiftb, y, mode ) C: vsLinearFrac( n, a, b, scalea, shifta, scaleb, shiftb, y ); vmsLinearFrac( n, a, b, scalea, shifta, scaleb, shiftb, y, mode ); vdLinearFrac( n, a, b, scalea, shifta, scaleb, shiftb, y ) vmdLinearFrac( n, a, b, scalea, shifta, scaleb, shiftb, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. Vector Mathematical Functions 9 1993 Name Type Description a, b FORTRAN 77: REAL for vslinearfrac DOUBLE PRECISION for vdlinearfrac Fortran 90: REAL, INTENT(IN) for vslinearfrac DOUBLE PRECISION, INTENT(IN) for vdlinearfrac C: const float* for vsLinearFrac const double* for vdLinearFrac FORTRAN: Arrays that specify the input vectors a and b. C: Pointers to arrays that contain the input vectors a and b. scalea, scaleb FORTRAN 77: REAL for vslinearfrac DOUBLE PRECISION for vdlinearfrac Fortran 90: REAL, INTENT(IN) for vslinearfrac DOUBLE PRECISION, INTENT(IN) for vdlinearfrac C: const float* for vsLinearFrac const double* for vdLinearFrac Constant values for shifting addends of vectors a and b. shifta, shiftb FORTRAN 77: REAL for vslinearfrac DOUBLE PRECISION for vdlinearfrac Fortran 90: REAL, INTENT(IN) for vslinearfrac DOUBLE PRECISION, INTENT(IN) for vdlinearfrac C: const float* for vsLinearFrac const double* for vdLinearFrac Constant values for scaling multipliers of vectors a and b. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vslinearfrac DOUBLE PRECISION for vdlinearfrac Fortran 90: REAL, INTENT(OUT) for vslinearfrac FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 1994 Name Type Description DOUBLE PRECISION, INTENT(OUT) for vdlinearfrac C: float* for vsLinearFrac double* for vdLinearFrac Description The v?LinearFrac function performs linear fraction transformation of vectors a by vector b with scalar parameters: scaling multipliers scalea, scaleb and shifting addends shifta, shiftb: y[i]=(scalea·a[i]+shifta)/(scaleb·b[i]+shiftb), i=1,2 … n The v?LinearFrac function is implemented in the EP accuracy mode only, therefore no special values are defined for this function. Correctness is guaranteed within the threshold limitations defined for each input parameter (see the table below); otherwise, the behavior is unspecified. Threshold Limitations on Input Parameters 2EMIN/2 = |scalea| = 2(EMAX-2)/2 2EMIN/2 = |scaleb| = 2(EMAX-2)/2 |shifta| = 2EMAX-2 |shiftb| = 2EMAX-2 2EMIN/2 = a[i] = 2(EMAX-2)/2 2EMIN/2 = b[i] = 2(EMAX-2)/2 a[i] ? - (shifta/scalea)*(1-d1), |d1| = 21-(p-1)/2 b[i] ? - (shiftb/scaleb)*(1-d2), |d2| = 21-(p-1)/2 EMIN and EMAX are the maximum and minimum exponents and p is the number of significant bits (precision) for corresponding data type according to the ANSI/IEEE Std 754-2008 standard ([IEEE754]): • for single precision EMIN = -126, EMAX = 127, p = 24 • for double precision EMIN = -1022, EMAX = 1023, p = 53 The thresholds become less strict for common cases with scalea=0 and/or scaleb=0: • if scalea=0, there are no limitations for the values of a[i] and shifta • if scaleb=0, there are no limitations for the values of b[i] and shiftb Power and Root Functions v?Inv Performs element by element inversion of the vector. Syntax Fortran: call vsinv( n, a, y ) call vmsinv( n, a, y, mode ) call vdinv( n, a, y ) Vector Mathematical Functions 9 1995 call vmdinv( n, a, y, mode ) C: vsInv( n, a, y ); vmsInv( n, a, y, mode ); vdInv( n, a, y ); vmdInv( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsinv, vmsinv DOUBLE PRECISION for vdinv, vmdinv Fortran 90: REAL, INTENT(IN) for vsinv, vmsinv DOUBLE PRECISION, INTENT(IN) for vdinv, vmdinv C: const float* for vsInv, vmsInv const double* for vdInv, vmdInv FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsinv, vmsinv DOUBLE PRECISION for vdinv, vmdinv FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 1996 Name Type Description Fortran 90: REAL, INTENT(OUT) for vsinv, vmsinv DOUBLE PRECISION, INTENT(OUT) for vdinv, vmdinv C: float* for vsInv, vmsInv double* for vdInv, vmdInv Description The v?Inv function performs element by element inversion of the vector. Special Values for Real Function v?Inv(x) Argument Result VML Error Status Exception +0 +8 VML_STATUS_SING ZERODIVIDE -0 -8 VML_STATUS_SING ZERODIVIDE +8 +0 -8 -0 QNAN QNAN SNAN QNAN INVALID v?Div Performs element by element division of vector a by vector b Syntax Fortran: call vsdiv( n, a, b, y ) call vmsdiv( n, a, b, y, mode ) call vddiv( n, a, b, y ) call vmddiv( n, a, b, y, mode ) call vcdiv( n, a, b, y ) call vmcdiv( n, a, b, y, mode ) call vzdiv( n, a, b, y ) call vmzdiv( n, a, b, y, mode ) C: vsDiv( n, a, b, y ); vmsDiv( n, a, b, y, mode ); vdDiv( n, a, b, y ); vmdDiv( n, a, b, y, mode ); vcDiv( n, a, b, y ); vmcDiv( n, a, b, y, mode ); vzDiv( n, a, b, y ); Vector Mathematical Functions 9 1997 vmzDiv( n, a, b, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a, b FORTRAN 77: REAL for vsdiv, vmsdiv DOUBLE PRECISION for vddiv, vmddiv COMPLEX for vcdiv, vmcdiv DOUBLE COMPLEX for vzdiv, vmzdiv Fortran 90: REAL, INTENT(IN) for vsdiv, vmsdiv DOUBLE PRECISION, INTENT(IN) for vddiv, vmddiv COMPLEX, INTENT(IN) for vcdiv, vmcdiv DOUBLE COMPLEX, INTENT(IN) for vzdiv, vmzdiv C: const float* for vsDiv, vmsDiv const double* for vdDiv, vmdDiv const MKL_Complex8* for vcDiv, vmcDiv const MKL_Complex16* for vzDiv, vmzDiv FORTRAN: Arrays that specify the input vectors a and b. C: Pointers to arrays that contain the input vectors a and b. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Precision Overflow Thresholds for Real v?Div Function Data Type Threshold Limitations on Input Parameters single precision abs(a[i]) < abs(b[i]) * FLT_MAX double precision abs(a[i]) < abs(b[i]) * DBL_MAX Precision overflow thresholds for the complex v?Div function are beyond the scope of this document. 9 Intel® Math Kernel Library Reference Manual 1998 Output Parameters Name Type Description y FORTRAN 77: REAL for vsdiv, vmsdiv DOUBLE PRECISION for vddiv, vmddiv COMPLEX for vcdiv, vmcdiv DOUBLE COMPLEX for vzdiv, vmzdiv Fortran 90: REAL, INTENT(OUT) for vsdiv, vmsdiv DOUBLE PRECISION, INTENT(OUT) for vddiv, vmddiv COMPLEX, INTENT(OUT) for vcdiv, vmcdiv DOUBLE COMPLEX, INTENT(OUT) for vzdiv, vmzdiv C: float* for vsDiv, vmsDiv double* for vdDiv, vmdDiv MKL_Complex8* for vcDiv, vmcDiv MKL_Complex16* for vzDiv, vmzDiv FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Div function performs element by element division of vector a by vector b. Special values for Real Function v?Div(x) Argument 1 Argument 2 Result VML Error Status Exception X > +0 +0 +8 VML_STATUS_SING ZERODIVIDE X > +0 -0 -8 VML_STATUS_SING ZERODIVIDE X < +0 +0 -8 VML_STATUS_SING ZERODIVIDE X < +0 -0 +8 VML_STATUS_SING ZERODIVIDE +0 +0 QNAN VML_STATUS_SING -0 -0 QNAN VML_STATUS_SING X > +0 +8 +0 X > +0 -8 -0 +8 +8 QNAN -8 -8 QNAN QNAN QNAN QNAN SNAN SNAN QNAN INVALID Specifications for special values of the complex functions are defined according to the following formula Div(x1+i*y1,x2+i*y2) = (x1+i*y1)*(x2-i*y2)/(x2*x2+y2*y2). Overflow in a complex function occurs when x2+i*y2 is not zero, x1, x2, y1, y2 are finite numbers, but the real or imaginary part of the exact result is so large that it does not fit the target precision. In that case, the function returns 8 in that part of the result, raises the OVERFLOW exception, and sets the VML Error Status to VML_STATUS_OVERFLOW. Vector Mathematical Functions 9 1999 v?Sqrt Computes a square root of vector elements. Syntax Fortran: call vssqrt( n, a, y ) call vmssqrt( n, a, y, mode ) call vdsqrt( n, a, y ) call vmdsqrt( n, a, y, mode ) call vcsqrt( n, a, y ) call vmcsqrt( n, a, y, mode ) call vzsqrt( n, a, y ) call vmzsqrt( n, a, y, mode ) C: vsSqrt( n, a, y ); vmsSqrt( n, a, y, mode ); vdSqrt( n, a, y ); vmdSqrt( n, a, y, mode ); vcSqrt( n, a, y ); vmcSqrt( n, a, y, mode ); vzSqrt( n, a, y ); vmzSqrt( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vssqrt, vmssqrt DOUBLE PRECISION for vdsqrt, vmdsqrt COMPLEX for vcsqrt, vmcsqrt FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. 9 Intel® Math Kernel Library Reference Manual 2000 Name Type Description DOUBLE COMPLEX for vzsqrt, vmzsqrt Fortran 90: REAL, INTENT(IN) for vssqrt, vmssqrt DOUBLE PRECISION, INTENT(IN) for vdsqrt, vmdsqrt COMPLEX, INTENT(IN) for vcsqrt, vmcsqrt DOUBLE COMPLEX, INTENT(IN) for vzsqrt, vmzsqrt C: const float* for vsSqrt, vmsSqrt const double* for vdSqrt, vmdSqrt const MKL_Complex8* for vcSqrt, vmcSqrt const MKL_Complex16* for vzSqrt, vmzSqrt mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN: REAL for vssqrt, vmssqrt DOUBLE PRECISION for vdsqrt, vmdsqrt COMPLEX for vcsqrt, vmcsqrt DOUBLE COMPLEX for vzsqrt, vmzsqrt Fortran 90: REAL, INTENT(OUT) for vssqrt, vmssqrt DOUBLE PRECISION, INTENT(OUT) for vdsqrt, vmdsqrt COMPLEX, INTENT(OUT) for vcsqrt, vmcsqrt DOUBLE COMPLEX, INTENT(OUT) for vzsqrt, vmzsqrt C: float* for vsSqrt, vmsSqrt double* for vdSqrt, vmdSqrt FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Vector Mathematical Functions 9 2001 Name Type Description MKL_Complex8* for vcSqrt, vmcSqrt MKL_Complex16* for vzSqrt, vmzSqrt Description The v?Sqrt function computes a square root of vector elements. Special Values for Real Function v?Sqrt(x) Argument Result VML Error Status Exception X < +0 QNAN VML_STATUS_ERRDOM INVALID +0 +0 -0 -0 -8 QNAN VML_STATUS_ERRDOM INVALID +8 +8 QNAN QNAN SNAN QNAN INVALID See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?Sqrt(z) RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·8 +8+i·8 +8+i·8 +8+i·8 +8+i·8 +8+i·8 +8+i·8 +8+i·8 +i·Y +0+i·8 +8+i·0 QNAN+i·QNAN +i·0 +0+i·8 +0+i·0 +0+i·0 +8+i·0 QNAN+i·QNAN -i·0 +0-i·8 +0-i·0 +0-i·0 +8-i·0 QNAN+i·QNAN -i·Y +0-i·8 +8-i·0 QNAN+i·QNAN -i·8 +8-i·8 +8-i·8 +8-i·8 +8-i·8 +8-i·8 +8-i·8 +8-i·8 +i·NAN QNAN+i·QNAN QNAN+i·QNAN QNAN+i·QNAN QNAN+i·QNAN QNAN+i·QNAN +8+i·QNAN QNAN+i·QNAN Notes: • raises INVALID exception when the real or imaginary part of the argument is SNAN • Sqrt(CONJ(z))=CONJ(Sqrt(z)). v?InvSqrt Computes an inverse square root of vector elements. Syntax Fortran: call vsinvsqrt( n, a, y ) call vmsinvsqrt( n, a, y, mode ) call vdinvsqrt( n, a, y ) call vmdinvsqrt( n, a, y, mode ) 9 Intel® Math Kernel Library Reference Manual 2002 C: vsInvSqrt( n, a, y ); vmsInvSqrt( n, a, y, mode ); vdInvSqrt( n, a, y ); vmdInvSqrt( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsinvsqrt, vmsinvsqrt DOUBLE PRECISION for vdinvsqrt, vmdinvsqrt Fortran 90: REAL, INTENT(IN) for vsinvsqrt, vmsinvsqrt DOUBLE PRECISION, INTENT(IN) for vdinvsqrt, vmdinvsqrt C: const float* for vsInvSqrt, vmsInvSqrt const double* for vdInvSqrt, vmdInvSqrt FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsinvsqrt, vmsinvsqrt DOUBLE PRECISION for vdinvsqrt, vmdinvsqrt Fortran 90: REAL, INTENT(OUT) for vsinvsqrt, vmsinvsqrt FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Vector Mathematical Functions 9 2003 Name Type Description DOUBLE PRECISION, INTENT(OUT) for vdinvsqrt, vmdinvsqrt C: float* for vsInvSqrt, vmsInvSqrt double* for vdInvSqrt, vmdInvSqrt Description The v?InvSqrt function computes an inverse square root of vector elements. Special Values for Real Function v?InvSqrt(x) Argument Result VML Error Status Exception X < +0 QNAN VML_STATUS_ERRDOM INVALID +0 +8 VML_STATUS_SING ZERODIVIDE -0 -8 VML_STATUS_SING ZERODIVIDE -8 QNAN VML_STATUS_ERRDOM INVALID +8 +0 QNAN QNAN SNAN QNAN INVALID v?Cbrt Computes a cube root of vector elements. Syntax Fortran: call vscbrt( n, a, y ) call vmscbrt( n, a, y, mode ) call vdcbrt( n, a, y ) call vmdcbrt( n, a, y, mode ) C: vsCbrt( n, a, y ); vmsCbrt( n, a, y, mode ); vdCbrt( n, a, y ); vmdCbrt( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h 9 Intel® Math Kernel Library Reference Manual 2004 Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vscbrt, vmscbrt DOUBLE PRECISION for vdcbrt, vmdcbrt Fortran 90: REAL, INTENT(IN) for vscbrt, vmscbrt DOUBLE PRECISION, INTENT(IN) for vdcbrt, vmdcbrt C: const float* for vsCbrt, vmsCbrt const double* for vdCbrt, vmdCbrt FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vscbrt, vmscbrt DOUBLE PRECISION for vdcbrt, vmdcbrt Fortran 90: REAL, INTENT(OUT) for vscbrt, vmscbrt DOUBLE PRECISION, INTENT(OUT) for vdcbrt, vmdcbrt C: float* for vsCbrt, vmsCbrt double* for vdCbrt, vmdCbrt FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Cbrt function computes a cube root of vector elements. Special Values for Real Function v?Cbrt(x) Argument Result Exception +0 +0 -0 -0 +8 +8 Vector Mathematical Functions 9 2005 Argument Result Exception -8 -8 QNAN QNAN SNAN QNAN INVALID v?InvCbrt Computes an inverse cube root of vector elements. Syntax Fortran: call vsinvcbrt( n, a, y ) call vmsinvcbrt( n, a, y, mode ) call vdinvcbrt( n, a, y ) call vmdinvcbrt( n, a, y, mode ) C: vsInvCbrt( n, a, y ); vmsInvCbrt( n, a, y, mode ); vdInvCbrt( n, a, y ); vmdInvCbrt( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsinvcbrt, vmsinvcbrt DOUBLE PRECISION for vdinvcbrt, vmdinvcbrt Fortran 90: REAL, INTENT(IN) for vsinvcbrt, vmsinvcbrt DOUBLE PRECISION, INTENT(IN) for vdinvcbrt, vmdinvcbrt C: const float* for vsInvCbrt, vmsInvCbrt FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. 9 Intel® Math Kernel Library Reference Manual 2006 Name Type Description const double* for vdInvCbrt, vmdInvCbrt mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsinvcbrt, vmsinvcbrt DOUBLE PRECISION for vdinvcbrt, vmdinvcbrt Fortran 90: REAL, INTENT(OUT) for vsinvcbrt, vmsinvcbrt DOUBLE PRECISION, INTENT(OUT) for vdinvcbrt, vmdinvcbrt C: float* for vsInvCbrt, vmsInvCbrt double* for vdInvCbrt, vmdInvCbrt FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?InvCbrt function computes an inverse cube root of vector elements. Special Values for Real Function v?InvCbrt(x) Argument Result VML Error Status Exception +0 +8 VML_STATUS_SING ZERODIVIDE -0 -8 VML_STATUS_SING ZERODIVIDE +8 +0 -8 -0 QNAN QNAN SNAN QNAN INVALID v?Pow2o3 Raises each element of a vector to the constant power 2/3. Syntax Fortran: call vspow2o3( n, a, y ) call vmspow2o3( n, a, y, mode ) call vdpow2o3( n, a, y ) call vmdpow2o3( n, a, y, mode ) Vector Mathematical Functions 9 2007 C: vsPow2o3( n, a, y ); vmsPow2o3( n, a, y, mode ); vdPow2o3( n, a, y ); vmdPow2o3( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vspow2o3, vmspow2o3 DOUBLE PRECISION for vdpow2o3, vmdpow2o3 Fortran 90: REAL, INTENT(IN) for vspow2o3, vmspow2o3 DOUBLE PRECISION, INTENT(IN) for vdpow2o3, vmdpow2o3 C: const float* for vsPow2o3, vmsPow2o3 const double* for vdPow2o3, vmdPow2o3 FORTRAN: Arrays, specify the input vector a. C: Pointers to arrays that contain the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vspow2o3, vmspow2o3 DOUBLE PRECISION for vdpow2o3, vmdpow2o3 Fortran 90: REAL, INTENT(OUT) for vspow2o3, vmspow2o3 FORTRAN: Array, specifies the output vector y. C: Pointer to an array that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 2008 Name Type Description DOUBLE PRECISION, INTENT(OUT) for vdpow2o3, vmdpow2o3 C: float* for vsPow2o3, vmsPow2o3 double* for vdPow2o3, vmdPow2o3 Description The v?Pow2o3 function raises each element of a vector to the constant power 2/3. Special Values for Real Function v?Pow2o3(x) Argument Result Exception +0 +0 -0 +0 +8 +8 -8 +8 QNAN QNAN SNAN QNAN INVALID v?Pow3o2 Raises each element of a vector to the constant power 3/2. Syntax Fortran: call vspow3o2( n, a, y ) call vmspow3o2( n, a, y, mode ) call vdpow3o2( n, a, y ) call vmdpow3o2( n, a, y, mode ) C: vsPow3o2( n, a, y ); vmsPow3o2( n, a, y, mode ); vdPow3o2( n, a, y ); vmdPow3o2( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) Specifies the number of elements to be calculated. Vector Mathematical Functions 9 2009 Name Type Description C: const int a FORTRAN 77: REAL for vspow3o2, vmspow3o2 DOUBLE PRECISION for vdpow3o2, vmdpow3o2 Fortran 90: REAL, INTENT(IN) for vspow3o2, vmspow3o2 DOUBLE PRECISION, INTENT(IN) for vdpow3o2, vmdpow3o2 C: const float* for vsPow3o2, vmsPow3o2 const double* for vdPow3o2, vmdPow3o2 FORTRAN: Arrays, specify the input vector a. C: Pointers to arrays that contain the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Precision Overflow Thresholds for Pow3o2 Function Data Type Threshold Limitations on Input Parameters single precision abs(a[i]) < ( FLT_MAX )2/3 double precision abs(a[i]) < ( DBL_MAX )2/3 Output Parameters Name Type Description y FORTRAN 77: REAL for vspow3o2, vmspow3o2 DOUBLE PRECISION for vdpow3o2, vmdpow3o2 Fortran 90: REAL, INTENT(OUT) for vspow3o2, vmspow3o2 DOUBLE PRECISION, INTENT(OUT) for vdpow3o2, vmdpow3o2 C: float* for vsPow3o2, vmsPow3o2 double* for vdPow3o2, vmdPow3o2 FORTRAN: Array, specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Pow3o2 function raises each element of a vector to the constant power 3/2. Special Values for Real Function v?Pow3o2(x) Argument Result VML Error Status Exception X < +0 QNAN VML_STATUS_ERRDOM INVALID 9 Intel® Math Kernel Library Reference Manual 2010 Argument Result VML Error Status Exception +0 +0 -0 -0 -8 QNAN VML_STATUS_ERRDOM INVALID +8 +8 QNAN QNAN SNAN QNAN INVALID v?Pow Computes a to the power b for elements of two vectors. Syntax Fortran: call vspow( n, a, b, y ) call vmspow( n, a, b, y, mode ) call vdpow( n, a, b, y ) call vmdpow( n, a, b, y, mode ) call vcpow( n, a, b, y ) call vmcpow( n, a, b, y, mode ) call vzpow( n, a, b, y ) call vmzpow( n, a, b, y, mode ) C: vsPow( n, a, b, y ); vmsPow( n, a, b, y, mode ); vdPow( n, a, b, y ); vmdPow( n, a, b, y, mode ); vcPow( n, a, b, y ); vmcPow( n, a, b, y, mode ); vzPow( n, a, b, y ); vmzPow( n, a, b, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) Specifies the number of elements to be calculated. Vector Mathematical Functions 9 2011 Name Type Description C: const int a, b FORTRAN 77: REAL for vspow, vmspow DOUBLE PRECISION for vdpow, vmdpow COMPLEX for vcpow, vmcpow DOUBLE COMPLEX for vzpow, vmzpow Fortran 90: REAL, INTENT(IN) for vspow, vmspow DOUBLE PRECISION, INTENT(IN) for vdpow, vmdpow COMPLEX, INTENT(IN) for vcpow, vmcpow DOUBLE COMPLEX, INTENT(IN) for vzpow, vmzpow C: const float* for vsPow, vmsPow const double* for vdPow, vmdPow const MKL_Complex8* for vcPow, vmcPow const MKL_Complex16* for vzPow, vmzPow FORTRAN: Arrays that specify the input vectors a and b. C: Pointers to arrays that contain the input vectors a and b. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Precision Overflow Thresholds for Real v?Pow Function Data Type Threshold Limitations on Input Parameters single precision abs(a[i]) < ( FLT_MAX )1/b[i] double precision abs(a[i]) < ( DBL_MAX )1/b[i] Precision overflow thresholds for the complex v?Pow function are beyond the scope of this document. Output Parameters Name Type Description y FORTRAN 77: REAL for vspow, vmspow DOUBLE PRECISION for vdpow, vmdpow COMPLEX for vcpow, vmcpow DOUBLE COMPLEX for vzpow, vmzpow FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 2012 Name Type Description Fortran 90: REAL, INTENT(OUT) for vspow, vmspow DOUBLE PRECISION, INTENT(OUT) for vdpow, vmdpow COMPLEX, INTENT(OUT) for vcpow, vmcpow DOUBLE COMPLEX, INTENT(OUT) for vzpow, vmzpow C: float* for vsPow, vmsPow double* for vdPow, vmdPow MKL_Complex8* for vcPow, vmcPow MKL_Complex16* for vzPow, vmzPow Description The v?Pow function computes a to the power b for elements of two vectors. The real function v(s/d)Pow has certain limitations on the input range of a and b parameters. Specifically, if a[i] is positive, then b[i] may be arbitrary. For negative a[i], the value of b[i] must be an integer (either positive or negative). The complex function v(c/z)Pow has no input range limitations. Special values for Real Function v?Pow(x) Argument 1 Argument 2 Result VML Error Status Exception +0 neg. odd integer +8 VML_STATUS_ERRDOM ZERODIVIDE -0 neg. odd integer -8 VML_STATUS_ERRDOM ZERODIVIDE +0 neg. even integer +8 VML_STATUS_ERRDOM ZERODIVIDE -0 neg. even integer +8 VML_STATUS_ERRDOM ZERODIVIDE +0 neg. non-integer +8 VML_STATUS_ERRDOM ZERODIVIDE -0 neg. non-integer +8 VML_STATUS_ERRDOM ZERODIVIDE -0 pos. odd integer +0 -0 pos. odd integer -0 +0 pos. even integer +0 -0 pos. even integer +0 +0 pos. non-integer +0 -0 pos. non-integer +0 -1 +8 +1 -1 -8 +1 +1 any value +1 +1 +0 +1 +1 -0 +1 +1 +8 +1 +1 -8 +1 +1 QNAN +1 any value +0 +1 +0 +0 +1 Vector Mathematical Functions 9 2013 Argument 1 Argument 2 Result VML Error Status Exception -0 +0 +1 +8 +0 +1 -8 +0 +1 QNAN +0 +1 any value -0 +1 +0 -0 +1 -0 -0 +1 +8 -0 +1 -8 -0 +1 QNAN -0 +1 X < +0 non-integer QNAN VML_STATUS_ERRDOM INVALID |X| < 1 -8 +8 +0 -8 +8 VML_STATUS_ERRDOM ZERODIVIDE -0 -8 +8 VML_STATUS_ERRDOM ZERODIVIDE |X| > 1 -8 +0 +8 -8 +0 -8 -8 +0 |X| < 1 +8 +0 +0 +8 +0 -0 +8 +0 |X| > 1 +8 +8 +8 +8 +8 -8 +8 +8 -8 neg. odd integer -0 -8 neg. even integer +0 -8 neg. non-integer +0 -8 pos. odd integer -8 -8 pos. even integer +8 -8 pos. non-integer +8 +8 X < +0 +0 +8 X > +0 +8 QNAN QNAN QNAN QNAN SNAN QNAN INVALID SNAN QNAN QNAN INVALID SNAN SNAN QNAN INVALID Overflow in a complex function occurs (supported in the HA/LA accuracy modes only) when x1, x2, y1, y2 are finite numbers, but the real or imaginary part of the exact result is so large that it does not fit the target precision. In this case, the function returns 8 in that part of the result, raises the OVERFLOW exception, and sets the VML Error Status to VML_STATUS_OVERFLOW. v?Powx Raises each element of a vector to the constant power. Syntax Fortran: call vspowx( n, a, b, y ) 9 Intel® Math Kernel Library Reference Manual 2014 call vmspowx( n, a, b, y, mode ) call vdpowx( n, a, b, y ) call vmdpowx( n, a, b, y, mode ) call vcpowx( n, a, b, y ) call vmcpowx( n, a, b, y, mode ) call vzpowx( n, a, b, y ) call vmzpowx( n, a, b, y, mode ) C: vsPowx( n, a, b, y ); vmsPowx( n, a, b, y, mode ); vdPowx( n, a, b, y ); vmdPowx( n, a, b, y, mode ); vcPowx( n, a, b, y ); vmcPowx( n, a, b, y, mode ); vzPowx( n, a, b, y ); vmzPowx( n, a, b, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of elements to be calculated. a FORTRAN 77: REAL for vspowx, vmspowx DOUBLE PRECISION for vdpowx, vmdpowx COMPLEX for vcpowx, vmcpowx DOUBLE COMPLEX for vzpowx, vmzpowx Fortran 90: REAL, INTENT(IN) for vspowx, vmspowx DOUBLE PRECISION, INTENT(IN) for vdpowx, vmdpowx COMPLEX, INTENT(IN) for vcpowx, vmcpowx FORTRAN: Array a that specifies the input vector C: Pointer to an array that contains the input vector a. Vector Mathematical Functions 9 2015 Name Type Description DOUBLE COMPLEX, INTENT(IN) for vzpowx, vmzpowx C: const float* for vsPowx, vmsPowx const double* for vdPowx, vmdPowx const MKL_Complex8* for vcPowx, vmcPowx const MKL_Complex16* for vzPowx, vmzPowx b FORTRAN 77: REAL for vspowx, vmspowx DOUBLE PRECISION for vdpowx, vmdpowx COMPLEX for vcpowx, vmcpowx DOUBLE COMPLEX for vzpowx, vmzpowx Fortran 90: REAL, INTENT(IN) for vspowx, vmspowx DOUBLE PRECISION, INTENT(IN) for vdpowx, vmdpowx COMPLEX, INTENT(IN) for vcpowx, vmcpowx DOUBLE COMPLEX, INTENT(IN) for vzpowx, vmzpowx C: const float* for vsPowx, vmsPowx const double* for vdPowx, vmdPowx const MKL_Complex8* for vcPowx, vmcPowx const MKL_Complex16* for vzPowx, vmzPowx FORTRAN: Scalar value b that is the constant power. C: Constant value for power b. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Precision Overflow Thresholds for Real v?Powx Function Data Type Threshold Limitations on Input Parameters single precision abs(a[i]) < ( FLT_MAX )1/b double precision abs(a[i]) < ( DBL_MAX )1/b Precision overflow thresholds for the complex v?Powx function are beyond the scope of this document. 9 Intel® Math Kernel Library Reference Manual 2016 Output Parameters Name Type Description y FORTRAN 77: REAL for vspowx, vmspowx DOUBLE PRECISION for vdpowx, vmdpowx COMPLEX for vcpowx, vmcpowx DOUBLE COMPLEX for vzpowx, vmzpowx Fortran 90: REAL, INTENT(OUT) for vspowx, vmspowx DOUBLE PRECISION, INTENT(OUT) for vdpowx, vmdpowx COMPLEX, INTENT(OUT) for vcpowx, vmcpowx DOUBLE COMPLEX, INTENT(OUT) for vzpowx, vmzpowx C: float* for vsPowx, vmsPowx double* for vdPowx, vmdPowx MKL_Complex8* for vcPowx, vmcPowx MKL_Complex16* for vzPowx, vmzPowx FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Powx function raises each element of a vector to the constant power. The real function v(s/d)Powx has certain limitations on the input range of a and b parameters. Specifically, if a[i] is positive, then b may be arbitrary. For negative a[i], the value of b must be an integer (either positive or negative). The complex function v(c/z)Powx has no input range limitations. Special values are the same as for the v?Pow function. v?Hypot Computes a square root of sum of two squared elements. Syntax Fortran: call vshypot( n, a, b, y ) call vmshypot( n, a, b, y, mode ) call vdhypot( n, a, b, y ) call vmdhypot( n, a, b, y, mode ) Vector Mathematical Functions 9 2017 C: vsHypot( n, a, b, y ); vmsHypot( n, a, b, y, mode ); vdHypot( n, a, b, y ); vmdHypot( n, a, b, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of elements to be calculated. a, b FORTRAN 77: REAL for vshypot, vmshypot DOUBLE PRECISION for vdhypot, vmdhypot Fortran 90: REAL, INTENT(IN) for vshypot, vmshypot DOUBLE PRECISION, INTENT(IN) for vdhypot, vmdhypot C: const float* for vsHypot, vmsHypot const double* for vdHypot, vmdHypot FORTRAN: Arrays that specify the input vectors a and b C: Pointers to arrays that contain the input vectors a and b. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Precision Overflow Thresholds for Hypot Function Data Type Threshold Limitations on Input Parameters single precision abs(a[i]) < sqrt(FLT_MAX) abs(b[i]) < sqrt(FLT_MAX) double precision abs(a[i]) < sqrt(DBL_MAX) abs(b[i]) < sqrt(DBL_MAX) 9 Intel® Math Kernel Library Reference Manual 2018 Output Parameters Name Type Description y FORTRAN 77: REAL for vshypot, vmshypot DOUBLE PRECISION for vdhypot, vmdhypot Fortran 90: REAL, INTENT(OUT) for vshypot, vmshypot DOUBLE PRECISION, INTENT(OUT) for vdhypot, vmdhypot C: float* for vsHypot, vmsHypot double* for vdHypot, vmdHypot FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The function v?Hypot computes a square root of sum of two squared elements. Special values for Real Function v?Hypot(x) Argument 1 Argument 2 Result Exception +0 +0 +0 -0 -0 +0 +8 any value +8 any value +8 +8 SNAN any value QNAN INVALID any value SNAN QNAN INVALID QNAN any value QNAN any value QNAN QNAN Exponential and Logarithmic Functions v?Exp Computes an exponential of vector elements. Syntax Fortran: call vsexp( n, a, y ) call vmsexp( n, a, y, mode ) call vdexp( n, a, y ) call vmdexp( n, a, y, mode ) call vcexp( n, a, y ) call vmcexp( n, a, y, mode ) call vzexp( n, a, y ) call vmzexp( n, a, y, mode ) Vector Mathematical Functions 9 2019 C: vsExp( n, a, y ); vmsExp( n, a, y, mode ); vdExp( n, a, y ); vmdExp( n, a, y, mode ); vcExp( n, a, y ); vmcExp( n, a, y, mode ); vzExp( n, a, y ); vmzExp( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsexp, vmsexp DOUBLE PRECISION for vdexp, vmdexp COMPLEX for vcexp, vmcexp DOUBLE COMPLEX for vzexp, vmzexp Fortran 90: REAL, INTENT(IN) for vsexp, vmsexp DOUBLE PRECISION, INTENT(IN) for vdexp, vmdexp COMPLEX, INTENT(IN) for vcexp, vmcexp DOUBLE COMPLEX, INTENT(IN) for vzexp, vmzexp C: const float* for vsExp, vmsExp const double* for vdExp, vmdExp const MKL_Complex8* for vcExp, vmcExp const MKL_Complex16* for vzExp, vmzExp FORTRAN: Array, specifies the input vector a. C: Pointer to an array that contains the input vector a. 9 Intel® Math Kernel Library Reference Manual 2020 Name Type Description mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Precision Overflow Thresholds for Real v?Exp Function Data Type Threshold Limitations on Input Parameters single precision a[i] < Ln( FLT_MAX ) double precision a[i] < Ln( DBL_MAX ) Precision overflow thresholds for the complex v?Exp function are beyond the scope of this document. Output Parameters Name Type Description y FORTRAN 77: REAL for vsexp, vmsexp DOUBLE PRECISION for vdexp, vmdexp COMPLEX for vcexp, vmcexp DOUBLE COMPLEX for vzexp, vmzexp Fortran 90: REAL, INTENT(OUT) for vsexp, vmsexp DOUBLE PRECISION, INTENT(OUT) for vdexp, vmdexp COMPLEX, INTENT(OUT) for vcexp, vmcexp DOUBLE COMPLEX, INTENT(OUT) for vzexp, vmzexp C: float* for vsExp, vmsExp double* for vdExp, vmdExp MKL_Complex8* for vcExp, vmcExp MKL_Complex16* for vzExp, vmzExp FORTRAN: Array, specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Exp function computes an exponential of vector elements. Special Values for Real Function v?Exp(x) Argument Result VML Error Status Exception +0 +1 -0 +1 X > overflow +8 VML_STATUS_OVERFLOW OVERFLOW X < underflow +0 VML_STATUS_UNDERFLOW UNDERFLOW +8 +8 Vector Mathematical Functions 9 2021 Argument Result VML Error Status Exception -8 +0 QNAN QNAN SNAN QNAN INVALID See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?Exp(z) RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·8 +0+i·0 QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID +8+i·QNAN INVALID QNAN+i·QNAN INVALID +i·Y +0·CIS(Y) +8·CIS(Y) QNAN+i·QNAN +i·0 +0·CIS(0) +1+i·0 +1+i·0 +8+i·0 QNAN+i·0 -i·0 +0·CIS(0) +1-i·0 +1-i·0 +8-i·0 QNAN-i·0 -i·Y +0·CIS(Y) +8·CIS(Y) QNAN+i·QNAN -i·8 +0-i·0 QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID +8+i·QNAN INVALID QNAN+i·QNAN +i·NAN +0+i·0 QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID +8+i·QNAN QNAN+i·QNAN Notes: • raises the INVALID exception when real or imaginary part of the argument is SNAN • raises the INVALID exception on argument z=-8+i·QNAN • raises the OVERFLOW exception and sets the VML Error Status to VML_STATUS_OVERFLOW in the case of overflow, that is, when RE(z), IM(z) are finite non-zero numbers, but the real or imaginary part of the exact result is so large that it does not meet the target precision. v?Expm1 Computes an exponential of vector elements decreased by 1. Syntax Fortran: call vsexpm1( n, a, y ) call vmsexpm1( n, a, y, mode ) call vdexpm1( n, a, y ) call vdexpm1( n, a, y, mode ) C: vsExpm1( n, a, y ); vmsExpm1( n, a, y, mode ); vdExpm1( n, a, y ); vmdExpm1( n, a, y, mode ); 9 Intel® Math Kernel Library Reference Manual 2022 Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsexpm1, vmsexpm1 DOUBLE PRECISION for vdexpm1, vmdexpm1 Fortran 90: REAL, INTENT(IN) for vsexpm1, vmsexpm1 DOUBLE PRECISION, INTENT(IN) for vdexpm1, vmdexpm1 C: const float* for vsExpm1, vmsExpm1 const double* for vdExpm1, vmdExpm1 FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Precision Overflow Thresholds for Expm1 Function Data Type Threshold Limitations on Input Parameters single precision a[i] < Ln( FLT_MAX ) double precision a[i] < Ln( DBL_MAX ) Output Parameters Name Type Description y FORTRAN 77: REAL for vsexpm1, vmsexpm1 DOUBLE PRECISION for vdexpm1, vmdexpm1 Fortran 90: REAL, INTENT(OUT) for vsexpm1, vmsexpm1 DOUBLE PRECISION, INTENT(OUT) for vdexpm1, vmdexpm1 FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Vector Mathematical Functions 9 2023 Name Type Description C: float* for vsExpm1, vmsExpm1 double* for vdExpm1, vmdExpm1 Description The v?Expm1 function computes an exponential of vector elements decreased by 1. Special Values for Real Function v?Expm1(x) Argument Result VML Error Status Exception +0 +0 -0 +0 X > overflow +8 VML_STATUS_OVERFLOW OVERFLOW +8 +8 -8 -1 QNAN QNAN SNAN QNAN INVALID v?Ln Computes natural logarithm of vector elements. Syntax Fortran: call vsln( n, a, y ) call vmsln( n, a, y, mode ) call vdln( n, a, y ) call vmdln( n, a, y, mode ) call vcln( n, a, y ) call vmcln( n, a, y, mode ) call vzln( n, a, y ) call vmzln( n, a, y, mode ) C: vsLn( n, a, y ); vmsLn( n, a, y, mode ); vdLn( n, a, y ); vmdLn( n, a, y, mode ); vcLn( n, a, y ); vmcLn( n, a, y, mode ); vzLn( n, a, y ); vmzLn( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 9 Intel® Math Kernel Library Reference Manual 2024 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsln, vmsln DOUBLE PRECISION for vdln, vmdln COMPLEX for vcln, vmcln DOUBLE COMPLEX for vzln, vmzln Fortran 90: REAL, INTENT(IN) for vsln, vmsln DOUBLE PRECISION, INTENT(IN) for vdln, vmdln COMPLEX, INTENT(IN) for vcln, vmcln DOUBLE COMPLEX, INTENT(IN) for vzln, vmzln C: const float* for vsLn, vmsLn const double* for vdLn, vmdLn const MKL_Complex8* for vcLn, vmcLn const MKL_Complex16* for vzLn, vmzLn FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsln, vmsln DOUBLE PRECISION for vdln, vmdln COMPLEX for vcln, vmcln DOUBLE COMPLEX for vzln, vmzln Fortran 90: REAL, INTENT(OUT) for vsln, vmsln DOUBLE PRECISION, INTENT(OUT) for vdln, vmdln FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Vector Mathematical Functions 9 2025 Name Type Description COMPLEX, INTENT(OUT) for vcln, vmcln DOUBLE COMPLEX, INTENT(OUT) for vzln, vmzln C: float* for vsLn, vmsLn double* for vdLn, vmdLn MKL_Complex8* for vcLn, vmcLn MKL_Complex16* for vzLn, vmzLn Description The v?Ln function computes natural logarithm of vector elements. Special Values for Real Function v?Ln(x) Argument Result VML Error Status Exception +1 +0 X < +0 QNAN VML_STATUS_ERRDOM INVALID +0 -8 VML_STATUS_SING ZERODIVIDE -0 -8 VML_STATUS_SING ZERODIVIDE -8 QNAN VML_STATUS_ERRDOM INVALID +8 +8 QNAN QNAN SNAN QNAN INVALID See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?Ln(z) RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·8 +8+i·p/2 +8+i·p/2 +8+i·p/2 +8+i·p/2 +8+i·p/4 +8+i·QNAN +i·Y +8+i·p +8+i·0 QNAN+i·QNAN INVALID +i·0 +8+i·p -8+i·p ZERODIVID E -8+i·0 ZERODIVID E +8+i·0 QNAN+i·QNAN INVALID -i·0 +8-i·p -8-i·p ZERODIVID E -8-i·0 ZERODIVID E +8-i·0 QNAN+i·QNAN INVALID -i·Y +8-i·p +8-i·0 QNAN+i·QNAN INVALID -i·8 +8-i·p/2 +8-i·p/2 +8-i·p/2 +8-i·p/2 +8-i·p/4 +8+i·QNAN +i·NAN +8+i·QNAN QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID +8+i·QNAN QNAN+i·QNAN INVALID Notes: 9 Intel® Math Kernel Library Reference Manual 2026 • raises INVALID exception when real or imaginary part of the argument is SNAN v?Log10 Computes denary logarithm of vector elements. Syntax Fortran: call vslog10( n, a, y ) call vmslog10( n, a, y, mode ) call vdlog10( n, a, y ) call vmdlog10( n, a, y, mode ) call vclog10( n, a, y ) call vmclog10( n, a, y, mode ) call vzlog10( n, a, y ) call vmzlog10( n, a, y, mode ) C: vsLog10( n, a, y ); vmsLog10( n, a, y, mode ); vdLog10( n, a, y ); vmdLog10( n, a, y, mode ); vcLog10( n, a, y ); vmcLog10( n, a, y, mode ); vzLog10( n, a, y ); vmzLog10( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vslog10, vmslog10 DOUBLE PRECISION for vdlog10, vmdlog10 COMPLEX for vclog10, vmclog10 FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. Vector Mathematical Functions 9 2027 Name Type Description DOUBLE COMPLEX for vzlog10, vmzlog10 Fortran 90: REAL, INTENT(IN) for vslog10, vmslog10 DOUBLE PRECISION, INTENT(IN) for vdlog10, vmdlog10 COMPLEX, INTENT(IN) for vclog10, vmclog10 DOUBLE COMPLEX, INTENT(IN) for vzlog10, vmzlog10 C: const float* for vsLog10, vmsLog10 const double* for vdLog10, vmdLog10 const MKL_Complex8* for vcLog10, vmcLog10 const MKL_Complex16* for vzLog10, vmzLog10 mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vslog10, vmslog10 DOUBLE PRECISION for vdlog10, vmdlog10 COMPLEX for vclog10, vmclog10 DOUBLE COMPLEX for vzlog10, vmzlog10 Fortran 90: REAL, INTENT(OUT) for vslog10, vmslog10 DOUBLE PRECISION, INTENT(OUT) for vdlog10, vmdlog10 COMPLEX, INTENT(OUT) for vclog10, vmclog10 DOUBLE COMPLEX, INTENT(OUT) for vzlog10, vmzlog10 C: float* for vsLog10, vmsLog10 FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 2028 Name Type Description double* for vdLog10, vmdLog10 MKL_Complex8* for vcLog10, vmcLog10 MKL_Complex16* for vzLog10, vmzLog10 Description The v?Log10 function computes a denary logarithm of vector elements. Special Values for Real Function v?Log10(x) Argument Result VML Error Status Exception +1 +0 X < +0 QNAN VML_STATUS_ERRDOM INVALID +0 -8 VML_STATUS_SING ZERODIVIDE -0 -8 VML_STATUS_SING ZERODIVIDE -8 QNAN VML_STATUS_ERRDOM INVALID +8 +8 QNAN QNAN SNAN QNAN INVALID See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?Log10(z) RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·8 +8+i·QNAN INVALID +i·Y +8+i·0 QNAN+i·QNAN INVALID +i·0 ZERODRIVE -8+i·0 ZERODRIVE +8+i·0 QNAN+i·QNAN INVALID -i·0 ZERODIVID E -8-i·0 ZERODIVID E +8-i·0 QNAN-i·QNAN INVALID -i·Y +8-i·0 QNAN+i·QNAN INVALID -i·8 +8+i·QNAN +i·NAN +8+i·QNAN QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID +8+i·QNAN QNAN+i·QNAN INVALID Notes: • raises INVALID exception when real or imaginary part of the argument is SNAN Vector Mathematical Functions 9 2029 v?Log1p Computes a natural logarithm of vector elements that are increased by 1. Syntax Fortran: call vslog1p( n, a, y ) call vmslog1p( n, a, y, mode ) call vdlog1p( n, a, y ) call vmdlog1p( n, a, y, mode ) C: vsLog1p( n, a, y ); vmsLog1p( n, a, y, mode ); vdLog1p( n, a, y ); vmdLog1p( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vslog1p, vmslog1p DOUBLE PRECISION for vdlog1p, vmdlog1p Fortran 90: REAL, INTENT(IN) for vslog1p, vmslog1p DOUBLE PRECISION, INTENT(IN) for vdlog1p, vmdlog1p C: const float* for vsLog1p, vmsLog1p const double* for vdLog1p, vmdLog1p FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. 9 Intel® Math Kernel Library Reference Manual 2030 Name Type Description C: const MKL_INT64 Output Parameters Name Type Description y FORTRAN 77: REAL for vslog1p, vmslog1p DOUBLE PRECISION for vdlog1p, vmdlog1p Fortran 90: REAL, INTENT(OUT) for vslog1p, vmslog1p DOUBLE PRECISION, INTENT(OUT) for vdlog1p, vmdlog1p C: float* for vsLog1p, vmsLog1p double* for vdLog1p, vmdLog1p FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Log1p function computes a natural logarithm of vector elements that are increased by 1. Special Values for Real Function v?Log1p(x) Argument Result VML Error Status Exception -1 -8 VML_STATUS_SING ZERODIVIDE X < -1 QNAN VML_STATUS_ERRDOM INVALID +0 +0 -0 -0 -8 QNAN VML_STATUS_ERRDOM INVALID +8 +8 QNAN QNAN SNAN QNAN INVALID Trigonometric Functions v?Cos Computes cosine of vector elements. Syntax Fortran: call vscos( n, a, y ) call vmscos( n, a, y, mode ) call vdcos( n, a, y ) call vmdcos( n, a, y, mode ) call vccos( n, a, y ) call vmccos( n, a, y, mode ) Vector Mathematical Functions 9 2031 call vzcos( n, a, y ) call vmzcos( n, a, y, mode ) C: vsCos( n, a, y ); vmsCos( n, a, y, mode ); vdCos( n, a, y ); vmdCos( n, a, y, mode ); vcCos( n, a, y ); vmcCos( n, a, y, mode ); vzCos( n, a, y ); vmzCos( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vscos, vmscos DOUBLE PRECISION for vdcos, vmdcos COMPLEX for vccos, vmccos DOUBLE PRECISION for vzcos, vmzcos Fortran 90: REAL, INTENT(IN) for vscos, vmscos DOUBLE PRECISION, INTENT(IN) for vdcos, vmdcos COMPLEX, INTENT(IN) for vccos, vmccos DOUBLE PRECISION, INTENT(IN) for vzcos, vmzcos C: const float* for vsCos, vmsCos const double* for vdCos, vmdCos const MKL_Complex8* for vcCos, vmcCos FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. 9 Intel® Math Kernel Library Reference Manual 2032 Name Type Description const MKL_Complex16* for vzCos, vmzCos mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vscos, vmscos DOUBLE PRECISION for vdcos, vmdcos COMPLEX for vccos, vmccos DOUBLE PRECISION for vzcos, vmzcos Fortran 90: REAL, INTENT(OUT) for vscos, vmscos DOUBLE PRECISION, INTENT(OUT) for vdcos, vmdcos COMPLEX, INTENT(OUT) for vccos, vmccos DOUBLE PRECISION, INTENT(OUT) for vzcos, vmzcos C: float* for vsCos, vmsCos double* for vdCos, vmdCos MKL_Complex8* for vcCos, vmcCos MKL_Complex16* for vzCos, vmzCos FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Cos function computes cosine of vector elements. Note that arguments abs(a[i]) = 213 and abs(a[i]) = 216 for single and double precisions respectively are called fast computational path. These are trigonometric function arguments for which VML provides the best possible performance. Avoid arguments that do not belong to the fast computational path in the VML High Accuracy (HA) and Low Accuracy (LA) functions. Alternatively, you can use VML Enhanced Performance (EP) functions that are fast on the entire function domain. However, these functions provide less accuracy. Special Values for Real Function v?Cos(x) Argument Result VML Error Status Exception +0 +1 -0 +1 +8 QNAN VML_STATUS_ERRDOM INVALID Vector Mathematical Functions 9 2033 Argument Result VML Error Status Exception -8 QNAN VML_STATUS_ERRDOM INVALID QNAN QNAN SNAN QNAN INVALID Specifications for special values of the complex functions are defined according to the following formula Cos(z) = Cosh(i*z). v?Sin Computes sine of vector elements. Syntax Fortran: call vssin( n, a, y ) call vmssin( n, a, y, mode ) call vdsin( n, a, y ) call vmdsin( n, a, y, mode ) call vcsin( n, a, y ) call vmcsin( n, a, y, mode ) call vzsin( n, a, y ) call vmzsin( n, a, y, mode ) C: vsSin( n, a, y ); vmsSin( n, a, y, mode ); vdSin( n, a, y ); vmdSin( n, a, y, mode ); vcSin( n, a, y ); vmcSin( n, a, y, mode ); vzSin( n, a, y ); vmzSin( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. 9 Intel® Math Kernel Library Reference Manual 2034 Name Type Description a FORTRAN 77: REAL for vssin, vmssin DOUBLE PRECISION for vdsin, vmdsin COMPLEX for vcsin, vmcsin DOUBLE PRECISION for vzsin, vmzsin Fortran 90: REAL, INTENT(IN) for vssin, vmssin DOUBLE PRECISION, INTENT(IN) for vdsin, vmdsin COMPLEX, INTENT(IN) for vcsin, vmcsin DOUBLE PRECISION, INTENT(IN) for vzsin, vmzsin C: const float* for vsSin, vmsSin const double* for vdSin, vmdSin const MKL_Complex8* for vcSin, vmcSin const MKL_Complex16* for vzSin, vmzSin FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vssin, vmssin DOUBLE PRECISION for vdsin, vmdsin COMPLEX for vcsin, vmcsin DOUBLE PRECISION for vzsin, vmzsin Fortran 90: REAL, INTENT(OUT) for vssin, vmssin DOUBLE PRECISION, INTENT(OUT) for vdsin, vmdsin FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Vector Mathematical Functions 9 2035 Name Type Description COMPLEX, INTENT(OUT) for vcsin, vmcsin DOUBLE PRECISION, INTENT(OUT) for vzsin, vmzsin C: float* for vsSin, vmsSin double* for vdSin, vmdSin MKL_Complex8* for vcSin, vmcSin MKL_Complex16* for vzSin, vmzSin Description This function is declared in mkl_vml.f77 for FORTRAN 77 interface, in mkl_vml.f90 for Fortran 90 interface, and in mkl_vml_functions.h for C interface. The function computes sine of vector elements. Note that arguments abs(a[i]) = 213 and abs(a[i]) = 216 for single and double precisions respectively are called fast computational path. These are trigonometric function arguments for which VML provides the best possible performance. Avoid arguments that do not belong to the fast computational path in the VML High Accuracy (HA) and Low Accuracy (LA) functions. Alternatively, you can use VML Enhanced Performance (EP) functions that are fast on the entire function domain. However, these functions provide less accuracy. Special Values for Real Function v?Sin(x) Argument Result VML Error Status Exception +0 +0 -0 -0 +8 QNAN VML_STATUS_ERRDOM INVALID -8 QNAN VML_STATUS_ERRDOM INVALID QNAN QNAN SNAN QNAN INVALID Specifications for special values of the complex functions are defined according to the following formula Sin(z) = -i*Sinh(i*z). v?SinCos Computes sine and cosine of vector elements. Syntax Fortran: call vssincos( n, a, y, z ) call vmssincos( n, a, y, z, mode ) call vdsincos( n, a, y, z ) call vmdsincos( n, a, y, z, mode ) C: vsSinCos( n, a, y, z ); vmsSinCos( n, a, y, z, mode ); vdSinCos( n, a, y, z ); 9 Intel® Math Kernel Library Reference Manual 2036 vmdSinCos( n, a, y, z, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vssincos, vmssincos DOUBLE PRECISION for vdsincos, vmdsincos Fortran 90: REAL, INTENT(IN) for vssincos, vmssincos DOUBLE PRECISION, INTENT(IN) for vdsincos, vmdsincos C: const float* for vsSinCos, vmsSinCos const double* for vdSinCos, vmdSinCos FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y, z FORTRAN 77: REAL for vssincos, vmssincos DOUBLE PRECISION for vdsincos, vmdsincos Fortran 90: REAL, INTENT(OUT) for vssincos, vmssincos DOUBLE PRECISION, INTENT(OUT) for vdsincos, vmdsincos C: float* for vsSinCos, vmsSinCos double* for vdSinCos, vmdSinCos FORTRAN: Arrays that specify the output vectors y (for sine values) and z (for cosine values). C: Pointers to arrays that contain the output vectors y (for sinevalues) and z(for cosine values). Vector Mathematical Functions 9 2037 Description This function is declared in mkl_vml.f77 for FORTRAN 77 interface, in mkl_vml.f90 for Fortran 90 interface, and in mkl_vml_functions.h for C interface. The function computes sine and cosine of vector elements. Note that arguments abs(a[i]) = 213 and abs(a[i]) = 216 for single and double precisions respectively are called fast computational path. These are trigonometric function arguments for which VML provides the best possible performance. Avoid arguments that do not belong to the fast computational path in the VML High Accuracy (HA) and Low Accuracy (LA) functions. Alternatively, you can use VML Enhanced Performance (EP) functions that are fast on the entire function domain. However, these functions provide less accuracy. Special Values for Real Function v?SinCos(x) Argument Result 1 Result 2 VML Error Status Exception +0 +0 +1 -0 -0 +1 +8 QNAN QNAN VML_STATUS_ERRDOM INVALID -8 QNAN QNAN VML_STATUS_ERRDOM INVALID QNAN QNAN QNAN SNAN QNAN QNAN INVALID Specifications for special values of the complex functions are defined according to the following formula Sin(z) = -i*Sinh(i*z). v?CIS Computes complex exponent of real vector elements (cosine and sine of real vector elements combined to complex value). Syntax Fortran: call vccis( n, a, y ) call vmccis( n, a, y, mode ) call vzcis( n, a, y ) call vmzcis( n, a, y, mode ) C: vcCIS( n, a, y ); vmcCIS( n, a, y, mode ); vzCIS( n, a, y ); vmzCIS( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h 9 Intel® Math Kernel Library Reference Manual 2038 Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vccis, vmccis DOUBLE PRECISION for vzcis, vmzcis Fortran 90: REAL, INTENT(IN) for vccis, vmccis DOUBLE PRECISION, INTENT(IN) for vzcis, vmzcis C: const float* for vcCIS, vmcCIS const double* for vzCIS, vmzCIS FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: COMPLEX for vccis, vmccis DOUBLE COMPLEX for vzcis, vmzcis Fortran 90: COMPLEX, INTENT(OUT) for vccis, vmccis DOUBLE COMPLEX, INTENT(OUT) for vzcis, vmzcis C: MKL_Complex8* for vcCIS, vmcCIS MKL_Complex16* for vzCIS, vmzCIS FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?CIS function computes complex exponent of real vector elements (cosine and sine of real vector elements combined to complex value). See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?CIS(x) x CIS(x) + 8 QNAN+i·QNAN Vector Mathematical Functions 9 2039 x CIS(x) INVALID + 0 +1+i·0 - 0 +1-i·0 - 8 QNAN+i·QNAN INVALID NAN QNAN+i·QNAN Notes: • raises INVALID exception when the argument is SNAN • raises INVALID exception and sets the VML Error Status to VML_STATUS_ERRDOM for x=+8, x=-8 v?Tan Computes tangent of vector elements. Syntax Fortran: call vstan( n, a, y ) call vmstan( n, a, y, mode ) call vdtan( n, a, y ) call vmdtan( n, a, y, mode ) call vctan( n, a, y ) call vmctan( n, a, y, mode ) call vztan( n, a, y ) call vmztan( n, a, y, mode ) C: vsTan( n, a, y ); vmsTan( n, a, y, mode ); vdTan( n, a, y ); vmdTan( n, a, y, mode ); vcTan( n, a, y ); vmcTan( n, a, y, mode ); vzTan( n, a, y ); vmzTan( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h 9 Intel® Math Kernel Library Reference Manual 2040 Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vstan, vmstan DOUBLE PRECISION for vdtan, vmdtan COMPLEX for vctan, vmctan DOUBLE COMPLEX for vztan, vmztan Fortran 90: REAL, INTENT(IN) for vstan, vmstan DOUBLE PRECISION, INTENT(IN) for vdtan, vmdtan COMPLEX, INTENT(IN) for vctan, vmctan DOUBLE COMPLEX, INTENT(IN) for vztan, vmztan C: const float* for vsTan, vmsTan const double* for vdTan, vmdTan const MKL_Complex8* for vcTan, vmcTan const MKL_Complex16* for vzTan, vmzTan FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vstan, vmstan DOUBLE PRECISION for vdtan, vmdtan COMPLEX for vctan, vmctan DOUBLE COMPLEX for vztan, vmztan Fortran 90: REAL, INTENT(OUT) for vstan, vmstan FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Vector Mathematical Functions 9 2041 Name Type Description DOUBLE PRECISION, INTENT(OUT) for vdtan, vmdtan COMPLEX, INTENT(OUT) for vctan, vmctan DOUBLE COMPLEX, INTENT(OUT) for vztan, vmztan C: float* for vsTan, vmsTan double* for vdTan, vmdTan MKL_Complex8* for vcTan, vmcTan MKL_Complex16* for vzTan, vmzTan Description The v?Tan function computes tangent of vector elements. Note that arguments abs(a[i]) = 213 and abs(a[i]) = 216 for single and double precisions respectively are called fast computational path. These are trigigonometric function arguments for which VML provides the best possible performance. Avoid arguments that do not belong to the fast computational path in the VML High Accuracy (HA) and Low Accuracy (LA) functions. Alternatively, you can use VML Enhanced Performance (EP) functions that are fast on the entire function domain. However, these functions provide less accuracy. Special Values for Real Function v?Tan(x) Argument Result VML Error Status Exception +0 +0 -0 -0 +8 QNAN VML_STATUS_ERRDOM INVALID -8 QNAN VML_STATUS_ERRDOM INVALID QNAN QNAN SNAN QNAN INVALID Specifications for special values of the complex functions are defined according to the following formula Tan(z) = -i*Tanh(i*z). v?Acos Computes inverse cosine of vector elements. Syntax Fortran: call vsacos( n, a, y ) call vmsacos( n, a, y, mode ) call vdacos( n, a, y ) call vmdacos( n, a, y, mode ) call vcacos( n, a, y ) call vmcacos( n, a, y, mode ) call vzacos( n, a, y ) call vmzacos( n, a, y, mode ) 9 Intel® Math Kernel Library Reference Manual 2042 C: vsAcos( n, a, y ); vmsAcos( n, a, y, mode ); vdAcos( n, a, y ); vmdAcos( n, a, y, mode ); vcAcos( n, a, y ); vmcAcos( n, a, y, mode ); vzAcos( n, a, y ); vmzAcos( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsacos, vmsacos DOUBLE PRECISION for vdacos, vmdacos COMPLEX for vcacos, vmcacos DOUBLE COMPLEX for vzacos, vmzacos Fortran 90: REAL, INTENT(IN) for vsacos, vmsacos DOUBLE PRECISION, INTENT(IN) for vdacos, vmdacos COMPLEX, INTENT(IN) for vcacos, vmcacos DOUBLE COMPLEX, INTENT(IN) for vzacos, vmzacos C: const float* for vsAcos, vmsAcos const double* for vdAcos, vmdAcos const MKL_Complex8* for vcAcos, vmcAcos FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. Vector Mathematical Functions 9 2043 Name Type Description const MKL_Complex16* for vzAcos, vmzAcos mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsacos, vmsacos DOUBLE PRECISION for vdacos, vmdacos COMPLEX for vcacos, vmcacos DOUBLE COMPLEX for vzacos, vmzacos Fortran 90: REAL, INTENT(OUT) for vsacos, vmsacos DOUBLE PRECISION, INTENT(OUT) for vdacos, vmdacos COMPLEX, INTENT(OUT) for vcacos, vmcacos DOUBLE COMPLEX, INTENT(OUT) for vzacos, vmzacos C: float* for vsAcos, vmsAcos double* for vdAcos, vmdAcos MKL_Complex8* for vcAcos, vmcAcos MKL_Complex16* for vzAcos, vmzAcos FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Acos function computes inverse cosine of vector elements. Special Values for Real Function v?Acos(x) Argument Result VML Error Status Exception +0 +p/2 -0 +p/2 +1 +0 -1 +p |X| > 1 QNAN VML_STATUS_ERRDOM INVALID +8 QNAN VML_STATUS_ERRDOM INVALID -8 QNAN VML_STATUS_ERRDOM INVALID 9 Intel® Math Kernel Library Reference Manual 2044 Argument Result VML Error Status Exception QNAN QNAN SNAN QNAN INVALID See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?Acos(z) RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·8 QNAN-i·8 +i·Y +p-i·8 +0-i·8 QNAN+i·QNAN +i·0 +p-i·8 +0-i·8 QNAN+i·QNAN -i·0 +p+i·8 +0+i·8 QNAN+i·QNAN -i·Y +p+i·8 +0+i·8 QNAN+i·QNAN -i·8 QNAN+i·8 +i·NAN QNAN+i·8 QNAN+i·QNAN QNAN+i·QNAN QNAN+i·8 QNAN+i·QNAN Notes: • raises INVALID exception when real or imaginary part of the argument is SNAN • Acos(CONJ(z))=CONJ(Acos(z)). v?Asin Computes inverse sine of vector elements. Syntax Fortran: call vsasin( n, a, y ) call vmsasin( n, a, y, mode ) call vdasin( n, a, y ) call vmdasin( n, a, y, mode ) call vcasin( n, a, y ) call vmcasin( n, a, y, mode ) call vzasin( n, a, y ) call vmzasin( n, a, y, mode ) C: vsAsin( n, a, y ); vmsAsin( n, a, y, mode ); vdAsin( n, a, y ); vmdAsin( n, a, y, mode ); Vector Mathematical Functions 9 2045 vcAsin( n, a, y ); vmcAsin( n, a, y, mode ); vzAsin( n, a, y ); vmzAsin( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsasin, vmsasin DOUBLE PRECISION for vdasin, vmdasin COMPLEX for vcasin, vmcasin DOUBLE COMPLEX for vzasin, vmzasin Fortran 90: REAL, INTENT(IN) for vsasin, vmsasin DOUBLE PRECISION, INTENT(IN) for vdasin, vmdasin COMPLEX, INTENT(IN) for vcasin, vmcasin DOUBLE COMPLEX, INTENT(IN) for vzasin, vmzasin C: const float* for vsAsin, vmsAsin const double* for vdAsin, vmdAsin const MKL_Complex8* for vcAsin, vmcAsin const MKL_Complex16* for vzAsin, vmzAsin FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. 9 Intel® Math Kernel Library Reference Manual 2046 Output Parameters Name Type Description y FORTRAN 77: REAL for vsasin, vmsasin DOUBLE PRECISION for vdasin, vmdasin COMPLEX for vcasin, vmcasin DOUBLE COMPLEX for vzasin, vmzasin Fortran 90: REAL, INTENT(OUT) for vsasin, vmsasin DOUBLE PRECISION, INTENT(OUT) for vdasin, vmdasin COMPLEX, INTENT(OUT) for vcasin, vmcasin DOUBLE COMPLEX, INTENT(OUT) for vzasin, vmzasin C: float* for vsAsin, vmsAsin double* for vdAsin, vmdAsin MKL_Complex8* for vcAsin, vmcAsin MKL_Complex16* for vzAsin, vmzAsin FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Asin function computes inverse sine of vector elements. Special Values for Real Function v?Asin(x) Argument Result VML Error Status Exception +0 +0 -0 -0 +1 +p/2 -1 -p/2 |X| > 1 QNAN VML_STATUS_ERRDOM INVALID +8 QNAN VML_STATUS_ERRDOM INVALID -8 QNAN VML_STATUS_ERRDOM INVALID QNAN QNAN SNAN QNAN INVALID Specifications for special values of the complex functions are defined according to the following formula Asin(z) = -i*Asinh(i*z). v?Atan Computes inverse tangent of vector elements. Vector Mathematical Functions 9 2047 Syntax Fortran: call vsatan( n, a, y ) call vmsatan( n, a, y, mode ) call vdatan( n, a, y ) call vmdatan( n, a, y, mode ) call vcatan( n, a, y ) call vmcatan( n, a, y, mode ) call vzatan( n, a, y ) call vmzatan( n, a, y, mode ) C: vsAtan( n, a, y ); vmsAtan( n, a, y, mode ); vdAtan( n, a, y ); vmdAtan( n, a, y, mode ); vcAtan( n, a, y ); vmcAtan( n, a, y, mode ); vzAtan( n, a, y ); vmzAtan( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsatan, vmsatan DOUBLE PRECISION for vdatan, vmdatan COMPLEX for vcatan, vmcatan DOUBLE COMPLEX for vzatan, vmzatan Fortran 90: REAL, INTENT(IN) for vsatan, vmsatan FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. 9 Intel® Math Kernel Library Reference Manual 2048 Name Type Description DOUBLE PRECISION, INTENT(IN) for vdatan, vmdatan COMPLEX, INTENT(IN) for vcatan, vmcatan DOUBLE COMPLEX, INTENT(IN) for vzatan, vmzatan C: const float* for vsAtan, vmsAtan const double* for vdAsin, vmdAtan const MKL_Complex8* for vcAtan, vmcAtan const MKL_Complex16* for vzAsin, vmzAtan mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsatan, vmsatan DOUBLE PRECISION for vdatan, vmdatan COMPLEX for vcatan, vmcatan DOUBLE COMPLEX for vzatan, vmzatan Fortran 90: REAL, INTENT(OUT) for vsatan, vmsatan DOUBLE PRECISION, INTENT(OUT) for vdatan, vmdatan COMPLEX, INTENT(OUT) for vcatan, vmcatan DOUBLE COMPLEX, INTENT(OUT) for vzatan, vmzatan C: float* for vsAtan, vmsAtan double* for vdAsin, vmdAtan MKL_Complex8* for vcAtan, vmcAtan MKL_Complex16* for vzAsin, vmzAtan FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Vector Mathematical Functions 9 2049 Description The v?Atan function computes inverse tangent of vector elements. Special Values for Real Function v?Atan(x) Argument Result VML Error Status Exception +0 +0 -0 -0 +8 +p/2 -8 -p/2 QNAN QNAN SNAN QNAN INVALID Specifications for special values of the complex functions are defined according to the following formula Atan(z) = -i*Atanh(i*z). v?Atan2 Computes four-quadrant inverse tangent of elements of two vectors. Syntax Fortran: call vsatan2( n, a, b, y ) call vmsatan2( n, a, b, y, mode ) call vdatan2( n, a, b, y ) call vmdatan2( n, a, b, y, mode ) C: vsAtan2( n, a, b, y ); vmsAtan2( n, a, b, y, mode ); vdAtan2( n, a, b, y ); vmdAtan2( n, a, b, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a, b FORTRAN 77: REAL for vsatan2, vmsatan2 FORTRAN: Arrays that specify the input vectors a and b. 9 Intel® Math Kernel Library Reference Manual 2050 Name Type Description DOUBLE PRECISION for vdatan2, vmdatan2 Fortran 90: REAL, INTENT(IN) for vsatan2, vmsatan2 DOUBLE PRECISION, INTENT(IN) for vdatan2, vmdatan2 C: const float* for vsAtan2, vmsAtan2 const double* for vdAtan2, vmdAtan2 C: Pointers to arrays that contain the input vectors a and b. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsatan2, vmsatan2 DOUBLE PRECISION for vdatan2, vmdatan2 Fortran 90: REAL, INTENT(OUT) for vsatan2, vmsatan2 DOUBLE PRECISION, INTENT(OUT) for vdatan2, vmdatan2 C: float* for vsAtan2, vmsAtan2 double* for vdAtan2, vmdAtan2 FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Atan2 function computes four-quadrant inverse tangent of elements of two vectors. The elements of the output vectory are computed as the four-quadrant arctangent of a[i] / b[i]. Special values for Real Function v?Atan2(x) Argument 1 Argument 2 Result Exception -8 -8 -3*p/4 -8 X < +0 -p/2 -8 -0 -p/2 -8 +0 -p/2 -8 X > +0 -p/2 -8 +8 -p/4 X < +0 -8 -p X < +0 -0 -p/2 Vector Mathematical Functions 9 2051 Argument 1 Argument 2 Result Exception X < +0 +0 -p/2 X < +0 +8 -0 -0 -8 -p -0 X < +0 -p -0 -0 -p -0 +0 -0 -0 X > +0 -0 -0 +8 -0 +0 -8 +p +0 X < +0 +p +0 -0 +p +0 +0 +0 +0 X > +0 +0 +0 +8 +0 X > +0 -8 +p X > +0 -0 +p/2 X > +0 +0 +p/2 X > +0 +8 +0 +8 -8 -3*p/4 +8 X < +0 +p/2 +8 -0 +p/2 +8 +0 +p/2 +8 X > +0 +p/2 +8 +8 +p/4 X > +0 QNAN QNAN X > +0 SNAN QNAN INVALID QNAN X > +0 QNAN SNAN X > +0 QNAN INVALID QNAN QNAN QNAN QNAN SNAN QNAN INVALID SNAN QNAN QNAN INVALID SNAN SNAN QNAN INVALID Hyperbolic Functions v?Cosh Computes hyperbolic cosine of vector elements. Syntax Fortran: call vscosh( n, a, y ) call vmscosh( n, a, y, mode ) call vdcosh( n, a, y ) call vmdcosh( n, a, y, mode ) 9 Intel® Math Kernel Library Reference Manual 2052 call vccosh( n, a, y ) call vmccosh( n, a, y, mode ) call vzcosh( n, a, y ) call vmzcosh( n, a, y, mode ) C: vsCosh( n, a, y ); vmsCosh( n, a, y, mode ); vdCosh( n, a, y ); vmdCosh( n, a, y, mode ); vcCosh( n, a, y ); vmcCosh( n, a, y, mode ); vzCosh( n, a, y ); vmzCosh( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vscosh, vmscosh DOUBLE PRECISION for vdcosh, vmdcosh COMPLEX for vccosh, vmccosh DOUBLE COMPLEX for vzcosh, vmzcosh Fortran 90: REAL, INTENT(IN) for vscosh, vmscosh DOUBLE PRECISION, INTENT(IN) for vdcosh, vmdcosh COMPLEX, INTENT(IN) for vccosh, vmccosh DOUBLE COMPLEX, INTENT(IN) for vzcosh, vmzcosh C: const float* for vsCosh, vmsCosh FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. Vector Mathematical Functions 9 2053 Name Type Description const double* for vdCosh, vmdCosh const MKL_Complex8* for vcCosh, vmcCosh const MKL_Complex16* for vzCosh, vmzCosh mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Precision Overflow Thresholds for Real v?Cosh Function Data Type Threshold Limitations on Input Parameters single precision -Ln(FLT_MAX)-Ln2 overflow +8 VML_STATUS_OVERFLOW OVERFLOW X < -overflow +8 VML_STATUS_OVERFLOW OVERFLOW +8 +8 -8 +8 QNAN QNAN SNAN QNAN INVALID See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?Cosh(z) RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·8 +8+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN-i·0 INVALID QNAN+i·0 INVALID QNAN+i·QNAN INVALID +8+i·QNAN INVALID QNAN+i·QNAN +i·Y +8·Cos(Y)- i·8·Sin(Y) +8·CIS(Y) QNAN+i·QNAN +i·0 +8-i·0 +1-i·0 +1+i·0 +8+i·0 QNAN+i·0 -i·0 +8+i·0 +1+i·0 +1-i·0 +8-i·0 QNAN-i·0 -i·Y +8·Cos(Y)- i·8·Sin(Y) +8·CIS(Y) QNAN+i·QNAN -i·8 +8+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·0 INVALID QNAN-i·0 INVALID QNAN+i·QNAN INVALID +8+i·QNAN INVALID QNAN+i·QNAN +i·NAN +8+i·QNAN QNAN+i·QNAN QNAN+i·QNAN QNAN-i·QNAN QNAN+i·QNAN +8+i·QNAN QNAN+i·QNAN Notes: • raises the INVALID exception when the real or imaginary part of the argument is SNAN • raises the OVERFLOW exception and sets the VML Error Status to VML_STATUS_OVERFLOW in the case of overflow, that is, when RE(z), IM(z) are finite non-zero numbers, but the real or imaginary part of the exact result is so large that it does not meet the target precision. • Cosh(CONJ(z))=CONJ(Cosh(z)) • Cosh(-z)=Cosh(z). v?Sinh Computes hyperbolic sine of vector elements. Syntax Fortran: call vssinh( n, a, y ) call vmssinh( n, a, y, mode ) call vdsinh( n, a, y ) call vmdsinh( n, a, y, mode ) Vector Mathematical Functions 9 2055 call vcsinh( n, a, y ) call vmcsinh( n, a, y, mode ) call vzsinh( n, a, y ) call vmzsinh( n, a, y, mode ) C: vsSinh( n, a, y ); vmsSinh( n, a, y, mode ); vdSinh( n, a, y ); vmdSinh( n, a, y, mode ); vcSinh( n, a, y ); vmcSinh( n, a, y, mode ); vzSinh( n, a, y ); vmzSinh( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vssinh, vmssinh DOUBLE PRECISION for vdsinh, vmdsinh COMPLEX for vcsinh, vmcsinh DOUBLE COMPLEX for vzsinh, vmzsinh Fortran 90: REAL, INTENT(IN) for vssinh, vmssinh DOUBLE PRECISION, INTENT(IN) for vdsinh, vmdsinh COMPLEX, INTENT(IN) for vcsinh, vmcsinh DOUBLE COMPLEX, INTENT(IN) for vzsinh, vmzsinh C: const float* for vsSinh, vmsSinh FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. 9 Intel® Math Kernel Library Reference Manual 2056 Name Type Description const double* for vdSinh, vmdSinh const MKL_Complex8* for vcSinh, vmcSinh const MKL_Complex16* for vzSinh, vmzSinh mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Precision Overflow Thresholds for Real v?Sinh Function Data Type Threshold Limitations on Input Parameters single precision -Ln(FLT_MAX)-Ln2 overflow +8 VML_STATUS_OVERFLOW OVERFLOW X < -overflow -8 VML_STATUS_OVERFLOW OVERFLOW +8 +8 -8 -8 QNAN QNAN SNAN QNAN INVALID See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?Sinh(z) RE(z) i·IM(z) -8 -X -0 +0 +X +8 NAN +i·8 -8+i·QNAN INVALID QNAN+i·QNAN INVALID -0+i·QNAN INVALID +0+i·QNAN INVALID QNAN+i·QNAN INVALID +8+i·QNAN INVALID QNAN+i·QNAN +i·Y -8·Cos(Y)+ i·8·Sin(Y) +8·CIS(Y) QNAN+i·QNAN +i·0 -8+i·0 -0+i·0 +0+i·0 +8+i·0 QNAN+i·0 -i·0 -8-i·0 -0-i·0 +0-i·0 +8-i·0 QNAN-i·0 -i·Y -8·Cos(Y)+ i·8·Sin(Y) +8·CIS(Y) QNAN+i·QNAN -i·8 -8+i·QNAN INVALID QNAN+i·QNAN INVALID -0+i·QNAN INVALID +0+i·QNAN INVALID QNAN+i·QNAN INVALID +8+i·QNAN INVALID QNAN+i·QNAN +i·NAN -8+i·QNAN QNAN+i·QNAN -0+i·QNAN +0+i·QNAN QNAN+i·QNAN +8+i·QNAN QNAN+i·QNAN Notes: • raises the INVALID exception when the real or imaginary part of the argument is SNAN • raises the OVERFLOW exception and sets the VML Error Status to VML_STATUS_OVERFLOW in the case of overflow, that is, when RE(z), IM(z) are finite non-zero numbers, but the real or imaginary part of the exact result is so large that it does not meet the target precision. • Sinh(CONJ(z))=CONJ(Sinh(z)) • Sinh(-z)=-Sinh(z). v?Tanh Computes hyperbolic tangent of vector elements. Syntax Fortran: call vstanh( n, a, y ) call vmstanh( n, a, y, mode ) call vdtanh( n, a, y ) call vmdtanh( n, a, y, mode ) call vctanh( n, a, y ) 9 Intel® Math Kernel Library Reference Manual 2058 call vmctanh( n, a, y, mode ) call vztanh( n, a, y ) call vmztanh( n, a, y, mode ) C: vsTanh( n, a, y ); vmsTanh( n, a, y, mode ); vdTanh( n, a, y ); vmdTanh( n, a, y, mode ); vcTanh( n, a, y ); vmcTanh( n, a, y, mode ); vzTanh( n, a, y ); vmzTanh( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vstanh, vmstanh DOUBLE PRECISION for vdtanh, vmdtanh COMPLEX for vctanh, vmctanh DOUBLE COMPLEX for vztanh, vmztanh Fortran 90: REAL, INTENT(IN) for vstanh, vmstanh DOUBLE PRECISION, INTENT(IN) for vdtanh, vmdtanh COMPLEX, INTENT(IN) for vctanh, vmctanh DOUBLE COMPLEX, INTENT(IN) for vztanh, vmztanh C: const float* for vsTanh, vmsTanh const double* for vdTanh, vmdTanh FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. Vector Mathematical Functions 9 2059 Name Type Description const MKL_Complex8* for vcTanh, vmcTanh const MKL_Complex16* for vzTanh, vmzTanh mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vstanh, vmstanh DOUBLE PRECISION for vdtanh, vmdtanh COMPLEX for vctanh, vmctanh DOUBLE COMPLEX for vztanh, vmztanh Fortran 90: REAL, INTENT(OUT) for vstanh, vmstanh DOUBLE PRECISION, INTENT(OUT) for vdtanh, vmdtanh COMPLEX, INTENT(OUT) for vctanh, vmctanh DOUBLE COMPLEX, INTENT(OUT) for vztanh, vmztanh C: float* for vsTanh, vmsTanh double* for vdTanh, vmdTanh MKL_Complex8* for vcTanh, vmcTanh MKL_Complex16* for vzTanh, vmzTanh FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Tanh function computes hyperbolic tangent of vector elements. Special Values for Real Function v?Tanh(x) Argument Result Exception +0 +0 -0 -0 +8 +1 -8 -1 QNAN QNAN 9 Intel® Math Kernel Library Reference Manual 2060 Argument Result Exception SNAN QNAN INVALID See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?Tanh(z) RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·8 -1+i·0 QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID +1+i·0 QNAN+i·QNAN +i·Y -1+i·0·Tan(Y) +1+i·0·Tan(Y) QNAN+i·QNAN +i·0 -1+i·0 -0+i·0 +0+i·0 +1+i·0 QNAN+i·0 -i·0 -1-i·0 -0-i·0 +0-i·0 +1-i·0 QNAN-i·0 -i·Y -1+i·0·Tan(Y) +1+i·0·Tan(Y) QNAN+i·QNAN -i·8 -1-i·0 QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID QNAN+i·QNAN INVALID +1-i·0 QNAN+i·QNAN +i·NAN -1+i·0 QNAN+i·QNAN QNAN+i·QNAN QNAN+i·QNAN QNAN+i·QNAN +1+i·0 QNAN+i·QNAN Notes: • raises INVALID exception when real or imaginary part of the argument is SNAN • Tanh(CONJ(z))=CONJ(Tanh(z)) • Tanh(-z)=-Tanh(z). v?Acosh Computes inverse hyperbolic cosine (nonnegative) of vector elements. Syntax Fortran: call vsacosh( n, a, y ) call vmsacosh( n, a, y, mode ) call vdacosh( n, a, y ) call vmdacosh( n, a, y, mode ) call vcacosh( n, a, y ) call vmcacosh( n, a, y, mode ) call vzacosh( n, a, y ) call vmzacosh( n, a, y, mode ) C: vsAcosh( n, a, y ); vmsAcosh( n, a, y, mode ); vdAcosh( n, a, y ); vmdAcosh( n, a, y, mode ); Vector Mathematical Functions 9 2061 vcAcosh( n, a, y ); vmcAcosh( n, a, y, mode ); vzAcosh( n, a, y ); vmzAcosh( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsacosh, vmsacosh DOUBLE PRECISION for vdacosh, vmdacosh COMPLEX for vcacosh, vmcacosh DOUBLE COMPLEX for vzacosh, vmzacosh Fortran 90: REAL, INTENT(IN) for vsacosh, vmsacosh DOUBLE PRECISION, INTENT(IN) for vdacosh, vmdacosh COMPLEX, INTENT(IN) for vcacosh, vmcacosh DOUBLE COMPLEX, INTENT(IN) for vzacosh, vmzacosh C: const float* for vsAcosh, vmsAcosh const double* for vdAcosh, vmdAcosh const MKL_Complex8* for vcAcosh, vmcAcosh const MKL_Complex16* for vzAcosh, vmzAcosh FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. 9 Intel® Math Kernel Library Reference Manual 2062 Output Parameters Name Type Description y FORTRAN 77: REAL for vsacosh, vmsacosh DOUBLE PRECISION for vdacosh, vmdacosh COMPLEX for vcacosh, vmcacosh DOUBLE COMPLEX for vzacosh, vmzacosh Fortran 90: REAL, INTENT(OUT) for vsacosh, vmsacosh DOUBLE PRECISION, INTENT(OUT) for vdacosh, vmdacosh COMPLEX, INTENT(OUT) for vcacosh, vmcacosh DOUBLE COMPLEX, INTENT(OUT) for vzacosh, vmzacosh C: float* for vsAcosh, vmsAcosh double* for vdAcosh, vmdAcosh MKL_Complex8* for vcAcosh, vmcAcosh MKL_Complex16* for vzAcosh, vmzAcosh FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Acosh function computes inverse hyperbolic cosine (nonnegative) of vector elements. Special Values for Real Function v?Acosh(x) Argument Result VML Error Status Exception +1 +0 X < +1 QNAN VML_STATUS_ERRDOM INVALID -8 QNAN VML_STATUS_ERRDOM INVALID +8 +8 QNAN QNAN SNAN QNAN INVALID See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?Acosh(z) RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·8 +8+i·p/2 +8+i·p/2 +8+i·p/2 +8+i·p/2 +8+i·p/4 +8+i·QNAN +i·Y +8+i·p +8+i·0 QNAN+i·QNAN Vector Mathematical Functions 9 2063 RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·0 +8+i·p +0+i·p/2 +0+i·p/2 +8+i·0 QNAN+i·QNAN -i·0 +8+i·p +0+i·p/2 +0+i·p/2 +8+i·0 QNAN+i·QNAN -i·Y +8+i·p +8+i·0 QNAN+i·QNAN -i·8 +8-i·p/2 +8-i·p/2 +8-i·p/2 +8-i·p/2 +8-i·p/4 +8+i·QNAN +i·NAN +8+i·QNAN QNAN+i·QNAN QNAN+i·QNAN QNAN+i·QNAN QNAN+i·QNAN +8+i·QNAN QNAN+i·QNAN Notes: • raises INVALID exception when real or imaginary part of the argument is SNAN • Acosh(CONJ(z))=CONJ(Acosh(z)). v?Asinh Computes inverse hyperbolic sine of vector elements. Syntax Fortran: call vsasinh( n, a, y ) call vmsasinh( n, a, y, mode ) call vdasinh( n, a, y ) call vmdasinh( n, a, y, mode ) call vcasinh( n, a, y ) call vmcasinh( n, a, y, mode ) call vzasinh( n, a, y ) call vmzasinh( n, a, y, mode ) C: vsAsinh( n, a, y ); vmsAsinh( n, a, y, mode ); vdAsinh( n, a, y ); vmdAsinh( n, a, y, mode ); vcAsinh( n, a, y ); vmcAsinh( n, a, y, mode ); vzAsinh( n, a, y ); vmzAsinh( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h 9 Intel® Math Kernel Library Reference Manual 2064 Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsasinh, vmsasinh DOUBLE PRECISION for vdasinh, vmdasinh COMPLEX for vcasinh, vmcasinh DOUBLE COMPLEX for vzasinh, vmzasinh Fortran 90: REAL, INTENT(IN) for vsasinh, vmsasinh DOUBLE PRECISION, INTENT(IN) for vdasinh, vmdasinh COMPLEX, INTENT(IN) for vcasinh, vmcasinh DOUBLE COMPLEX, INTENT(IN) for vzasinh, vmzasinh C: const float* for vsAsinh, vmsAsinh const double* for vdAsinh, vmdAsinh const MKL_Complex8* for vcAsinh, vmcAsinh const MKL_Complex16* for vzAsinh, vmzAsinh FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsasinh, vmsasinh DOUBLE PRECISION for vdasinh, vmdasinh COMPLEX for vcasinh, vmcasinh FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Vector Mathematical Functions 9 2065 Name Type Description DOUBLE COMPLEX for vzasinh, vmzasinh Fortran 90: REAL, INTENT(OUT) for vsasinh, vmsasinh DOUBLE PRECISION, INTENT(OUT) for vdasinh, vmdasinh COMPLEX, INTENT(OUT) for vcasinh, vmcasinh DOUBLE COMPLEX, INTENT(OUT) for vzasinh, vmzasinh C: float* for vsAsinh, vmsAsinh double* for vdAsinh, vmdAsinh MKL_Complex8* for vcAsinh, vmcAsinh MKL_Complex16* for vzAsinh, vmzAsinh Description The v?Asinh function computes inverse hyperbolic sine of vector elements. Special Values for Real Function v?Asinh(x) Argument Result Exception +0 +0 -0 -0 +8 +8 -8 -8 QNAN QNAN SNAN QNAN INVALID See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?Asinh(z) RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·8 -8+i·p/4 -8+i·p/2 +8+i·p/2 +8+i·p/2 +8+i·p/2 +8+i·p/4 +8+i·QNAN +i·Y -8+i·0 +8+i·0 QNAN+i·QNAN +i·0 +8+i·0 +0+i·0 +0+i·0 +8+i·0 QNAN+i·QNAN -i·0 -8-i·0 -0-i·0 +0-i·0 +8-i·0 QNAN-i·QNAN -i·Y -8-i·0 +8-i·0 QNAN+i·QNAN -i·8 -8-i·p/4 -8-i·p/2 -8-i·p/2 +8-i·p/2 +8-i·p/2 +8-i·p/4 +8+i·QNAN +i·NAN -8+i·QNAN QNAN +i·QNAN QNAN +i·QNAN QNAN +i·QNAN QNAN +i·QNAN +8+i·QNAN QNAN+i·QNAN Notes: 9 Intel® Math Kernel Library Reference Manual 2066 • raises INVALID exception when real or imaginary part of the argument is SNAN • Asinh(CONJ(z))=CONJ(Asinh(z)) • Asinh(-z)=-Asinh(z). v?Atanh Computes inverse hyperbolic tangent of vector elements. Syntax Fortran: call vsatanh( n, a, y ) call vmsatanh( n, a, y, mode ) call vdatanh( n, a, y ) call vmdatanh( n, a, y, mode ) call vcatanh( n, a, y ) call vmcatanh( n, a, y, mode ) call vzatanh( n, a, y ) call vmzatanh( n, a, y, mode ) C: vsAtanh( n, a, y ); vmsAtanh( n, a, y, mode ); vdAtanh( n, a, y ); vmdAtanh( n, a, y, mode ); vcAtanh( n, a, y ); vmcAtanh( n, a, y, mode ); vzAtanh( n, a, y ); vmzAtanh( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsatanh, vmsatanh FORTRAN: Array that specifies the input vector a. Vector Mathematical Functions 9 2067 Name Type Description DOUBLE PRECISION for vdatanh, vmdatanh COMPLEX for vcatanh, vmcatanh DOUBLE COMPLEX for vzatanh, vmzatanh Fortran 90: REAL, INTENT(IN) for vsatanh, vmsatanh DOUBLE PRECISION, INTENT(IN) for vdatanh, vmdatanh COMPLEX, INTENT(IN) for vcatanh, vmcatanh DOUBLE COMPLEX, INTENT(IN) for vzatanh, vmzatanh C: const float* for vsAtanh, vmsAtanh const double* for vdAtanh, vmdAtanh const MKL_Complex8* for vcAtanh, vmcAtanh const MKL_Complex16* for vzAtanh, vmzAtanh C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsatanh, vmsatanh DOUBLE PRECISION for vdatanh, vmdatanh COMPLEX for vcatanh, vmcatanh DOUBLE COMPLEX for vzatanh, vmzatanh Fortran 90: REAL, INTENT(OUT) for vsatanh, vmsatanh DOUBLE PRECISION, INTENT(OUT) for vdatanh, vmdatanh COMPLEX, INTENT(OUT) for vcatanh, vmcatanh FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 2068 Name Type Description DOUBLE COMPLEX, INTENT(OUT) for vzatanh, vmzatanh C: float* for vsAtanh, vmsAtanh double* for vdAtanh, vmdAtanh MKL_Complex8* for vcAtanh, vmcAtanh MKL_Complex16* for vzAtanh, vmzAtanh Description The v?Atanh function computes inverse hyperbolic tangent of vector elements. Special Values for Real Function v?Atanh(x) Argument Result VML Error Status Exception +1 +8 VML_STATUS_SING ZERODIVIDE -1 -8 VML_STATUS_SING ZERODIVIDE |X| > 1 QNAN VML_STATUS_ERRDOM INVALID +8 QNAN VML_STATUS_ERRDOM INVALID -8 QNAN VML_STATUS_ERRDOM INVALID QNAN QNAN SNAN QNAN INVALID See the Special Value Notations section for the conventions used in the table below. Special Values for Complex Function v?Atanh(z) RE(z) i·IM(z ) -8 -X -0 +0 +X +8 NAN +i·8 -0+i·p/2 -0+i·p/2 -0+i·p/2 +0+i·p/2 +0+i·p/2 +0+i·p/2 +0+i·p/2 +i·Y -0+i·p/2 +0+i·p/2 QNAN+i·QNAN +i·0 -0+i·p/2 -0+i·0 +0+i·0 +0+i·p/2 QNAN+i·QNAN -i·0 -0-i·p/2 -0-i·0 +0-i·0 +0-i·p/2 QNAN-i·QNAN -i·Y -0-i·p/2 +0-i·p/2 QNAN+i·QNAN -i·8 -0-i·p/2 -0-i·p/2 -0-i·p/2 +0-i·p/2 +0-i·p/2 +0-i·p/2 +0-i·p/2 +i·NAN -0+i·QNAN QNAN +i·QNAN -0+i·QNAN +0+i·QNAN QNAN +i·QNAN +0+i·QNAN QNAN+i·QNAN Notes: • Atanh(+-1+-i*0)=+-8+-i*0, and ZERODIVIDE exception is raised • raises INVALID exception when real or imaginary part of the argument is SNAN • Atanh(CONJ(z))=CONJ(Atanh(z)) • Atanh(-z)=-Atanh(z). Vector Mathematical Functions 9 2069 Special Functions v?Erf Computes the error function value of vector elements. Syntax Fortran: call vserf( n, a, y ) call vmserf( n, a, y, mode ) call vderf( n, a, y ) call vmderf( n, a, y, mode ) C: vsErf( n, a, y ); vmsErf( n, a, y, mode ); vdErf( n, a, y ); vmdErf( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vserf, vmserf DOUBLE PRECISION for vderf, vmderf Fortran 90: REAL, INTENT(IN) for vserf, vmserf DOUBLE PRECISION, INTENT(IN) for vderf, vmderf C: const float* for vsErf, vmsErf const double* for vdErf, vmdErf FORTRAN: Array, specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. 9 Intel® Math Kernel Library Reference Manual 2070 Name Type Description Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Output Parameters Name Type Description y FORTRAN 77: REAL for vserf, vmserf DOUBLE PRECISION for vderf, vmderf Fortran 90: REAL, INTENT(OUT) for vserf, vmserf DOUBLE PRECISION, INTENT(OUT) for vderf, vmderf C: float* for vsErf, vmsErf double* for vdErf, vmdErf FORTRAN: Array, specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The Erf function computes the error function values for elements of the input vector a and writes them to the output vector y. The error function is defined as given by: Useful relations: where erfc is the complementary error function. where is the cumulative normal distribution function. Vector Mathematical Functions 9 2071 where F-1(x) and erf-1(x) are the inverses to F(x) and erf(x) respectively. The following figure illustrates the relationships among Erf family functions (Erf, Erfc, CdfNorm). Erf Family Functions Relationship Useful relations for these functions: Special Values for Real Function v?Erf(x) Argument Result Exception +8 +1 -8 -1 QNAN QNAN SNAN QNAN INVALID See Also v?Erfc v?CdfNorm 9 Intel® Math Kernel Library Reference Manual 2072 v?Erfc Computes the complementary error function value of vector elements. Syntax Fortran: call vserfc( n, a, y ) call vmserfc( n, a, y, mode ) call vderfc( n, a, y ) call vmderfc( n, a, y, mode ) C: vsErfc( n, a, y ); vmsErfc( n, a, y, mode ); vdErfc( n, a, y ); vmdErfc( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vserfc, vmserfc DOUBLE PRECISION for vderfc, vmderfc Fortran 90: REAL, INTENT(IN) for vserfc, vmserfc DOUBLE PRECISION, INTENT(IN) for vderfc, vmderfc C: const float* for vsErfc, vmsErfc const double* for vdErfc, vmdErfc FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Vector Mathematical Functions 9 2073 Name Type Description C: const MKL_INT64 Output Parameters Name Type Description y FORTRAN 77: REAL for vserfc, vmserfc DOUBLE PRECISION for vderfc, vmderfc Fortran 90: REAL, INTENT(OUT) for vserfc, vmserfc DOUBLE PRECISION, INTENT(OUT) for vderfc, vmderfc C: float* for vsErfc, vmsErfc double* for vdErfc, vmdErfc FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The Erfc function computes the complementary error function values for elements of the input vector a and writes them to the output vector y. The complementary error function is defined as follows: Useful relations: where is the cumulative normal distribution function. where F-1(x) and erf-1(x) are the inverses to F(x) and erf(x) respectively. See also Figure "Erf Family Functions Relationship" in Erf function description for Erfc function relationship with the other functions of Erf family. 9 Intel® Math Kernel Library Reference Manual 2074 Special Values for Real Function v?Erfc(x) Argument Result VML Error Status Exception X > underflow +0 VML_STATUS_UNDERFLOW UNDERFLOW +8 +0 -8 +2 QNAN QNAN SNAN QNAN INVALID See Also v?Erf v?CdfNorm v?CdfNorm Computes the cumulative normal distribution function values of vector elements. Syntax Fortran: call vscdfnorm( n, a, y ) call vmscdfnorm( n, a, y, mode ) call vdcdfnorm( n, a, y ) call vmdcdfnorm( n, a, y, mode ) C: vsCdfNorm( n, a, y ); vmsCdfNorm( n, a, y, mode ); vdCdfNorm( n, a, y ); vmdCdfNorm( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vscdfnorm, vmscdfnorm DOUBLE PRECISION for vdcdfnorm, vmdcdfnorm FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. Vector Mathematical Functions 9 2075 Name Type Description Fortran 90: REAL, INTENT(IN) for vscdfnorm, vmscdfnorm DOUBLE PRECISION, INTENT(IN) for vdcdfnorm, vmdcdfnorm C: const float* for vsCdfNorm, vmsCdfNorm const double* for vdCdfNorm, vmdCdfNorm mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vscdfnorm, vmscdfnorm DOUBLE PRECISION for vdcdfnorm, vmdcdfnorm Fortran 90: REAL, INTENT(OUT) for vscdfnorm, vmscdfnorm DOUBLE PRECISION, INTENT(OUT) for vdcdfnorm, vmdcdfnorm C: float* for vsCdfNorm, vmsCdfNorm double* for vdCdfNorm, vmdCdfNorm FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The CdfNorm function computes the cumulative normal distribution function values for elements of the input vector a and writes them to the output vector y. The cumulative normal distribution function is defined as given by: Useful relations: 9 Intel® Math Kernel Library Reference Manual 2076 where Erf and Erfc are the error and complementary error functions. See also Figure "Erf Family Functions Relationship" in Erf function description for CdfNorm function relationship with the other functions of Erf family. Special Values for Real Function v?CdfNorm(x) Argument Result VML Error Status Exception X < underflow +0 VML_STATUS_UNDERFLOW UNDERFLOW +8 +1 -8 +0 QNAN QNAN SNAN QNAN INVALID See Also v?Erf v?Erfc v?ErfInv Computes inverse error function value of vector elements. Syntax Fortran: call vserfinv( n, a, y ) call vmserfinv( n, a, y, mode ) call vderfinv( n, a, y ) call vmderfinv( n, a, y, mode ) C: vsErfInv( n, a, y ); vmsErfInv( n, a, y, mode ); vdErfInv( n, a, y ); vmdErfInv( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. Vector Mathematical Functions 9 2077 Name Type Description a FORTRAN 77: REAL for vserfinv, vmserfinv DOUBLE PRECISION for vderfinv, vmderfinv Fortran 90: REAL, INTENT(IN) for vserfinv, vmserfinv DOUBLE PRECISION, INTENT(IN) for vderfinv, vmderfinv C: const float* for vsErfInv, vmsErfInv const double* for vdErfInv, vmdErfInv FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vserfinv, vmserfinv DOUBLE PRECISION for vderfinv, vmderfinv Fortran 90: REAL, INTENT(OUT) for vserfinv, vmserfinv DOUBLE PRECISION, INTENT(OUT) for vderfinv, vmderfinv C: float* for vsErfInv, vmsErfInv double* for vdErfInv, vmdErfInv FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The ErfInv function computes the inverse error function values for elements of the input vector a and writes them to the output vector y y = erf-1(a), where erf(x) is the error function defined as given by: 9 Intel® Math Kernel Library Reference Manual 2078 Useful relations: where erfc is the complementary error function. where is the cumulative normal distribution function. where F-1(x) and erf-1(x) are the inverses to F(x) and erf(x) respectively. Figure "ErfInv Family Functions Relationship" illustrates the relationships among ErfInv family functions (ErfInv, ErfcInv, CdfNormInv). ErfInv Family Functions Relationship Useful relations for these functions: Vector Mathematical Functions 9 2079 Special Values for Real Function v?ErfInv(x) Argument Result VML Error Status Exception +0 +0 -0 -0 +1 +8 VML_STATUS_SING ZERODIVIDE -1 -8 VML_STATUS_SING ZERODIVIDE |X| > 1 QNAN VML_STATUS_ERRDOM INVALID +8 QNAN VML_STATUS_ERRDOM INVALID -8 QNAN VML_STATUS_ERRDOM INVALID QNAN QNAN SNAN QNAN INVALID See Also v?ErfcInv v?CdfNormInv v?ErfcInv Computes the inverse complementary error function value of vector elements. Syntax Fortran: call vserfcinv( n, a, y ) call vmserfcinv( n, a, y, mode ) call vderfcinv( n, a, y ) call vmderfcinv( n, a, y, mode ) C: vsErfcInv( n, a, y ); vmsErfcInv( n, a, y, mode ); vdErfcInv( n, a, y ); vmdErfcInv( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. 9 Intel® Math Kernel Library Reference Manual 2080 Name Type Description a FORTRAN 77: REAL for vserfcinv, vmserfcinv DOUBLE PRECISION for vderfcinv, vmderfcinv Fortran 90: REAL, INTENT(IN) for vserfcinv, vmserfcinv DOUBLE PRECISION, INTENT(IN) for vderfcinv, vmderfcinv C: const float* for vsErfcInv, vmsErfcInv const double* for vdErfcInv, vmdErfcInv FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vserfcinv, vmserfcinv DOUBLE PRECISION for vderfcinv, vmderfcinv Fortran 90: REAL, INTENT(OUT) for vserfcinv, vmserfcinv DOUBLE PRECISION, INTENT(OUT) for vderfcinv, vmderfcinv C: float* for vsErfcInv, vmsErfcInv double* for vdErfcInv, vmdErfcInv FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The ErfcInv function computes the inverse complimentary error function values for elements of the input vector a and writes them to the output vector y. The inverse complementary error function is defined as given by: Vector Mathematical Functions 9 2081 where erf(x) denotes the error function and erfinv(x) denotes the inverse error function. See also Figure "ErfInv Family Functions Relationship" in ErfInv function description for ErfcInv function relationship with the other functions of ErfInv family. Special Values for Real Function v?ErfcInv(x) Argument Result VML Error Status Exception +1 +0 +2 -8 VML_STATUS_SING ZERODIVIDE -0 +8 VML_STATUS_SING ZERODIVIDE +0 +8 VML_STATUS_SING ZERODIVIDE X < -0 QNAN VML_STATUS_ERRDOM INVALID X > +2 QNAN VML_STATUS_ERRDOM INVALID +8 QNAN VML_STATUS_ERRDOM INVALID -8 QNAN VML_STATUS_ERRDOM INVALID QNAN QNAN SNAN QNAN INVALID See Also v?ErfInv v?CdfNormInv v?CdfNormInv Computes the inverse cumulative normal distribution function values of vector elements. Syntax Fortran: call vscdfnorminv( n, a, y ) call vmscdfnorminv( n, a, y, mode ) call vdcdfnorminv( n, a, y ) call vmdcdfnorminv( n, a, y, mode ) C: vsCdfNormInv( n, a, y ); vmsCdfNormInv( n, a, y, mode ); vdCdfNormInv( n, a, y ); vmdCdfNormInv( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 9 Intel® Math Kernel Library Reference Manual 2082 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vscdfnorminv, vmscdfnorminv DOUBLE PRECISION for vdcdfnorminv, vmdcdfnorminv Fortran 90: REAL, INTENT(IN) for vscdfnorminv, vmscdfnorminv DOUBLE PRECISION, INTENT(IN) for vdcdfnorminv, vmdcdfnorminv C: const float* for vsCdfNormInv, vmsCdfNormInv const double* for vdCdfNormInv, vmdCdfNormInv FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vscdfnorminv, vmscdfnorminv DOUBLE PRECISION for vdcdfnorminv, vmdcdfnorminv Fortran 90: REAL, INTENT(OUT) for vscdfnorminv, vmscdfnorminv DOUBLE PRECISION, INTENT(OUT) for vdcdfnorminv, vmdcdfnorminv C: float* for vsCdfNormInv, vmsCdfNormInv double* for vdCdfNormInv, vmdCdfNormInv FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The CdfNormInv function computes the inverse cumulative normal distribution function values for elements of the input vector a and writes them to the output vector y. Vector Mathematical Functions 9 2083 The inverse cumulative normal distribution function is defined as given by: where CdfNorm(x) denotes the cumulative normal distribution function. Useful relations: where erfinv(x) denotes the inverse error function and erfcinv(x) denotes the inverse complementary error functions. See also Figure "ErfInv Family Functions Relationship" in ErfInv function description for CdfNormInv function relationship with the other functions of ErfInv family. Special Values for Real Function v?CdfNormInv(x) Argument Result VML Error Status Exception +0.5 +0 +1 +8 VML_STATUS_SING ZERODIVIDE -0 -8 VML_STATUS_SING ZERODIVIDE +0 -8 VML_STATUS_SING ZERODIVIDE X < -0 QNAN VML_STATUS_ERRDOM INVALID X > +1 QNAN VML_STATUS_ERRDOM INVALID +8 QNAN VML_STATUS_ERRDOM INVALID -8 QNAN VML_STATUS_ERRDOM INVALID QNAN QNAN SNAN QNAN INVALID See Also v?ErfInv v?ErfcInv v?LGamma Computes the natural logarithm of the absolute value of gamma function for vector elements. Syntax Fortran: call vslgamma( n, a, y ) call vmslgamma( n, a, y, mode ) call vdlgamma( n, a, y ) call vmdlgamma( n, a, y, mode ) C: vsLGamma( n, a, y ); vmsLGamma( n, a, y, mode ); vdLGamma( n, a, y ); 9 Intel® Math Kernel Library Reference Manual 2084 vmdLGamma( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vslgamma, vmslgamma DOUBLE PRECISION for vdlgamma, vmdlgamma Fortran 90: REAL, INTENT(IN) for vslgamma, vmslgamma DOUBLE PRECISION, INTENT(IN) for vdlgamma, vmdlgamma C: const float* for vsLGamma, vmsLGamma const double* for vdLGamma, vmdLGamma FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vslgamma, vmslgamma DOUBLE PRECISION for vdlgamma, vmdlgamma Fortran 90: REAL, INTENT(OUT) for vslgamma, vmslgamma DOUBLE PRECISION, INTENT(OUT) for vdlgamma, vmdlgamma C: float* for vsLGamma, vmsLGamma double* for vdLGamma, vmdLGamma FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Vector Mathematical Functions 9 2085 Description The v?LGamma function computes the natural logarithm of the absolute value of gamma function for elements of the input vector a and writes them to the output vector y. Precision overflow thresholds for the v?LGamma function are beyond the scope of this document. If the result does not meet the target precision, the function raises the OVERFLOW exception and sets the VML Error Status to VML_STATUS_OVERFLOW. Special Values for Real Function v?LGamma(x) Argument Result VML Error Status Exception +1 +0 +2 +0 +0 +? VML_STATUS_SING ZERODIVIDE -0 +? VML_STATUS_SING ZERODIVIDE negative integer +? VML_STATUS_SING ZERODIVIDE -? +? +? +? X > overflow +? VML_STATUS_OVERFLOW OVERFLOW QNAN QNAN SNAN QNAN INVALID v?TGamma Computes the gamma function of vector elements. Syntax Fortran: call vstgamma( n, a, y ) call vmstgamma( n, a, y, mode ) call vdtgamma( n, a, y ) call vmdtgamma( n, a, y, mode ) C: vsTGamma( n, a, y ); vmsTGamma( n, a, y, mode ); vdTGamma( n, a, y ); vmdTGamma( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. 9 Intel® Math Kernel Library Reference Manual 2086 Name Type Description a FORTRAN 77: REAL for vstgamma, vmstgamma DOUBLE PRECISION for vdtgamma, vmdtgamma Fortran 90: REAL, INTENT(IN) for vstgamma, vmstgamma DOUBLE PRECISION, INTENT(IN) for vdtgamma, vmdtgamma C: const float* for vsTGamma, vmsTGamma const double* for vdTGamma, vmdTGamma FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vstgamma, vmstgamma DOUBLE PRECISION for vdtgamma, vmdtgamma Fortran 90: REAL, INTENT(OUT) for vstgamma, vmstgamma DOUBLE PRECISION, INTENT(OUT) for vdtgamma, vmdtgamma C: float* for vsTGamma, vmsTGamma double* for vdTGamma, vmdTGamma FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?TGamma function computes the gamma function for elements of the input vector a and writes them to the output vector y. Precision overflow thresholds for the v?TGamma function are beyond the scope of this document. If the result does not meet the target precision, the function raises the OVERFLOW exception and sets the VML Error Status to VML_STATUS_OVERFLOW. Special Values for Real Function v?TGamma(x) Argument Result VML Error Status Exception +0 +8 VML_STATUS_SING ZERODIVIDE -0 -8 VML_STATUS_SING ZERODIVIDE negative integer QNAN VML_STATUS_ERRDOM INVALID -8 QNAN VML_STATUS_ERRDOM INVALID +8 +8 Vector Mathematical Functions 9 2087 Argument Result VML Error Status Exception X > overflow +8 VML_STATUS_OVERFLOW OVERFLOW QNAN QNAN SNAN QNAN INVALID Rounding Functions v?Floor Computes an integer value rounded towards minus infinity for each vector element. Syntax Fortran: call vsfloor( n, a, y ) call vmsfloor( n, a, y, mode ) call vdfloor( n, a, y ) call vmdfloor( n, a, y, mode ) C: vsFloor( n, a, y ); vmsFloor( n, a, y, mode ); vdFloor( n, a, y ); vmdFloor( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsfloor, vmsfloor DOUBLE PRECISION for vdfloor, vmdfloor Fortran 90: REAL, INTENT(IN) for vsfloor, vmsfloor DOUBLE PRECISION, INTENT(IN) for vdfloor, vmdfloor FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. 9 Intel® Math Kernel Library Reference Manual 2088 Name Type Description C: const float* for vsFloor, vmsfloor const double* for vdFloor, vmdfloor mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsfloor, vmsfloor DOUBLE PRECISION for vdfloor, vmdfloor Fortran 90: REAL, INTENT(OUT) for vsfloor, vmsfloor DOUBLE PRECISION, INTENT(OUT) for vdfloor, vmdfloor C: float* for vsFloor, vmsfloor double* for vdFloor, vmdfloor FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The function computes an integer value rounded towards minus infinity for each vector element. Special Values for Real Function v?Floor(x) Argument Result Exception +0 +0 -0 -0 +8 +8 -8 -8 SNAN QNAN INVALID QNAN QNAN v?Ceil Computes an integer value rounded towards plus infinity for each vector element. Syntax Fortran: call vsceil( n, a, y ) call vmsceil( n, a, y, mode ) call vdceil( n, a, y ) Vector Mathematical Functions 9 2089 call vmdceil( n, a, y, mode ) C: vsCeil( n, a, y ); vmsCeil( n, a, y, mode ); vdCeil( n, a, y ); vmdCeil( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsceil, vmsceil DOUBLE PRECISION for vdceil, vmdceil Fortran 90: REAL, INTENT(IN) for vsceil, vmsceil DOUBLE PRECISION, INTENT(IN) for vdceil, vmdceil C: const float* for vsCeil, vmsceil const double* for vdCeil, vmdceil FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsceil, vmsceil DOUBLE PRECISION for vdceil, vmdceil Fortran 90: REAL, INTENT(OUT) for vsceil, vmsceil FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 2090 Name Type Description DOUBLE PRECISION, INTENT(OUT) for vdceil, vmdceil C: float* for vsCeil, vmsceil double* for vdCeil, vmdceil Description The function computes an integer value rounded towards plus infinity for each vector element. Special Values for Real Function v?Ceil(x) Argument Result Exception +0 +0 -0 -0 +8 +8 -8 -8 SNAN QNAN INVALID QNAN QNAN v?Trunc Computes an integer value rounded towards zero for each vector element. Syntax Fortran: call vstrunc( n, a, y ) call vmstrunc( n, a, y, mode ) call vdtrunc( n, a, y ) call vmdtrunc( n, a, y, mode ) C: vsTrunc( n, a, y ); vmsTrunc( n, a, y, mode ); vdTrunc( n, a, y ); vmdTrunc( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) Specifies the number of elements to be calculated. Vector Mathematical Functions 9 2091 Name Type Description C: const int a FORTRAN 77: REAL for vstrunc, vmstrunc DOUBLE PRECISION for vdtrunc, vmdtrunc Fortran 90: REAL, INTENT(IN) for vstrunc, vmstrunc DOUBLE PRECISION, INTENT(IN) for vdtrunc, vmdtrunc C: const float* for vsTrunc, vmstrunc const double* for vdTrunc, vmdtrunc FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vstrunc, vmstrunc DOUBLE PRECISION for vdtrunc, vmdtrunc Fortran 90: REAL, INTENT(OUT) for vstrunc, vmstrunc DOUBLE PRECISION, INTENT(OUT) for vdtrunc, vmdtrunc C: float* for vsTrunc, vmstrunc double* for vdTrunc, vmdtrunc FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The function computes an integer value rounded towards zero for each vector element. Special Values for Real Function v?Trunc(x) Argument Result Exception +0 +0 -0 -0 +8 +8 -8 -8 SNAN QNAN INVALID QNAN QNAN 9 Intel® Math Kernel Library Reference Manual 2092 v?Round Computes a value rounded to the nearest integer for each vector element. Syntax Fortran: call vsround( n, a, y ) call vmsround( n, a, y, mode ) call vdround( n, a, y ) call vmdround( n, a, y, mode ) C: vsRound( n, a, y ); vmsRound( n, a, y, mode ); vdRound( n, a, y ); vmdRound( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsround, vmsround DOUBLE PRECISION for vdround, vmdround Fortran 90: REAL, INTENT(IN) for vsround, vmsround DOUBLE PRECISION, INTENT(IN) for vdround, vmdround C: const float* for vsRound, vmsround const double* for vdRound, vmdround FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Vector Mathematical Functions 9 2093 Name Type Description C: const MKL_INT64 Output Parameters Name Type Description y FORTRAN 77: REAL for vsround, vmsround DOUBLE PRECISION for vdround, vmdround Fortran 90: REAL, INTENT(OUT) for vsround, vmsround DOUBLE PRECISION, INTENT(OUT) for vdround, vmdround C: float* for vsRound, vmsround double* for vdRound, vmdround FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The function computes a value rounded to the nearest integer for each vector element. The resulting mode affects the results computed for inputs that fall half-way between consecutive integres. For example: • f(0.5) = 0, for rounding modes set to round to nearest round toward zero or to minus infinity. • f(0.5) = 1, for rounding modes set to plus infinity. • f(-1.5) = -2, for rounding modes set to round to nearest or to minus infinity. • f(-1.5) = -1, for rounding modes set to round toward zero or to plus infinity. Special Values for Real Function v?Round(x) Argument Result Exception +0 +0 -0 -0 +8 +8 -8 -8 SNAN QNAN INVALID QNAN QNAN v?NearbyInt Computes a rounded integer value in the current rounding mode for each vector element. Syntax Fortran: call vsnearbyint( n, a, y ) call vmsnearbyint( n, a, y, mode ) call vdnearbyint( n, a, y ) call vmdnearbyint( n, a, y, mode ) 9 Intel® Math Kernel Library Reference Manual 2094 C: vsNearbyInt( n, a, y ); vmsNearbyInt( n, a, y, mode ); vdNearbyInt( n, a, y ); vmdNearbyInt( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsnearbyint, vmsnearbyint DOUBLE PRECISION for vdnearbyint, vmdnearbyint Fortran 90: REAL, INTENT(IN) for vsnearbyint, vmsnearbyint DOUBLE PRECISION, INTENT(IN) for vdnearbyint, vmdnearbyint C: const float* for vsNearbyInt, vmsnearbyint const double* for vdNearbyInt, vmdnearbyint FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsnearbyint, vmsnearbyint DOUBLE PRECISION for vdnearbyint, vmdnearbyint Fortran 90: REAL, INTENT(OUT) for vsnearbyint, vmsnearbyint FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Vector Mathematical Functions 9 2095 Name Type Description DOUBLE PRECISION, INTENT(OUT) for vdnearbyint, vmdnearbyint C: float* for vsNearbyInt, vmsnearbyint double* for vdNearbyInt, vmdnearbyint Description The v?NearbyInt function computes a rounded integer value in a current rounding mode for each vector element. Halfway values, that is, 0.5, -1.5, and the like, are rounded off towards even values. Special Values for Real Function v?NearbyInt(x) Argument Result Exception +0 +0 -0 -0 +8 +8 -8 -8 SNAN QNAN INVALID QNAN QNAN v?Rint Computes a rounded integer value in the current rounding mode for each vector element with inexact result exception raised for each changed value. Syntax Fortran: call vsrint( n, a, y ) call vmsrint( n, a, y, mode ) call vdrint( n, a, y ) call vmdrint( n, a, y, mode ) C: vsRint( n, a, y ); vmsRint( n, a, y, mode ); vdRint( n, a, y ); vmdRint( n, a, y, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h 9 Intel® Math Kernel Library Reference Manual 2096 Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsrint, vmsrint DOUBLE PRECISION for vdrint, vmdrint Fortran 90: REAL, INTENT(IN) for vsrint, vmsrint DOUBLE PRECISION, INTENT(IN) for vdrint, vmdrint C: const float* for vsRint, vmsrint const double* for vdRint, vmdrint FORTRAN: Array that specifies the input vector a. C: Pointer to an array that contains the input vector a. mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y FORTRAN 77: REAL for vsrint, vmsrint DOUBLE PRECISION for vdrint, vmdrint Fortran 90: REAL, INTENT(OUT) for vsrint, vmsrint DOUBLE PRECISION, INTENT(OUT) for vdrint, vmdrint C: float* for vsRint, vmsrint double* for vdRint, vmdrint FORTRAN: Array that specifies the output vector y. C: Pointer to an array that contains the output vector y. Description The v?Rint function computes a rounded floating-point integer value using the current rounding mode for each vector element. The resulting mode affects the results computed for inputs that fall half-way between consecutive integres. For example: • f(0.5) = 0, for rounding modes set to round to nearest round toward zero or to minus infinity. • f(0.5) = 1, for rounding modes set to plus infinity. Vector Mathematical Functions 9 2097 • f(-1.5) = -2, for rounding modes set to round to nearest or to minus infinity. • f(-1.5) = -1, for rounding modes set to round toward zero or to plus infinity. Special Values for Real Function v?Rint(x) Argument Result Exception +0 +0 -0 -0 +8 +8 -8 -8 SNAN QNAN INVALID QNAN QNAN v?Modf Computes a truncated integer value and the remaining fraction part for each vector element. Syntax Fortran: call vsmodf( n, a, y, z ) call vmsmodf( n, a, y, z, mode ) call vdmodf( n, a, y, z ) call vmdmodf( n, a, y, z, mode ) C: vsModf( n, a, y, z ); vmsModf( n, a, y, z, mode ); vdModf( n, a, y, z ); vmdModf( n, a, y, z, mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsmodf, vmsmodf DOUBLE PRECISION for vdmodf, vmdmodf FORTRAN: Array, specifies the input vector a. C: Pointer to an array that contains the input vector a. 9 Intel® Math Kernel Library Reference Manual 2098 Name Type Description Fortran 90: REAL, INTENT(IN) for vsmodf, vmsmodf DOUBLE PRECISION, INTENT(IN) for vdmodf, vmdmodf C: const float* for vsModf, vmsmodf const double* for vdModf, vmdmodf mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Overrides global VML mode setting for this function call. See vmlSetMode for possible values and their description. Output Parameters Name Type Description y, z FORTRAN 77: REAL for vsmodf, vmsmodf DOUBLE PRECISION for vdmodf, vmdmodf Fortran 90: REAL, INTENT(OUT) for vsmodf, vmsmodf DOUBLE PRECISION, INTENT(OUT) for vdmodf, vmdmodf C: float* for vsModf, vmsmodf double* for vdModf, vmdmodf FORTRAN: Array, specifies the output vector y and z. C: Pointer to an array that contains the output vector y and z. Description The function computes a truncated integer value and the remaining fraction part for each vector element. Halfway values, such as 0.5, -1.5, are rounded off towards even values. An inexact result exception is raised for each changed value. Special Values for Real Function v?Modf(x) Argument Result 1 Result 2 Exception +0 +0 +0 -0 -0 -0 +8 +8 +0 -8 -8 -0 SNAN QNAN QNAN INVALID QNAN QNAN QNAN Vector Mathematical Functions 9 2099 VML Pack/Unpack Functions This section describes VML functions that convert vectors with unit increment to and from vectors with positive increment indexing, vector indexing, and mask indexing (see Appendix B for details on vector indexing methods). The table below lists available VML Pack/Unpack functions, together with data types and indexing methods associated with them. VML Pack/Unpack Functions Function Short Name Data Types Indexing Methods Description v?Pack s, d, c, z I,V,M Gathers elements of arrays, indexed by different methods. v?Unpack s, d, c, z I,V,M Scatters vector elements to arrays with different indexing. See Also Vector Arguments in VML v?Pack Copies elements of an array with specified indexing to a vector with unit increment. Syntax Fortran: call vspacki( n, a, inca, y ) call vspackv( n, a, ia, y ) call vspackm( n, a, ma, y ) call vdpacki( n, a, inca, y ) call vdpackv( n, a, ia, y ) call vdpackm( n, a, ma, y ) call vcpacki( n, a, inca, y ) call vcpackv( n, a, ia, y ) call vcpackm( n, a, ma, y ) call vzpacki( n, a, inca, y ) call vzpackv( n, a, ia, y ) call vzpackm( n, a, ma, y ) C: vsPackI( n, a, inca, y ); vsPackV( n, a, ia, y ); vsPackM( n, a, ma, y ); vdPackI( n, a, inca, y ); vdPackV( n, a, ia, y ); vdPackM( n, a, ma, y ); vcPackI( n, a, inca, y ); 9 Intel® Math Kernel Library Reference Manual 2100 vcPackV( n, a, ia, y ); vcPackM( n, a, ma, y ); vzPackI( n, a, inca, y ); vzPackV( n, a, ia, y ); vzPackM( n, a, ma, y ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vspacki, vspackv, vspackm DOUBLE PRECISION for vdpacki, vdpackv, vdpackm COMPLEX for vcpacki, vcpackv, vcpackm DOUBLE COMPLEX for vzpacki, vzpackv, vzpackm Fortran 90: REAL, INTENT(IN) for vspacki, vspackv, vspackm DOUBLE PRECISION, INTENT(IN) for vdpacki, vdpackv, vdpackm COMPLEX, INTENT(IN) for vcpacki, vcpackv, vcpackm DOUBLE COMPLEX, INTENT(IN) for vzpacki, vzpackv, vzpackm C: const float* for vsPackI, vsPackV, vsPackM const double* for vdPackI, vdPackV, vdPackM const MKL_Complex8* for vcPackI, vcPackV, vcPackM const MKL_Complex16* for vzPackI, vzPackV, vzPackM FORTRAN: Array, DIMENSION at least(1 + (n-1)*inca) for v?packi, Array, DIMENSION at least max( n,max(ia[j]) ), j=0, …, n-1 for v?packv, Array, DIMENSION at least n for v?packm. Specifies the input vector a. C: Specifies pointer to an array that contains the input vector a. The arrays must be: for v?PackI, at least(1 + (n-1)*inca) for v?PackV, at least max( n,max(ia[j]) ), j=0, …, n-1 for v?PackM, at least n. Vector Mathematical Functions 9 2101 Name Type Description inca FORTRAN 77: INTEGER for vspacki, vdpacki, vcpacki, vzpacki Fortran 90: INTEGER, INTENT(IN) for vspacki, vdpacki, vcpacki, vzpacki C: const int for vsPackI, vdPackI, vcPackI, vzPackI Specifies the increment for the elements of a. ia FORTRAN 77: INTEGER for vspackv, vdpackv, vcpackv, vzpackv Fortran 90: INTEGER, INTENT(IN) for vspackv, vdpackv, vcpackv, vzpackv C: const int* for vsPackV, vdPackV, vcPackV, vzPackV FORTRAN: Array, DIMENSION at least n. Specifies the index vector for the elements of a. C: Specifies the pointer to an array of size at least n that contains the index vector for the elements of a. ma FORTRAN 77: INTEGER for vspackm, vdpackm, vcpackm, vzpackm Fortran 90: INTEGER, INTENT(IN) for vspackm, vdpackm, vcpackm, vzpackm C: const int* for vsPackM, vdPackM, vcPackM, vzPackM FORTRAN: Array, DIMENSION at least n, Specifies the mask vector for the elements of a. C: Specifies the pointer to an array of size at least n that contains the mask vector for the elements of a. Output Parameters Name Type Description y FORTRAN 77: REAL for vspacki, vspackv, vspackm DOUBLE PRECISION for vdpacki, vdpackv, vdpackm COMPLEX for vcpacki, vcpackv, vcpackm DOUBLE COMPLEX for vzpacki, vzpackv, vzpackm Fortran 90: REAL, INTENT(OUT) for vspacki, vspackv, vspackm DOUBLE PRECISION, INTENT(OUT) for vdpacki, vdpackv, vdpackm COMPLEX, INTENT(OUT) for vcpacki, vcpackv, vcpackm FORTRAN: Array, DIMENSION at least n. Specifies the output vector y. C: Pointer to an array of size at least n that contains the output vector y. 9 Intel® Math Kernel Library Reference Manual 2102 Name Type Description DOUBLE COMPLEX, INTENT(OUT) for vzpacki, vzpackv, vzpackm C: float* for vsPackI, vsPackV, vsPackM double* for vdPackI, vdPackV, vdPackM const MKL_Complex8* for vcPackI, vcPackV, vcPackM const MKL_Complex16* for vzPackI, vzPackV, vzPackM v?Unpack Copies elements of a vector with unit increment to an array with specified indexing. Syntax Fortran: call vsunpacki( n, a, y, incy ) call vsunpackv( n, a, y, iy ) call vsunpackm( n, a, y, my ) call vdunpacki( n, a, y, incy ) call vdunpackv( n, a, y, iy ) call vdunpackm( n, a, y, my ) call vcunpacki( n, a, y, incy ) call vcunpackv( n, a, y, iy ) call vcunpackm( n, a, y, my ) call vzunpacki( n, a, y, incy ) call vzunpackv( n, a, y, iy ) call vzunpackm( n, a, y, my ) C: vsUnpackI( n, a, y, incy ); vsUnpackV( n, a, y, iy ); vsUnpackM( n, a, y, my ); vdUnpackI( n, a, y, incy ); vdUnpackV( n, a, y, iy ); vdUnpackM( n, a, y, my ); vcUnpackI( n, a, y, incy ); vcUnpackV( n, a, y, iy ); Vector Mathematical Functions 9 2103 vcUnpackM( n, a, y, my ); vzUnpackI( n, a, y, incy ); vzUnpackV( n, a, y, iy ); vzUnpackM( n, a, y, my ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the number of elements to be calculated. a FORTRAN 77: REAL for vsunpacki, vsunpackv, vsunpackm DOUBLE PRECISION for vdunpacki, vdunpackv, vdunpackm COMPLEX for vcunpacki, vcunpackv, vcunpackm DOUBLE COMPLEX for vzunpacki, vzunpackv, vznpackm Fortran 90: REAL, INTENT(IN) for vsunpacki, vsunpackv, vsunpackm DOUBLE PRECISION, INTENT(IN) for vdunpacki, vdunpackv, vdunpackm COMPLEX, INTENT(IN) for vcunpacki, vcunpackv, vcunpackm DOUBLE COMPLEX, INTENT(IN) for vzunpacki, vzunpackv, vzunpackm C: const float* for vsUnpackI, vsUnpackV, vsUnpackM const double* for vdUnpackI, vdUnpackV, vdUnpackM const MKL_Complex8* for vcUnpackI, vcUnpackV, vcUnpackM const MKL_Complex16* for vzUnpackI, vzUnpackV, vzUnpackM FORTRAN: Array, DIMENSION at least n. Specifies the input vector a. C: Specifies the pointer to an array of size at least n that contains the input vector a. incy FORTRAN 77: INTEGER for vsunpacki, vdunpacki, vcunpacki, vzunpacki Specifies the increment for the elements of y. 9 Intel® Math Kernel Library Reference Manual 2104 Name Type Description Fortran 90: INTEGER, INTENT(IN) for vsunpacki, vdunpacki, vcunpacki, vzunpacki C: const int for vsUnpackI, vdUnpackI, vcUnpackI, vzUnpackI iy FORTRAN 77: INTEGER for vsunpackv, vdunpackv, vcunpackv, vzunpackv Fortran 90: INTEGER, INTENT(IN) for vsunpackv, vdunpackv, vcunpackv, vzunpackv C: const int* for vsUnpackV, vdUnpackV, vcUnpackV, vzUnpackV FORTRAN: Array, DIMENSION at least n. Specifies the index vector for the elements of y. C: Specifies the pointer to an array of size at least n that contains the index vector for the elements of a. my FORTRAN 77: INTEGER for vsunpackm, vdunpackm, vcunpackm, vzunpackm Fortran 90: INTEGER, INTENT(IN) for vsunpackm, vdunpackm, vcunpackm, vzunpackm C: const int* for vsUnpackM, vdUnpackM, vcUnpackM, vzUnpackM FORTRAN: Array, DIMENSION at least n, Specifies the mask vector for the elements of y. C: Specifies the pointer to an array of size at least n that contains the mask vector for the elements of a. Output Parameters Name Type Description y FORTRAN 77: REAL for vsunpacki, vsunpackv, vsunpackm DOUBLE PRECISION for vdunpacki, vdunpackv, vdunpackm COMPLEX, INTENT(IN) for vcunpacki, vcunpackv, vcunpackm DOUBLE COMPLEX, INTENT(IN) for vzunpacki, vzunpackv, vzunpackm Fortran 90: REAL, INTENT(OUT) for vsunpacki, vsunpackv, vsunpackm DOUBLE PRECISION, INTENT(OUT) for vdunpacki, vdunpackv, vdunpackm COMPLEX, INTENT(OUT) for vcunpacki, vcunpackv, vcunpackm DOUBLE COMPLEX, INTENT(OUT) for vzunpacki, vzunpackv, vzunpackm FORTRAN: Array, DIMENSION for v?unpacki, at least (1 + (n-1)*incy) for v?unpackv, at least max( n,max(iy[j]) ),j=0,..., n-1 for v?unpackm, at least n C: Specifies the pointer to an array that contains the output vector y. The array must be: for v?UnpackI, at least (1 + (n-1)*incy) for v?UnpackV, at least max( n,max(ia[j]) ),j=0,..., n-1, for v?UnpackM, at least n. Vector Mathematical Functions 9 2105 Name Type Description C: float* for vsUnpackI, vsUnpackV, vsUnpackM double* for vdUnpackI, vdUnpackV, vdUnpackM const MKL_Complex8* for vcUnpackI, vcUnpackV, vcUnpackM const MKL_Complex16* for vzUnpackI, vzUnpackV, vzUnpackM VML Service Functions The VML Service functions enable you to set/get the accuracy mode and error code. These functions are available both in the Fortran and C interfaces. The table below lists available VML Service functions and their short description. VML Service Functions Function Short Name Description vmlSetMode Sets the VML mode vmlGetMode Gets the VML mode vmlSetErrStatus Sets the VML Error Status vmlGetErrStatus Gets the VML Error Status vmlClearErrStatus Clears the VML Error Status vmlSetErrorCallBack Sets the additional error handler callback function vmlGetErrorCallBack Gets the additional error handler callback function vmlClearErrorCallBack Deletes the additional error handler callback function vmlSetMode Sets a new mode for VML functions according to the mode parameter and stores the previous VML mode to oldmode. Syntax Fortran: oldmode = vmlsetmode( mode ) C: oldmode = vmlSetMode( mode ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h 9 Intel® Math Kernel Library Reference Manual 2106 Input Parameters Name Type Description mode FORTRAN 77: INTEGER*8 Fortran 90: INTEGER(KIND=8), INTENT(IN) C: const MKL_INT64 Specifies the VML mode to be set. Output Parameters Name Type Description oldmode FORTRAN: INTEGER*8 Fortran 90: INTEGER(KIND=8) C: MKL_INT64 Specifies the former VML mode. Description The vmlSetMode function sets a new mode for VML functions according to the mode parameter and stores the previous VML mode to oldmode. The mode change has a global effect on all the VML functions within a thread. NOTE You can override the global mode setting and change the mode for a given VML function call by using a respective vm[s,d] variant of the function. The mode parameter is designed to control accuracy, handling of denormalized numbers, and error handling. Table "Values of the mode Parameter" lists values of the mode parameter. You can obtain all other possible values of the mode parameter from the mode parameter values by using bitwise OR ( | ) operation to combine one value for accuracy, one value for handling of denormalized numbers, and one vlaue for error control options. The default value of the mode parameter is VML_HA | VML_FTZDAZ_OFF | VML_ERRMODE_DEFAULT. The VML_FTZDAZ_ON mode is specifically designed to improve the performance of computations that involve denormalized numbers at the cost of reasonable accuracy loss. This mode changes the numeric behavior of the functions: denormalized input values are treated as zeros (DAZ = denormals-are-zero) and denormalized results are flushed to zero (FTZ = flush-to-zero). Accuracy loss may occur if input and/or output values are close to denormal range. Values of the mode Parameter Value of mode Description Accuracy Control VML_HA high accuracy versions of VML functions VML_LA low accuracy versions of VML functions VML_EP enhanced performance accuracy versions of VML functions Denormalized Numbers Handling Control VML_FTZDAZ_ON Faster processing of denormalized inputs is enabled. VML_FTZDAZ_OFF Faster processing of denormalized inputs is disabled. Error Mode Control VML_ERRMODE_IGNORE No action is set for computation errors. VML_ERRMODE_ERRNO On error, the errno variable is set. Vector Mathematical Functions 9 2107 Value of mode Description VML_ERRMODE_STDERR On error, the error text information is written to stderr. VML_ERRMODE_EXCEPT On error, an exception is raised. VML_ERRMODE_CALLBACK On error, an additional error handler function is called. VML_ERRMODE_DEFAULT On error, the errno variable is set, an exception is raised, and an additional error handler function is called. Examples The following example shows how to set low accuracy, fast processing for denormalized numbers and errno error mode in the Fortran and C languages: oldmode = vmlsetmode( VML_LA ) call vmlsetmode( IOR(VML_LA, VML_FTZDAZ_ON, VML_ERRMODE_ERRNO) ) vmlSetMode( VML_LA ); vmlSetMode( VML_LA | VML_FTZDAZ_ON | VML_ERRMODE_ERRNO ); vmlGetMode Gets the VML mode. Syntax Fortran: mod = vmlgetmode() C: mod = vmlGetMode( void ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Output Parameters Name Type Description mod FORTRAN: INTEGER C: int Specifies the packed mode parameter. Description The function vmlGetMode() returns the VML mode parameter that controls accuracy, handling of denormalized numbers, and error handling options. The mod variable value is a combination of the values listed in the table "Values of the mode Parameter". You can obtain these values using the respective mask from the table "Values of Mask for the mode Parameter". Values of Mask for the mode Parameter Value of mask Description VML_ACCURACY_MASK Specifies mask for accuracy mode selection. 9 Intel® Math Kernel Library Reference Manual 2108 Value of mask Description VML_FTZDAZ_MASK Specifies mask for FTZDAZ mode selection. VML_ERRMODE_MASK Specifies mask for error mode selection. See example below: Examples mod = vmlgetmode() accm = IAND(mod, VML_ACCURACY_MASK) denm = IAND(mod, VML_FTZDAZ_MASK) errm = IAND(mod, VML_ERRMODE_MASK) accm = vmlGetMode(void )& VML_ACCURACY_MASK; denm = vmlGetMode(void )& VML_FTZDAZ_MASK; errm = vmlGetMode(void )& VML_ERRMODE_MASK; vmlSetErrStatus Sets the new VML Error Status according to err and stores the previous VML Error Status to olderr. Syntax Fortran: olderr = vmlseterrstatus( err ) C: olderr = vmlSetErrStatus( err ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Input Parameters Name Type Description err FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Specifies the VML error status to be set. Output Parameters Name Type Description olderr FORTRAN: INTEGER C: int Specifies the former VML error status. Vector Mathematical Functions 9 2109 Description Table "Values of the VML Status" lists possible values of the err parameter. Values of the VML Status Status Description Successful Execution VML_STATUS_OK The execution was completed successfully. Warnings VML_STATUS_ACCURACYWARNING The execution was completed successfully in a different accuracy mode. Errors VML_STATUS_BADSIZE The function does not support the preset accuracy mode. The Low Accuracy mode is used instead. VML_STATUS_BADMEM NULL pointer is passed. VML_STATUS_ERRDOM At least one of array values is out of a range of definition. VML_STATUS_SING At least one of the input array values causes a divide-by-zero exception or produces an invalid (QNaN) result. VML_STATUS_OVERFLOW An overflow has happened during the calculation process. VML_STATUS_UNDERFLOW An underflow has happened during the calculation process. Examples olderr = vmlSetErrStatus( VML_STATUS_OK ); olderr = vmlSetErrStatus( VML_STATUS_ERRDOM ); olderr = vmlSetErrStatus( VML_STATUS_UNDERFLOW ); vmlGetErrStatus Gets the VML Error Status. Syntax Fortran: err = vmlgeterrstatus( ) C: err = vmlGetErrStatus( void ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Output Parameters Name Type Description err FORTRAN: INTEGER C: int Specifies the VML error status. 9 Intel® Math Kernel Library Reference Manual 2110 vmlClearErrStatus Sets the VML Error Status to VML_STATUS_OK and stores the previous VML Error Status to olderr. Syntax Fortran: olderr = vmlclearerrstatus( ) C: olderr = vmlClearErrStatus( void ); Include Files • FORTRAN 77: mkl_vml.f77 • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Output Parameters Name Type Description olderr FORTRAN: INTEGER C: int Specifies the former VML error status. vmlSetErrorCallBack Sets the additional error handler callback function and gets the old callback function. Syntax Fortran: oldcallback = vmlseterrorcallback( callback ) C: oldcallback = vmlSetErrorCallBack( callback ); Include Files • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Vector Mathematical Functions 9 2111 Input Parameters Name Description FORTRAN: callback Address of the callback function. The callback function has the following format: INTEGER FUNCTION ERRFUNC(par) TYPE (ERROR_STRUCTURE) par ! ... ! user error processing ! ... ERRFUNC = 0 ! if ERRFUNC= 0 - standard VML error handler ! is called after the callback ! if ERRFUNC != 0 - standard VML error handler ! is not called END The passed error structure is defined as follows: TYPE ERROR_STRUCTURE SEQUENCE INTEGER*4 ICODE INTEGER*4 IINDEX REAL*8 DBA1 REAL*8 DBA2 REAL*8 DBR1 REAL*8 DBR2 CHARACTER(64) CFUNCNAME INTEGER*4 IFUNCNAMELEN REAL*8 DBA1IM REAL*8 DBA2IM REAL*8 DBR1IM REAL*8 DBR2IM END TYPE ERROR_STRUCTURE C: callback Pointer to the callback function. The callback function has the following format: static int __stdcall MyHandler(DefVmlErrorContext* pContext) { /* Handler body */ }; 9 Intel® Math Kernel Library Reference Manual 2112 Name Description The passed error structure is defined as follows: typedef struct _DefVmlErrorContext { int iCode;/* Error status value */ int iIndex;/* Index for bad array element, or bad array dimension, or bad array pointer */ double dbA1; /* Error argument 1 */ double dbA2; /* Error argument 2 */ double dbR1; /* Error result 1 */ double dbR2; /* Error result 2 */ char cFuncName[64]; /* Function name */ int iFuncNameLen; /* Length of functionname*/ double dbA1Im; /* Error argument 1, imag part*/ double dbA2Im; /* Error argument 2, imag part*/ double dbR1Im; /* Error result 1, imag part*/ double dbR2Im; /* Error result 2, imag part*/ } DefVmlErrorContext; Output Parameters Name Type Description oldcallback Fortran 90: INTEGER C: int FORTRAN: Address of the former callback function. C: Pointer to the former callback function. NOTE This function does not have a FORTRAN 77 interface due to the use of internal structures. Description The callback function is called on each VML mathematical function error if VML_ERRMODE_CALLBACK error mode is set (see "Values of the mode Parameter"). Use the vmlSetErrorCallBack() function if you need to define your own callback function instead of default empty callback function. The input structure for a callback function contains the following information about the error encountered: • the input value that caused an error • location (array index) of this value • the computed result value • error code • name of the function in which the error occurred. Vector Mathematical Functions 9 2113 You can insert your own error processing into the callback function. This may include correcting the passed result values in order to pass them back and resume computation. The standard error handler is called after the callback function only if it returns 0. vmlGetErrorCallBack Gets the additional error handler callback function. Syntax Fortran: callback = vmlgeterrorcallback( ) C: callback = vmlGetErrorCallBack( void ); Include Files • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Output Parameters Name Description callback Fortran 90: Address of the callback function C: Pointer to the callback function vmlClearErrorCallBack Deletes the additional error handler callback function and retrieves the former callback function. Syntax Fortran: oldcallback = vmlclearerrorcallback( ) C: oldcallback = vmlClearErrorCallBack( void ); Include Files • Fortran 90: mkl_vml.f90 • C: mkl_vml_functions.h Output Parameters Name Type Description oldcallback Fortran 90: INTEGER C: int FORTRAN: Address of the former callback function C: Pointer to the former callback function 9 Intel® Math Kernel Library Reference Manual 2114 Statistical Functions 10 Statistical functions in Intel® MKL are known as Vector Statistical Library (VSL) that is designed for the purpose of • generating vectors of pseudorandom and quasi-random numbers • performing mathematical operations of convolution and correlation • computing basic statistical estimates for single and double precision multi-dimensional datasets The corresponding functionality is described in the respective Random Number Generators, Convolution and Correlation, and VSL Summary Statistics sections. See VSL performance data in the online VSL Performance Data document available at http:// software.intel.com/en-us/articles/intel-math-kernel-library-documentation/ The basic notion in VSL is a task. The task object is a data structure or descriptor that holds the parameters related to a specific statistical operation: random number generation, convolution and correlation, or summary statistics estimation. Such parameters can be an identifier of a random number generator, its internal state and parameters, data arrays, their shape and dimensions, an identifier of the operation and so forth. You can modify the VSL task parameters using the respective service functionality of the library. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Random Number Generators Intel MKL VSL provides a set of routines implementing commonly used pseudo- or quasi-random number generators with continuous and discrete distribution. To improve performance, all these routines were developed using the calls to the highly optimized Basic Random Number Generators (BRNGs) and the library of vector mathematical functions (VML, see Chapter 9, "Vector Mathematical Functions"). VSL provides interfaces both for Fortran and C languages. For users of the C and C++ languages the mkl_vsl.h header file is provided. For users of the Fortran 90 or Fortran 95 language the mkl_vsl.f90 header file is provided. The mkl_vsl.fi header file available in the previous versions of Intel MKL is retained for backward compatibility. For users of the FORTRAN 77 language the mkl_vsl.f77 header file is provided. All header files are found in the following directory: ${MKL}/include The mkl_vsl.f90 header is intended for using via the Fortran include clause and is compatible with both standard forms of F90/F95 sources - the free and 72-columns fixed forms. If you need to use the VSL interface with 80- or 132-columns fixed form sources, you may add a new file to your project. That file is formatted as a 72-columns fixed-form source and consists of a single include clause as follows: include ‘mkl_vsl.f90’ This include clause causes the compiler to generate the module files mkl_vsl.mod and mkl_vsl_type.mod, which are used to process the Fortran use clauses referencing to the VSL interface: use mkl_vsl_type use mkl_vsl 2115 Because of this specific feature, you do not need to include the mkl_vsl.f90 header into each source of your project. You only need to include the header into some of the sources. In any case, make sure that the sources that depend on the VSL interface are compiled after those that include the header so that the module files mkl_vsl.mod and mkl_vsl_type.mod are generated prior to using them. The mkl_vsl.f77 header is intended for using via the Fortran include clause as follows: include ‘mkl_vsl.f77’ NOTE For Fortran 90 interface, VSL provides both subroutine-style interface and function-style interface. Default interface in this case is a function-style interface. Function-style interface, unlike subroutine-style interface, allows the user to get error status of each routine. Subroutine-style interface is provided for backward compatibility only. To use subroutine-style interface, manually include mkl_vsl_subroutine.fi file instead of mkl_vsl.f90 by changing the line include ‘mkl_vsl.f90’ in include\mkl.fi with the line include ‘mkl_vsl_subroutine.fi’. For the FORTRAN 77 interface, VSL provides only function-style interface. All VSL routines can be classified into three major categories: • Transformation routines for different types of statistical distributions, for example, uniform, normal (Gaussian), binomial, etc. These routines indirectly call basic random number generators, which are either pseudorandom number generators or quasi-random number generators. Detailed description of the generators can be found in Distribution Generators section. • Service routines to handle random number streams: create, initialize, delete, copy, save to a binary file, load from a binary file, get the index of a basic generator. The description of these routines can be found in Service Routines section. • Registration routines for basic pseudorandom generators and routines that obtain properties of the registered generators (see Advanced Service Routines section ). The last two categories are referred to as service routines. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Conventions This document makes no specific differentiation between random, pseudorandom, and quasi-random numbers, nor between random, pseudorandom, and quasi-random number generators unless the context requires otherwise. For details, refer to the ‘Random Numbers’ section in VSL Notes document provided at the Intel® MKL web page. All generators of nonuniform distributions, both discrete and continuous, are built on the basis of the uniform distribution generators, called Basic Random Number Generators (BRNGs). The pseudorandom numbers with nonuniform distribution are obtained through an appropriate transformation of the uniformly distributed pseudorandom numbers. Such transformations are referred to as generation methods. For a given distribution, several generation methods can be used. See VSL Notes for the description of methods available for each generator. An RNG task determines environment in which random number generation is performed, in particular parameters of the BRNG and its internal state. Output of VSL generators is a stream of random numbers that are used in Monte Carlo simulations. A random stream descriptor and a random stream are used as synonyms of an RNG task in the document unless the context requires otherwise. 10 Intel® Math Kernel Library Reference Manual 2116 The random stream descriptor specifies which BRNG should be used in a given transformation method. See the Random Streams and RNGs in Parallel Computation section of VSL Notes. The term computational node means a logical or physical unit that can process data in parallel. Mathematical Notation The following notation is used throughout the text: N The set of natural numbers N = {1, 2, 3 ...}. Z The set of integers Z = {... -3, -2, -1, 0, 1, 2, 3 ...}. R The set of real numbers. The floor of a (the largest integer less than or equal to a). ? or xor Bitwise exclusive OR. Binomial coefficient or combination (a?R, a= 0; k?N ?{0}). For a=k binomial coefficient is defined as If a < k, then F(x) Cumulative Gaussian distribution function defined over - 8 < x < + 8. F(-8) = 0, F(+8) = 1. G(a) The complete gamma function where a > 0. B(p, q) The complete beta function where p>0 and q>0. Statistical Functions 10 2117 LCG(a,c, m) Linear Congruential Generator xn+1 = (axn + c) mod m, where a is called the multiplier, c is called the increment, and m is called the modulus of the generator. MCG(a,m) Multiplicative Congruential Generator xn+1 = (axn) mod m is a special case of Linear Congruential Generator, where the increment c is taken to be 0. GFSR(p, q) Generalized Feedback Shift Register Generator xn = xn-p ?xn-q. Naming Conventions The names of the Fortran routines in VSL random number generators are lowercase (virnguniform). The names are not case-sensitive. In C, the names of the routines, types, and constants are case-sensitive and can be lowercase and uppercase viRngUniform). The names of generator routines have the following structure: vrng for the Fortran interface vRng for the C interface, where • v is the prefix of a VSL vector function. • is either s, d, or i and specifies one of the following types: s REAL for the Fortran interface float for the C interface d DOUBLE PRECISION for the Fortran interface double for the C interface i INTEGER for the Fortran interface int for the C interface Prefixes s and d apply to continuous distributions only, prefix i applies only to discrete case. • rng indicates that the routine is a random generator. • specifies the type of statistical distribution. Names of service routines follow the template below: vsl where • vsl is the prefix of a VSL service function. • contains a short function name. For a more detailed description of service routines, refer to Service Routines and Advanced Service Routines sections. Prototype of each generator routine corresponding to a given probability distribution fits the following structure: status = ( method, stream, n, r, [] ) where • method defines the method of generation. A detailed description of this parameter can be found in table "Values of in method parameter". See the next page, where the structure of the method parameter name is explained. • stream defines the descriptor of the random stream and must have a non-zero value. Random streams, descriptors, and their usage are discussed further in Random Streams and Service Routines. • n defines the number of random values to be generated. If n is less than or equal to zero, no values are generated. Furthermore, if n is negative, an error condition is set. 10 Intel® Math Kernel Library Reference Manual 2118 • r defines the destination array for the generated numbers. The dimension of the array must be large enough to store at least n random numbers. • status defines the error status of a VSL routine. See Error Reporting section for a detailed description of error status values. Additional parameters included into field are individual for each generator routine and are described in detail in Distribution Generators section. To invoke a distribution generator, use a call to the respective VSL routine. For example, to obtain a vector r, composed of n independent and identically distributed random numbers with normal (Gaussian) distribution, that have the mean value a and standard deviation sigma, write the following: for the Fortran interface status = vsrnggaussian( method, stream, n, r, a, sigma ) for the C interface status = vsRngGaussian( method, stream, n, r, a, sigma ) The name of a method parameter has the following structure: VSL_RNG_METHOD_method_ VSL_RNG_METHOD___ACCURATE where • is the probability distribution. • is the method name. Type of the name structure for the method parameter corresponds to fast and accurate modes of random number generation (see "Distribution Generators" section and VSL Notes for details). Method names VSL_RNG_METHOD__ and VSL_RNG_METHOD___ACCURATE should be used with vRng function only, where • is s for single precision continuous distribution d for double precision continuous distribution i for discrete distribution • is the probability distribution. is the probability distribution.Table "Values of in method parameter" provides specific predefined values of the method name. The third column contains names of the functions that use the given method. Values of in method parameter Method Short Description Functions STD Standard method. Currently there is only one method for these functions. Uniform (continuous), Uniform (discrete), UniformBits, UniformBits32, UniformBits64 Statistical Functions 10 2119 Method Short Description Functions BOXMULLER BOXMULLER generates normally distributed random number x thru the pair of uniformly distributed numbers u1 and u2 according to the formula: Gaussian, GaussianMV BOXMULLER2 BOXMULLER2 generates normally distributed random numbers x1 and x2 thru the pair of uniformly distributed numbers u1 and u2 according to the formulas: Gaussian, GaussianMV, Lognormal ICDF Inverse cumulative distribution function method. Exponential, Laplace, Weibull, Cauchy, Rayleigh, Gumbel, Bernoulli, Geometric, Gaussian, GaussianMV GNORM For a > 1, a gamma distributed random number is generated as a cube of properly scaled normal random number; for 0.6 = a < 1, a gamma distributed random number is generated using rejection from Weibull distribution; for a < 0.6, a gamma distributed random number is obtained using transformation of exponential power distribution; for a = 1, gamma distribution is reduced to exponential distribution. Gamma CJA For min(p, q) > 1, Cheng method is used; for min(p, q) < 1, Johnk method is used, if q + K·p2+ C = 0 (K = 0.852..., C=-0.956...) otherwise, Atkinson switching algorithm is used; for max(p, q) < 1, method of Johnk is used; for min(p, q) < 1, max(p, q)> 1, Atkinson switching algorithm is used (CJA stands for the first letters of Cheng, Johnk, Atkinson); for p = 1 or q = 1, inverse cumulative distribution function method is used;for p = 1 and q = 1, beta distribution is reduced to uniform distribution. Beta BTPE Acceptance/rejection method for ntrial·min(p,1 - p)= 30 with decomposition into 4 regions: – 2 parallelograms – triangle – left exponential tail – right exponential tail Binomial 10 Intel® Math Kernel Library Reference Manual 2120 Method Short Description Functions H2PE Acceptance/rejection method for large mode of distribution with decomposition into 3 regions: – rectangular – left exponential tail – right exponential tail Hypergeometric PTPE Acceptance/rejection method for ? = 27 with decomposition into 4 regions: – 2 parallelograms – triangle – left exponential tail – right exponential tail; otherwise, table lookup method is used. Poisson POISNORM for ? = 1, method based on Poisson inverse CDF approximation by Gaussian inverse CDF; for ? < 1, table lookup method is used. Poisson, PoissonV NBAR Acceptance/rejection method for , with decomposition into 5 regions: – rectangular – 2 trapezoid – left exponential tail – right exponential tail NegBinomial NOTE In this document, routines are often referred to by their base name (Gaussian) when this does not lead to ambiguity. In the routine reference, the full name (vsrnggaussian, vsRngGaussian) is always used in prototypes and code examples. Basic Generators VSL provides the following BRNGs, which differ in speed and other properties: • the 32-bit multiplicative congruential pseudorandom number generator MCG(1132489760, 231 -1) [L'Ecuyer99] • the 32-bit generalized feedback shift register pseudorandom number generator GFSR(250,103) [Kirkpatrick81] • the combined multiple recursive pseudorandom number generator MRG-32k3a [L'Ecuyer99a] • the 59-bit multiplicative congruential pseudorandom number generator MCG(1313, 259) from NAG Numerical Libraries [NAG] • Wichmann-Hill pseudorandom number generator (a set of 273 basic generators) from NAG Numerical Libraries [NAG] • Mersenne Twister pseudorandom number generator MT19937 [Matsumoto98] with period length 219937-1 of the produced sequence Statistical Functions 10 2121 • Set of 6024 Mersenne Twister pseudorandom number generators MT2203 [Matsumoto98], [Matsumoto00]. Each of them generates a sequence of period length equal to 22203-1. Parameters of the generators provide mutual independence of the corresponding sequences. • SIMD-oriented Fast Mersenne Twister pseudorandom number generator SFMT19937 [Saito08] with a period length equal to 219937-1 of the produced sequence. Besides these pseudorandom number generators, VSL provides two basic quasi-random number generators: • Sobol quasi-random number generator [Sobol76], [Bratley88], which works in arbitrary dimension. For dimensions greater than 40 the user should supply initialization parameters (initial direction numbers and primitive polynomials or direction numbers) by using vslNewStreamEx function. See additional details on interface for registration of the parameters in the library in VSL Notes. • Niederreiter quasi-random number generator [Bratley92], which works in arbitrary dimension. For dimensions greater than 318 the user should supply initialization parameters (irreducible polynomials or direction numbers) by using vslNewStreamEx function. See additional details on interface for registration of the parameters in the library in VSL Notes. See some testing results for the generators in VSL Notes and comparative performance data at http:// software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_data/vsl_performance_data.htm. VSL provides means of registration of such user-designed generators through the steps described in Advanced Service Routines section. For some basic generators, VSL provides two methods of creating independent random streams in multiprocessor computations, which are the leapfrog method and the block-splitting method. These sequence splitting methods are also useful in sequential Monte Carlo. In addition, MT2203 pseudorandom number generator is a set of 6024 generators designed to create up to 6024 independent random sequences, which might be used in parallel Monte Carlo simulations. Another generator that has the same feature is Wichmann-Hill. It allows creating up to 273 independent random streams. The properties of the generators designed for parallel computations are discussed in detail in [Coddington94]. You may want to design and use your own basic generators. VSL provides means of registration of such user-designed generators through the steps described in Advanced Service Routines section. There is also an option to utilize externally generated random numbers in VSL distribution generator routines. For this purpose VSL provides three additional basic random number generators: – for external random data packed in 32-bit integer array – for external random data stored in double precision floating-point array; data is supposed to be uniformly distributed over (a,b) interval – for external random data stored in single precision floating-point array; data is supposed to be uniformly distributed over (a,b) interval. Such basic generators are called the abstract basic random number generators. See VSL Notes for a more detailed description of the generator properties. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 BRNG Parameter Definition Predefined values for the brng input parameter are as follows: 10 Intel® Math Kernel Library Reference Manual 2122 Values of brng parameter Value Short Description VSL_BRNG_MCG31 A 31-bit multiplicative congruential generator. VSL_BRNG_R250 A generalized feedback shift register generator. VSL_BRNG_MRG32K3A A combined multiple recursive generator with two components of order 3. VSL_BRNG_MCG59 A 59-bit multiplicative congruential generator. VSL_BRNG_WH A set of 273 Wichmann-Hill combined multiplicative congruential generators. VSL_BRNG_MT19937 A Mersenne Twister pseudorandom number generator. VSL_BRNG_MT2203 A set of 6024 Mersenne Twister pseudorandom number generators. VSL_BRNG_SFMT19937 A SIMD-oriented Fast Mersenne Twister pseudorandom number generator. VSL_BRNG_SOBOL A 32-bit Gray code-based generator producing lowdiscrepancy sequences for dimensions 1 = s = 40; userdefined dimensions are also available. VSL_BRNG_NIEDERR A 32-bit Gray code-based generator producing lowdiscrepancy sequences for dimensions 1 = s = 318; userdefined dimensions are also available. VSL_BRNG_IABSTRACT An abstract random number generator for integer arrays. VSL_BRNG_DABSTRACT An abstract random number generator for double precision floating-point arrays. VSL_BRNG_SABSTRACT An abstract random number generator for single precision floating-point arrays. See VSL Notes for detailed description. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Random Streams Random stream (or stream) is an abstract source of pseudo- and quasi-random sequences of uniform distribution. You can operate with stream state descriptors only. A stream state descriptor, which holds state descriptive information for a particular BRNG, is a necessary parameter in each routine of a distribution generator. Only the distribution generator routines operate with random streams directly. See VSL Notes for details. NOTE Random streams associated with abstract basic random number generator are called the abstract random streams. See VSL Notes for detailed description of abstract streams and their use. Statistical Functions 10 2123 You can create unlimited number of random streams by VSL Service Routines like NewStream and utilize them in any distribution generator to get the sequence of numbers of given probability distribution. When they are no longer needed, the streams should be deleted calling service routine DeleteStream. VSL provides service functions SaveStreamF and LoadStreamF to save random stream descriptive data to a binary file and to read this data from a binary file respectively. See VSL Notes for detailed description. Data Types FORTRAN 77: INTEGER*4 vslstreamstate(2) Fortran 90: TYPE VSL_STREAM_STATE INTEGER*4 descriptor1 INTEGER*4 descriptor2 END TYPE VSL_STREAM_STATE C: typedef (void*) VSLStreamStatePtr; See Advanced Service Routines for the format of the stream state structure for user-designed generators. Error Reporting VSL RNG routines return status codes of the performed operation to report errors to the calling program. The application should perform error-related actions and/or recover from the error. The status codes are of integer type and have the following format: VSL_ERROR_ - indicates VSL errors common for all VSL domains. VSL_RNG_ERROR_ - indicates VSL RNG errors. VSL RNG errors are of negative values while warnings are of positive values. The status code of zero value indicates successful completion of the operation: VSL_ERROR_OK (or synonymic VSL_STATUS_OK). Status Codes Status Code Description Common VSL VSL_ERROR_OK, VSL_STATUS_OK No error, execution is successful. VSL_ERROR_BADARGS Input argument value is not valid. VSL_ERROR_CPU_NOT_SUPPORTED CPU version is not supported. VSL_ERROR_FEATURE_NOT_IMPLEMENTED Feature invoked is not implemented. VSL_ERROR_MEM_FAILURE System cannot allocate memory. VSL_ERROR_NULL_PTR Input pointer argument is NULL. VSL_ERROR_UNKNOWN Unknown error. VSL RNG Specific VSL_RNG_ERROR_BAD_FILE_FORMAT File format is unknown. 10 Intel® Math Kernel Library Reference Manual 2124 Status Code Description VSL_RNG_ERROR_BAD_MEM_FORMAT Descriptive random stream format is unknown. VSL_RNG_ERROR_BAD_NBITS The value in NBits field is bad. VSL_RNG_ERROR_BAD_NSEEDS The value in NSeeds field is bad. VSL_RNG_ERROR_BAD_STREAM The random stream is invalid. VSL_RNG_ERROR_BAD_STREAM_STATE_SIZE The value in StreamStateSize field is bad. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or >nmax. VSL_RNG_ERROR_BAD_WORD_SIZE The value in WordSize field is bad. VSL_RNG_ERROR_BRNG_NOT_SUPPORTED BRNG is not supported by the function. VSL_RNG_ERROR_BRNG_TABLE_FULL Registration cannot be completed due to lack of free entries in the table of registered BRNGs. VSL_RNG_ERROR_BRNGS_INCOMPATIBLE Two BRNGs are not compatible for the operation. VSL_RNG_ERROR_FILE_CLOSE Error in closing the file. VSL_RNG_ERROR_FILE_OPEN Error in opening the file. VSL_RNG_ERROR_FILE_READ Error in reading the file. VSL_RNG_ERROR_FILE_WRITE Error in writing the file. VSL_RNG_ERROR_INVALID_ABSTRACT_STREAM The abstract random stream is invalid. VSL_RNG_ERROR_INVALID_BRNG_INDEX BRNG index is not valid. VSL_RNG_ERROR_LEAPFROG_UNSUPPORTED BRNG does not support Leapfrog method. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns zero as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator is exceeded. VSL_RNG_ERROR_SKIPAHEAD_UNSUPPORTED BRNG does not support Skip-Ahead method. VSL_RNG_ERROR_UNSUPPORTED_FILE_VER File format version is not supported. VSL RNG Usage Model A typical algorithm for VSL random number generators is as follows: 1. Create and initialize stream/streams. Functions vslNewStream, vslNewStreamEx, vslCopyStream, vslCopyStreamState, vslLeapfrogStream, vslSkipAheadStream. 2. Call one or more RNGs. 3. Process the output. 4. Delete the stream/streams. Function vslDeleteStream. Statistical Functions 10 2125 NOTE You may reiterate steps 2-3. Random number streams may be generated for different threads. The following C example demonstrates generation of a random stream that is output of basic generator MT19937. The seed is equal to 777. The stream is used to generate 10,000 normally distributed random numbers in blocks of 1,000 random numbers with parameters a = 5 and sigma = 2. Delete the streams after completing the generation. The purpose of the example is to calculate the sample mean for normal distribution with the given parameters. C Example of VSL RNG Usage #include #include "mkl_vsl.h" int main() { double r[1000]; /* buffer for random numbers */ double s; /* average */ VSLStreamStatePtr stream; int i, j; /* Initializing */ s = 0.0; vslNewStream( &stream, VSL_BRNG_MT19937, 777 ); /* Generating */ for ( i=0; i<10; i++ ); { vdRngGaussian( VSL_RNG_METHOD_GAUSSIAN_ICDF, stream, 1000, r, 5.0, 2.0 ); for ( j=0; j<1000; j++ ); { s += r[j]; } } s /= 10000.0; /* Deleting the stream */ vslDeleteStream( &stream ); /* Printing results */ printf( "Sample mean of normal distribution = %f\n", s ); return 0; } The Fortran version of the same example is below: Fortran Example of VSL RNG Usage include 'mkl_vsl.f90' program MKL_VSL_GAUSSIAN USE MKL_VSL_TYPE USE MKL_VSL real(kind=8) r(1000) ! buffer for random numbers real(kind=8) s ! average real(kind=8) a, sigma ! parameters of normal distribution TYPE (VSL_STREAM_STATE) :: stream integer(kind=4) errcode integer(kind=4) i,j integer brng,method,seed,n n = 1000 s = 0.0 a = 5.0 10 Intel® Math Kernel Library Reference Manual 2126 sigma = 2.0 brng=VSL_BRNG_MT19937 method=VSL_RNG_METHOD_GAUSSIAN_ICDF seed=777 ! ***** Initializing ***** errcode=vslnewstream( stream, brng, seed ) ! ***** Generating ***** do i = 1,10 errcode=vdrnggaussian( method, stream, n, r, a, sigma ) do j = 1, 1000 s = s + r(j) end do end do s = s / 10000.0 ! ***** Deinitialize ***** errcode=vsldeletestream( stream ) ! ***** Printing results ***** print *,"Sample mean of normal distribution = ", s end Additionally, examples that demonstrate usage of VSL random number generators are available in the following directories: ${MKL}/examples/vslc/source ${MKL}/examples/vslf/source Service Routines Stream handling comprises routines for creating, deleting, or copying the streams and getting the index of a basic generator. A random stream can also be saved to and then read from a binary file. Table "Service Routines" lists all available service routines Service Routines Routine Short Description vslNewStream Creates and initializes a random stream. vslNewStreamEx Creates and initializes a random stream for the generators with multiple initial conditions. vsliNewAbstractStream Creates and initializes an abstract random stream for integer arrays. vsldNewAbstractStream Creates and initializes an abstract random stream for double precision floating-point arrays. vslsNewAbstractStream Creates and initializes an abstract random stream for single precision floating-point arrays. vslDeleteStream Deletes previously created stream. vslCopyStream Copies a stream to another stream. vslCopyStreamState Creates a copy of a random stream state. vslSaveStreamF Writes a stream to a binary file. vslLoadStreamF Reads a stream from a binary file. Statistical Functions 10 2127 Routine Short Description vslSaveStreamM Writes a random stream descriptive data, including state, to a memory buffer. vslLoadStreamM Creates a new stream and reads stream descriptive data, including state, from the memory buffer. vslGetStreamSize Computes size of memory necessary to hold the random stream. vslLeapfrogStream Initializes the stream by the leapfrog method to generate a subsequence of the original sequence. vslSkipAheadStream Initializes the stream by the skip-ahead method. vslGetStreamStateBrng Obtains the index of the basic generator responsible for the generation of a given random stream. vslGetNumRegBrngs Obtains the number of currently registered basic generators. Most of the generator-based work comprises three basic steps: 1. Creating and initializing a stream (vslNewStream, vslNewStreamEx, vslCopyStream, vslCopyStreamState, vslLeapfrogStream, vslSkipAheadStream). 2. Generating random numbers with given distribution, see Distribution Generators. 3. Deleting the stream (vslDeleteStream). Note that you can concurrently create multiple streams and obtain random data from one or several generators by using the stream state. You must use the vslDeleteStream function to delete all the streams afterwards. vslNewStream Creates and initializes a random stream. Syntax Fortran: status = vslnewstream( stream, brng, seed ) C: status = vslNewStream( &stream, brng, seed ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description brng FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Index of the basic generator to initialize the stream. See Table Values of brng parameter for specific value. 10 Intel® Math Kernel Library Reference Manual 2128 Name Type Description seed FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const unsigned int Initial condition of the stream. In the case of a quasirandom number generator seed parameter is used to set the dimension. If the dimension is greater than the dimension that brng can support or is less than 1, then the dimension is assumed to be equal to 1. Output Parameters Name Type Description stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE(VSL_STREAM_STATE), INTENT(OUT) C: VSLStreamStatePtr* Stream state descriptor Description For a basic generator with number brng, this function creates a new stream and initializes it with a 32-bit seed. The seed is an initial value used to select a particular sequence generated by the basic generator brng. The function is also applicable for generators with multiple initial conditions. See VSL Notes for a more detailed description of stream initialization for different basic generators. NOTE This function is not applicable for abstract basic random number generators. Please use vsliNewAbstractStream, vslsNewAbstractStream or vsldNewAbstractStream to utilize integer, single-precision or double-precision external random data respectively. Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_RNG_ERROR_INVALID_BRNG_INDEX BRNG index is invalid. VSL_ERROR_MEM_FAILURE System cannot allocate memory for stream. vslNewStreamEx Creates and initializes a random stream for generators with multiple initial conditions. Syntax Fortran: status = vslnewstreamex( stream, brng, n, params ) C: status = vslNewStreamEx( &stream, brng, n, params ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Statistical Functions 10 2129 Input Parameters Name Type Description brng FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Index of the basic generator to initialize the stream. See Table "Values of brng parameter" for specific value. n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of initial conditions contained in params params FORTRAN 77: INTEGER Fortran 90: INTEGER(KIND=4), INTENT(IN) C: const unsigned int Array of initial conditions necessary for the basic generator brng to initialize the stream. In the case of a quasi-random number generator only the first element in params parameter is used to set the dimension. If the dimension is greater than the dimension that brng can support or is less than 1, then the dimension is assumed to be equal to 1. Output Parameters Name Type Description stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE(VSL_STREAM_STATE), INTENT(OUT) C: VSLStreamStatePtr* Stream state descriptor Description The vslNewStreamEx function provides an advanced tool to set the initial conditions for a basic generator if its input arguments imply several initialization parameters. Initial values are used to select a particular sequence generated by the basic generator brng. Whenever possible, use vslNewStream, which is analogous to vslNewStreamEx except that it takes only one 32-bit initial condition. In particular, vslNewStreamEx may be used to initialize the state table in Generalized Feedback Shift Register Generators (GFSRs). A more detailed description of this issue can be found in VSL Notes. This function is also used to pass user-defined initialization parameters of quasi-random number generators into the library. See VSL Notes for the format for their passing and registration in VSL. NOTE This function is not applicable for abstract basic random number generators. Please use vsliNewAbstractStream, vslsNewAbstractStream or vsldNewAbstractStream to utilize integer, single-precision or double-precision external random data respectively. Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_RNG_ERROR_INVALID_BRNG_INDEX BRNG index is invalid. 10 Intel® Math Kernel Library Reference Manual 2130 VSL_ERROR_MEM_FAILURE System cannot allocate memory for stream. vsliNewAbstractStream Creates and initializes an abstract random stream for integer arrays. Syntax Fortran: status = vslinewabstractstream( stream, n, ibuf, icallback ) C: status = vsliNewAbstractStream( &stream, n, ibuf, icallback ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Size of the array ibuf ibuf FORTRAN 77: INTEGER Fortran 90: INTEGER(KIND=4), INTENT(IN) C: const unsigned int Array of n 32-bit integers icallback See Note below Fortran: Address of the callback function used for ibuf update C: Pointer to the callback function used for ibuf update NOTE Format of the callback function in FORTRAN 77: INTEGER FUNCTION IUPDATEFUNC( stream, n, ibuf, nmin, nmax, idx ) INTEGER*4 stream(2) INTEGER n INTEGER*4 ibuf(n) INTEGER nmin INTEGER nmax INTEGER idx Statistical Functions 10 2131 Format of the callback function in Fortran 90: INTEGER FUNCTION IUPDATEFUNC[C]( stream, n, ibuf, nmin, nmax, idx ) TYPE(VSL_STREAM_STATE),POINTER :: stream[reference] INTEGER(KIND=4),INTENT(IN) :: n[reference] INTEGER(KIND=4),INTENT(OUT) :: ibuf[reference](0:n-1) INTEGER(KIND=4),INTENT(IN) :: nmin[reference] INTEGER(KIND=4),INTENT(IN) :: nmax[reference] INTEGER(KIND=4),INTENT(IN) :: idx[reference] Format of the callback function in C: int iUpdateFunc( VSLStreamStatePtr stream, int* n, unsigned int ibuf[], int* nmin, int* nmax, int* idx ); The callback function returns the number of elements in the array actually updated by the function. Table icallback Callback Function Parameters gives the description of the callback function parameters. icallback Callback Function Parameters Parameters Short Description stream Abstract random stream descriptor n Size of ibuf ibuf Array of random numbers associated with the stream stream nmin Minimal quantity of numbers to update nmax Maximal quantity of numbers that can be updated idx Position in cyclic buffer ibuf to start update 0=idx nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngGaussian Generates normally distributed random numbers. Syntax Fortran: status = vsrnggaussian( method, stream, n, r, a, sigma ) status = vdrnggaussian( method, stream, n, r, a, sigma ) C: status = vsRngGaussian( method, stream, n, r, a, sigma ); status = vdRngGaussian( method, stream, n, r, a, sigma ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_GAUSSIAN_BOXMULLER VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2 VSL_RNG_METHOD_GAUSSIAN_ICDF See brief description of the methods BOXMULLER, BOXMULLER2, and ICDF in Table "Values of in method parameter" stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN ) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated Statistical Functions 10 2159 Name Type Description a FORTRAN 77: REAL for vsrnggaussian DOUBLE PRECISION for vdrnggaussian Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnggaussian REAL(KIND=8), INTENT(IN) for vdrnggaussian C: const float for vsRngGaussian const double for vdRngGaussian Mean value a. sigma FORTRAN 77: REAL for vsrnggaussian DOUBLE PRECISION for vdrnggaussian Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnggaussian REAL(KIND=8), INTENT(IN) for vdrnggaussian C: const float for vsRngGaussian const double for vdRngGaussian Standard deviation s. Output Parameters Name Type Description r FORTRAN 77: REAL for vsrnggaussian DOUBLE PRECISION for vdrnggaussian Fortran 90: REAL(KIND=4), INTENT(OUT) for vsrnggaussian REAL(KIND=8), INTENT(OUT) for vdrnggaussian C: float* for vsRngGaussian double* for vdRngGaussian Vector of n normally distributed random numbers 10 Intel® Math Kernel Library Reference Manual 2160 Description The vRngGaussian function generates random numbers with normal (Gaussian) distribution with mean value a and standard deviation s, where a, s?R ; s > 0. The probability density function is given by: The cumulative distribution function is as follows: The cumulative distribution function Fa,s(x) can be expressed in terms of standard normal distribution F(x) as Fa,s(x) = F((x - a)/s) Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngGaussianMV Generates random numbers from multivariate normal distribution. Syntax Fortran: status = vsrnggaussianmv( method, stream, n, r, dimen, mstorage, a, t ) status = vdrnggaussianmv( method, stream, n, r, dimen, mstorage, a, t ) C: status = vsRngGaussianMV( method, stream, n, r, dimen, mstorage, a, t ); status = vdRngGaussianMV( method, stream, n, r, dimen, mstorage, a, t ); Include Files • FORTRAN 77: mkl_vsl.f77 Statistical Functions 10 2161 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER2 VSL_RNG_METHOD_GAUSSIANMV_ICDF See brief description of the methods BOXMULLER, BOXMULLER2, and ICDF in Table "Values of in method parameter" stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of d-dimensional vectors to be generated dimen FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Dimension d ( d = 1) of output random vectors mstorage FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Fortran: Matrix storage scheme for upper triangular matrix TT. The routine supports three matrix storage schemes: • VSL_MATRIX_STORAGE_FULL— all d x d elements of the matrix TT are passed, however, only the upper triangle part is actually used in the routine. • VSL_MATRIX_STORAGE_PACKED— upper triangle elements of TT are packed by rows into a onedimensional array. • VSL_MATRIX_STORAGE_DIAGONAL— only diagonal elements of TT are passed. C: Matrix storage scheme for lower triangular matrix T. The routine supports three matrix storage schemes: • VSL_MATRIX_STORAGE_FULL— all d x d elements of the matrix T are passed, however, only the lower triangle part is actually used in the routine. 10 Intel® Math Kernel Library Reference Manual 2162 Name Type Description • VSL_MATRIX_STORAGE_PACKED— lower triangle elements of T are packed by rows into a onedimensional array. • VSL_MATRIX_STORAGE_DIAGONAL— only diagonal elements of T are passed. a FORTRAN 77: REAL for vsrnggaussianmv DOUBLE PRECISION for vdrnggaussianmv Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnggaussianmv REAL(KIND=8), INTENT(IN) for vdrnggaussianmv C: const float* for vsRngGaussianMV const double* for vdRngGaussianMV Mean vector a of dimension d t FORTRAN 77: REAL for vsrnggaussianmv DOUBLE PRECISION for vdrnggaussianmv Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnggaussianmv REAL(KIND=8), INTENT(IN) for vdrnggaussianmv C: const float* for vsRngGaussianMV const double* for vdRngGaussianMV Fortran: Elements of the upper triangular matrix passed according to the matrix TT storage scheme mstorage. C: Elements of the lower triangular matrix passed according to the matrix T storage scheme mstorage. Output Parameters Name Type Description r FORTRAN 77: REAL for vsrnggaussianmv DOUBLE PRECISION for vdrnggaussianmv Fortran 90: REAL(KIND=4), INTENT(OUT) for vsrnggaussianmv Array of n random vectors of dimension dimen Statistical Functions 10 2163 Name Type Description REAL(KIND=8), INTENT(OUT) for vdrnggaussianmv C: float* for vsRngGaussianMV double* for vdRngGaussianMV Description The vRngGaussianMV function generates random numbers with d-variate normal (Gaussian) distribution with mean value a and variance-covariance matrix C, where a?Rd; C is a d×d symmetric positive-definite matrix. The probability density function is given by: where x?Rd . Matrix C can be represented as C = TTT, where T is a lower triangular matrix - Cholesky factor of C. Instead of variance-covariance matrix C the generation routines require Cholesky factor of C in input. To compute Cholesky factor of matrix C, the user may call MKL LAPACK routines for matrix factorization: ? potrf or ?pptrf for v?RngGaussianMV/v?rnggaussianmv routines (? means either s or d for single and double precision respectively). See Application Notes for more details. Application Notes Since matrices are stored in Fortran by columns, while in C they are stored by rows, the usage of MKL factorization routines (assuming Fortran matrices storage) in combination with multivariate normal RNG (assuming C matrix storage) is slightly different in C and Fortran. The following tables help in using these routines in C and Fortran. For further information please refer to the appropriate VSL example file. Using Cholesky Factorization Routines in Fortran Matrix Storage Scheme Variance- Covariance Matrix Argument Factorization Routine UPLO Parameter in Factorizati on Routine Result of Factorizatio n as Input Argument for RNG VSL_MATRIX_STORAGE_FULL C in Fortran twodimensional array spotrf for vsrnggaussianmv dpotrf for vdrnggaussianmv ‘U’ Upper triangle of TT. Lower triangle is not used. VSL_MATRIX_STORAGE_PACK ED Lower triangle of C packed by columns into onedimensional array spptrf for vsrnggaussianmv dpptrf for vdrnggaussianmv ‘L’ Upper triangle of TT packed by rows into onedimensional array. 10 Intel® Math Kernel Library Reference Manual 2164 Using Cholesky Factorization Routines in C Matrix Storage Scheme Variance- Covariance Matrix Argument Factorization Routine UPLO Parameter in Factorizati on Routine Result of Factorizatio n as Input Argument for RNG VSL_MATRIX_STORAGE_FULL C in C twodimensional array spotrf for vsRngGaussianMV dpotrf for vdRngGaussianMV ‘U’ Upper triangle of TT. Lower triangle is not used. VSL_MATRIX_STORAGE_PACK ED Lower triangle of C packed by columns into onedimensional array spptrf for vsRngGaussianMV dpptrf for vdRngGaussianMV ‘L’ Upper triangle of TT packed by rows into onedimensional array. Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngExponential Generates exponentially distributed random numbers. Syntax Fortran: status = vsrngexponential( method, stream, n, r, a, beta ) status = vdrngexponential( method, stream, n, r, a, beta ) C: status = vsRngExponential( method, stream, n, r, a, beta ); status = vdRngExponential( method, stream, n, r, a, beta ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Statistical Functions 10 2165 Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_EXPONENTIAL_ICDF VSL_RNG_METHOD_EXPONENTIAL_ICDF_ACCURATE Inverse cumulative distribution function method stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated a FORTRAN 77: REAL for vsrngexponential DOUBLE PRECISION for vdrngexponential Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngexponential REAL(KIND=8), INTENT(IN) for vdrngexponential C: const float for vsRngExponential C: const double for vdRngExponential Displacement a beta FORTRAN 77: REAL for vsrngexponential DOUBLE PRECISION for vdrngexponential Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngexponential REAL(KIND=8), INTENT(IN) for vdrngexponential C: const float for vsRngExponential Scalefactor ß. 10 Intel® Math Kernel Library Reference Manual 2166 Name Type Description const double for vdRngExponential Output Parameters Name Type Description r FORTRAN 77: REAL for vsrngexponential DOUBLE PRECISION for vdrngexponential Fortran 90: REAL(KIND=4), INTENT(OUT) for vsrngexponential REAL(KIND=8), INTENT(OUT) for vdrngexponential C: float* for vsRngExponential double* for vdRngExponential Vector of n exponentially distributed random numbers Description The vRngExponential function generates random numbers with exponential distribution that has displacement a and scalefactor ß, where a, ß?R ; ß > 0. The probability density function is given by: The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. Statistical Functions 10 2167 VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngLaplace Generates random numbers with Laplace distribution. Syntax Fortran: status = vsrnglaplace( method, stream, n, r, a, beta ) status = vdrnglaplace( method, stream, n, r, a, beta ) C: status = vsRngLaplace( method, stream, n, r, a, beta ); status = vdRngLaplace( method, stream, n, r, a, beta ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_LAPLACE_ICDF Inverse cumulative distribution function method stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated a FORTRAN 77: REAL for vsrnglaplace DOUBLE PRECISION for vdrnglaplace Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnglaplace Mean value a 10 Intel® Math Kernel Library Reference Manual 2168 Name Type Description REAL(KIND=8), INTENT(IN) for vdrnglaplace C: const float for vsRngLaplace const double for vdRngLaplace beta FORTRAN 77: REAL for vsrnglaplace DOUBLE PRECISION for vdrnglaplace Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnglaplace REAL(KIND=8), INTENT(IN) for vdrnglaplace C: const float for vsRngLaplace const double for vdRngLaplace Scalefactor ß. Output Parameters Name Type Description r FORTRAN 77: REAL for vsrnglaplace DOUBLE PRECISION for vdrnglaplace Fortran 90: REAL(KIND=4), INTENT(OUT) for vsrnglaplace REAL(KIND=8), INTENT(OUT) for vdrnglaplace C: float* for vsRngLaplace double* for vdRngLaplace Vector of n Laplace distributed random numbers Description The vRngLaplace function generates random numbers with Laplace distribution with mean value (or average) a and scalefactor ß, where a, ß?R ; ß > 0. The scalefactor value determines the standard deviation as The probability density function is given by: Statistical Functions 10 2169 The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngWeibull Generates Weibull distributed random numbers. Syntax Fortran: status = vsrngweibull( method, stream, n, r, alpha, a, beta ) status = vdrngweibull( method, stream, n, r, alpha, a, beta ) C: status = vsRngWeibull( method, stream, n, r, alpha, a, beta ); status = vdRngWeibull( method, stream, n, r, alpha, a, beta ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h 10 Intel® Math Kernel Library Reference Manual 2170 Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_WEIBULL_ICDF VSL_RNG_METHOD_WEIBULL_ICDF_ACCURATE Inverse cumulative distribution function method stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated alpha FORTRAN 77: REAL for vsrngweibull DOUBLE PRECISION for vdrngweibull Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngweibull REAL(KIND=8), INTENT(IN) for vdrngweibull C: const float for vsRngWeibull const double for vdRngWeibull Shape a. a FORTRAN 77: REAL for vsrngweibull DOUBLE PRECISION for vdrngweibull Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngweibull REAL(KIND=8), INTENT(IN) for vdrngweibull C: const float for vsRngWeibull const double for vdRngWeibull Displacement a Statistical Functions 10 2171 Name Type Description beta FORTRAN 77: REAL for vsrngweibull DOUBLE PRECISION for vdrngweibull Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngweibull REAL(KIND=8), INTENT(IN) for vdrngweibull C: const float for vsRngWeibull const double for vdRngWeibull Scalefactor ß. Output Parameters Name Type Description r FORTRAN 77: REAL for vsrngweibull DOUBLE PRECISION for vdrngweibull Fortran 90: REAL(KIND=4), INTENT(OUT) for vsrngweibull REAL(KIND=8), INTENT(OUT) for vdrngweibull C: float* for vsRngWeibull double* for vdRngWeibull Vector of n Weibull distributed random numbers Description The vRngWeibull function generates Weibull distributed random numbers with displacement a, scalefactor ß, and shape a, where a, ß, a?R ; a > 0, ß > 0. The probability density function is given by: The cumulative distribution function is as follows: 10 Intel® Math Kernel Library Reference Manual 2172 Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngCauchy Generates Cauchy distributed random values. Syntax Fortran: status = vsrngcauchy( method, stream, n, r, a, beta ) status = vdrngcauchy( method, stream, n, r, a, beta ) C: status = vsRngCauchy( method, stream, n, r, a, beta ); status = vdRngCauchy( method, stream, n, r, a, beta ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_CAUCHY_ICDF Inverse cumulative distribution function method stream FORTRAN 77: INTEGER*4 stream(2) Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure Statistical Functions 10 2173 Name Type Description Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated a FORTRAN 77: REAL for vsrngcauchy DOUBLE PRECISION for vdrngcauchy Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngcauchy REAL(KIND=8), INTENT(IN) for vdrngcauchy C: const float for vsRngCauchy const double for vdRngCauchy Displacement a. beta FORTRAN 77: REAL for vsrngcauchy DOUBLE PRECISION for vdrngcauchy Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngcauchy REAL(KIND=8), INTENT(IN) for vdrngcauchy C: const float for vsRngCauchy const double for vdRngCauchy Scalefactor ß. Output Parameters Name Type Description r FORTRAN 77: REAL for vsrngcauchy DOUBLE PRECISION for vdrngcauchy Fortran 90: REAL(KIND=4), INTENT(OUT) for vsrngcauchy Vector of n Cauchy distributed random numbers 10 Intel® Math Kernel Library Reference Manual 2174 Name Type Description REAL(KIND=8), INTENT(OUT) for vdrngcauchy C: float* for vsRngCauchy double* for vdRngCauchy Description The function generates Cauchy distributed random numbers with displacement a and scalefactor ß, where a, ß?R ; ß > 0. The probability density function is given by: The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngRayleigh Generates Rayleigh distributed random values. Syntax Fortran: status = vsrngrayleigh( method, stream, n, r, a, beta ) status = vdrngrayleigh( method, stream, n, r, a, beta ) C: status = vsRngRayleigh( method, stream, n, r, a, beta ); Statistical Functions 10 2175 status = vdRngRayleigh( method, stream, n, r, a, beta ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_RAYLEIGH_ICDF VSL_RNG_METHOD_RAYLEIGH_ICDF_ACCURATE Inverse cumulative distribution function method stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated a FORTRAN 77: REAL for vsrngrayleigh DOUBLE PRECISION for vdrngrayleigh Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngrayleigh REAL(KIND=8), INTENT(IN) for vdrngrayleigh C: const float for vsRngRayleigh const double for vdRngRayleigh Displacement a beta FORTRAN 77: REAL for vsrngrayleigh DOUBLE PRECISION for vdrngrayleigh Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngrayleigh Scalefactor ß. 10 Intel® Math Kernel Library Reference Manual 2176 Name Type Description REAL(KIND=8), INTENT(IN) for vdrngrayleigh C: const float for vsRngRayleigh const double for vdRngRayleigh Output Parameters Name Type Description r FORTRAN 77: REAL for vsrngrayleigh DOUBLE PRECISION for vdrngrayleigh Fortran 90: REAL(KIND=4), INTENT(OUT) for vsrngrayleigh REAL(KIND=8), INTENT(OUT) for vdrngrayleigh C: float* for vsRngRayleigh double* for vdRngRayleigh Vector of n Rayleigh distributed random numbers Description The vRngRayleigh function generates Rayleigh distributed random numbers with displacement a and scalefactor ß, where a, ß?R ; ß > 0. The Rayleigh distribution is a special case of the Weibull distribution, where the shape parameter a = 2. The probability density function is given by: The cumulative distribution function is as follows: Statistical Functions 10 2177 Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngLognormal Generates lognormally distributed random numbers. Syntax Fortran: status = vsrnglognormal( method, stream, n, r, a, sigma, b, beta ) status = vdrnglognormal( method, stream, n, r, a, sigma, b, beta ) C: status = vsRngLognormal( method, stream, n, r, a, sigma, b, beta ); status = vdRngLognormal( method, stream, n, r, a, sigma, b, beta ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_LOGNORMAL_BOXMULLER2 VSL_RNG_METHOD_LOGNORMAL_BOXMULLER2_ACCURATE Inverse cumulative distribution function method stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) Number of random values to be generated 10 Intel® Math Kernel Library Reference Manual 2178 Name Type Description C: const int a FORTRAN 77: REAL for vsrnglognormal DOUBLE PRECISION for vdrnglognormal Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnglognormal REAL(KIND=8), INTENT(IN) for vdrnglognormal C: const float for vsRngLognormal const double for vdRngLognormal Average a of the subject normal distribution sigma FORTRAN 77: REAL for vsrnglognormal DOUBLE PRECISION for vdrnglognormal Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnglognormal REAL(KIND=8), INTENT(IN) for vdrnglognormal C: const float for vsRngLognormal const double for vdRngLognormal Standard deviation s of the subject normal distribution b FORTRAN 77: REAL for vsrnglognormal DOUBLE PRECISION for vdrnglognormal Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnglognormal REAL(KIND=8), INTENT(IN) for vdrnglognormal C: const float for vsRngLognormal const double for vdRngLognormal Displacement b Statistical Functions 10 2179 Name Type Description beta FORTRAN 77: REAL for vsrnglognormal DOUBLE PRECISION for vdrnglognormal Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnglognormal REAL(KIND=8), INTENT(IN) for vdrnglognormal C: const float for vsRngLognormal const double for vdRngLognormal Scalefactor ß. Output Parameters Name Type Description r FORTRAN 77: REAL for vsrnglognormal DOUBLE PRECISION for vdrnglognormal Fortran 90: REAL(KIND=4), INTENT(OUT) for vsrnglognormal REAL(KIND=8), INTENT(OUT) for vdrnglognormal C: float* for vsRngLognormal double* for vdRngLognormal Vector of n lognormally distributed random numbers Description The vRngLognormal function generates lognormally distributed random numbers with average of distribution a and standard deviation s of subject normal distribution, displacement b, and scalefactor ß, where a, s, b, ß?R ; s > 0 , ß > 0. The probability density function is given by: The cumulative distribution function is as follows: 10 Intel® Math Kernel Library Reference Manual 2180 Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngGumbel Generates Gumbel distributed random values. Syntax Fortran: status = vsrnggumbel( method, stream, n, r, a, beta ) status = vdrnggumbel( method, stream, n, r, a, beta ) C: status = vsRngGumbel( method, stream, n, r, a, beta ); status = vdRngGumbel( method, stream, n, r, a, beta ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_GUMBEL_ICDF Inverse cumulative distribution function method stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) Fortran: Descriptor of the stream state structure C: Pointer to the stream state structure Statistical Functions 10 2181 Name Type Description C: VSLStreamStatePtr n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated a FORTRAN 77: REAL for vsrnggumbel DOUBLE PRECISION for vdrnggumbel Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnggumbel REAL(KIND=8), INTENT(IN) for vdrnggumbel C: const float for vsRngGumbel const double for vdRngGumbel Displacement a. beta FORTRAN 77: REAL for vsrnggumbel DOUBLE PRECISION for vdrnggumbel Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnggumbel REAL(KIND=8), INTENT(IN) for vdrnggumbel C: const float for vsRngGumbel const double for vdRngGumbel Scalefactor ß. Output Parameters Name Type Description r FORTRAN 77: REAL for vsrnggumbel DOUBLE PRECISION for vdrnggumbel Fortran 90: REAL(KIND=4), INTENT(OUT) for vsrnggumbel REAL(KIND=8), INTENT(OUT) for vdrnggumbel C: float* for vsRngGumbel double* for vdRngGumbel Vector of n random numbers with Gumbel distribution 10 Intel® Math Kernel Library Reference Manual 2182 Description The vRngGumbel function generates Gumbel distributed random numbers with displacement a and scalefactor ß, where a, ß?R ; ß > 0. The probability density function is given by: The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngGamma Generates gamma distributed random values. Syntax Fortran: status = vsrnggamma( method, stream, n, r, alpha, a, beta ) status = vdrnggamma( method, stream, n, r, alpha, a, beta ) C: status = vsRngGamma( method, stream, n, r, alpha, a, beta ); status = vdRngGamma( method, stream, n, r, alpha, a, beta ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Statistical Functions 10 2183 Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_GAMMA_GNORM VSL_RNG_METHOD_GAMMA_GNORM_ACCURATE Acceptance/rejection method using random numbers with Gaussian distribution. See brief description of the method GNORM in Table "Values of in method parameter" stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated alpha FORTRAN 77: REAL for vsrnggamma DOUBLE PRECISION for vdrnggamma Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnggamma REAL(KIND=8), INTENT(IN) for vdrnggamma C: const float for vsRngGamma const double for vdRngGamma Shape a. a FORTRAN 77: REAL for vsrnggamma DOUBLE PRECISION for vdrnggamma Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnggamma REAL(KIND=8), INTENT(IN) for vdrnggamma C: const float for vsRngGamma const double for vdRngGamma Displacement a. 10 Intel® Math Kernel Library Reference Manual 2184 Name Type Description beta FORTRAN 77: REAL for vsrnggamma DOUBLE PRECISION for vdrnggamma Fortran 90: REAL(KIND=4), INTENT(IN) for vsrnggamma REAL(KIND=8), INTENT(IN) for vdrnggamma C: const float for vsRngGamma const double for vdRngGamma Scalefactor ß. Output Parameters Name Type Description r FORTRAN 77: REAL for vsrnggamma DOUBLE PRECISION for vdrnggamma Fortran 90: REAL(KIND=4), INTENT(OUT) for vsrnggamma REAL(KIND=8), INTENT(OUT) for vdrnggamma C: float* for vsRngGamma double* for vdRngGamma Vector of n random numbers with gamma distribution Description The vRngGamma function generates random numbers with gamma distribution that has shape parameter a, displacement a, and scale parameter ß, where a, ß, and a?R ; a > 0, ß > 0. The probability density function is given by: where G(a) is the complete gamma function. The cumulative distribution function is as follows: Statistical Functions 10 2185 Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngBeta Generates beta distributed random values. Syntax Fortran: status = vsrngbeta( method, stream, n, r, p, q, a, beta ) status = vdrngbeta( method, stream, n, r, p, q, a, beta ) C: status = vsRngBeta( method, stream, n, r, p, q, a, beta ); status = vdRngBeta( method, stream, n, r, p, q, a, beta ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_BETA_CJA VSL_RNG_METHOD_BETA_CJA_ACCURATE See brief description of the method CJA in Table "Values of in method parameter" stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure C: Pointer to the stream state structure n FORTRAN 77: INTEGER Number of random values to be generated 10 Intel® Math Kernel Library Reference Manual 2186 Name Type Description Fortran 90: INTEGER, INTENT(IN) C: const int p FORTRAN 77: REAL for vsrngbeta DOUBLE PRECISION for vdrngbeta Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngbeta REAL(KIND=8), INTENT(IN) for vdrngbeta C: const float for vsRngBeta const double for vdRngBeta Shape p q FORTRAN 77: REAL for vsrngbeta DOUBLE PRECISION for vdrngbeta Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngbeta REAL(KIND=8), INTENT(IN) for vdrngbeta C: const float for vsRngBeta const double for vdRngBeta Shape q a FORTRAN 77: REAL for vsrngbeta DOUBLE PRECISION for vdrngbeta Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngbeta REAL(KIND=8), INTENT(IN) for vdrngbeta C: const float for vsRngBeta const double for vdRngBeta Displacement a. beta FORTRAN 77: REAL for vsrngbeta DOUBLE PRECISION for vdrngbeta Fortran 90: REAL(KIND=4), INTENT(IN) for vsrngbeta Scalefactor ß. Statistical Functions 10 2187 Name Type Description REAL(KIND=8), INTENT(IN) for vdrngbeta C: const float for vsRngBeta const double for vdRngBeta Output Parameters Name Type Description r FORTRAN 77: REAL for vsrngbeta DOUBLE PRECISION for vdrngbeta Fortran 90: REAL(KIND=4), INTENT(OUT) for vsrngbeta REAL(KIND=8), INTENT(OUT) for vdrngbeta C: float* for vsRngBeta double* for vdRngBeta Vector of n random numbers with beta distribution Description The vRngBeta function generates random numbers with beta distribution that has shape parameters p and q, displacement a, and scale parameter ß, where p, q, a, and ß?R ; p > 0, q > 0, ß > 0. The probability density function is given by: where B(p, q) is the complete beta function. The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. 10 Intel® Math Kernel Library Reference Manual 2188 VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. Discrete Distributions This section describes routines for generating random numbers with discrete distribution. vRngUniform Generates random numbers uniformly distributed over the interval [a, b). Syntax Fortran: status = virnguniform( method, stream, n, r, a, b ) C: status = viRngUniform( method, stream, n, r, a, b ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method; the specific value is as follows: VSL_RNG_METHOD_UNIFORM_STD Standard method. Currently there is only one method for this distribution generator. stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated a FORTRAN 77: INTEGER*4 Left interval bound a Statistical Functions 10 2189 Name Type Description Fortran 90: INTEGER(KIND=4), INTENT(IN) C: const int b FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(IN) C: const int Right interval bound b Output Parameters Name Type Description r FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(OUT) C: int* Vector of n random numbers uniformly distributed over the interval [a,b) Description The vRngUniform function generates random numbers uniformly distributed over the interval [a, b), where a, b are the left and right bounds of the interval respectively, and a, b?Z; a < b. The probability distribution is given by: The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. 10 Intel® Math Kernel Library Reference Manual 2190 VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngUniformBits Generates bits of underlying BRNG integer recurrence. Syntax Fortran: status = virnguniformbits( method, stream, n, r ) C: status = viRngUniformBits( method, stream, n, r ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method; the specific value is VSL_RNG_METHOD_UNIFORMBITS_STD stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated Output Parameters Name Type Description r FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(OUT) C: unsigned int* Fortran: Vector of n random integer numbers. If the stream was generated by a 64 or a 128-bit generator, each integer value is represented by two or four elements of r respectively. The number of bytes occupied by each integer is contained in the field wordsize of the structure VSL_BRNG_PROPERTIES. The total number of bits that are Statistical Functions 10 2191 Name Type Description actually used to store the value are contained in the field nbits of the same structure. See Advanced Service Routines for a more detailed discussion of VSLBRngProperties. C: Vector of n random integer numbers. If the stream was generated by a 64 or a 128-bit generator, each integer value is represented by two or four elements of r respectively. The number of bytes occupied by each integer is contained in the field WordSize of the structure VSLBRngProperties. The total number of bits that are actually used to store the value are contained in the field NBits of the same structure.See Advanced Service Routines for a more detailed discussion of VSLBRngProperties. Description The vRngUniformBits function generates integer random values with uniform bit distribution. The generators of uniformly distributed numbers can be represented as recurrence relations over integer values in modular arithmetic. Apparently, each integer can be treated as a vector of several bits. In a truly random generator, these bits are random, while in pseudorandom generators this randomness can be violated. For example, a well known drawback of linear congruential generators is that lower bits are less random than higher bits (for example, see [Knuth81]). For this reason, care should be taken when using this function. Typically, in a 32-bit LCG only 24 higher bits of an integer value can be considered random. See VSL Notes for details. Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngUniformBits32 Generates uniformly distributed bits in 32-bit chunks. Syntax Fortran: status = virnguniformbits32( method, stream, n, r ) C: status = viRngUniformBits32( method, stream, n, r ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 10 Intel® Math Kernel Library Reference Manual 2192 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method; the specific value is VSL_RNG_METHOD_UNIFORMBITS32_STD stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated Output Parameters Name Type Description r FORTRAN 77: INTEGER*4 Fortran 90: INTEGER (KIND=4), INTENT(OUT) C: unsigned int* Fortran: Vector of n 32-bit random integer numbers with uniform bit distribution. C: Vector of n 32-bit random integer numbers with uniform bit distribution. Description The vRngUniformBits32 function generates uniformly distributed bits in 32-bit chunks. Unlike vRngUniformBits, which provides the output of underlying integer recurrence and does not guarantee uniform distribution across bits, vRngUniformBits32 is designed to ensure each bit in the 32-bit chunk is uniformly distributed. See VSL Notes for details. Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BRNG_NOT_SUPPORTED BRNG is not supported by the function. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngUniformBits64 Generates uniformly distributed bits in 64-bit chunks. Statistical Functions 10 2193 Syntax Fortran: status = virnguniformbits64( method, stream, n, r ) C: status = viRngUniformBits64( method, stream, n, r ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method; the specific value is VSL_RNG_METHOD_UNIFORMBITS64_STD stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated Output Parameters Name Type Description r FORTRAN 77: INTEGER*8 Fortran 90: INTEGER (KIND=8), INTENT(OUT) C: unsigned long long* Fortran: Vector of n 64-bit random integer numbers with uniform bit distribution. C: Vector of n 64-bit random integer numbers with uniform bit distribution. Description The vRngUniformBits64 function generates uniformly distributed bits in 64-bit chunks. Unlike vRngUniformBits, which provides the output of underlying integer recurrence and does not guarantee uniform distribution across bits, vRngUniformBits64 is designed to ensure each bit in the 64-bit chunk is uniformly distributed. See VSL Notes for details. 10 Intel® Math Kernel Library Reference Manual 2194 Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BRNG_NOT_SUPPORTED BRNG is not supported by the function. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngBernoulli Generates Bernoulli distributed random values. Syntax Fortran: status = virngbernoulli( method, stream, n, r, p ) C: status = viRngBernoulli( method, stream, n, r, p ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific value is as follows: VSL_RNG_METHOD_BERNOULLI_ICDF Inverse cumulative distribution function method. stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated p FORTRAN 77: DOUBLE PRECISION Success probability p of a trial Statistical Functions 10 2195 Name Type Description Fortran 90: REAL(KIND=8), INTENT(IN) C: const double Output Parameters Name Type Description r FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(OUT) C: int* Vector of n Bernoulli distributed random values Description The vRngBernoulli function generates Bernoulli distributed random numbers with probability p of a single trial success, where p?R; 0 = p = 1. A variate is called Bernoulli distributed, if after a trial it is equal to 1 with probability of success p, and to 0 with probability 1 - p. The probability distribution is given by: P(X = 1) = p P(X = 0) = 1 - p The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngGeometric Generates geometrically distributed random values. 10 Intel® Math Kernel Library Reference Manual 2196 Syntax Fortran: status = virnggeometric( method, stream, n, r, p ) C: status = viRngGeometric( method, stream, n, r, p ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific value is as follows: VSL_RNG_METHOD_GEOMETRIC_ICDF Inverse cumulative distribution function method. stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated p FORTRAN 77: DOUBLE PRECISION Fortran 90: REAL(KIND=8), INTENT(IN) C: const double Success probability p of a trial Output Parameters Name Type Description r FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(OUT) C: int* Vector of n geometrically distributed random values Statistical Functions 10 2197 Description The vRngGeometric function generates geometrically distributed random numbers with probability p of a single trial success, where p?R; 0 < p < 1. A geometrically distributed variate represents the number of independent Bernoulli trials preceding the first success. The probability of a single Bernoulli trial success is p. The probability distribution is given by: P(X = k) = p·(1 - p)k, k? {0,1,2, ... }. The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngBinomial Generates binomially distributed random numbers. Syntax Fortran: status = virngbinomial( method, stream, n, r, ntrial, p ) C: status = viRngBinomial( method, stream, n, r, ntrial, p ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h 10 Intel® Math Kernel Library Reference Manual 2198 Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific value is as follows: VSL_RNG_METHOD_BINOMIAL_BTPE See brief description of the BTPE method in Table "Values of in method parameter". stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated ntrial FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(IN) C: const int Number of independent trials m p FORTRAN 77: DOUBLE PRECISION Fortran 90: REAL(KIND=8), INTENT(IN) C: const double Success probability p of a single trial Output Parameters Name Type Description r FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(OUT) C: int* Vector of n binomially distributed random values Description The vRngBinomial function generates binomially distributed random numbers with number of independent Bernoulli trials m, and with probability p of a single trial success, where p?R; 0 = p = 1, m?N. A binomially distributed variate represents the number of successes in m independent Bernoulli trials with probability of a single trial success p. The probability distribution is given by: Statistical Functions 10 2199 The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngHypergeometric Generates hypergeometrically distributed random values. Syntax Fortran: status = virnghypergeometric( method, stream, n, r, l, s, m ) C: status = viRngHypergeometric( method, stream, n, r, l, s, m ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Generation method. The specific value is as follows: VSL_RNG_METHOD_HYPERGEOMETRIC_H2PE 10 Intel® Math Kernel Library Reference Manual 2200 Name Type Description Fortran 90: INTEGER, INTENT(IN) C: const int See brief description of the H2PE method in Table "Values of in method parameter" stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated l FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(IN) C: const int Lot size l s FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(IN) C: const int Size of sampling without replacement s m FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(IN) C: const int Number of marked elements m Output Parameters Name Type Description r FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(OUT) C: int* Vector of n hypergeometrically distributed random values Description The vRngHypergeometric function generates hypergeometrically distributed random values with lot size l, size of sampling s, and number of marked elements in the lot m, where l, m, s?N?{0}; l = max(s, m). Consider a lot of l elements comprising m "marked" and l-m "unmarked" elements. A trial sampling without replacement of exactly s elements from this lot helps to define the hypergeometric distribution, which is the probability that the group of s elements contains exactly k marked elements. Statistical Functions 10 2201 The probability distribution is given by:) , k? {max(0, s + m - l), ..., min(s, m)} The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngPoisson Generates Poisson distributed random values. Syntax Fortran: status = virngpoisson( method, stream, n, r, lambda ) C: status = viRngPoisson( method, stream, n, r, lambda ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h 10 Intel® Math Kernel Library Reference Manual 2202 Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific values are as follows: VSL_RNG_METHOD_POISSON_PTPE VSL_RNG_METHOD_POISSON_POISNORM See brief description of the PTPE and POISNORM methods in Table "Values of in method parameter". stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated lambda FORTRAN 77: DOUBLE PRECISION Fortran 90: REAL(KIND=8), INTENT(IN) C: const double Distribution parameter ?. Output Parameters Name Type Description r FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(OUT) C: int* Vector of n Poisson distributed random values Description The vRng"Poisson function generates Poisson distributed random numbers with distribution parameter ?, where ??R; ? > 0. The probability distribution is given by: k? {0, 1, 2, ...}. The cumulative distribution function is as follows: Statistical Functions 10 2203 Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngPoissonV Generates Poisson distributed random values with varying mean. Syntax Fortran: status = virngpoissonv( method, stream, n, r, lambda ) C: status = viRngPoissonV( method, stream, n, r, lambda ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific value is as follows: VSL_RNG_METHOD_POISSONV_POISNORM See brief description of the POISNORM method in Table "Values of in method parameter" stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) Fortran: Descriptor of the stream state structure. C: Pointer to the stream state structure 10 Intel® Math Kernel Library Reference Manual 2204 Name Type Description C: VSLStreamStatePtr n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated lambda FORTRAN 77: DOUBLE PRECISION Fortran 90: REAL(KIND=8), INTENT(IN) C: const double* Array of n distribution parameters ?i. Output Parameters Name Type Description r FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(OUT) C: int* Vector of n Poisson distributed random values Description The vRngPoissonV function generates n Poisson distributed random numbers xi(i = 1, ..., n) with distribution parameter ?i, where ?i?R; ?i > 0. The probability distribution is given by: The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. Statistical Functions 10 2205 VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. vRngNegBinomial Generates random numbers with negative binomial distribution. Syntax Fortran: status = virngnegbinomial( method, stream, n, r, a, p ) C: status = viRngNegbinomial( method, stream, n, r, a, p ); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description method FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Generation method. The specific value is: VSL_RNG_METHOD_NEGBINOMIAL_NBAR See brief description of the NBAR method in Table "Values of in method parameter" stream FORTRAN 77: INTEGER*4 stream(2) Fortran 90: TYPE (VSL_STREAM_STATE), INTENT(IN) C: VSLStreamStatePtr Fortran: descriptor of the stream state structure. C: pointer to the stream state structure n FORTRAN 77: INTEGER Fortran 90: INTEGER, INTENT(IN) C: const int Number of random values to be generated a FORTRAN 77: DOUBLE PRECISION Fortran 90: REAL(KIND=8), INTENT(IN) C: const double The first distribution parameter a 10 Intel® Math Kernel Library Reference Manual 2206 Name Type Description p FORTRAN 77: DOUBLE PRECISION Fortran 90: REAL(KIND=8), INTENT(IN) C: const double The second distribution parameter p Output Parameters Name Type Description r FORTRAN 77: INTEGER*4 Fortran 90: INTEGER(KIND=4), INTENT(OUT) C: int* Vector of n random values with negative binomial distribution. Description The vRngNegBinomial function generates random numbers with negative binomial distribution and distribution parameters a and p, where p, a?R; 0 < p < 1; a > 0. If the first distribution parameter a?N, this distribution is the same as Pascal distribution. If a?N, the distribution can be interpreted as the expected time of a-th success in a sequence of Bernoulli trials, when the probability of success is p. The probability distribution is given by: The cumulative distribution function is as follows: Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_ERROR_NULL_PTR stream is a NULL pointer. VSL_RNG_ERROR_BAD_STREAM stream is not a valid random stream. VSL_RNG_ERROR_BAD_UPDATE Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or > nmax. VSL_RNG_ERROR_NO_NUMBERS Callback function for an abstract BRNG returns 0 as the number of updated entries in a buffer. VSL_RNG_ERROR_QRNG_PERIOD_ELAPSED Period of the generator has been exceeded. Statistical Functions 10 2207 Advanced Service Routines This section describes service routines for registering a user-designed basic generator (vslRegisterBrng) and for obtaining properties of the previously registered basic generators (vslGetBrngProperties). See VSL Notes ("Basic Generators" section of VSL Structure chapter) for substantiation of the need for several basic generators including user-defined BRNGs. Data types The Advanced Service routines refer to a structure defining the properties of the basic generator. This structure is described in Fortran 90 as follows: TYPE VSL_BRNG_PROPERTIES INTEGER streamstatesize INTEGER nseeds INTEGER includeszero INTEGER wordsize INTEGER nbits INTEGER nitstream INTEGER sbrng INTEGER dbrng INTEGER ibrng END TYPE VSL_BRNG_PROPERTIES The C version is as follows: typedef struct _VSLBRngProperties { int StreamStateSize; int NSeeds; int IncludesZero; int WordSize; int NBits; InitStreamPtr InitStream; sBRngPtr sBRng; dBRngPtr dBRng; iBRngPtr iBRng; } VSLBRngProperties; The following table provides brief descriptions of the fields engaged in the above structure: Field Descriptions Field Short Description Fortran: streamstatesize C: StreamStateSize The size, in bytes, of the stream state structure for a given basic generator. Fortran: nseeds C: NSeeds The number of 32-bit initial conditions (seeds) necessary to initialize the stream state structure for a given basic generator. 10 Intel® Math Kernel Library Reference Manual 2208 Field Short Description Fortran: includeszero C: IncludesZero Flag value indicating whether the generator can produce a random 0. Fortran: wordsize C: WordSize Machine word size, in bytes, used in integer-value computations. Possible values: 4, 8, and 16 for 32, 64, and 128-bit generators, respectively. Fortran: nbits C: NBits The number of bits required to represent a random value in integer arithmetic. Note that, for instance, 48-bit random values are stored to 64-bit (8 byte) memory locations. In this case, wordsize/ WordSize is equal to 8 (number of bytes used to store the random value), while nbits/NBits contains the actual number of bits occupied by the value (in this example, 48). Fortran: initstream C: InitStream Contains the pointer to the initialization routine of a given basic generator. Fortran: sbrng C: sBRng Contains the pointer to the basic generator of single precision real numbers uniformly distributed over the interval (a,b) (real in Fortran and float in C). Fortran: dbrng C: dBRng Contains the pointer to the basic generator of double precision real numbers uniformly distributed over the interval (a,b) (double PRECISION in Fortran and double in C). Fortran: ibrng C: iBRng Contains the pointer to the basic generator of integer numbers with uniform bit distribution1 (INTEGER in Fortran and unsigned int in C). 1A specific generator that permits operations over single bits and bit groups of random numbers. vslRegisterBrng Registers user-defined basic generator. Syntax Fortran: brng = vslregisterbrng( properties ) C: brng = vslRegisterBrng( &properties ); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description propertie s Fortran: TYPE(VSL_BRNG_PROPERTIES), INTENT(IN) C: const VSLBRngProperties* Pointer to the structure containing properties of the basic generator to be registered Statistical Functions 10 2209 NOTE FORTRAN 77 support is unavailable for this function. Output Parameters Name Type Description brng Fortran: INTEGER, INTENT(OUT) C: int Number (index) of the registered basic generator; used for identification. Negative values indicate the registration error. Description An example of a registration procedure can be found in the respective directory of the VSL examples. Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_RNG_ERROR_BRNG_TABLE_FULL Registration cannot be completed due to lack of free entries in the table of registered BRNGs. VSL_RNG_ERROR_BAD_STREAM_STATE_SIZE Bad value in StreamStateSize field. VSL_RNG_ERROR_BAD_WORD_SIZE Bad value in WordSize field. VSL_RNG_ERROR_BAD_NSEEDS Bad value in NSeeds field. VSL_RNG_ERROR_BAD_NBITS Bad value in NBits field. VSL_ERROR_NULL_PTR At least one of the fields iBrng, dBrng, sBrng or InitStream is a NULL pointer. vslGetBrngProperties Returns structure with properties of a given basic generator. Syntax Fortran: status = vslgetbrngproperties( brng, properties ) C: status = vslGetBrngProperties( brng, &properties ); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description brng Fortran: INTEGER(KIND=4), INTENT(IN) C: const int Number (index) of the registered basic generator; used for identification. See specific values in Table "Values of brng parameter". Negative values indicate the registration error. 10 Intel® Math Kernel Library Reference Manual 2210 NOTE FORTRAN 77 support is unavailable for this function. Output Parameters Name Type Description propertie s Fortran: TYPE(VSL_BRNG_PROPERTIES), INTENT(OUT) C: VSLBRngProperties* Pointer to the structure containing properties of the generator with number brng Description The vslGetBrngProperties function returns a structure with properties of a given basic generator. Return Values VSL_ERROR_OK, VSL_STATUS_OK Indicates no error, execution is successful. VSL_RNG_ERROR_INVALID_BRNG_INDEX BRNG index is invalid. Formats for User-Designed Generators To register a user-designed basic generator using vslRegisterBrng function, you need to pass the pointer iBrng to the integer-value implementation of the generator; the pointers sBrng and dBrng to the generator implementations for single and double precision values, respectively; and pass the pointer InitStream to the stream initialization routine. See below recommendations on defining such functions with input and output arguments. An example of the registration procedure for a user-designed generator can be found in the respective directory of VSL examples. The respective pointers are defined as follows: typedef int(*InitStreamPtr)( int method, VSLStreamStatePtr stream, int n, const unsigned int params[] ); typedef int(*sBRngPtr)( VSLStreamStatePtr stream, int n, float r[], float a, float b ); typedef int(*dBRngPtr)( VSLStreamStatePtr stream, int n, double r[], double a, double b ); typedef int(*iBRngPtr)( VSLStreamStatePtr stream, int n, unsigned int r[] ); InitStream C: int MyBrngInitStream( int method, VSLStreamStatePtr stream, int n, const unsigned int params[] ) { /* Initialize the stream */ ... } /* MyBrngInitStream */ Description The initialization routine of a user-designed generator must initialize stream according to the specified initialization method, initial conditions params and the argument n. The value of method determines the initialization method to be used. • If method is equal to 1, the initialization is by the standard generation method, which must be supported by all basic generators. In this case the function assumes that the stream structure was not previously initialized. The value of n is used as the actual number of 32-bit values passed as initial conditions Statistical Functions 10 2211 through params. Note, that the situation when the actual number of initial conditions passed to the function is not sufficient to initialize the generator is not an error. Whenever it occurs, the basic generator must initialize the missing conditions using default settings. • If method is equal to 2, the generation is by the leapfrog method, where n specifies the number of computational nodes (independent streams). Here the function assumes that the stream was previously initialized by the standard generation method. In this case params contains only one element, which identifies the computational node. If the generator does not support the leapfrog method, the function must return the error code VSL_ERROR_LEAPFROG_UNSUPPORTED. • If method is equal to 3, the generation is by the block-splitting method. Same as above, the stream is assumed to be previously initialized by the standard generation method; params is not used, n identifies the number of skipped elements. If the generator does not support the block-splitting method, the function must return the error code VSL_ERROR_SKIPAHEAD_UNSUPPORTED. For a more detailed description of the leapfrog and the block-splitting methods, refer to the description of vslLeapfrogStream and vslSkipAheadStream, respectively. Stream state structure is individual for every generator. However, each structure has a number of fields that are the same for all the generators: C: typedef struct { unsigned int Reserved1[2]; unsigned int Reserved2[2]; [fields specific for the given generator] } MyStreamState; The fields Reserved1 and Reserved2 are reserved for private needs only, and must not be modified by the user. When including specific fields into the structure, follow the rules below: • The fields must fully describe the current state of the generator. For example, the state of a linear congruential generator can be identified by only one initial condition; • If the generator can use both the leapfrog and the block-splitting methods, additional fields should be introduced to identify the independent streams. For example, in LCG(a, c, m), apart from the initial conditions, two more fields should be specified: the value of the multiplier ak and the value of the increment (ak-1)c/(a-1). For a more detailed discussion, refer to [Knuth81], and [Gentle98]. An example of the registration procedure can be found in the respective directory of VSL examples. 10 Intel® Math Kernel Library Reference Manual 2212 iBRng C: int iMyBrng( VSLStreamStatePtr stream, int n, unsigned int r[] ) { int i; /* Loop variable */ /* Generating integer random numbers */ /* Pay attention to word size needed to store only random number */ for( i = 0; i < n; i++) { r[i] = ...; } /* Update stream state */ ... return errcode; } /* iMyBrng */ NOTE When using 64 and 128-bit generators, consider digit capacity to store the numbers to the random vector r correctly. For example, storing one 64-bit value requires two elements of r, the first to store the lower 32 bits and the second to store the higher 32 bits. Similarly, use 4 elements of r to store a 128-bit value. sBRng C: int sMyBrng( VSLStreamStatePtr stream, int n, float r[], float a, float b ) { int i; /* Loop variable */ /* Generating float (a,b) random numbers */ for ( i = 0; i < n; i++ ) { r[i] = ...; } /* Update stream state */ ... return errcode; } /* sMyBrng */ Statistical Functions 10 2213 dBRng C: int dMyBrng( VSLStreamStatePtr stream, int n, double r[], double a, double b ) { int i; /* Loop variable */ /* Generating double (a,b) random numbers */ for ( i = 0; i < n; i++ ) { r[i] = ...; } /* Update stream state */ ... return errcode; } /* dMyBrng */ Convolution and Correlation Intel MKL VSL provides a set of routines intended to perform linear convolution and correlation transformations for single and double precision real and complex data. For correct definition of implemented operations, see the Mathematical Notation and Definitions section. The current implementation provides: • Fourier algorithms for one-dimensional single and double precision real and complex data • Fourier algorithms for multi-dimensional single and double precision real and complex data • Direct algorithms for one-dimensional single and double precision real and complex data • Direct algorithms for multi-dimensional single and double precision real and complex data One-dimensional algorithms cover the following functions from the IBM* ESSL library: SCONF, SCORF SCOND, SCORD SDCON, SDCOR DDCON, DDCOR SDDCON, SDDCOR. Special wrappers are designed to simulate these ESSL functions. The wrappers are provided as sample sources for Fortran and C. To reuse them, use the following directories: ${MKL}/examples/vslc/essl/vsl_wrappers ${MKL}/examples/vslf/essl/vsl_wrappers Additionally, you can browse the examples demonstrating the calculation of the ESSL functions through the wrappers. You can find the examples in the following directories: ${MKL}/examples/vslc/essl ${MKL}/examples/vslf/essl Convolution and correlation API provides interfaces for FORTRAN 77, Fortran 90 and C/89 languages. You may use the C/89 interface also with later versions of C or C++, or Fortran 90 interface with programs written in Fortran 95. For users of the C/C++ and Fortran languages, the mkl_vsl.h, mkl_vsl.f90, and mkl_vsl.f77 headers are provided. All header files are found under the directory: 10 Intel® Math Kernel Library Reference Manual 2214 ${MKL}/include See more details about the Fortran header in Random Number Generators section of this chapter. Convolution and correlation API is implemented through task objects, or tasks. Task object is a data structure, or descriptor, which holds parameters that determine the specific convolution or correlation operation. Such parameters may be precision, type, and number of dimensions of user data, an identifier of the computation algorithm to be used, shapes of data arrays, and so on. All the Intel MKL VSL convolution and correlation routines process task objects in one way or another: either create a new task descriptor, change the parameter settings, compute mathematical results of the convolution or correlation using the stored parameters, or perform other operations. Accordingly, all routines are split into the following groups: Task Constructors - routines that create a new task object descriptor and set up most common parameters. Task Editors - routines that can set or modify some parameter settings in the existing task descriptor. Task Execution Routines - compute results of the convolution or correlation operation over the actual input data, using the operation parameters held in the task descriptor. Task Copy - routines used to make several copies of the task descriptor. Task Destructors - routines that delete task objects and free the memory. When the task is executed or copied for the first time, a special process runs which is called task commitment. During this process, consistency of task parameters is checked and the required work data are prepared. If the parameters are consistent, the task is tagged as committed successfully. The task remains committed until you edit its parameters. Hence, the task can be executed multiple times after a single commitment process. Since the task commitment process may include costly intermediate calculations such as preparation of Fourier transform of input data, launching the process only once can help speed up overall performance. Naming Conventions The names of Fortran routines in the convolution and correlation API are written in lowercase (vslsconvexec), while the names of Fortran types and constants are written in uppercase. The names are not case-sensitive. In C, the names of routines, types, and constants are case-sensitive and can be lowercase and uppercase (vslsConvExec). The names of routines have the following structure: vsl[datatype]{Conv|Corr} for the C interface vsl[datatype]{conv|corr} for the Fortran interface where • vsl is a prefix indicating that the routine belongs to Vector Statistical Library of Intel® MKL. • [datatype] is optional. If present, the symbol specifies the type of the input and output data and can be s (for single precision real type), d (for double precision real type), c (for single precision complex type), or z (for double precision complex type). • Conv or Corr specifies whether the routine refers to convolution or correlation task, respectively. • field specifies a particular functionality that the routine is designed for, for example, NewTask, DeleteTask. Data Types All convolution or correlation routines use the following types for specifying data objects: Type Data Object FORTRAN 77: INTEGER*4 task (2) Pointer to a task descriptor for convolution Statistical Functions 10 2215 Type Data Object Fortran 90: TYPE(VSL_CONV_TASK) C: VSLConvTaskPtr FORTRAN 77: INTEGER*4 task (2) Fortran 90: TYPE(VSL_CORR_TASK) C: VSLCorrTaskPtr Pointer to a task descriptor for correlation FORTRAN 77: REAL*4 Fortran 90: REAL(KIND=4) C: float Input/output user real data in single precision FORTRAN 77: REAL*8 Fortran 90: REAL(KIND=8) C: double Input/output user real data in double precision FORTRAN 77: COMLEX*8 Fortran 90: COMPLEX(KIND=4) C: MKL_Complex8 Input/output user complex data in single precision FORTRAN 77: COMPLEX*16 Fortran 90: COMPLEX(KIND=8) C: MKL_Complex16 Input/output user complex data in double precision FORTRAN 77: INTEGER Fortran 90: INTEGER C: int All other data Generic integer type (without specifying the byte size) is used for all integer data. NOTE The actual size of the generic integer type is platform-dependent. Before you compile your application, set an appropriate byte size for integers. See details in the 'Using the ILP64 Interface vs. LP64 Interface' section of the Intel® MKL User's Guide. Parameters Basic parameters held by the task descriptor are assigned values when the task object is created, copied, or modified by task editors. Parameters of the correlation or convolution task are initially set up by task constructors when the task object is created. Parameter changes or additional settings are made by task editors. More parameters which define location of the data being convolved need to be specified when the task execution routine is invoked. According to how the parameters are passed or assigned values, all of them can be categorized as either explicit (directly passed as routine parameters when a task object is created or executed) or optional (assigned some default or implicit values during task construction). The following table lists all applicable parameters used in the Intel MKL convolution and correlation API. 10 Intel® Math Kernel Library Reference Manual 2216 Convolution and Correlation Task Parameters Name Category Type Default Value Label Description job explicit integer Implied by the constructor name Specifies whether the task relates to convolution or correlation type explicit integer Implied by the constructor name Specifies the type (real or complex) of the input/output data. Set to real in the current version. precision explicit integer Implied by the constructor name Specifies precision (single or double) of the input/output data to be provided in arrays x,y,z. mode explicit integer None Specifies whether the convolution/ correlation computation should be done via Fourier transforms, or by a direct method, or by automatically choosing between the two. See SetMode for the list of named constants for this parameter. method optional integer "auto" Hints at a particular computation method if several methods are available for the given mode. Setting this parameter to "auto" means that software will choose the best available method. internal_pre cision optional integer Set equal to the value of precision Specifies precision of internal calculations. Can enforce double precision calculations even when input/output data are single precision. See SetInternalPrecision for the list of named constants for this parameter. dims explicit integer None Specifies the rank (number of dimensions) of the user data provided in arrays x,y,z. Can be in the range from 1 to 7. x,y explicit real arrays None Specify input data arrays. See Data Allocation for more information. z explicit real array None Specifies output data array. See Data Allocation for more information. xshape, yshape, zshape explicit integer arrays None Define shapes of the arrays x, y, z. See Data Allocation for more information. xstride, ystride, zstride explicit integer arrays None Define strides within arrays x, y, z, that is specify the physical location of the input and output data in these arrays. See Data Allocation for more information. start optional integer array Undefined Defines the first element of the mathematical result that will be stored to output array z. See SetStart and Data Allocation for more information. Statistical Functions 10 2217 Name Category Type Default Value Label Description decimation optional integer array Undefined Defines how to thin out the mathematical result that will be stored to output array z. See SetDecimation and Data Allocation for more information. Users of the C or C++ language may pass the NULL pointer instead of either or all of the parameters xstride, ystride, or zstride for multi-dimensional calculations. In this case, the software assumes the dense data allocation for the arrays x, y, or z due to the Fortran-style "by columns" representation of multidimensional arrays. Task Status and Error Reporting The task status is an integer value, which is zero if no error has been detected while processing the task, or a specific non-zero error code otherwise. Negative status values indicate errors, and positive values indicate warnings. An error can be caused by invalid parameter values, a system fault like a memory allocation failure, or can be an internal error self-detected by the software. Each task descriptor contains the current status of the task. When creating a task object, the constructor assigns the VSL_STATUS_OK status to the task. When processing the task afterwards, other routines such as editors or executors can change the task status if an error occurs and write a corresponding error code into the task status field. Note that at the stage of creating a task or editing its parameters, the set of parameters may be inconsistent. The parameter consistency check is only performed during the task commitment operation, which is implicitly invoked before task execution or task copying. If an error is detected at this stage, task execution or task copying is terminated and the task descriptor saves the corresponding error code. Once an error occurs, any further attempts to process that task descriptor is terminated and the task keeps the same error code. Normally, every convolution or correlation function (except DeleteTask) returns the status assigned to the task while performing the function operation. The status codes are given symbolic names defined in the respective header files. For the C/C++ interface, these names are defined as macros via the #define statements, and for the Fortran interface as integer constants via the PARAMETER operators. If there is no error, the VSL_STATUS_OK status is returned, which is defined as zero: C/C++: #define VSL_STATUS_OK 0 F90/F95: INTEGER(KIND=4) VSL_STATUS_OK PARAMETER(VSL_STATUS_OK = 0) F77: INTEGER*4 VSL_STATUS_OK PARAMETER(VSL_STATUS_OK = 0) In case of an error, a non-zero error code is returned, which indicates the origin of the failure. The following status codes for the convolution/correlation error codes are pre-defined in the header files for both C/C++ and Fortran languages. Convolution/Correlation Status Codes Status Code Description VSL_CC_ERROR_NOT_IMPLEMENTED Requested functionality is not implemented. VSL_CC_ERROR_ALLOCATION_FAILURE Memory allocation failure. VSL_CC_ERROR_BAD_DESCRIPTOR Task descriptor is corrupted. 10 Intel® Math Kernel Library Reference Manual 2218 Status Code Description VSL_CC_ERROR_SERVICE_FAILURE A service function has failed. VSL_CC_ERROR_EDIT_FAILURE Failure while editing the task. VSL_CC_ERROR_EDIT_PROHIBITED You cannot edit this parameter. VSL_CC_ERROR_COMMIT_FAILURE Task commitment has failed. VSL_CC_ERROR_COPY_FAILURE Failure while copying the task. VSL_CC_ERROR_DELETE_FAILURE Failure while deleting the task. VSL_CC_ERROR_BAD_ARGUMENT Bad argument or task parameter. VSL_CC_ERROR_JOB Bad parameter: job. SL_CC_ERROR_KIND Bad parameter: kind. VSL_CC_ERROR_MODE Bad parameter: mode. VSL_CC_ERROR_METHOD Bad parameter: method. VSL_CC_ERROR_TYPE Bad parameter: type. VSL_CC_ERROR_EXTERNAL_PRECISION Bad parameter: external_precision. VSL_CC_ERROR_INTERNAL_PRECISION Bad parameter: internal_precision. VSL_CC_ERROR_PRECISION Incompatible external/internal precisions. VSL_CC_ERROR_DIMS Bad parameter: dims. VSL_CC_ERROR_XSHAPE Bad parameter: xshape. VSL_CC_ERROR_YSHAPE Bad parameter: yshape. Callback function for an abstract BRNG returns an invalid number of updated entries in a buffer, that is, < 0 or >nmax. VSL_CC_ERROR_ZSHAPE Bad parameter: zshape. VSL_CC_ERROR_XSTRIDE Bad parameter: xstride. VSL_CC_ERROR_YSTRIDE Bad parameter: ystride. VSL_CC_ERROR_ZSTRIDE Bad parameter: zstride. VSL_CC_ERROR_X Bad parameter: x. VSL_CC_ERROR_Y Bad parameter: y. VSL_CC_ERROR_Z Bad parameter: z. VSL_CC_ERROR_START Bad parameter: start. VSL_CC_ERROR_DECIMATION Bad parameter: decimation. VSL_CC_ERROR_OTHER Another error. Statistical Functions 10 2219 Task Constructors Task constructors are routines intended for creating a new task descriptor and setting up basic parameters. No additional parameter adjustment is typically required and other routines can use the task object. Intel® MKL implementation of the convolution and correlation API provides two different forms of constructors: a general form and an X-form. X-form constructors work in the same way as the general form constructors but also assign particular data to the first operand vector used in the convolution or correlation operation (stored in array x). Using X-form constructors is recommended when you need to compute multiple convolutions or correlations with the same data vector held in array x against different vectors held in array y. This helps improve performance by eliminating unnecessary overhead in repeated computation of intermediate data required for the operation. Each constructor routine has an associated one-dimensional version that provides algorithmic and computational benefits. NOTE If the constructor fails to create a task descriptor, it returns the NULL task pointer. The Table "Task Constructors" lists available task constructors: Task Constructors Routine Description vslConvNewTask/vslCorrNewTask Creates a new convolution or correlation task descriptor for a multidimensional case. vslConvNewTask1D/ vslCorrNewTask1D Creates a new convolution or correlation task descriptor for a one-dimensional case. vslConvNewTaskX/vslCorrNewTaskX Creates a new convolution or correlation task descriptor as an X-form for a multidimensional case. vslConvNewTaskX1D/ vslCorrNewTaskX1D Creates a new convolution or correlation task descriptor as an X-form for a one-dimensional case. vslConvNewTask/vslCorrNewTask Creates a new convolution or correlation task descriptor for multidimensional case. Syntax Fortran: status = vslsconvnewtask(task, mode, dims, xshape, yshape, zshape) status = vsldconvnewtask(task, mode, dims, xshape, yshape, zshape) status = vslcconvnewtask(task, mode, dims, xshape, yshape, zshape) status = vslzconvnewtask(task, mode, dims, xshape, yshape, zshape) status = vslscorrnewtask(task, mode, dims, xshape, yshape, zshape) status = vsldcorrnewtask(task, mode, dims, xshape, yshape, zshape) status = vslccorrnewtask(task, mode, dims, xshape, yshape, zshape) status = vslzcorrnewtask(task, mode, dims, xshape, yshape, zshape) C: status = vslsConvNewTask(task, mode, dims, xshape, yshape, zshape); 10 Intel® Math Kernel Library Reference Manual 2220 status = vsldConvNewTask(task, mode, dims, xshape, yshape, zshape); status = vslcConvNewTask(task, mode, dims, xshape, yshape, zshape); status = vslzConvNewTask(task, mode, dims, xshape, yshape, zshape); status = vslsCorrNewTask(task, mode, dims, xshape, yshape, zshape); status = vsldCorrNewTask(task, mode, dims, xshape, yshape, zshape); status = vslcCorrNewTask(task, mode, dims, xshape, yshape, zshape); status = vslzCorrNewTask(task, mode, dims, xshape, yshape, zshape); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description mode FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Specifies whether convolution/correlation calculation must be performed by using a direct algorithm or through Fourier transform of the input data. See Table "Values of mode parameter" for a list of possible values. dims FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Rank of user data. Specifies number of dimensions for the input and output arrays x, y, and z used during the execution stage. Must be in the range from 1 to 7. The value is explicitly assigned by the constructor. xshape FORTRAN 77: INTEGER Fortran 90: INTEGER, DIMENSION(*) C: const int[] Defines the shape of the input data for the source array x. See Data Allocation for more information. yshape FORTRAN 77: INTEGER Fortran 90: INTEGER, DIMENSION(*) C: const int[] Defines the shape of the input data for the source array y. See Data Allocation for more information. zshape FORTRAN 77: INTEGER Fortran 90: INTEGER, DIMENSION(*) C: const int[] Defines the shape of the output data to be stored in array z. See Data Allocation for more information. Statistical Functions 10 2221 Output Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslsconvnewtask, vsldconvnewtask, vslcconvnewtask, vslzconvnewtask INTEGER*4 task(2) for vslscorrnewtask, vsldcorrnewtask, vslccorrnewtask, vslzcorrnewtask Fortran 90: TYPE(VSL_CONV_TASK) for vslsconvnewtask, vsldconvnewtask, vslcconvnewtask, vslzconvnewtask TYPE(VSL_CORR_TASK) for vslscorrnewtask, vsldcorrnewtask, vslccorrnewtask, vslzcorrnewtask C: VSLConvTaskPtr* for vslsConvNewTask, vsldConvNewTask, vslcConvNewTask, vslzConvNewTask VSLCorrTaskPtr* for vslsCorrNewTask, vsldCorrNewTask, vslcConvNewTask, vslzConvNewTask Pointer to the task descriptor if created successfully or NULL pointer otherwise. status FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Set to VSL_STATUS_OK if the task is created successfully or set to non-zero error code otherwise. Description Each vslConvNewTask/vslCorrNewTask constructor creates a new convolution or correlation task descriptor with the user specified values for explicit parameters. The optional parameters are set to their default values (see Table "Convolution and Correlation Task Parameters"). The parameters xshape, yshape, and zshape define the shapes of the input and output data provided by the arrays x, y, and z, respectively. Each shape parameter is an array of integers with its length equal to the value of dims. You explicitly assign the shape parameters when calling the constructor. If the value of the parameter dims is 1, then xshape, yshape, zshape are equal to the number of elements read from the arrays x and y or stored to the array z. Note that values of shape parameters may differ from physical shapes of arrays x, y, and z if non-trivial strides are assigned. 10 Intel® Math Kernel Library Reference Manual 2222 If the constructor fails to create a task descriptor, it returns a NULL task pointer. vslConvNewTask1D/vslCorrNewTask1D Creates a new convolution or correlation task descriptor for one-dimensional case. Syntax Fortran: status = vslsconvnewtask1d(task, mode, xshape, yshape, zshape) status = vsldconvnewtask1d(task, mode, xshape, yshape, zshape) status = vslcconvnewtask1d(task, mode, xshape, yshape, zshape) status = vslzconvnewtask1d(task, mode, xshape, yshape, zshape) status = vslscorrnewtask1d(task, mode, xshape, yshape, zshape) status = vsldcorrnewtask1d(task, mode, xshape, yshape, zshape) status = vslccorrnewtask1d(task, mode, xshape, yshape, zshape) status = vslzcorrnewtask1d(task, mode, xshape, yshape, zshape) C: status = vslsConvNewTask1D(task, mode, xshape, yshape, zshape); status = vsldConvNewTask1D(task, mode, xshape, yshape, zshape); status = vslcConvNewTask1D(task, mode, xshape, yshape, zshape); status = vslzConvNewTask1D(task, mode, xshape, yshape, zshape); status = vslsCorrNewTask1D(task, mode, xshape, yshape, zshape); status = vsldCorrNewTask1D(task, mode, xshape, yshape, zshape); status = vslcCorrNewTask1D(task, mode, xshape, yshape, zshape); status = vslzCorrNewTask1D(task, mode, xshape, yshape, zshape); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description mode FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Specifies whether convolution/correlation calculation must be performed by using a direct algorithm or through Fourier transform of the input data. See Table "Values of mode parameter" for a list of possible values. xshape FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Defines the length of the input data sequence for the source array x. See Data Allocation for more information. Statistical Functions 10 2223 Name Type Description yshape FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Defines the length of the input data sequence for the source array y. See Data Allocation for more information. zshape FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Defines the length of the output data sequence to be stored in array z. See Data Allocation for more information. Output Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslsconvnewtask1d, vsldconvnewtask1d, vslcconvnewtask1d, vslzconvnewtask1d INTEGER*4 task(2) for vslscorrnewtask1d, vsldcorrnewtask1d, vslccorrnewtask1d, vslzcorrnewtask1d Fortran 90: TYPE(VSL_CONV_TASK) for vslsconvnewtask1d, vsldconvnewtask1d, vslcconvnewtask1d, vslzconvnewtask1d TYPE(VSL_CORR_TASK) for vslscorrnewtask1d, vsldcorrnewtask1d, vslccorrnewtask1d, vslzcorrnewtask1d C: VSLConvTaskPtr* for vslsConvNewTask1D, vsldConvNewTask1D, vslcConvNewTask1D, vslzConvNewTask1D VSLCorrTaskPtr* for vslsCorrNewTask1D, vsldCorrNewTask1D, vslcCorrNewTask1D, vslzCorrNewTask1D Pointer to the task descriptor if created successfully or NULL pointer otherwise. status FORTRAN 77: INTEGER Fortran 90: INTEGER Set to VSL_STATUS_OK if the task is created successfully or set to non-zero error code otherwise. 10 Intel® Math Kernel Library Reference Manual 2224 Name Type Description C: int Description Each vslConvNewTask1D/vslCorrNewTask1D constructor creates a new convolution or correlation task descriptor with the user specified values for explicit parameters. The optional parameters are set to their default values (see Table "Convolution and Correlation Task Parameters"). Unlike vslConvNewTask/ vslCorrNewTask, these routines represent a special one-dimensional version of the constructor which assumes that the value of the parameter dims is 1. The parameters xshape, yshape, and zshape are equal to the number of elements read from the arrays x and y or stored to the array z. You explicitly assign the shape parameters when calling the constructor. vslConvNewTaskX/vslCorrNewTaskX Creates a new convolution or correlation task descriptor for multidimensional case and assigns source data to the first operand vector. Syntax Fortran: status = vslsconvnewtaskx(task, mode, dims, xshape, yshape, zshape, x, xstride) status = vsldconvnewtaskx(task, mode, dims, xshape, yshape, zshape, x, xstride) status = vslcconvnewtaskx(task, mode, dims, xshape, yshape, zshape, x, xstride) status = vslzconvnewtaskx(task, mode, dims, xshape, yshape, zshape, x, xstride) status = vslscorrnewtaskx(task, mode, dims, xshape, yshape, zshape, x, xstride) status = vsldcorrnewtaskx(task, mode, dims, xshape, yshape, zshape, x, xstride) status = vslccorrnewtaskx(task, mode, dims, xshape, yshape, zshape, x, xstride) status = vslzcorrnewtaskx(task, mode, dims, xshape, yshape, zshape, x, xstride) C: status = vslsConvNewTaskX(task, mode, dims, xshape, yshape, zshape, x, xstride); status = vsldConvNewTaskX(task, mode, dims, xshape, yshape, zshape, x, xstride); status = vslcConvNewTaskX(task, mode, dims, xshape, yshape, zshape, x, xstride); status = vslzConvNewTaskX(task, mode, dims, xshape, yshape, zshape, x, xstride); status = vslsCorrNewTaskX(task, mode, dims, xshape, yshape, zshape, x, xstride); status = vsldCorrNewTaskX(task, mode, dims, xshape, yshape, zshape, x, xstride); status = vslcCorrNewTaskX(task, mode, dims, xshape, yshape, zshape, x, xstride); status = vslzCorrNewTaskX(task, mode, dims, xshape, yshape, zshape, x, xstride); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Statistical Functions 10 2225 Input Parameters Name Type Description mode FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Specifies whether convolution/correlation calculation must be performed by using a direct algorithm or through Fourier transform of the input data. See Table "Values of mode parameter" for a list of possible values. dims FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Rank of user data. Specifies number of dimensions for the input and output arrays x, y, and z used during the execution stage. Must be in the range from 1 to 7. The value is explicitly assigned by the constructor. xshape FORTRAN 77: INTEGER Fortran 90: INTEGER, DIMENSION(*) C: const int[] Defines the shape of the input data for the source array x. See Data Allocation for more information. yshape FORTRAN 77: INTEGER Fortran 90: INTEGER, DIMENSION(*) C: const int[] Defines the shape of the input data for the source array y. See Data Allocation for more information. zshape FORTRAN 77: INTEGER Fortran 90: INTEGER, DIMENSION(*) C: const int[] Defines the shape of the output data to be stored in array z.See Data Allocation for more information. x FORTRAN 77: REAL*4 for real data in single precision flavors, REAL*8 for real data in double precision flavors, COMPLEX*8 for complex data in single precision flavors, COMPLEX*16 for complex data in double precision flavors Fortran 90: REAL(KIND=4), DIMENSION (*) for real data in single precision flavors, REAL(KIND=8), DIMENSION (*) for real data in double precision flavors, COMPLEX(KIND=4), DIMENSION (*) for complex data in single precision flavors, COMPLEX(KIND=8), DIMENSION (*) for complex data in double precision flavors Pointer to the array containing input data for the first operand vector.See Data Allocation for more information. 10 Intel® Math Kernel Library Reference Manual 2226 Name Type Description C: const float[] for real data in single precision flavors, const double[] for real data in double precision flavors, const MKL_Complex8[] for complex data in single precision flavors, const MKL_Complex16[] for complex data in double precision flavors xstride FORTRAN 77: INTEGER Fortran 90: INTEGER, DIMENSION(*) C: const int[] Strides for input data in the array x. Output Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslsconvnewtaskx, vsldconvnewtaskx, vslcconvnewtaskx, vslzconvnewtaskx INTEGER*4 task(2) for vslscorrnewtaskx, vsldcorrnewtaskx, vslccorrnewtaskx, vslzcorrnewtaskx Fortran 90: TYPE(VSL_CONV_TASK) for vslsconvnewtaskx, vsldconvnewtaskx, vslcconvnewtaskx, vslzconvnewtaskx TYPE(VSL_CORR_TASK) for vslscorrnewtaskx, vsldcorrnewtaskx, vslccorrnewtaskx, vslzcorrnewtaskx C: VSLConvTaskPtr* for vslsConvNewTaskX, vsldConvNewTaskX, vslcConvNewTaskX, vslzConvNewTaskX Pointer to the task descriptor if created successfully or NULL pointer otherwise. Statistical Functions 10 2227 Name Type Description VSLCorrTaskPtr* for vslsCorrNewTaskX, vsldCorrNewTaskX, vslcCorrNewTaskX, vslzCorrNewTaskX status FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Set to VSL_STATUS_OK if the task is created successfully or set to non-zero error code otherwise. Description Each vslConvNewTaskX/vslCorrNewTaskX constructor creates a new convolution or correlation task descriptor with the user specified values for explicit parameters. The optional parameters are set to their default values (see Table "Convolution and Correlation Task Parameters"). Unlike vslConvNewTask/vslCorrNewTask, these routines represent the so called X-form version of the constructor, which means that in addition to creating the task descriptor they assign particular data to the first operand vector in array x used in convolution or correlation operation. The task descriptor created by the vslConvNewTaskX/vslCorrNewTaskX constructor keeps the pointer to the array x all the time, that is, until the task object is deleted by one of the destructor routines (see vslConvDeleteTask/ vslCorrDeleteTask). Using this form of constructors is recommended when you need to compute multiple convolutions or correlations with the same data vector in array x against different vectors in array y. This helps improve performance by eliminating unnecessary overhead in repeated computation of intermediate data required for the operation. The parameters xshape, yshape, and zshape define the shapes of the input and output data provided by the arrays x, y, and z, respectively. Each shape parameter is an array of integers with its length equal to the value of dims. You explicitly assign the shape parameters when calling the constructor. If the value of the parameter dims is 1, then xshape, yshape, and zshape are equal to the number of elements read from the arrays x and y or stored to the array z. Note that values of shape parameters may differ from physical shapes of arrays x, y, and z if non-trivial strides are assigned. The stride parameter xstride specifies the physical location of the input data in the array x. In a onedimensional case, stride is an interval between locations of consecutive elements of the array. For example, if the value of the parameter xstride is s, then only every sth element of the array x will be used to form the input sequence. The stride value must be positive or negative but not zero. vslConvNewTaskX1D/vslCorrNewTaskX1D Creates a new convolution or correlation task descriptor for one-dimensional case and assigns source data to the first operand vector. Syntax Fortran: status = vslsconvnewtaskx1d(task, mode, xshape, yshape, zshape, x, xstride) status = vsldconvnewtaskx1d(task, mode, xshape, yshape, zshape, x, xstride) status = vslcconvnewtaskx1d(task, mode, xshape, yshape, zshape, x, xstride) status = vslzconvnewtaskx1d(task, mode, xshape, yshape, zshape, x, xstride) 10 Intel® Math Kernel Library Reference Manual 2228 status = vslscorrnewtaskx1d(task, mode, xshape, yshape, zshape, x, xstride) status = vsldcorrnewtaskx1d(task, mode, xshape, yshape, zshape, x, xstride) status = vslccorrnewtaskx1d(task, mode, xshape, yshape, zshape, x, xstride) status = vslzcorrnewtaskx1d(task, mode, xshape, yshape, zshape, x, xstride) C: status = vslsConvNewTaskX1D(task, mode, xshape, yshape, zshape, x, xstride); status = vsldConvNewTaskX1D(task, mode, xshape, yshape, zshape, x, xstride); status = vslcConvNewTaskX1D(task, mode, xshape, yshape, zshape, x, xstride); status = vslzConvNewTaskX1D(task, mode, xshape, yshape, zshape, x, xstride); status = vslsCorrNewTaskX1D(task, mode, xshape, yshape, zshape, x, xstride); status = vsldCorrNewTaskX1D(task, mode, xshape, yshape, zshape, x, xstride); status = vslcCorrNewTaskX1D(task, mode, xshape, yshape, zshape, x, xstride); status = vslzCorrNewTaskX1D(task, mode, xshape, yshape, zshape, x, xstride); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description mode FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Specifies whether convolution/correlation calculation must be performed by using a direct algorithm or through Fourier transform of the input data. See Table "Values of mode parameter" for a list of possible values. xshape FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Defines the length of the input data sequence for the source array x. See Data Allocation for more information. yshape FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Defines the length of the input data sequence for the source array y. See Data Allocation for more information. zshape FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Defines the length of the output data sequence to be stored in array z. See Data Allocation for more information. x FORTRAN 77: REAL*4 for real data in single precision flavors, REAL*8 for real data in double precision flavors, COMPLEX*8 for complex data in single precision flavors, Pointer to the array containing input data for the first operand vector. See Data Allocation for more information. Statistical Functions 10 2229 Name Type Description COMPLEX*16 for complex data in double precision flavors Fortran 90: REAL(KIND=4), DIMENSION (*) for real data in single precision flavors, REAL(KIND=8), DIMENSION (*) for real data in double precision flavors, COMPLEX(KIND=4), DIMENSION (*) for complex data in single precision flavors, COMPLEX(KIND=8), DIMENSION (*) for complex data in double precision flavors C: const float[] for real data in single precision flavors, const double[] for real data in double precision flavors, const MKL_Complex8[] for complex data in single precision flavors, const MKL_Complex16[] for complex data in double precision flavors xstride FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Stride for input data sequence in the arrayx. Output Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslsconvnewtaskx1d, vsldconvnewtaskx1d, vslcconvnewtaskx1d, vslzconvnewtaskx1d INTEGER*4 task(2) for vslscorrnewtaskx1d, vsldcorrnewtaskx1d, vslccorrnewtaskx1d, vslzcorrnewtaskx1d Fortran 90: TYPE(VSL_CONV_TASK) for vslsconvnewtaskx1d, Pointer to the task descriptor if created successfully or NULL pointer otherwise. 10 Intel® Math Kernel Library Reference Manual 2230 Name Type Description vsldconvnewtaskx1d, vslcconvnewtaskx1d, vslzconvnewtaskx1d TYPE(VSL_CORR_TASK) for vslscorrnewtaskx1d, vsldcorrnewtaskx1d, vslccorrnewtaskx1d, vslzcorrnewtaskx1d C: VSLConvTaskPtr* for vslsConvNewTaskX1D, vsldConvNewTaskX1D, vslcConvNewTaskX1D, vslzConvNewTaskX1D VSLCorrTaskPtr* for vslsCorrNewTaskX1D, vsldCorrNewTaskX1D, vslcCorrNewTaskX1D, vslzCorrNewTaskX1D status FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Set to VSL_STATUS_OK if the task is created successfully or set to non-zero error code otherwise. Description Each vslConvNewTaskX1D/vslCorrNewTaskX1D constructor creates a new convolution or correlation task descriptor with the user specified values for explicit parameters. The optional parameters are set to their default values (see Table "Convolution and Correlation Task Parameters"). These routines represent a special one-dimensional version of the so called X-form of the constructor. This assumes that the value of the parameter dims is 1 and that in addition to creating the task descriptor, constructor routines assign particular data to the first operand vector in array x used in convolution or correlation operation. The task descriptor created by the vslConvNewTaskX1D/vslCorrNewTaskX1D constructor keeps the pointer to the array x all the time, that is, until the task object is deleted by one of the destructor routines (see vslConvDeleteTask/vslCorrDeleteTask). Using this form of constructors is recommended when you need to compute multiple convolutions or correlations with the same data vector in array x against different vectors in array y. This helps improve performance by eliminating unnecessary overhead in repeated computation of intermediate data required for the operation. The parameters xshape, yshape, and zshape are equal to the number of elements read from the arrays x and y or stored to the array z. You explicitly assign the shape parameters when calling the constructor. The stride parameters xstride specifies the physical location of the input data in the array x and is an interval between locations of consecutive elements of the array. For example, if the value of the parameter xstride is s, then only every sth element of the array x will be used to form the input sequence. The stride value must be positive or negative but not zero. Statistical Functions 10 2231 Task Editors Task editors in convolution and correlation API of Intel MKL are routines intended for setting up or changing the following task parameters (see Table "Convolution and Correlation Task Parameters"): • mode • internal_precision • start • decimation For setting up or changing each of the above parameters, a separate routine exists. NOTE Fields of the task descriptor structure are accessible only through the set of task editor routines provided with the software. The work data computed during the last commitment process may become invalid with respect to new parameter settings. That is why after applying any of the editor routines to change the task descriptor settings, the task loses its commitment status and goes through the full commitment process again during the next execution or copy operation. For more information on task commitment, see the Introduction to Convolution and Correlation. Table "Task Editors" lists available task editors. Task Editors Routine Description vslConvSetMode/vslCorrSetMode Changes the value of the parameter mode for the operation of convolution or correlation. vslConvSetInternalPrecision/ vslCorrSetInternalPrecision Changes the value of the parameter internal_precision for the operation of convolution or correlation. vslConvSetStart/vslCorrSetStart Sets the value of the parameter start for the operation of convolution or correlation. vslConvSetDecimation/ vslCorrSetDecimation Sets the value of the parameter decimation for the operation of convolution or correlation. NOTE You can use the NULL task pointer in calls to editor routines. In this case, the routine is terminated and no system crash occurs. vslConvSetMode/vslCorrSetMode Changes the value of the parameter mode in the convolution or correlation task descriptor. Syntax Fortran: status = vslconvsetmode(task, newmode) status = vslcorrsetmode(task, newmode) C: status = vslConvSetMode(task, newmode); status = vslCorrSetMode(task, newmode); 10 Intel® Math Kernel Library Reference Manual 2232 Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslconvsetmode INTEGER*4 task(2) for vslcorrsetmode Fortran 90: TYPE(VSL_CONV_TASK) for vslconvsetmode TYPE(VSL_CORR_TASK) for vslcorrsetmode C: VSLConvTaskPtr for vslConvSetMode VSLCorrTaskPtr for vslCorrSetMode Pointer to the task descriptor. newmode FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int New value of the parameter mode. Output Parameters Name Type Description status FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Current status of the task. Description This function is declared in mkl_vsl.f77 for FORTRAN 77 interface, in mkl_vsl.f90 for Fortran 90 interface, and in mkl_vsl_functions.h for C interface. The function routine changes the value of the parameter mode for the operation of convolution or correlation. This parameter defines whether the computation should be done via Fourier transforms of the input/output data or using a direct algorithm. Initial value for mode is assigned by a task constructor. Predefined values for the mode parameter are as follows: Values of mode parameter Value Purpose VSL_CONV_MODE_FFT Compute convolution by using fast Fourier transform. VSL_CORR_MODE_FFT Compute correlation by using fast Fourier transform. Statistical Functions 10 2233 Value Purpose VSL_CONV_MODE_DIRECT Compute convolution directly. VSL_CORR_MODE_DIRECT Compute correlation directly. VSL_CONV_MODE_AUTO Automatically choose direct or Fourier mode for convolution. VSL_CORR_MODE_AUTO Automatically choose direct or Fourier mode for correlation. vslConvSetInternalPrecision/vslCorrSetInternalPrecision Changes the value of the parameter internal_precision in the convolution or correlation task descriptor. Syntax Fortran: status = vslconvsetinternalprecision(task, precision) status = vslcorrsetinternalprecision(task, precision) C: status = vslConvSetInternalPrecision(task, precision); status = vslCorrSetInternalPrecision(task, precision); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslconvsetinternalprecisio n INTEGER*4 task(2) for vslcorrsetinternalprecisio n Fortran 90: TYPE(VSL_CONV_TASK) for vslconvsetinternalprecisio n TYPE(VSL_CORR_TASK) for vslcorrsetinternalprecisio n C: VSLConvTaskPtr for vslConvSetInternalPrecisio n Pointer to the task descriptor. 10 Intel® Math Kernel Library Reference Manual 2234 Name Type Description VSLCorrTaskPtr for vslCorrSetInternalPrecisio n precision FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int New value of the parameter internal_precision. Output Parameters Name Type Description status FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Current status of the task. Description The vslConvSetInternalPrecision/vslCorrSetInternalPrecision routine changes the value of the parameter internal_precision for the operation of convolution or correlation. This parameter defines whether the internal computations of the convolution or correlation result should be done in single or double precision. Initial value for internal_precision is assigned by a task constructor and set to either "single" or "double" according to the particular flavor of the constructor used. Changing the internal_precision can be useful if the default setting of this parameter was "single" but you want to calculate the result with double precision even if input and output data are represented in single precision. Predefined values for the internal_precision input parameter are as follows: Values of internal_precision Parameter Value Purpose VSL_CONV_PRECISION_SINGLE Compute convolution with single precision. VSL_CORR_PRECISION_SINGLE Compute correlation with single precision. VSL_CONV_PRECISION_DOUBLE Compute convolution with double precision. VSL_CORR_PRECISION_DOUBLE Compute correlation with double precision. vslConvSetStart/vslCorrSetStart Changes the value of the parameter start in the convolution or correlation task descriptor. Syntax Fortran: status = vslconvsetstart(task, start) status = vslcorrsetstart(task, start) C: status = vslConvSetStart(task, start); Statistical Functions 10 2235 status = vslCorrSetStart(task, start); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslconvsetstart INTEGER*4 task(2) for vslcorrsetstart Fortran 90: TYPE(VSL_CONV_TASK) for vslconvsetstart TYPE(VSL_CORR_TASK) for vslcorrsetstart C: VSLConvTaskPtr for vslConvSetStart VSLCorrTaskPtr for vslCorrSetStart Pointer to the task descriptor. start FORTRAN 77: INTEGER Fortran 90: INTEGER, DIMENSION (*) C: const int[] New value of the parameter start. Output Parameters Name Type Description status FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Current status of the task. Description The vslConvSetStart/vslCorrSetStart routine sets the value of the parameter start for the operation of convolution or correlation. In a one-dimensional case, this parameter points to the first element in the mathematical result that should be stored in the output array. In a multidimensional case, start is an array of indices and its length is equal to the number of dimensions specified by the parameter dims. For more information about the definition and effect of this parameter, see Data Allocation. During the initial task descriptor construction, the default value for start is undefined and this parameter is not used. Therefore the only way to set and use the start parameter is via assigning it some value by one of the vslConvSetStart/vslCorrSetStart routines. 10 Intel® Math Kernel Library Reference Manual 2236 vslConvSetDecimation/vslCorrSetDecimation Changes the value of the parameter decimation in the convolution or correlation task descriptor. Syntax Fortran: status = vslconvsetdecimation(task, decimation) status = vslcorrsetdecimation(task, decimation) C: status = vslConvSetDecimation(task, decimation); status = vslCorrSetDecimation(task, decimation); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslconvsetdecimation INTEGER*4 task(2) for vslcorrsetdecimation Fortran 90: TYPE(VSL_CONV_TASK) for vslconvsetdecimation TYPE(VSL_CORR_TASK) for vslcorrsetdecimation C: VSLConvTaskPtr for vslConvSetDecimation VSLCorrTaskPtr for vslCorrSetDecimation Pointer to the task descriptor. decimatio n FORTRAN 77: INTEGER Fortran 90: INTEGER, DIMENSION (*) C: const int[] New value of the parameter decimation. Output Parameters Name Type Description status FORTRAN 77: INTEGER Fortran 90: INTEGER Current status of the task. Statistical Functions 10 2237 Name Type Description C: int Description The routine sets the value of the parameter decimation for the operation of convolution or correlation. This parameter determines how to thin out the mathematical result of convolution or correlation before writing it into the output data array. For example, in a one-dimensional case, if decimation = d > 1, only every d-th element of the mathematical result is written to the output array z. In a multidimensional case, decimation is an array of indices and its length is equal to the number of dimensions specified by the parameter dims. For more information about the definition and effect of this parameter, see Data Allocation. During the initial task descriptor construction, the default value for decimation is undefined and this parameter is not used. Therefore the only way to set and use the decimation parameter is via assigning it some value by one of the vslSetDecimation routines. Task Execution Routines Task execution routines compute convolution or correlation results based on parameters held by the task descriptor and on the user data supplied for input vectors. After you create and adjust a task, you can execute it multiple times by applying to different input/output data of the same type, precision, and shape. Intel MKL provides the following forms of convolution/correlation execution routines: • General form executors that use the task descriptor created by the general form constructor and expect to get two source data arrays x and y on input • X-form executors that use the task descriptor created by the X-form constructor and expect to get only one source data array y on input because the first array x has been already specified on the construction stage When the task is executed for the first time, the execution routine includes a task commitment operation, which involves two basic steps: parameters consistency check and preparation of auxiliary data (for example, this might be the calculation of Fourier transform for input data). Each execution routine has an associated one-dimensional version that provides algorithmic and computational benefits. NOTE You can use the NULL task pointer in calls to execution routines. In this case, the routine is terminated and no system crash occurs. If the task is executed successfully, the execution routine returns the zero status code. If an error is detected, the execution routine returns an error code which signals that a specific error has occurred. In particular, an error status code is returned in the following cases: • if the task pointer is NULL • if the task descriptor is corrupted • if calculation has failed for some other reason. NOTE Intel® MKL does not control floating-point errors, like overflow or gradual underflow, or operations with NaNs, etc. If an error occurs, the task descriptor stores the error code. The table below lists all task execution routines. 10 Intel® Math Kernel Library Reference Manual 2238 Task Execution Routines Routine Description vslConvExec/vslCorrExec Computes convolution or correlation for a multidimensional case. vslConvExec1D/vslCorrExec1D Computes convolution or correlation for a one-dimensional case. vslConvExecX/vslCorrExecX Computes convolution or correlation as X-form for a multidimensional case. vslConvExecX1D/vslCorrExecX1D Computes convolution or correlation as X-form for a onedimensional case. vslConvExec/vslCorrExec Computes convolution or correlation for multidimensional case. Syntax Fortran: status = vslsconvexec(task, x, xstride, y, ystride, z, zstride) status = vsldconvexec(task, x, xstride, y, ystride, z, zstride) status = vslcconvexec(task, x, xstride, y, ystride, z, zstride) status = vslzconvexec(task, x, xstride, y, ystride, z, zstride) status = vslscorrexec(task, x, xstride, y, ystride, z, zstride) status = vsldcorrexec(task, x, xstride, y, ystride, z, zstride) status = vslccorrexec(task, x, xstride, y, ystride, z, zstride) status = vslzcorrexec(task, x, xstride, y, ystride, z, zstride) C: status = vslsConvExec(task, x, xstride, y, ystride, z, zstride); status = vsldConvExec(task, x, xstride, y, ystride, z, zstride); status = vslcConvExec(task, x, xstride, y, ystride, z, zstride); status = vslzConvExec(task, x, xstride, y, ystride, z, zstride); status = vslsCorrExec(task, x, xstride, y, ystride, z, zstride); status = vsldCorrExec(task, x, xstride, y, ystride, z, zstride); status = vslcCorrExec(task, x, xstride, y, ystride, z, zstride); status = vslzCorrExec(task, x, xstride, y, ystride, z, zstride); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Statistical Functions 10 2239 Input Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslsconvexec, vsldconvexec, vslcconvexec, vslzconvexec INTEGER*4 task(2) for vslscorrexec, vsldcorrexec, vslccorrexec, vslzcorrexec Fortran 90: TYPE(VSL_CONV_TASK) for vslsconvexec, vsldconvexec, vslcconvexec, vslzconvexec TYPE(VSL_CORR_TASK) for vslscorrexec, vsldcorrexec, vslccorrexec, vslzcorrexec C: VSLConvTaskPtr for vslsConvExec, vsldConvExec, vslcConvExec, vslzConvExec VSLCorrTaskPtr for vslsCorrExec, vsldCorrExec, vslcCorrExec, vslzCorrExec Pointer to the task descriptor x, y FORTRAN 77: REAL*4 for vslsconvexec and vslscorrexec, REAL*8 for vsldconvexec and vsldcorrexec, COMPLEX*8 forvslcconvexec and vslccorrexec, COMPLEX*16 forvslzconvexec and vslzcorrexec Fortran 90: REAL(KIND=4), DIMENSION(*) for vslsconvexec and vslscorrexec, REAL(KIND=8), DIMENSION(*) for vsldconvexec and vsldcorrexec, COMPLEX(KIND=4), DIMENSION (*) forvslcconvexec and vslccorrexec, COMPLEX(KIND=8), DIMENSION (*) for vslzconvexec and vslzcorrexec Pointers to arrays containing input data. See Data Allocation for more information. 10 Intel® Math Kernel Library Reference Manual 2240 Name Type Description C: const float[] for vslsConvExec and vslsCorrExec, const double[] for vsldConvExec and vsldCorrExec, const MKL_Complex8[] for vslcConvExec and vslcCorrExec, const MKL_Complex16[] for vslzConvExec and vslzCorrExec xstride, ystride, zstride FORTRAN 77: INTEGER Fortran 90: INTEGER, DIMENSION (*) C: const int[] Strides for input and output data. For more information, see stride parameters. Output Parameters Name Type Description z FORTRAN 77: REAL*4 for vslsconvexec and vslscorrexec, REAL*8 for vsldconvexec and vsldcorrexec, COMPLEX*8 forvslcconvexec and vslccorrexec, COMPLEX*16 forvslzconvexec and vslzcorrexec Fortran 90: REAL(KIND=4), DIMENSION(*) for vslsconvexec and vslscorrexec, REAL(KIND=8), DIMENSION(*) for vsldconvexec and vsldcorrexec, COMPLEX(KIND=4), DIMENSION (*) forvslcconvexec and vslccorrexec, COMPLEX(KIND=8), DIMENSION (*) for vslzconvexec and vslzcorrexec Pointer to the array that stores output data. See Data Allocation for more information. Statistical Functions 10 2241 Name Type Description C: const float[] for vslsConvExec and vslsCorrExec, const double[] for vsldConvExec and vsldCorrExec, const MKL_Complex8[] for vslcConvExec and vslcCorrExec, const MKL_Complex16[] for vslzConvExec and vslzCorrExec status FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Set to VSL_STATUS_OK if the task is executed successfully or set to non-zero error code otherwise. Description Each of the vslConvExec/vslCorrExec routines computes convolution or correlation of the data provided by the arrays x and y and then stores the results in the array z. Parameters of the operation are read from the task descriptor created previously by a corresponding vslConvNewTask/vslCorrNewTask constructor and pointed to by task. If task is NULL, no operation is done. The stride parameters xstride, ystride, and zstride specify the physical location of the input and output data in the arrays x, y, and z, respectively. In a one-dimensional case, stride is an interval between locations of consecutive elements of the array. For example, if the value of the parameter zstride is s, then only every sth element of the array z will be used to store the output data. The stride value must be positive or negative but not zero. vslConvExec1D/vslCorrExec1D Computes convolution or correlation for onedimensional case. Syntax Fortran: status = vslsconvexec1d(task, x, xstride, y, ystride, z, zstride) status = vsldconvexec1d(task, x, xstride, y, ystride, z, zstride) status = vslcconvexec1d(task, x, xstride, y, ystride, z, zstride) status = vslzconvexec1d(task, x, xstride, y, ystride, z, zstride) status = vslscorrexec1d(task, x, xstride, y, ystride, z, zstride) status = vsldcorrexec1d(task, x, xstride, y, ystride, z, zstride) status = vslccorrexec1d(task, x, xstride, y, ystride, z, zstride) status = vslzcorrexec1d(task, x, xstride, y, ystride, z, zstride) C: status = vslsConvExec1D(task, x, xstride, y, ystride, z, zstride); 10 Intel® Math Kernel Library Reference Manual 2242 status = vsldConvExec1D(task, x, xstride, y, ystride, z, zstride); status = vslcConvExec1D(task, x, xstride, y, ystride, z, zstride); status = vslzConvExec1D(task, x, xstride, y, ystride, z, zstride); status = vslsCorrExec1D(task, x, xstride, y, ystride, z, zstride); status = vsldCorrExec1D(task, x, xstride, y, ystride, z, zstride); status = vslcCorrExec1D(task, x, xstride, y, ystride, z, zstride); status = vslzCorrExec1D(task, x, xstride, y, ystride, z, zstride); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslsconvexec1d, vsldconvexec1d, vslcconvexec1d, vslzconvexec1d INTEGER*4 task(2) for vslscorrexec1d, vsldcorrexec1d, vslccorrexec1d, vslzcorrexec1d Fortran 90: TYPE(VSL_CONV_TASK) for vslsconvexec1d, vsldconvexec1d, vslcconvexec1d, vslzconvexec1d TYPE(VSL_CORR_TASK) for vslscorrexec1d, vsldcorrexec1d, vslccorrexec1d, vslzcorrexec1d C: VSLConvTaskPtr for vslsConvExec1D, vsldConvExec1D, vslcConvExec1D, vslzConvExec1D VSLCorrTaskPtr for vslsCorrExec1D, vsldCorrExec1D, vslcCorrExec1D, vslzCorrExec1D Pointer to the task descriptor. Statistical Functions 10 2243 Name Type Description x, y FORTRAN 77: REAL*4 for vslsconvexec1d and vslscorrexec1d, REAL*8 for vsldconvexec1d and vsldcorrexec1d, COMPLEX*8 forvslcconvexec1d and vslccorrexec1d, COMPLEX*16 forvslzconvexec1d and vslzcorrexec1d Fortran 90: REAL(KIND=4), DIMENSION(*) for vslsconvexec1d and vslscorrexec1d, REAL(KIND=8), DIMENSION(*) for vsldconvexec1d and vsldcorrexec1d, COMPLEX(KIND=4), DIMENSION (*) forvslcconvexec1d and vslccorrexec1d, COMPLEX(KIND=8), DIMENSION (*) for vslzconvexec1d and vslzcorrexec1d C: const float[] for vslsConvExec1D and vslsCorrExec1D, const double[] for vsldConvExec1D and vsldCorrExec1D, const MKL_Complex8[] for vslcConvExec1D and vslcCorrExec1D, const MKL_Complex16[] for vslzConvExec1D and vslzCorrExec1D Pointers to arrays containing input data. See Data Allocation for more information. xstride, ystride, zstride FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Strides for input and output data. For more information, see stride parameters. 10 Intel® Math Kernel Library Reference Manual 2244 Output Parameters Name Type Description z FORTRAN 77: REAL*4 for vslsconvexec1d and vslscorrexec1d, REAL*8 for vsldconvexec1d and vsldcorrexec1d, COMPLEX*8 forvslcconvexec1d and vslccorrexec1d, COMPLEX*16 forvslzconvexec1d and vslzcorrexec1d Fortran 90: REAL(KIND=4), DIMENSION(*) for vslsconvexec1d and vslscorrexec1d, REAL(KIND=8), DIMENSION(*) for vsldconvexec1d and vsldcorrexec1d, COMPLEX(KIND=4), DIMENSION (*) forvslcconvexec1d and vslccorrexec1d, COMPLEX(KIND=8), DIMENSION (*) for vslzconvexec1d and vslzcorrexec1d C: const float[] for vslsConvExec1D and vslsCorrExec1D, const double[] for vsldConvExec1D and vsldCorrExec1D, const MKL_Complex8[] for vslcConvExec1D and vslcCorrExec1D, const MKL_Complex16[] for vslzConvExec1D and vslzCorrExec1D Pointer to the array that stores output data. See Data Allocation for more information. status FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Set to VSL_STATUS_OK if the task is executed successfully or set to non-zero error code otherwise. Statistical Functions 10 2245 Description Each of the vslConvExec1D/vslCorrExec1D routines computes convolution or correlation of the data provided by the arrays x and y and then stores the results in the array z. These routines represent a special one-dimensional version of the operation, assuming that the value of the parameter dims is 1. Using this version of execution routines can help speed up performance in case of one-dimensional data. Parameters of the operation are read from the task descriptor created previously by a corresponding vslConvNewTask1D/vslCorrNewTask1D constructor and pointed to by task. If task is NULL, no operation is done. vslConvExecX/vslCorrExecX Computes convolution or correlation for multidimensional case with the fixed first operand vector. Syntax Fortran: status = vslsconvexecx(task, y, ystride, z, zstride) status = vsldconvexecx(task, y, ystride, z, zstride) status = vslcconvexecx(task, y, ystride, z, zstride) status = vslzconvexecx(task, y, ystride, z, zstride) status = vslscorrexecx(task, y, ystride, z, zstride) status = vsldcorrexecx(task, y, ystride, z, zstride) status = vslccorrexecx(task, y, ystride, z, zstride) status = vslzcorrexecx(task, y, ystride, z, zstride) C: status = vslsConvExecX(task, y, ystride, z, zstride); status = vsldConvExecX(task, y, ystride, z, zstride); status = vslcConvExecX(task, y, ystride, z, zstride); status = vslzConvExecX(task, y, ystride, z, zstride); status = vslsCorrExecX(task, y, ystride, z, zstride); status = vslcCorrExecX(task, y, ystride, z, zstride); status = vslzCorrExecX(task, y, ystride, z, zstride); status = vsldCorrExecX(task, y, ystride, z, zstride); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h 10 Intel® Math Kernel Library Reference Manual 2246 Input Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslsconvexecx, vsldconvexecx, vslcconvexecx, vslzconvexecx INTEGER*4 task(2) for vslscorrexecx, vsldcorrexecx, vslccorrexecx, vslzcorrexecx Fortran 90: TYPE(VSL_CONV_TASK) for vslsconvexecx, vsldconvexecx, vslcconvexecx, vslzconvexecx TYPE(VSL_CORR_TASK) for vslscorrexecx, vsldcorrexecx, vslccorrexecx, vslzcorrexecx C: VSLConvTaskPtr for vslsConvExecX, vsldConvExecX, vslcConvExecX, vslzConvExecX VSLCorrTaskPtr for vslsCorrExecX, vsldCorrExecX, vslcCorrExecX, vslzCorrExecX Pointer to the task descriptor. x ,y FORTRAN 77: REAL*4 for vslsconvexecx and vslscorrexecx, REAL*8 for vsldconvexecx and vsldcorrexecx, COMPLEX*8 forvslcconvexecx and vslccorrexecx, COMPLEX*16 forvslzconvexecx and vslzcorrexecx Fortran 90: REAL(KIND=4), DIMENSION(*) for vslsconvexecx and vslscorrexecx, Pointer to array containing input data (for the second operand vector). See Data Allocation for more information. Statistical Functions 10 2247 Name Type Description REAL(KIND=8), DIMENSION(*) for vsldconvexecx and vsldcorrexecx, COMPLEX(KIND=4), DIMENSION (*) forvslcconvexecx and vslccorrexecx, COMPLEX(KIND=8), DIMENSION (*) for vslzconvexecx and vslzcorrexecx C: const float[] for vslsConvExecX and vslsCorrExecX, const double[] for vsldConvExecX and vsldCorrExecX, const MKL_Complex8[] for vslcConvExecX and vslcCorrExecX, const MKL_Complex16[] for vslzConvExecX and vslzCorrExecX ystride ,z stride FORTRAN 77: INTEGER Fortran 90: INTEGER, DIMENSION (*) C: const int[] Strides for input and output data. For more information, see stride parameters. Output Parameters Name Type Description z FORTRAN 77: REAL*4 for vslsconvexecx and vslscorrexecx, REAL*8 for vsldconvexecx and vsldcorrexecx, COMPLEX*8 forvslcconvexecx and vslccorrexecx, COMPLEX*16 forvslzconvexecx and vslzcorrexecx Fortran 90: REAL(KIND=4), DIMENSION(*) for vslsconvexecx and vslscorrexecx, Pointer to the array that stores output data. See Data Allocation for more information. 10 Intel® Math Kernel Library Reference Manual 2248 Name Type Description REAL(KIND=8), DIMENSION(*) for vsldconvexecx and vsldcorrexecx, COMPLEX(KIND=4), DIMENSION (*) forvslcconvexecx and vslccorrexecx, COMPLEX(KIND=8), DIMENSION (*) for vslzconvexecx and vslzcorrexecx C: const float[] for vslsConvExecX and vslsCorrExecX, const double[] for vsldConvExecX and vsldCorrExecX, const MKL_Complex8[] for vslcConvExecX and vslcCorrExecX, const MKL_Complex16[] for vslzConvExecX and vslzCorrExecX status FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Set to VSL_STATUS_OK if the task is executed successfully or set to non-zero error code otherwise. Description Each of the vslConvExecX/vslCorrExecX routines computes convolution or correlation of the data provided by the arrays x and y and then stores the results in the array z. These routines represent a special version of the operation, which assumes that the first operand vector was set on the task construction stage and the task object keeps the pointer to the array x. Parameters of the operation are read from the task descriptor created previously by a corresponding vslConvNewTaskX/vslCorrNewTaskX constructor and pointed to by task. If task is NULL, no operation is done. Using this form of execution routines is recommended when you need to compute multiple convolutions or correlations with the same data vector in array x against different vectors in array y. This helps improve performance by eliminating unnecessary overhead in repeated computation of intermediate data required for the operation. vslConvExecX1D/vslCorrExecX1D Computes convolution or correlation for onedimensional case with the fixed first operand vector. Syntax Fortran: status = vslsconvexecx1d(task, y, ystride, z, zstride) Statistical Functions 10 2249 status = vsldconvexecx1d(task, y, ystride, z, zstride) status = vslcconvexecx1d(task, y, ystride, z, zstride) status = vslzconvexecx1d(task, y, ystride, z, zstride) status = vslscorrexecx1d(task, y, ystride, z, zstride) status = vsldcorrexecx1d(task, y, ystride, z, zstride) status = vslccorrexecx1d(task, y, ystride, z, zstride) status = vslzcorrexecx1d(task, y, ystride, z, zstride) C: status = vslsConvExecX1D(task, y, ystride, z, zstride); status = vsldConvExecX1D(task, y, ystride, z, zstride); status = vslcConvExecX1D(task, y, ystride, z, zstride); status = vslzConvExecX1D(task, y, ystride, z, zstride); status = vslsCorrExecX1D(task, y, ystride, z, zstride); status = vslcCorrExecX1D(task, y, ystride, z, zstride); status = vslzCorrExecX1D(task, y, ystride, z, zstride); status = vsldCorrExecX1D(task, y, ystride, z, zstride); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslsconvexecx1d, vsldconvexecx1d, vslcconvexecx1d, vslzconvexecx1d INTEGER*4 task(2) for vslscorrexecx1d, vsldcorrexecx1d, vslccorrexecx1d, vslzcorrexecx1d Fortran 90: TYPE(VSL_CONV_TASK) for vslsconvexecx1d, vsldconvexecx1d, vslcconvexecx1d, vslzconvexecx1d Pointer to the task descriptor. 10 Intel® Math Kernel Library Reference Manual 2250 Name Type Description TYPE(VSL_CORR_TASK) for vslscorrexecx1d, vsldcorrexecx1d, vslccorrexecx1d, vslzcorrexecx1d C: VSLConvTaskPtr for vslsConvExecX1D, vsldConvExecX1D, vslcConvExecX1D, vslzConvExecX1D VSLCorrTaskPtr for vslsCorrExecX1D, vsldCorrExecX1D, vslcCorrExecX1D, vslzCorrExecX1D x , y FORTRAN 77: REAL*4 for vslsconvexecx1d and vslscorrexecx1d, REAL*8 for vsldconvexecx1d and vsldcorrexecx1d, COMPLEX*8 forvslcconvexecx1d and vslccorrexecx1d, COMPLEX*16 forvslzconvexecx1d and vslzcorrexecx1d Fortran 90: REAL(KIND=4), DIMENSION(*) for vslsconvexecx1d and vslscorrexecx1d, REAL(KIND=8), DIMENSION(*) for vsldconvexecx1d and vsldcorrexecx1d, COMPLEX(KIND=4), DIMENSION (*) forvslcconvexecx1d and vslccorrexecx1d, COMPLEX(KIND=8), DIMENSION (*) for vslzconvexecx1d and vslzcorrexecx1d C: const float[] for vslsConvExecX1D and vslsCorrExecX1D, const double[] for vsldConvExecX1D and vsldCorrExecX1D, Pointer to array containing input data (for the second operand vector). See Data Allocation for more information. Statistical Functions 10 2251 Name Type Description const MKL_Complex8[] for vslcConvExecX1D and vslcCorrExecX1D, const MKL_Complex16[] for vslzConvExecX1D and vslzCorrExecX1D ystride, zstride FORTRAN 77: INTEGER Fortran 90: INTEGER C: const int Strides for input and output data. For more information, see stride parameters. Output Parameters Name Type Description z FORTRAN 77: REAL*4 for vslsconvexecx1d and vslscorrexecx1d, REAL*8 for vsldconvexecx1d and vsldcorrexecx1d, COMPLEX*8 forvslcconvexecx1d and vslccorrexecx1d, COMPLEX*16 forvslzconvexecx1d and vslzcorrexecx1d Fortran 90: REAL(KIND=4), DIMENSION(*) for vslsconvexecx1d and vslscorrexecx1d, REAL(KIND=8), DIMENSION(*) for vsldconvexecx1d and vsldcorrexecx1d, COMPLEX(KIND=4), DIMENSION (*) forvslcconvexecx1d and vslccorrexecx1d, COMPLEX(KIND=8), DIMENSION (*) for vslzconvexecx1d and vslzcorrexecx1d C: const float[] for vslsConvExecX1D and vslsCorrExecX1D, const double[] for vsldConvExecX1D and vsldCorrExecX1D, Pointer to the array that stores output data. See Data Allocation for more information. 10 Intel® Math Kernel Library Reference Manual 2252 Name Type Description const MKL_Complex8[] for vslcConvExecX1D and vslcCorrExecX1D, const MKL_Complex16[] for vslzConvExecX1D and vslzCorrExecX1D status FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Set to VSL_STATUS_OK if the task is executed successfully or set to non-zero error code otherwise. Description Each of the vslConvExecX1D/vslCorrExecX1D routines computes convolution or correlation of onedimensional (assuming that dims =1) data provided by the arrays x and y and then stores the results in the array z. These routines represent a special version of the operation, which expects that the first operand vector was set on the task construction stage. Parameters of the operation are read from the task descriptor created previously by a corresponding vslConvNewTaskX1D/vslCorrNewTaskX1D constructor and pointed to by task. If task is NULL, no operation is done. Using this form of execution routines is recommended when you need to compute multiple one-dimensional convolutions or correlations with the same data vector in array x against different vectors in array y. This helps improve performance by eliminating unnecessary overhead in repeated computation of intermediate data required for the operation. Task Destructors Task destructors are routines designed for deleting task objects and deallocating memory. vslConvDeleteTask/vslCorrDeleteTask Destroys the task object and frees the memory. Syntax Fortran: errcode = vslconvdeletetask(task) errcode = vslcorrdeletetask(task) C: errcode = vslConvDeleteTask(task); errcode = vslCorrDeleteTask(task); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Statistical Functions 10 2253 Input Parameters Name Type Description task FORTRAN 77: INTEGER*4 task(2) for vslconvdeletetask INTEGER*4 task(2) for vslcorrdeletetask Fortran 90: TYPE(VSL_CONV_TASK) for vslconvdeletetask TYPE(VSL_CORR_TASK) for vslcorrdeletetask C: VSLConvTaskPtr* for vslConvDeleteTask VSLCorrTaskPtr* for vslCorrDeleteTask Pointer to the task descriptor. Output Parameters Name Type Description errcode FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Contains 0 if the task object is deleted successfully. Contains an error code if an error occurred. Description The vslConvDeleteTask/vslCorrvDeleteTask routine deletes the task descriptor object and frees any working memory and the memory allocated for the data structure. The task pointer is set to NULL. Note that if the vslConvDeleteTask/vslCorrvDeleteTask routine does not delete the task successfully, the routine returns an error code. This error code has no relation to the task status code and does not change it. NOTE You can use the NULL task pointer in calls to destructor routines. In this case, the routine terminates with no system crash. Task Copy The routines are designed for copying convolution and correlation task descriptors. vslConvCopyTask/vslCorrCopyTask Copies a descriptor for convolution or correlation task. Syntax Fortran: status = vslconvcopytask(newtask, srctask) 10 Intel® Math Kernel Library Reference Manual 2254 status = vslcorrcopytask(newtask, srctask) C: status = vslConvCopyTask(newtask, srctask); status = vslCorrCopyTask(newtask, srctask); Include Files • FORTRAN 77: mkl_vsl.f77 • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description srctask FORTRAN 77: INTEGER*4 srctask(2) for vslconvcopytask INTEGER*4 srctask(2) for vslcorrcopytask Fortran 90: TYPE(VSL_CONV_TASK) for vslconvcopytask TYPE(VSL_CORR_TASK) for vslcorrcopytask C: const VSLConvTaskPtr for vslConvCopyTask const VSLCorrTaskPtr for vslCorrCopyTask Pointer to the source task descriptor. Output Parameters Name Type Description newtask FORTRAN 77: INTEGER*4 srctask(2) for vslconvcopytask INTEGER*4 srctask(2) for vslcorrcopytask Fortran 90: TYPE(VSL_CONV_TASK) for vslconvcopytask TYPE(VSL_CORR_TASK) for vslcorrcopytask C: VSLConvTaskPtr* for vslConvCopyTask VSLCorrTaskPtr* for vslCorrCopyTask Pointer to the new task descriptor. Statistical Functions 10 2255 Name Type Description status FORTRAN 77: INTEGER Fortran 90: INTEGER C: int Current status of the source task. Description If a task object srctask already exists, you can use an appropriate vslConvCopyTask/vslCorrCopyTask routine to make its copy in newtask. After the copy operation, both source and new task objects will become committed (see Introduction to Convolution and Correlation for information about task commitment). If the source task was not previously committed, the commitment operation for this task is implicitly invoked before copying starts. If an error occurs during source task commitment, the task stores the error code in the status field. If an error occurs during copy operation, the routine returns a NULL pointer instead of a reference to a new task object. Usage Examples This section demonstrates how you can use the Intel MKL routines to perform some common convolution and correlation operations both for single-threaded and multithreaded calculations. The following two sample functions scond1 and sconf1 simulate the convolution and correlation functions SCOND and SCONF found in IBM ESSL* library. The functions assume single-threaded calculations and can be used with C or C++ compilers. Function scond1 for Single-Threaded Calculations #include "mkl_vsl.h" int scond1( float h[], int inch, float x[], int incx, float y[], int incy, int nh, int nx, int iy0, int ny) { int status; VSLConvTaskPtr task; vslsConvNewTask1D(&task,VSL_CONV_MODE_DIRECT,nh,nx,ny); vslConvSetStart(task, &iy0); status = vslsConvExec1D(task, h,inch, x,incx, y,incy); vslConvDeleteTask(&task); return status; } 10 Intel® Math Kernel Library Reference Manual 2256 Function sconf1 for Single-Threaded Calculations #include "mkl_vsl.h" int sconf1( int init, float h[], int inc1h, float x[], int inc1x, int inc2x, float y[], int inc1y, int inc2y, int nh, int nx, int m, int iy0, int ny, void* aux1, int naux1, void* aux2, int naux2) { int status; /* assume that aux1!=0 and naux1 is big enough */ VSLConvTaskPtr* task = (VSLConvTaskPtr*)aux1; if (init != 0) /* initialization: */ status = vslsConvNewTaskX1D(task,VSL_CONV_MODE_FFT, nh,nx,ny, h,inc1h); if (init == 0) { /* calculations: */ int i; vslConvSetStart(*task, &iy0); for (i=0; i1, you can use multiple threads for invoking the task execution against different data sequences. For such cases, use task copy routines to create m copies of the task object before the calculations stage and then run these copies with different threads. Ensure that you make all necessary parameter adjustments for the task (using Task Editors) before copying it. Statistical Functions 10 2257 The sample code in this case may look as follows: if (init == 0) { int i, status, ss[M]; VSLConvTaskPtr tasks[M]; /* assume that M is big enough */ . . . vslConvSetStart(*task, &iy0); . . . for (i=0; i0, or dx(n) = in-xshape(n) if xstride(n)<0 dy(n) = jn-1 if ystride(n)>0, or dy(n) = jn-yshape(n) if ystride(n)<0 dz(n) = kn-1 if zstride(n)>0, or dz(n) = kn-zshape(n) if zstride(n)<0 The definitions of indices e, f, and g assume that indexes for arrays x, y, and z are started from unity: x(e) is defined for e=1,...,length(x) y(f) is defined for f=1,...,length(y) z(g) is defined for g=1,...,length(z) Below is a detailed explanation about how elements of the multi-dimensional output vector are stored in the array z for one-dimensional and two-dimensional cases. One-dimensional case. If dims=1, then zshape is the number of the output values to be stored in the array z. The actual length of array z may be greater than zshape elements. If zstride>1, output values are stored with the stride: output(1) is stored to z(1), output(2) is stored to z(1+zstride), and so on. Hence, the actual length of z must be at least 1+zstride*(zshape-1) elements or more. If zstride<0, it still defines the stride between elements of array z. However, the order of the used elements is the opposite. For the k-th output value, output(k) is stored in z(1+|zstride|*(zshape-k)), where |zstride| is the absolute value of zstride. The actual length of the array z must be at least 1+| zstride|*(zshape - 1) elements. Two-dimensional case. If dims=2, the output data is a two-dimensional matrix. The value zstride(1) defines the stride inside matrix columns, that is, the stride between the output(k1, k2) and output(k1+1, k2) for every pair of indices k1, k2. On the other hand, zstride(2) defines the stride between columns, that is, the stride between output(k1,k2) and output(k1,k2+1). If zstride(2) is greater than zshape(1), this causes sparse allocation of columns. If the value of zstride(2) is smaller than zshape(1), this may result in the transposition of the output matrix. For example, if zshape = (2,3), you can define zstride = (3,1) to allocate output values like transposed matrix of the shape 3x2. Whether zstride assumes this kind of transformations or not, you need to ensure that different elements output (k1, ...,kdims) will be stored in different locations z(g). VSL Summary Statistics The VSL Summary Statistics domain comprises a set of routines that compute basic statistical estimates for single and double precision multi-dimensional datasets. See the definition of the supported operations in the Mathematical Notation and Definitions section. The VSL Summary Statistics routines calculate: • raw and central moments up to the fourth order • skewness and excess kurtosis (further referred to as kurtosis for brevity) Statistical Functions 10 2261 • variation coefficient • quantiles and order statistics • minimum and maximum • variance-covariance/correlation matrix • pooled/group variance-covariance matrix and mean • partial variance-covariance/correlation matrix • robust estimators for variance-covariance matrix and mean in presence of outliers The library also contains functions to perform the following tasks: • Detect outliers in datasets • Support missing values in datasets • Parameterize correlation matrices • Compute quantiles for streaming data You can access the VSL Summary Statistics routines through the Fortran 90 and C89 language interfaces. You can also use the C89 interface with later versions of the C/C++, or Fortran 90 interface with programs written in Fortran 95. For users of the C/C++ and Fortran languages, Intel MKL provides the mkl_vsl.h, mkl_vsl.f90, and mkl_vsl.f77 header files. All the header files are in the directory ${MKL}/include See more details about the Fortran header in the Random Number Generators section of this chapter. You can find examples that demonstrate calculation of the VSL Summary Statistics estimates in the following directories: ${MKL}/examples/vslc ${MKL}/examples/vslf The VSL Summary Statistics API is implemented through task objects, or tasks. A task object is a data structure, or a descriptor, holding parameters that determine a specific VSL Summary Statistics operation. For example, such parameters may be precision, dimensions of user data, the matrix of the observations, or shapes of data arrays. All the VSL Summary Statistics routines process a task object as follows: 1. Create a task. 2. Modify settings of the task parameters. 3. Compute statistical estimates. 4. Destroy the task. The VSL Summary Statistics functions fall into the following categories: Task Constructors - routines that create a new task object descriptor and set up most common parameters (dimension, number of observations, and matrix of the observations). Task Editors - routines that can set or modify some parameter settings in the existing task descriptor. Task Computation Routine - a routine that computes specified statistical estimates. Task Destructor - a routine that deletes the task object and frees the memory. A VSL Summary Statistics task object contains a series of pointers to the input and output data arrays. You can read and modify the datasets and estimates at any time but you should allocate and release memory for such data. See detailed information on the algorithms, API, and their usage in the Intel® MKL Summary Statistics Library Application Notes on the Intel® MKL web page. Naming Conventions The names of the Fortran routines in the VSL Summary Statistics are in lowercase (vslssseditquantiles), while the names of types and constants are in uppercase. The names are not case-sensitive. 10 Intel® Math Kernel Library Reference Manual 2262 In C, the names of the routines, types, and constants are case-sensitive and can be lowercase and uppercase (vslsSSEditQuantiles). The names of routines have the following structure: vsl[datatype]SS for the C interface vsl[datatype]ss for the Fortran interface where • vsl is a prefix indicating that the routine belongs to the Vector Statistical Library of Intel MKL. • [datatype] specifies the type of the input and/or output data and can be s (single precision real type), d (double precision real type), or i (integer type). • SS/ss indicates that the routine is intended for calculations of the VSL Summary Statistics estimates. • specifies a particular functionality that the routine is designed for, for example, NewTask, Compute, DeleteTask. NOTE The VSL Summary Statistics routine vslDeleteTask for deletion of the task is independent of the data type and its name omits the [datatype] field. Data Types The VSL Summary Statistics routines use the following data types for the calculations: Type Data Object Fortran 90: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr Pointer to a VSL Summary Statistics task Fortran 90: REAL(KIND=4) C: float Input/output user data in single precision Fortran 90: REAL(KIND=8) C: double Input/output user data in double precision Fortran 90: INTEGER or INTEGER(KIND=8) C: MKL_INT or long long Other data NOTE The actual size of the generic integer type is platform-specific and can be 32 or 64 bits in length. Before you compile your application, set an appropriate size for integers. See details in the 'Using the ILP64 Interface vs. LP64 Interface' section of the Intel® MKL User's Guide. Parameters The basic parameters in the task descriptor (addresses of dimensions, number of observations, and datasets) are assigned values when the task editors create or modify the task object. Other parameters are determined by the specific task and changed by the task editors. Task Status and Error Reporting The task status is an integer value, which is zero if no error is detected, or a specific non-zero error code otherwise. Negative status values indicate errors, and positive values indicate warnings. An error can be caused by invalid parameter values or a memory allocation failure. Statistical Functions 10 2263 The status codes have symbolic names defined in the respective header files. For the C/C++ interface, these names are defined as macros via the #define statements, and for the Fortran interface as integer constants via the PARAMETER operators. If no error is detected, the function returns the VSL_STATUS_OK code, which is defined as zero: C/C++: #define VSL_STATUS_OK 0 F90/F95: INTEGER, PARAMETER::VSL_STATUS_OK = 0 In the case of an error, the function returns a non-zero error code that specifies the origin of the failure. The header files for both C/C++ and Fortran languages define the following status codes for the VSL Summary Statistics error codes: VSL Summary Statistics Status Codes Status Code Description VSL_STATUS_OK Operation is successfully completed. VSL_SS_ERROR_ALLOCATION_FAILURE Memory allocation has failed. VSL_SS_ERROR_BAD_DIMEN Dimension value is invalid. VSL_SS_ERROR_BAD_OBSERV_N Invalid number (zero or negative) of observations was obtained. VSL_SS_ERROR_STORAGE_NOT_SUPPORTED Storage format is not supported. VSL_SS_ERROR_BAD_INDC_ADDR Array of indices is not defined. VSL_SS_ERROR_BAD_WEIGHTS Array of weights contains negative values. VSL_SS_ERROR_BAD_MEAN_ADDR Array of means is not defined. VSL_SS_ERROR_BAD_2R_MOM_ADDR Array of the second order raw moments is not defined. VSL_SS_ERROR_BAD_3R_MOM_ADDR Array of the third order raw moments is not defined. VSL_SS_ERROR_BAD_4R_MOM_ADDR Array of the fourth order raw moments is not defined. VSL_SS_ERROR_BAD_2C_MOM_ADDR Array of the second order central moments is not defined. VSL_SS_ERROR_BAD_3C_MOM_ADDR Array of the third order central moments is not defined. VSL_SS_ERROR_BAD_4C_MOM_ADDR Array of the fourth order central moments is not defined. VSL_SS_ERROR_BAD_KURTOSIS_ADDR Array of kurtosis values is not defined. VSL_SS_ERROR_BAD_SKEWNESS_ADDR Array of skewness values is not defined. VSL_SS_ERROR_BAD_MIN_ADDR Array of minimum values is not defined. VSL_SS_ERROR_BAD_MAX_ADDR Array of maximum values is not defined. VSL_SS_ERROR_BAD_VARIATION_ADDR Array of variation coefficients is not defined. VSL_SS_ERROR_BAD_COV_ADDR Covariance matrix is not defined. VSL_SS_ERROR_BAD_COR_ADDR Correlation matrix is not defined. VSL_SS_ERROR_BAD_QUANT_ORDER_ADDR Array of quantile orders is not defined. 10 Intel® Math Kernel Library Reference Manual 2264 Status Code Description VSL_SS_ERROR_BAD_QUANT_ORDER Quantile order value is invalid. VSL_SS_ERROR_BAD_QUANT_ADDR Array of quantiles is not defined. VSL_SS_ERROR_BAD_ORDER_STATS_ADDR Array of order statistics is not defined. VSL_SS_ERROR_MOMORDER_NOT_SUPPORTED Moment of requested order is not supported. VSL_SS_NOT_FULL_RANK_MATRIX Correlation matrix is not of full rank. VSL_SS_ERROR_ALL_OBSERVS_OUTLIERS All observations are outliers. (At least one observation must not be an outlier.) VSL_SS_ERROR_BAD_ROBUST_COV_ADDR Robust covariance matrix is not defined. VSL_SS_ERROR_BAD_ROBUST_MEAN_ADDR Array of robust means is not defined. VSL_SS_ERROR_METHOD_NOT_SUPPORTED Requested method is not supported. VSL_SS_ERROR_NULL_TASK_DESCRIPTOR Task descriptor is null. VSL_SS_ERROR_BAD_OBSERV_ADDR Dataset matrix is not defined. VSL_SS_ERROR_BAD_ACCUM_WEIGHT_ADDR Pointer to the variable that holds the value of accumulated weight is not defined. VSL_SS_ERROR_SINGULAR_COV Covariance matrix is singular. VSL_SS_ERROR_BAD_POOLED_COV_ADDR Pooled covariance matrix is not defined. VSL_SS_ERROR_BAD_POOLED_MEAN_ADDR Array of pooled means is not defined. VSL_SS_ERROR_BAD_GROUP_COV_ADDR Group covariance matrix is not defined. VSL_SS_ERROR_BAD_GROUP_MEAN_ADDR Array of group means is not defined. VSL_SS_ERROR_BAD_GROUP_INDC_ADDR Array of group indices is not defined. VSL_SS_ERROR_BAD_GROUP_INDC Group indices have improper values. VSL_SS_ERROR_BAD_OUTLIERS_PARAMS_ADDR Array of parameters for the outlier detection algorithm is not defined. VSL_SS_ERROR_BAD_OUTLIERS_PARAMS_N_ADDR Pointer to size of the parameter array for the outlier detection algorithm is not defined. VSL_SS_ERROR_BAD_OUTLIERS_WEIGHTS_ADDR Output of the outlier detection algorithm is not defined. VSL_SS_ERROR_BAD_ROBUST_COV_PARAMS_ADDR Array of parameters of the robust covariance estimation algorithm is not defined. VSL_SS_ERROR_BAD_ROBUST_COV_PARAMS_N_ADDR Pointer to the number of parameters of the algorithm for robust covariance is not defined. VSL_SS_ERROR_BAD_STORAGE_ADDR Pointer to the variable that holds the storage format is not defined. VSL_SS_ERROR_BAD_PARTIAL_COV_IDX_ADDR Array that encodes sub-components of a random vector for the partial covariance algorithm is not defined. Statistical Functions 10 2265 Status Code Description VSL_SS_ERROR_BAD_PARTIAL_COV_IDX Array that encodes sub-components of a random vector for partial covariance has improper values. VSL_SS_ERROR_BAD_PARTIAL_COV_ADDR Partial covariance matrix is not defined. VSL_SS_ERROR_BAD_PARTIAL_COR_ADDR Partial correlation matrix is not defined. VSL_SS_ERROR_BAD_MI_PARAMS_ADDR Array of parameters for the Multiple Imputation method is not defined. VSL_SS_ERROR_BAD_MI_PARAMS_N_ADDR Pointer to number of parameters for the Multiple Imputation method is not defined. VSL_SS_ERROR_BAD_MI_BAD_PARAMS_N Size of the parameter array of the Multiple Imputation method is invalid. VSL_SS_ERROR_BAD_MI_PARAMS Parameters of the Multiple Imputation method are invalid. VSL_SS_ERROR_BAD_MI_INIT_ESTIMATES_N_ADDR Pointer to the number of initial estimates in the Multiple Imputation method is not defined. VSL_SS_ERROR_BAD_MI_INIT_ESTIMATES_ADDR Array of initial estimates for the Multiple Imputation method is not defined. VSL_SS_ERROR_BAD_MI_SIMUL_VALS_ADDR Array of simulated missing values in the Multiple Imputation method is not defined. VSL_SS_ERROR_BAD_MI_SIMUL_VALS_N_ADDR Pointer to the size of the array of simulated missing values in the Multiple Imputation method is not defined. VSL_SS_ERROR_BAD_MI_ESTIMATES_N_ADDR Pointer to the number of parameter estimates in the Multiple Imputation method is not defined. VSL_SS_ERROR_BAD_MI_ESTIMATES_ADDR Array of parameter estimates in the Multiple Imputation method is not defined. VSL_SS_ERROR_BAD_MI_SIMUL_VALS_N Invalid size of the array of simulated values in the Multiple Imputation method. VSL_SS_ERROR_BAD_MI_ESTIMATES_N Invalid size of an array to hold parameter estimates obtained using the Multiple Imputation method. VSL_SS_ERROR_BAD_MI_OUTPUT_PARAMS Array of output parameters in the Multiple Imputation method is not defined. VSL_SS_ERROR_BAD_MI_PRIOR_N_ADDR Pointer to the number of prior parameters is not defined. VSL_SS_ERROR_BAD_MI_PRIOR_ADDR Array of prior parameters is not defined. VSL_SS_ERROR_BAD_MI_MISSING_VALS_N Invalid number of missing values was obtained. VSL_SS_SEMIDEFINITE_COR Correlation matrix passed into the parameterization function is semi-definite. VSL_SS_ERROR_BAD_PARAMTR_COR_ADDR Correlation matrix to be parameterized is not defined. 10 Intel® Math Kernel Library Reference Manual 2266 Status Code Description VSL_SS_ERROR_BAD_COR All eigenvalues of the correlation matrix to be parameterized are non-positive. VSL_SS_ERROR_BAD_STREAM_QUANT_PARAMS_N_ADDR Pointer to the number of parameters for the quantile computation algorithm for streaming data is not defined. VSL_SS_ERROR_BAD_STREAM_QUANT_PARAMS_ADDR Array of parameters of the quantile computation algorithm for streaming data is not defined. VSL_SS_ERROR_BAD_STREAM_QUANT_PARAMS_N Invalid number of parameters of the quantile computation algorithm for streaming data has been obtained. VSL_SS_ERROR_BAD_STREAM_QUANT_PARAMS Invalid parameters of the quantile computation algorithm for streaming data have been passed. VSL_SS_ERROR_BAD_STREAM_QUANT_ORDER_ADDR Array of the quantile orders for streaming data is not defined. VSL_SS_ERROR_BAD_STREAM_QUANT_ORDER Invalid quantile order for streaming data is defined. VSL_SS_ERROR_BAD_STREAM_QUANT_ADDR Array of quantiles for streaming data is not defined. Routines for robust covariance estimation, outlier detection, partial covariance estimation, multiple imputation, and parameterization of a correlation matrix can return internal error codes that are related to a specific implementation. Such error codes indicate invalid input data or other bugs in the Intel MKL routines other than the VSL Summary Statistics routines. Task Constructors Task constructors are routines intended for creating a new task descriptor and setting up basic parameters. NOTE If the constructor fails to create a task descriptor, it returns the NULL task pointer. vslSSNewTask Creates and initializes a new summary statistics task descriptor. Syntax Fortran: status = vslsssnewtask(task, p, n, xstorage, x, w, indices) status = vsldssnewtask(task, p, n, xstorage, x, w, indices) C: status = vslsSSNewTask(&task, p, n, xstorage, x, w, indices); status = vsldSSNewTask(&task, p, n, xstorage, x, w, indices); Include Files • Fortran 90: mkl_vsl.f90 Statistical Functions 10 2267 • C: mkl_vsl_functions.h Input Parameters Name Type Description p Fortran: INTEGER C: MKL_INT Dimension of the task, number of variables n Fortran: INTEGER C: MKL_INT Number of observations xstorage Fortran: INTEGER C: MKL_INT Storage format of matrix of observations x Fortran: REAL(KIND=4) DIMENSION(*) for vslsssnewtask REAL(KIND=8) DIMENSION(*) for vsldssnewtask C: float* for vslsSSNewTask double* for vsldSSNewTask Matrix of observations w Fortran: REAL(KIND=4) DIMENSION(*) for vslsssnewtask REAL(KIND=8) DIMENSION(*) for vsldssnewtask C: float* for vslsSSNewTask double* for vsldSSNewTask Array of weights of size n. Elements of the arrays are non-negative numbers. If a NULL pointer is passed, each observation is assigned weight equal to 1. indices Fortran: INTEGER, DIMENSION(*) C: MKL_INT* Array of vector components that will be processed. Size of array is p. If a NULL pointer is passed, all components of random vector are processed. Output Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr* Descriptor of the task status Fortran: INTEGER C: int Set to VSL_STATUS_OK if the task is created successfully, otherwise a non-zero error code is returned. Description Each vslSSNewTask constructor routine creates a new summary statistics task descriptor with the userspecified value for a required parameter, dimension of the task. The optional parameters (matrix of observations, its storage format, number of observations, weights of observations, and indices of the random vector components) are set to their default values. 10 Intel® Math Kernel Library Reference Manual 2268 The observations of random p-dimensional vector ? = (?1, ..., ?i, ..., ?p), which are n vectors of dimension p, are passed as a one-dimensional array x. The parameter xstorage defines the storage format of the observations and takes one of the possible values listed in Table "Storage format of matrix of observations and order statistics". NOTE Since matrices in Fortran are stored by columns while in C they are stored by rows, initialization of the xstorage variable in Fortran is opposite to that in C. Set xstorage to VSL_SS_MATRIX_STORAGE_COLS, if the dataset is stored as a two-dimensional matrix that consists of p rows and n columns; otherwise, use the VSL_SS_MATRIX_STORAGE_ROWS constant. Storage format of matrix of observations and order statistics Parameter Description VSL_SS_MATRIX_STORAGE_ROWS The observations of random vector ? are packed by rows: n data points for the vector component ?1 come first, n data points for the vector component ?2 come second, and so forth. VSL_SS_MATRIX_STORAGE_COLS The observations of random vector ? are packed by columns: the first p-dimensional observation of the vector ? comes first, the second p-dimensional observation of the vector comes second, and so forth. A one-dimensional array w of size n contains non-negative weights assigned to the observations. You can pass a NULL array into the constructor. In this case, each observation is assigned the default value of the weight. You can choose vector components for which you wish to compute statistical estimates. If an element of the vector indices of size p contains 0, the observations that correspond to this component are excluded from the calculations. If you pass the NULL value of the parameter into the constructor, statistical estimates for all random variables are computed. If the constructor fails to create a task descriptor, it returns the NULL task pointer. Task Editors Task editors are intended to set up or change the task parameters listed in Table "Parameters of VSL Summary Statistics Task to Be Initialized or Modified". As an example, to compute the sample mean for a one-dimensional dataset, initialize a variable for the mean value, and pass its address into the task as shown in the example below: #define DIM 1 #define N 1000 int main() { VSLSSTaskPtr task; double x[N]; double mean; MKL_INT p, n, xstorage; int status; /* initialize variables used in the computations of sample mean */ p = DIM; n = N; xstorage = VSL_SS_MATRIX_STORAGE_ROWS; mean = 0.0; /* create task */ status = vsldSSNewTask( &task, &p, &n, &xstorage, x, 0, 0 ); /* initialize task parameters */ Statistical Functions 10 2269 status = vsldSSEditTask( task, VSL_SS_ED_MEAN, &mean ); /* compute mean using SS fast method */ status = vsldSSCompute(task, VSL_SS_MEAN, VSL_SS_METHOD_FAST ); /* deallocate task resources */ status = vslSSDeleteTask( &task ); return 0; } Use the single (vslsssedittask) or double (vsldssedittask) version of an editor, to initialize single or double precision version task parameters, respectively. Use an integer version of an editor (vslissedittask) to initialize parameters of the integer type. Table "VSL Summary Statistics Task Editors" lists the task editors for VSL Summary Statistics. Each of them initializes and/or modifies a respective group of related parameters. VSL Summary Statistics Task Editors Editor Description vslSSEditTask Changes a pointer in the task descriptor. vslSSEditMoments Changes pointers to arrays associated with raw and central moments. vslSSEditCovCor Changes pointers to arrays associated with covariance and/or correlation matrices. vslSSEditPartialCovCor Changes pointers to arrays associated with partial covariance and/ or correlation matrices. vslSSEditQuantiles Changes pointers to arrays associated with quantile/order statistics calculations. vslSSEditStreamQuantiles Changes pointers to arrays for quantile related calculations for streaming data. vslSSEditPooledCovariance Changes pointers to arrays associated with algorithms related to a pooled covariance matrix. vslSSEditRobustCovariance Changes pointers to arrays for robust estimation of a covariance matrix and mean. vslSSEditOutliersDetection Changes pointers to arrays for detection of outliers. vslSSEditMissingValues Changes pointers to arrays associated with the method of supporting missing values in a dataset. vslSSEditCorParameterization Changes pointers to arrays associated with the algorithm for parameterization of a correlation matrix. NOTE You can use the NULL task pointer in calls to editor routines. In this case, the routine is terminated and no system crash occurs. vslSSEditTask Modifies address of an input/output parameter in the task descriptor. Syntax Fortran: status = vslsssedittask(task, parameter, par_addr) 10 Intel® Math Kernel Library Reference Manual 2270 status = vsldssedittask(task, parameter, par_addr) status = vslissedittask(task, parameter, par_addr) C: status = vslsSSEditTask(task, parameter, par_addr); status = vsldSSEditTask(task, parameter, par_addr); status = vsliSSEditTask(task, parameter, par_addr); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr Descriptor of the task parameter Fortran: INTEGER C: MKL_INT Parameter to change par_addr Fortran: REAL(KIND=4) DIMENSION(*) for vslsssedittask REAL(KIND=8) DIMENSION(*) for vsldssedittask INTEGER DIMENSION(*) for vslissedittask C: float* for vslsSSEditTask double* for vsldSSEditTask MKL_INT* for vsliSSEditTask Address of the new parameter Output Parameters Name Type Description status Fortran: INTEGER C: int Current status of the task Description The vslSSEditTask routine replaces the pointer to the parameter stored in the VSL Summary Statistics task descriptor with the par_addr pointer. If you pass the NULL pointer to the editor, no changes take place in the task and a corresponding error code is returned. See Table "Parameters of VSL Summary Statistics Task to Be Initialized or Modified" for the predefined values of the parameter. Use the single (vslsssedittask) or double (vsldssedittask) version of the editor, to initialize single or double precision version task parameters, respectively. Use an integer version of the editor (vslissedittask) to initialize parameters of the integer type. Statistical Functions 10 2271 Parameters of VSL Summary Statistics Task to Be Initialized or Modified Parameter Value Type Purpose Initialization VSL_SS_ED_DIMEN i Address of a variable that holds the task dimension Required. Positive integer value. VSL_SS_ED_OBSERV_N i Address of a variable that holds the number of observations Required. Positive integer value. VSL_SS_ED_OBSERV d, s Address of the observation matrix Required. Provide the matrix containing your observations. VSL_SS_ED_OBSERV_STORAGE i Address of a variable that holds the storage format for the observation matrix Required. Provide a storage format supported by the library whenever you pass a matrix of observations.1 VSL_SS_ED_INDC i Address of the array of indices Optional. Provide this array if you need to process individual components of the random vector. Set entry i of the array to one to include the ith coordinate in the analysis. Set entry i of the array to zero to exclude the ith coordinate from the analysis. VSL_SS_ED_WEIGHTS d, s Address of the array of observation weights Optional. If the observations have weights different from the default weight (one), set entries of the array to non-negative floating point values. VSL_SS_ED_MEAN d, s Address of the array of means Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. Otherwise, do not initialize the array. VSL_SS_ED_2R_MOM d, s Address of an array of raw moments of the second order Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. Otherwise, do not initialize the array. VSL_SS_ED_3R_MOM d, s Address of an array of raw moments of the third order Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. Otherwise, do not initialize the array. VSL_SS_ED_4R_MOM d, s Address of an array of raw moments of the fourth order Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. Otherwise, do not initialize the array. 10 Intel® Math Kernel Library Reference Manual 2272 Parameter Value Type Purpose Initialization VSL_SS_ED_2C_MOM d, s Address of an array of central moments of the second order Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. Otherwise, do not initialize the array. Make sure you also provide arrays for raw moments of the first and second order. VSL_SS_ED_3C_MOM d, s Address of an array of central moments of the third order Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. Otherwise, do not initialize the array. Make sure you also provide arrays for raw moments of the first, second, and third order. VSL_SS_ED_4C_MOM d, s Address of an array of central moments of the fourth order Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. Otherwise, do not initialize the array. Make sure you also provide arrays for raw moments of the first, second, third, and fourth order. VSL_SS_ED_KURTOSIS d, s Address of the array of kurtosis estimates Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. Otherwise, do not initialize the array. Make sure you also provide arrays for raw moments of the first, second, third, and fourth order. VSL_SS_ED_SKEWNESS d, s Address of the array of skewness estimates Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. Otherwise, do not initialize the array. Make sure you also provide arrays for raw moments of the first, second, and third order. VSL_SS_ED_MIN d, s Address of the array of minimum estimates Optional. Set entries of array to meaningful values, such as the values of the first observation. VSL_SS_ED_MAX d, s Address of the array of maximum estimates Optional. Set entries of array to meaningful values, such as the values of the first observation. Statistical Functions 10 2273 Parameter Value Type Purpose Initialization VSL_SS_ED_VARIATION d, s Address of the array of variation coefficients Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. Otherwise, do not initialize the array. Make sure you also provide arrays for raw moments of the first and second order. VSL_SS_ED_COV d, s Address of a covariance matrix Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. Make sure you also provide an array for the mean. VSL_SS_ED_COV_STORAGE i Address of the variable that holds the storage format for a covariance matrix Required. Provide a storage format supported by the library whenever you intend to compute the covariance matrix.2 VSL_SS_ED_COR d, s Address of a correlation matrix Optional. Set entries of the array to meaningful values (typically zero) if you intend to compute a progressive estimate. If you initialize the matrix in non-trivial way, make sure that the main diagonal contains variance values. Also, provide an array for the mean. VSL_SS_ED_COR_STORAGE i Address of the variable that holds the correlation storage format for a correlation matrix Required. Provide a storage format supported by the library whenever you intend to compute the correlation matrix.2 VSL_SS_ED_ACCUM_WEIGHT d, s Address of the array of size 2 that holds the accumulated weight (sum of weights) in the first position and the sum of weights squared in the second position Optional. Set the entries of the matrix to meaningful values (typically zero) if you intend to do progressive processing of the dataset or need the sum of weights and sum of squared weights assigned to observations. VSL_SS_ED_QUANT_ORDER_N i Address of the variable that holds the number of quantile orders Required. Positive integer value. Provide the number of quantile orders whenever you compute quantiles. VSL_SS_ED_QUANT_ORDER d, s Address of the array of quantile orders Required. Set entries of array to values from the interval (0,1). Provide this parameter whenever you compute quantiles. 10 Intel® Math Kernel Library Reference Manual 2274 Parameter Value Type Purpose Initialization VSL_SS_ED_QUANT_QUANTILE S d, s Address of the array of quantiles None. VSL_SS_ED_ORDER_STATS d, s Address of the array of order statistics None. VSL_SS_ED_GROUP_INDC i Address of the array of group indices used in computation of a pooled covariance matrix Required. Set entry i to integer value k if the observation belongs to group k. Values of k take values in the range [0, g-1], where g is the number of groups. VSL_SS_ED_POOLED_COV_STO RAGE i Address of a variable that holds the storage format for a pooled covariance matrix Required. Provide a storage format supported by the library whenever you intend to compute pooled covariance.2 VSL_SS_ED_POOLED_MEAN i Address of an array of pooled means None. VSL_SS_ED_POOLED_COV d, s Address of pooled covariance matrices None. VSL_SS_ED_GROUP_COV_INDC d, s Address of an array of indices for which covariance/means should be computed Optional. Set the kth entry of the array to 1 if you need group covariance and mean for group k; otherwise set it to zero. VSL_SS_ED_GROUP_MEANS i Address of an array of group means None. VSL_SS_ED_GROUP_COV_STOR AGE d, s Address of a variable that holds the storage format for a group covariance matrix Required. Provide a storage format supported by the library whenever you intend to get group covariance.2 VSL_SS_ED_GROUP_COV i Address of group covariance matrices None. VSL_SS_ED_ROBUST_COV_STO RAGE d, s Address of a variable that holds the storage format for a robust covariance matrix Required. Provide a storage format supported by the library whenever you compute robust covariance2. VSL_SS_ED_ROBUST_COV_PAR AMS_N i Address of a variable that holds the number of algorithmic parameters of the method for robust covariance estimation Required. Set to the number of TBS parameters, VSL_SS_TBS_PARAMS_N. VSL_SS_ED_ROBUST_COV_PAR AMS d, s Address of an array of parameters of the method for robust estimation of a covariance Required. Set the entries of the array according to the description in EditRobustCovariance. Statistical Functions 10 2275 Parameter Value Type Purpose Initialization VSL_SS_ED_ROBUST_MEAN i Address of an array of robust means None. VSL_SS_ED_ROBUST_COV d, s Address of a robust covariance matrix None. VSL_SS_ED_OUTLIERS_PARAM S_N d, s Address of a variable that holds the number of parameters of the outlier detection method Required. Set to the number of outlier detection parameters, VSL_SS_BACON_PARAMS_N. VSL_SS_ED_OUTLIERS_PARAM S i Address of an array of algorithmic parameters for the outlier detection method Required. Set the entries of the array according to the description in EditOutliersDetection. VSL_SS_ED_OUTLIERS_WEIGH T d, s Address of an array of weights assigned to observations by the outlier detection method None. VSL_SS_ED_ORDER_STATS_ST ORAGE d, s Address of a variable that holds the storage format of an order statistics matrix Required. Provide a storage format supported by the library whenever you compute a matrix of order statistics.1 VSL_SS_ED_PARTIAL_COV_ID X i Address of an array that encodes subcomponents of a random vector Required. Set the entries of the array according to the description in EditPartialCovCor. VSL_SS_ED_PARTIAL_COV d, s Address of a partial covariance matrix None. VSL_SS_ED_PARTIAL_COV_ST ORAGE i Address of a variable that holds the storage format of a partial covariance matrix Required. Provide a storage format supported by the library whenever you compute the partial covariance.2 VSL_SS_ED_PARTIAL_COR d, s Address of a partial correlation matrix None. VSL_SS_ED_PARTIAL_COR_ST ORAGE i Address of a variable that holds the storage format for a partial correlation matrix Required. Provide a storage format supported by the library whenever you compute the partial correlation.2 VSL_SS_ED_MI_PARAMS_N i Address of a variable that holds the number of algorithmic parameters for the Multiple Imputation method Required. Set to the number of MI parameters, VSL_SS_MI_PARAMS_SIZE. VSL_SS_ED_MI_PARAMS d, s Address of an array of algorithmic parameters for the Multiple Imputation method Required. Set entries of the array according to the description in EditMissingValues. 10 Intel® Math Kernel Library Reference Manual 2276 Parameter Value Type Purpose Initialization VSL_SS_ED_MI_INIT_ESTIMA TES_N i Address of a variable that holds the number of initial estimates for the Multiple Imputation method Optional. Set to p+p*(p+1)/2, where p is the task dimension. VSL_SS_ED_MI_INIT_ESTIMA TES d, s Address of an array of initial estimates for the Multiple Imputation method Optional. Set the values of the array according to the description in "Basic Components of the Multiple Imputation Function in Summary Statistics Library" in the Intel® MKL Summary Statistics Library Application Notes document on the Intel® MKL web page. VSL_SS_ED_MI_SIMUL_VALS_ N i Address of a variable that holds the number of simulated values in the Multiple Imputation method Optional. Positive integer indicating the number of missing points in the observation matrix. VSL_SS_ED_MI_SIMUL_VALS d, s Address of an array of simulated values in the Multiple Imputation method None. VSL_SS_ED_MI_ESTIMATES_N i Address of a variable that holds the number of estimates obtained as a result of the Multiple Imputation method Optional. Positive integer number defined according to the description in "Basic Components of the Multiple Imputation Function in Summary Statistics Library" in the Intel® MKL Summary Statistics Library Application Notes document on the Intel® MKL web page. VSL_SS_ED_MI_ESTIMATES d, s Address of an array of estimates obtained as a result of the Multiple Imputation method None. VSL_SS_ED_MI_PRIOR_N i Address of a variable that holds the number of prior parameters for the Multiple Imputation method Optional. If you pass a userdefined array of prior parameters, set this parameter to (p2+3*p+4)/ 2, where p is the task dimension. VSL_SS_ED_MI_PRIOR d, s Address of an array of prior parameters for the Multiple Imputation method Optional. Set entries of the array of prior parameters according to the description in "Basic Components of the Multiple Imputation Function in Summary Statistics Library" in the Intel® MKL Summary Statistics Library Application Notes document on the Intel® MKL web page. Statistical Functions 10 2277 Parameter Value Type Purpose Initialization VSL_SS_ED_PARAMTR_COR d, s Address of a parameterized correlation matrix None. VSL_SS_ED_PARAMTR_COR_ST ORAGE i Address of a variable that holds the storage format of a parameterized correlation matrix Required. Provide a storage format supported by the library whenever you compute the parameterized correlation matrix.2 VSL_SS_ED_STREAM_QUANT_P ARAMS_N i Address of a variable that holds the number of parameters of a quantile computation method for streaming data Required. Set to the number of quantile computation parameters, VSL_SS_SQUANTS_ZW_PARAMS_N. VSL_SS_ED_STREAM_QUANT_P ARAMS d, s Address of an array of parameters of a quantile computation method for streaming data Required. Set the entries of the array according to the description in "Computing Quantiles for Streaming Data" in the Intel® MKL Summary Statistics Library Application Notes document on the Intel® MKL web page. VSL_SS_ED_STREAM_QUANT_O RDER_N i Address of a variable that holds the number of quantile orders for streaming data Required. Positive integer value. VSL_SS_ED_STREAM_QUANT_O RDER d, s Address of an array of quantile orders for streaming data Required. Set entries of the array to values from the interval (0,1). Provide this parameter whenever you compute quantiles. VSL_SS_ED_STREAM_QUANT_Q UANTILES d, s Address of an array of quantiles for streaming data None. 1. See Table: "Storage format of matrix of observations and order statistics" for storage formats. 2. See Table: "Storage formats of a variance-covariance/correlation matrix" for storage formats. vslSSEditMoments Modifies the pointers to arrays that hold moment estimates. Syntax Fortran: status = vslssseditmoments(task, mean, r2m, r3m, r4m, c2m, c3m, c4m) status = vsldsseditmoments(task, mean, r2m, r3m, r4m, c2m, c3m, c4m) C: status = vslsSSEditMoments(task, mean, r2m, r3m, r4m, c2m, c3m, c4m); status = vsldSSEditMoments(task, mean, r2m, r3m, r4m, c2m, c3m, c4m); 10 Intel® Math Kernel Library Reference Manual 2278 Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr Descriptor of the task mean Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmoments REAL(KIND=8) DIMENSION(*) for vsldsseditmoments C: float* for vslsSSEditMoments double* for vsldSSEditMoments Pointer to the array of means r2m Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmoments REAL(KIND=8) DIMENSION(*) for vsldsseditmoments C: float* for vslsSSEditMoments double* for vsldSSEditMoments Pointer to the array of raw moments of the 2nd order r3m Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmoments REAL(KIND=8) DIMENSION(*) for vsldsseditmoments C: float* for vslsSSEditMoments double* for vsldSSEditMoments Pointer to the array of raw moments of the 3rd order r4m Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmoments REAL(KIND=8) DIMENSION(*) for vsldsseditmoments C: float* for vslsSSEditMoments double* for vsldSSEditMoments Pointer to the array of raw moments of the 4th order c2m Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmoments REAL(KIND=8) DIMENSION(*) for vsldsseditmoments C: float* for vslsSSEditMoments double* for vsldSSEditMoments Pointer to the array of central moments of the 2nd order c3m Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmoments Pointer to the array of central moments of the 3rd order Statistical Functions 10 2279 Name Type Description REAL(KIND=8) DIMENSION(*) for vsldsseditmoments C: float* for vslsSSEditMoments double* for vsldSSEditMoments c4m Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmoments REAL(KIND=8) DIMENSION(*) for vsldsseditmoments C: float* for vslsSSEditMoments double* for vsldSSEditMoments Pointer to the array of central moments of the 4th order Output Parameters Name Type Description status Fortran: INTEGER C: int Current status of the task Description The vslSSEditMoments routine replaces pointers to the arrays that hold estimates of raw and central moments with values passed as corresponding parameters of the routine. If an input parameter is NULL, the value of the relevant parameter remains unchanged. vslSSEditCovCor Modifies the pointers to covariance/correlation parameters. Syntax Fortran: status = vslssseditcovcor(task, mean, cov, cov_storage, cor, cor_storage) status = vsldsseditcovcor(task, mean, cov, cov_storage, cor, cor_storage) C: status = vslsSSEditCovCor(task, mean, cov, cov_storage, cor, cor_storage); status = vsldSSEditCovCor(task, mean, cov, cov_storage, cor, cor_storage); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) Descriptor of the task 10 Intel® Math Kernel Library Reference Manual 2280 Name Type Description C: VSLSSTaskPtr mean Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditcovcor REAL(KIND=8) DIMENSION(*) for vsldsseditcovcor C: float* for vslsSSEditCovCor double* for vsldSSEditCovCor Pointer to the array of means cov Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditcovcor REAL(KIND=8) DIMENSION(*) for vsldsseditcovcor C: float* for vslsSSEditCovCor double* for vsldSSEditCovCor Pointer to a covariance matrix cov_storage Fortran: INTEGER C: MKL_INT* Pointer to the storage format of the covariance matrix cor Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditcovcor REAL(KIND=8) DIMENSION(*) for vsldsseditcovcor C: float* for vslsSSEditCovCor double* for vsldSSEditCovCor Pointer to a correlation matrix cor_storage Fortran: INTEGER C: MKL_INT* Pointer to the storage format of the correlation matrix Output Parameters Name Type Description status Fortran: INTEGER C: int Current status of the task Description The vslSSEditCovCor routine replaces pointers to the array of means, covariance/correlation arrays, and their storage format with values passed as corresponding parameters of the routine. See Table "Storage formats of a variance-covariance/correlation matrix" for possible values of the cov_storage and cor_storage parameters. If an input parameter is NULL, the old value of the parameter remains unchanged in the VSL Summary Statistics task descriptor. Statistical Functions 10 2281 Storage formats of a variance-covariance/correlation matrix Parameter Description VSL_SS_MATRIX_STORAGE_FULL A symmetric variance-covariance/correlation matrix is a one-dimensional array with elements c(i,j) stored as cp(i*p + j). The size of the array is p*p. VSL_SS_MATRIX_STORAGE_L_PACKED A symmetric variance-covariance/correlation matrix with elements c(i,j) is packed as a one-dimensional array cp(i + (2n - j)*(j - 1)/2) for j = i. The size of the array is p*(p+ 1)/2. VSL_SS_MATRIX_STORAGE_U_PACKED A symmetric variance-covariance/correlation matrix with elements c(i,j) is packed as a one-dimensional array cp(i + j*(j - 1)/2) for i = j. The size of the array is p*(p+ 1)/2. vslSSEditPartialCovCor Modifies the pointers to partial covariance/correlation parameters. Syntax Fortran: status = vslssseditpartialcovcor(task, p_idx_array, cov, cov_storage, cor, cor_storage, p_cov, p_cov_storage, p_cor, p_cor_storage) status = vsldsseditpartialcovcor(task, p_idx_array, cov, cov_storage, cor, cor_storage, p_cov, p_cov_storage, p_cor, p_cor_storage) C: status = vslsSSEditPartialCovCor(task, p_idx_array, cov, cov_storage, cor, cor_storage, p_cov, p_cov_storage, p_cor, p_cor_storage); status = vsldSSEditPartialCovCor(task, p_idx_array, cov, cov_storage, cor, cor_storage, p_cov, p_cov_storage, p_cor, p_cor_storage); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr Descriptor of the task p_idx_array Fortran: INTEGER C: MKL_INT* Pointer to the array that encodes indices of subcomponents Z and Y of the random vector as described in section Mathematical Notation and Definitions. p_idx_array[i] equals to 10 Intel® Math Kernel Library Reference Manual 2282 Name Type Description -1 if the i-th component of the random vector belongs to Z 1, if the i-th component of the random vector belongs to Y. cov Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditpartialcovcor REAL(KIND=8) DIMENSION(*) for vsldsseditpartialcovcor C: float* for vslsSSEditPartialCovCor double* for vsldSSEditPartialCovCor Pointer to a covariance matrix cov_storage Fortran: INTEGER C: MKL_INT* Pointer to the storage format of the covariance matrix cor Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditpartialcovcor REAL(KIND=8) DIMENSION(*) for vsldsseditpartialcovcor C: float* for vslsSSEditPartialCovCor double* for vsldSSEditPartialCovCor Pointer to a correlation matrix cor_storage Fortran: INTEGER C: MKL_INT* Pointer to the storage format of the correlation matrix p_cov Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditpartialcovcor REAL(KIND=8) DIMENSION(*) for vsldsseditpartialcovcor C: float* for vslsSSEditPartialCovCor double* for vsldSSEditPartialCovCor Pointer to a partial covariance matrix p_cov_storage Fortran: INTEGER C: MKL_INT* Pointer to the storage format of the partial covariance matrix p_cor Fortran: REAL(KIND=4) DIMENSION(*) Pointer to a partial correlation matrix Statistical Functions 10 2283 Name Type Description for vslssseditpartialcovcor REAL(KIND=8) DIMENSION(*) for vsldsseditpartialcovcor C: float* for vslsSSEditPartialCovCor double* for vsldSSEditPartialCovCor p_cor_storage Fortran: INTEGER C: MKL_INT* Pointer to the storage format of the partial correlation matrix Output Parameters Name Type Description status Fortran: INTEGER C: int Current status of the task Description The vslSSEditPartialCovCor routine replaces pointers to covariance/correlation arrays, partial covariance/correlation arrays, and their storage format with values passed as corresponding parameters of the routine. See Table "Storage formats of a variance-covariance/correlation matrix" for possible values of the cov_storage, cor_storage, p_cov_storage, and p_cor_storage parameters. If an input parameter is NULL, the old value of the parameter remains unchanged in the VSL Summary Statistics task descriptor. vslSSEditQuantiles Modifies the pointers to parameters related to quantile computations. Syntax Fortran: status = vslssseditquantiles(task, quant_order_n, quant_order, quants, order_stats, order_stats_storage) status = vsldsseditquantiles(task, quant_order_n, quant_order, quants, order_stats, order_stats_storage) C: status = vslsSSEditQuantiles(task, quant_order_n, quant_order, quants, order_stats, order_stats_storage); status = vsldSSEditQuantiles(task, quant_order_n, quant_order, quants, order_stats, order_stats_storage); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h 10 Intel® Math Kernel Library Reference Manual 2284 Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr Descriptor of the task quant_order_n Fortran: INTEGER C: MKL_INT* Pointer to the number of quantile orders quant_order Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditquantiles REAL(KIND=8) DIMENSION(*) for vsldsseditquantiles C: float* for vslsSSEditQuantiles double* for vsldSSEditQuantiles Pointer to the array of quantile orders quants Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditquantiles REAL(KIND=8) DIMENSION(*) for vsldsseditquantiles C: float* for vslsSSEditQuantiles double* for vsldSSEditQuantiles Pointer to the array of quantiles order_stats Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditquantiles REAL(KIND=8) DIMENSION(*) for vsldsseditquantiles C: float* for vslsSSEditQuantiles double* for vsldSSEditQuantiles Pointer to the array of order statistics order_stats_storage Fortran: INTEGER C: MKL_INT* Pointer to the storage format of the order statistics array Output Parameters Name Type Description status Fortran: INTEGER Current status of the task Statistical Functions 10 2285 Name Type Description C: int Description The vslSSEditQuantiles routine replaces pointers to the number of quantile orders, the array of quantile orders, the array of quantiles, the array that holds order statistics, and the storage format for the order statistics with values passed into the routine. See Table "Storage format of matrix of observations and order statistics" for possible values of the order_statistics_storage parameter. If an input parameter is NULL, the corresponding parameter in the VSL Summary Statistics task descriptor remains unchanged. vslSSEditStreamQuantiles Modifies the pointers to parameters related to quantile computations for streaming data. Syntax Fortran: status = vslssseditstreamquantiles(task, quant_order_n, quant_order, quants, nparams, params) status = vsldsseditstreamquantiles(task, quant_order_n, quant_order, quants, nparams, params) C: status = vslsSSEditStreamQuantiles(task, quant_order_n, quant_order, quants, nparams, params); status = vsldSSEditStreamQuantiles(task, quant_order_n, quant_order, quants, nparams, params); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr Descriptor of the task quant_order_n Fortran: INTEGER C: MKL_INT* Pointer to the number of quantile orders quant_order Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditstreamquantiles REAL(KIND=8) DIMENSION(*) for vsldsseditstreamquantiles C: float* for vslsSSEditStreamQuantiles Pointer to the array of quantile orders 10 Intel® Math Kernel Library Reference Manual 2286 Name Type Description double* for vsldSSEditStreamQuantiles quants Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditstreamquantiles REAL(KIND=8) DIMENSION(*) for vsldsseditstreamquantiles C: float* for vslsSSEditStreamQuantiles double* for vsldSSEditStreamQuantiles Pointer to the array of quantiles nparams Fortran: INTEGER C: MKL_INT* Pointer to the number of the algorithm parameters params Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditstreamquantiles REAL(KIND=8) DIMENSION(*) for vsldsseditstreamquantiles C: float* for vslsSSEditStreamQuantiles double* for vsldSSEditStreamQuantiles Pointer to the array of the algorithm parameters Output Parameters Name Type Description status Fortran: INTEGER C: int Current status of the task Description The vslSSEditStreamQuantiles routine replaces pointers to the number of quantile orders, the array of quantile orders, the array of quantiles, the number of the algorithm parameters, and the array of the algorithm parameters with values passed into the routine. If an input parameter is NULL, the corresponding parameter in the VSL Summary Statistics task descriptor remains unchanged. vslSSEditPooledCovariance Modifies pooled/group covariance matrix array pointers. Statistical Functions 10 2287 Syntax Fortran: status = vslssseditpooledcovariance(task, grp_indices, pld_mean, pld_cov, grp_cov_indices, grp_means, grp_cov) status = vsldsseditpooledcovariance(task, grp_indices, pld_mean, pld_cov, grp_cov_indices, grp_means, grp_cov) C: status = vslsSSEditPooledCovariance(task, grp_indices, pld_mean, pld_cov, grp_cov_indices, grp_means, grp_cov); status = vsldSSEditPooledCovariance(task, grp_indices, pld_mean, pld_cov, grp_cov_indices, grp_means, grp_cov); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr Descriptor of the task grp_indices Fortran: INTEGER DIMENSION(*) C: MKL_INT* Pointer to an array of size n. The i-th element of the array contains the number of the group the observation belongs to. pld_mean Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditpooledcovariance REAL(KIND=8) DIMENSION(*) for vsldsseditpooledcovariance C: float* for vslsSSEditPooledCovariance double* for vsldSSEditPooledCovariance Pointer to the array of pooled means pld_cov Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditpooledcovariance REAL(KIND=8) DIMENSION(*) for vsldsseditpooledcovariance C: float* for vslsSSEditPooledCovariance double* for Pointer to the array that holds a pooled covariance matrix 10 Intel® Math Kernel Library Reference Manual 2288 Name Type Description vsldSSEditPooledCovariance grp_cov_indices Fortran: INTEGER DIMENSION(*) C: MKL_INT* Pointer to the array that contains indices of group covariance matrices to return grp_means Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditpooledcovariance REAL(KIND=8) DIMENSION(*) for vsldsseditpooledcovariance C: float* for vslsSSEditPooledCovariance double* for vsldSSEditPooledCovariance Pointer to the array of group means grp_cov Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditpooledcovariance REAL(KIND=8) DIMENSION(*) for vsldsseditpooledcovariance C: float* for vslsSSEditPooledCovariance double* for vsldSSEditPooledCovariance Pointer to the array that holds group covariance matrices Output Parameters Name Type Description status Fortran: INTEGER C: int Current status of the task Description The vslSSEditPooledCovariance routine replaces pointers to the array of group indices, the array of pooled means, the array for a pooled covariance matrix, and pointers to the array of indices of group matrices, the array of group means, and the array for group covariance matrices with values passed in the editors. If an input parameter is NULL, the corresponding parameter in the VSL Summary Statistics task descriptor remains unchanged. Use the vslSSEditTask routine to replace the storage format for pooled and group covariance matrices. vslSSEditRobustCovariance Modifies pointers to arrays related to a robust covariance matrix. Statistical Functions 10 2289 Syntax Fortran: status = vslssseditrobustcovariance(task, rcov_storage, nparams, params, rmean, rcov) status = vsldsseditrobustcovariance(task, rcov_storage, nparams, params, rmean, rcov) C: status = vslsSSEditRobustCovariance(task, rcov_storage, nparams, params, rmean, rcov); status = vsldSSEditRobustCovariance(task, rcov_storage, nparams, params, rmean, rcov); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr Descriptor of the task rcov_storage Fortran: INTEGER C: MKL_INT* Pointer to the storage format of a robust covariance matrix nparams Fortran: INTEGER C: MKL_INT* Pointer to the number of method parameters params Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditrobustcovariance REAL(KIND=8) DIMENSION(*) for vsldsseditrobustcovariance C: float* for vslsSSEditRobustCovariance double* for vsldSSEditRobustCovariance Pointer to the array of method parameters rmean Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditrobustcovariance REAL(KIND=8) DIMENSION(*) for vsldsseditrobustcovariance C: float* for vslsSSEditRobustCovariance double* for Pointer to the array of robust means 10 Intel® Math Kernel Library Reference Manual 2290 Name Type Description vsldSSEditRobustCovariance rcov Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditrobustcovariance REAL(KIND=8) DIMENSION(*) for vsldsseditrobustcovariance C: float* for vslsSSEditRobustCovariance double* for vsldSSEditRobustCovariance Pointer to a robust covariance matrix Output Parameters Name Type Description status Fortran: INTEGER C: int Current status of the task Description The vslSSEditRobustCovariance routine uses values passed as parameters of the routine to replace: • pointers to covariance matrix storage • pointers to the number of method parameters and to the array of the method parameters of size nparams • pointers to the arrays that hold robust means and covariance See Table "Storage formats of a variance-covariance/correlation matrix" for possible values of the rcov_storage parameter. If an input parameter is NULL, the corresponding parameter in the task descriptor remains unchanged. Intel MKL provides a Translated Biweight S-estimator (TBS) for robust estimation of a variance-covariance matrix and mean [Rocke96]. Use one iteration of the Maronna algorithm with the reweighting step [Maronna02] to compute the initial point of the algorithm. Pack the parameters of the TBS algorithm into the params array and pass them into the editor. Table "Structure of the Array of TBS Parameters" describes the params structure. Structure of the Array of TBS Parameters Array Position Algorithm Parameter Description 0 e Breakdown point, the number of outliers the algorithm can hold. By default, the value is (n-p)/(2n). 1 a Asymptotic rejection probability, see details in [Rocke96]. By default, the value is 0.001. 2 d Stopping criterion: the algorithm is terminated if weights are changed less than d. By default, the value is 0.001. 3 max_iter Maximum number of iterations. The algorithm terminates after max_iter iterations. By default, the value is 10. Statistical Functions 10 2291 Array Position Algorithm Parameter Description If you set this parameter to zero, the function returns a robust estimate of the variance-covariance matrix computed using the Maronna method [Maronna02] only. See additional details of the algorithm usage model in the Intel® MKL Summary Statistics Library Application Notes document on the Intel® MKL web page. vslSSEditOutliersDetection Modifies array pointers related to multivariate outliers detection. Syntax Fortran: status = vslssseditoutliersdetection(task, nparams, params, w) status = vsldsseditoutliersdetection(task, nparams, params, w) C: status = vslsSSEditOutliersDetection(task, nparams, params, w); status = vsldSSEditOutliersDetection(task, nparams, params, w); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr Descriptor of the task nparams Fortran: INTEGER C: MKL_INT* Pointer to the number of method parameters params Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditoutliersdetection REAL(KIND=8) DIMENSION(*) for vsldsseditoutliersdetection C: float* for vslsSSEditOutliersDetection double* for vsldSSEditOutliersDetection Pointer to the array of method parameters 10 Intel® Math Kernel Library Reference Manual 2292 Name Type Description w Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditoutliersdetection REAL(KIND=8) DIMENSION(*) for vsldsseditoutliersdetection C: float* for vslsSSEditOutliersDetection double* for vsldSSEditOutliersDetection Pointer to an array of size n. The array holds the weights of observations to be marked as outliers. Output Parameters Name Type Description status Fortran: INTEGER C: int Current status of the task Description The vslSSEditOutliersDetection routine uses the parameters passed to replace • the pointers to the number of method parameters and to the array of the method parameters of size nparams • the pointer to the array that holds the calculated weights of the observations If an input parameter is NULL, the corresponding parameter in the task descriptor remains unchanged. Intel MKL provides the BACON algorithm ([Billor00]) for the detection of multivariate outliers. Pack the parameters of the BACON algorithm into the params array and pass them into the editor. Table "Structure of the Array of BACON Parameters" describes the params structure. Structure of the Array of BACON Parameters Array Position Algorithm Parameter Description 0 Method to start the algorithm The parameter takes one of the following possible values: VSL_SS_METHOD_BACON_MEDIAN_INIT, if the algorithm is started using the median estimate. This is the default value of the parameter. VSL_SS_METHOD_BACON_MAHALANOBIS_INIT, if the algorithm is started using the Mahalanobis distances. 1 a One-tailed probability that defines the (1 - a) quantile of ?2 distribution with p degrees of freedom. The recommended value is a/ n, where n is the number of observations. By default, the value is 0.05. 2 d Stopping criterion; the algorithm is terminated if the size of the basic subset is changed less than d. By default, the value is 0.005. Statistical Functions 10 2293 Output of the algorithm is the vector of weights, BaconWeights, such that BaconWeights(i) = 0 if i-th observation is detected as an outlier. Otherwise BaconWeights(i) = w(i), where w is the vector of input weights. If you do not provide the vector of input weights, BaconWeights(i) is set to 1 if the i-th observation is not detected as an outlier. See additional details about usage model of the algorithm in the Intel(R) MKL Summary Statistics Library Application Notes document on the Intel® MKL web page. vslSSEditMissingValues Modifies pointers to arrays associated with the method of supporting missing values in a dataset. Syntax Fortran: status = vslssseditmissingvalues(task, nparams, params, init_estimates_n, init_estimates, prior_n, prior, simul_missing_vals_n, simul_missing_vals, estimates_n, estimates) status = vsldsseditmissingvalues(task, nparams, params, init_estimates_n, init_estimates, prior_n, prior, simul_missing_vals_n, simul_missing_vals, estimates_n, estimates) C: status = vslsSSEditMissingValues(task, nparams, params, init_estimates_n, init_estimates, prior_n, prior, simul_missing_vals_n, simul_missing_vals, estimates_n, estimates); status = vsldSSEditMissingValues(task, nparams, params, init_estimates_n, init_estimates, prior_n, prior, simul_missing_vals_n, simul_missing_vals, estimates_n, estimates); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr Descriptor of the task nparams Fortran: INTEGER C: MKL_INT* Pointer to the number of method parameters params Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmissingvalues REAL(KIND=8) DIMENSION(*) for vsldsseditmissingvalues C: float* for Pointer to the array of method parameters 10 Intel® Math Kernel Library Reference Manual 2294 Name Type Description vslsSSEditMissingValues double* for vsldSSEditMissingValues init_estimates_n Fortran: INTEGER C: MKL_INT* Pointer to the number of initial estimates for mean and a variancecovariance matrix init_estimates Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmissingvalues REAL(KIND=8) DIMENSION(*) for vsldsseditmissingvalues C: float* for vslsSSEditMissingValues double* for vsldSSEditMissingValues Pointer to the array that holds initial estimates for mean and a variancecovariance matrix prior_n Fortran: INTEGER C: MKL_INT* Pointer to the number of prior parameters prior Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmissingvalues REAL(KIND=8) DIMENSION(*) for vsldsseditmissingvalues C: float* for vslsSSEditMissingValues double* for vsldSSEditMissingValues Pointer to the array of prior parameters simul_missing_vals_n Fortran: INTEGER C: MKL_INT* Pointer to the size of the array that holds output of the Multiple Imputation method simul_missing_vals Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmissingvalues REAL(KIND=8) DIMENSION(*) for vsldsseditmissingvalues C: float* for vslsSSEditMissingValues Pointer to the array of size k*m, where k is the total number of missing values, and m is number of copies of missing values. The array holds m sets of simulated missing values for the matrix of observations. Statistical Functions 10 2295 Name Type Description double* for vsldSSEditMissingValues estimates_n Fortran: INTEGER C: MKL_INT* Pointer to the number of estimates to be returned by the routine estimates Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditmissingvalues REAL(KIND=8) DIMENSION(*) for vsldsseditmissingvalues C: float* for vslsSSEditMissingValues double* for vsldSSEditMissingValues Pointer to the array that holds estimates of the mean and a variance-covariance matrix. Output Parameters Name Type Description status Fortran: INTEGER C: int Current status of the task Description The vslSSEditMissingValues routine uses values passed as parameters of the routine to replace pointers to the number and the array of the method parameters, pointers to the number and the array of initial mean/variance-covariance estimates, the pointer to the number and the array of prior parameters, pointers to the number and the array of simulated missing values, and pointers to the number and the array of the intermediate mean/covariance estimates. If an input parameter is NULL, the corresponding parameter in the task descriptor remains unchanged. Before you call the VSL Summary Statistics routines to process missing values, preprocess the dataset and denote missing observations with one of the following predefined constants: • VSL_SS_SNAN, if the dataset is stored in single precision floating-point arithmetic • VSL_SS_DNAN, if the dataset is stored in double precision floating-point arithmetic Intel MKL provides the VSL_SS_METHOD_MI method to support missing values in the dataset based on the Multiple Imputation (MI) approach described in [Schafer97]. The following components support Multiple Imputation: • Expectation Maximization (EM) algorithm to compute the start point for the Data Augmentation (DA) procedure • DA function NOTE The DA component of the MI procedure is simulation-based and uses the VSL_BRNG_MCG59 basic random number generator with predefined seed = 250 and the Gaussian distribution generator (ICDF method) available in Intel MKL [Gaussian]. 10 Intel® Math Kernel Library Reference Manual 2296 Pack the parameters of the MI algorithm into the params array. Table "Structure of the Array of MI Parameters" describes the params structure. Structure of the Array of MI Parameters Array Position Algorithm Parameter Description 0 em_iter_num Maximal number of iterations for the EM algorithm. By default, this value is 50. 1 da_iter_num Maximal number of iterations for the DA algorithm. By default, this value is 30. 2 e Stopping criterion for the EM algorithm. The algorithm terminates if the maximal module of the element-wise difference between the previous and current parameter values is less than e. By default, this value is 0.001. 3 m Number of sets to impute 4 missing_vals_num Total number of missing values in the datasets You can also pass initial estimates into the EM algorithm by packing both the vector of means and the variance-covariance matrix as a one-dimensional array init_estimates. The size of the array should be at least p + p(p + 1)/2. For i=0, .., p-1, the init_estimates[i] array contains the initial estimate of means. The remaining positions of the array are occupied by the upper triangular part of the variance-covariance matrix. If you provide no initial estimates for the EM algorithm, the editor uses the default values, that is, the vector of zero means and the unitary matrix as a variance-covariance matrix. You can also pass prior parameters for µ and S into the library: µ0, t, m, and ?-1. Pack these parameters as a one-dimensional array prior with a size of at least (p2 + 3p + 4)/2. The storage format is as follows: • prior[0], ..., prior[p-1] contain the elements of the vector µ0. • prior[p] contains the parameter t. • prior[p+1] contains the parameter m. • The remaining positions are occupied by the upper-triangular part of the inverted matrix ?-1. If you provide no prior parameters, the editor uses their default values: • The array of p zeros is used as µ0. • t is set to 0. • m is set to p. • The zero matrix is used as an initial approximate of ?-1. The EditMissingValues editor returns m sets of imputed values and/or a sequence of parameter estimates drawn during the DA procedure. The editor returns the imputed values as the simul_missing_vals array. The size of the array should be sufficient to hold m sets each of the missing_vals_num size, that is, at least m*missing_vals_num in total. The editor packs the imputed values one by one in the order of their appearance in the matrix of observations. Statistical Functions 10 2297 For example, consider a task of dimension 4. The total number of observations n is 10. The second observation vector misses variables 1 and 2, and the seventh observation vector lacks variable 1. The number of sets to impute is m=2. Then, simul_missing_vals[0] and simul_missing_vals[1] contains the first and the second points for the second observation vector, and simul_missing_vals[2] holds the first point for the seventh observation. Positions 3, 4, and 5 are formed similarly. To estimate convergence of the DA algorithm and choose a proper value of the number of DA iterations, request the sequence of parameter estimates that are produced during the DA procedure. The editor returns the sequence of parameters as a single array. The size of the array is m*da_iter_num*(p+(p2+p)/2) where • m is the number of sets of values to impute. • da_iter_num is the number of DA iterations. • The value p+(p2+p)/2 determines the size of the memory to hold one set of the parameter estimates. In each set of the parameters, the vector of means occupies the first p positions and the remaining (p2+p)/2 positions are intended for the upper triangular part of the variance-covariance matrix. Upon successful generation of m sets of imputed values, you can place them in cells of the data matrix with missing values and use the VSL Summary Statistics routines to analyze and get estimates for each of the m complete datasets. NOTE Intel MKL implementation of the MI algorithm rewrites cells of the dataset that contain the VSL_SS_SNAN/VSL_SS_DNAN values. If you want to use the VSL Summary Statistics routines to process the data with missing values again, mask the positions of the empty cells. See additional details of the algorithm usage model in the Intel® MKL Summary Statistics Library Application Notes document on the Intel® MKL web page. vslSSEditCorParameterization Modifies pointers to arrays related to the algorithm of correlation matrix parameterization. Syntax Fortran: status = vslssseditcorparameterization(task, cor, cor_storage, pcor, pcor_storage) status = vsldsseditcorparameterization(task, cor, cor_storage, pcor, pcor_storage) C: status = vslsSSEditCorParameterization(task, cor, cor_storage, pcor, pcor_storage); status = vsldSSEditCorParameterization(task, cor, cor_storage, pcor, pcor_storage); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) Descriptor of the task 10 Intel® Math Kernel Library Reference Manual 2298 Name Type Description C: VSLSSTaskPtr cor Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditcorparameterization REAL(KIND=8) DIMENSION(*) for vsldsseditcorparameterization C: float* for vslsSSEditCorParameterization double* for vsldSSEditCorParameterization Pointer to the correlation matrix cor_storage Fortran: INTEGER C: MKL_INT* Pointer to the storage format of the correlation matrix pcor Fortran: REAL(KIND=4) DIMENSION(*) for vslssseditcorparameterization REAL(KIND=8) DIMENSION(*) for vsldsseditcorparameterization C: float* for vslsSSEditCorParameterization double* for vsldSSEditCorParameterization Pointer to the parameterized correlation matrix por_storage Fortran: INTEGER C: MKL_INT* Pointer to the storage format of the parameterized correlation matrix Output Parameters Name Type Description status Fortran: INTEGER C: int Current status of the task Description The vslSSEditCorParameterization routine uses values passed as parameters of the routine to replace pointers to the correlation matrix, pointers to the correlation matrix storage format, a pointer to the parameterized correlation matrix, and a pointer to the parameterized correlation matrix storage format. See Table "Storage formats of a variance-covariance/correlation matrix" for possible values of the cor_storage and pcor_storage parameters. If an input parameter is NULL, the corresponding parameter in the VSL Summary Statistics task descriptor remains unchanged. Statistical Functions 10 2299 Task Computation Routines Task computation routines calculate statistical estimates on the data provided and parameters held in the task descriptor. After you create the task and initialize its parameters, you can call the computation routines as many times as necessary. Table "VSL Summary Statistics Estimates Obtained with vslSSCompute Routine" lists the statistical estimates that you can obtain using the vslSSCompute routine. NOTE The VSL Summary Statistics computation routines do not signal floating-point errors, such as overflow or gradual underflow, or operations with NaNs (except for the missing values in the observations). VSL Summary Statistics Estimates Obtained with vslSSCompute Routine Estimate Support of Observations Available in Blocks Description VSL_SS_MEAN Yes Computes the array of means. VSL_SS_2R_MOM Yes Computes the array of the 2nd order raw moments. VSL_SS_3R_MOM Yes Computes the array of the 3rd order raw moments. VSL_SS_4R_MOM Yes Computes the array of the 4th order raw moments. VSL_SS_2C_MOM Yes Computes the array of the 2nd order central moments. VSL_SS_3C_MOM Yes Computes the array of the 3rd order central moments. VSL_SS_4C_MOM Yes Computes the array of the 4th order central moments. VSL_SS_KURTOSIS Yes Computes the array of kurtosis values. VSL_SS_SKEWNESS Yes Computes the array of skewness values. VSL_SS_MIN Yes Computes the array of minimum values. VSL_SS_MAX Yes Computes the array of maximum values. VSL_SS_VARIATION Yes Computes the array of variation coefficients. VSL_SS_COV Yes Computes a covariance matrix. VSL_SS_COR Yes Computes a correlation matrix. VSL_SS_POOLED_COV No Computes a pooled covariance matrix. VSL_SS_GROUP_COV No Computes group covariance matrices. VSL_SS_QUANTS No Computes quantiles. VSL_SS_ORDER_STATS No Computes order statistics. VSL_SS_ROBUST_COV No Computes a robust covariance matrix. VSL_SS_OUTLIERS No Detects outliers in the dataset. 10 Intel® Math Kernel Library Reference Manual 2300 Estimate Support of Observations Available in Blocks Description VSL_SS_PARTIAL_COV No Computes a partial covariance matrix. VSL_SS_PARTIAL_COR No Computes a partial correlation matrix. VSL_SS_MISSING_VALS No Supports missing values in datasets. VSL_SS_PARAMTR_COR No Computes a parameterized correlation matrix. VSL_SS_STREAM_QUANTS Yes Computes quantiles for streaming data. Table "VSL Summary Statistics Computation Methods" lists estimate calculation methods supported by Intel MKL. See the Intel(R) MKL Summary Statistics Library Application Notes document on the Intel® MKL web page for a detailed description of the methods. VSL Summary Statistics Computation Method Method Description VSL_SS_METHOD_FAST Fast method for calculation of the estimates VSL_SS_METHOD_1PASS One-pass method for calculation of estimates VSL_SS_METHOD_TBS TBS method for robust estimation of covariance and mean VSL_SS_METHOD_BACON BACON method for detection of multivariate outliers VSL_SS_METHOD_MI Multiple imputation method for support of missing values VSL_SS_METHOD_SD Spectral decomposition method for parameterization of a correlation matrix VSL_SS_METHOD_SQUANTS_ZW Zhang-Wang (ZW) method for quantile estimation for streaming data VSL_SS_METHOD_SQUANTS_ZW_FAST Fast ZW method for quantile estimation for streaming data You can calculate all requested estimates in one call of the routine. For example, to compute a kurtosis and covariance matrix using a fast method, pass a combination of the pre-defined parameters into the Compute routine as shown in the example below: ... method = VSL_SS_METHOD_FAST; task_params = VSL_SS_KURTOSIS|VSL_SS_COV; … status = vsldSSCompute( task, task_params, method ); To compute statistical estimates for the next block of observations, you can do one of the following: • copy the observations to memory, starting with the address available to the task • use one of the appropriate Editors to modify the pointer to the new dataset in the task. The library does not detect your changes of the dataset and computed statistical estimates. To obtain statistical estimates for a new matrix, change the observations and initialize relevant arrays. You can follow this procedure to compute statistical estimates for observations that come in portions. See Table "VSL Summary Statistics Estimates Obtained with vslSSCompute Routine" for information on such observations supported by the Intel MKL VSL Summary Statistics estimators. To modify parameters of the task using the Task Editors, set the address of the targeted matrix of the observations or change the respective vector component indices. After you complete editing the task parameters, you can compute statistical estimates in the modified environment. Statistical Functions 10 2301 If the task completes successfully, the computation routine returns the zero status code. If an error is detected, the computation routine returns an error code. In particular, an error status code is returned in the following cases: • the task pointer is NULL • memory allocation has failed • the calculation has failed for some other reason NOTE You can use the NULL task pointer in calls to editor routines. In this case, the routine is terminated and no system crash occurs. vslSSCompute Computes VSL Summary Statistics estimates. Syntax Fortran: status = vslssscompute(task, estimates, method) status = vsldsscompute(task, estimates, method) C: status = vslsSSCompute(task, estimates, method); status = vsldSSCompute(task, estimates, method); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr Descriptor of the task estimates Fortran: INTEGER (KIND=8) C: unsigned long long List of statistical estimates to compute method Fortran: INTEGER C: MKL_INT Method to be used in calculations Output Parameters Name Type Description status Fortran: INTEGER C: int Current status of the task Description The vslSSCompute routine calculates statistical estimates passed as the estimates parameter using the algorithms passed as the method parameter of the routine. The computations are done in the context of the task descriptor that contains pointers to all required and optional, if necessary, properly initialized arrays. In 10 Intel® Math Kernel Library Reference Manual 2302 one call of the function, you can compute several estimates using proper methods for their calculation. See Table "VSL Summary Statistics Estimates Obtained with Compute Routine" for the list of the estimates that you can calculate with the vslSSCompute routine. See Table "VSL Summary Statistics Computation Methods" for the list of possible values of the method parameter. To initialize single or double precision version task parameters, use the single (vslssscompute) or double (vsldsscompute) version of the editor, respectively. To initialize parameters of the integer type, use an integer version of the editor (vslisscompute). NOTE Requesting a combination of the VSL_SS_MISSING_VALS value and any other estimate parameter in the Compute function results in processing only the missing values. Task Destructor Task destructor is the vslSSDeleteTask routine intended to delete task objects and release memory. vslSSDeleteTask Destroys the task object and releases the memory. Syntax Fortran: status = vslssdeletetask(task) C: status = vslSSDeleteTask(&task); Include Files • Fortran 90: mkl_vsl.f90 • C: mkl_vsl_functions.h Input Parameters Name Type Description task Fortran: TYPE(VSL_SS_TASK) C: VSLSSTaskPtr* Descriptor of the task to destroy Output Parameters Name Type Description status Fortran: INTEGER C: int Sets to VSL_STATUS_OK if the task is deleted; otherwise a non-zero code is returned. Description The vslSSDeleteTask routine deletes the task descriptor object, releases the memory allocated for the structure, and sets the task pointer to NULL. If vslSSDeleteTask fails to delete the task successfully, it returns an error code. Statistical Functions 10 2303 NOTE Call of the destructor with the NULL pointer as the parameter results in termination of the function with no system crash. Usage Examples The following examples show various standard operations with Summary Statistics routines. Calculating Fixed Estimates for Fixed Data The example shows recurrent calculation of the same estimates with a given set of variables for the complete life cycle of the task in the case of a variance-covariance matrix. The set of vector components to process remains unchanged, and the data comes in blocks. Before you call the vslSSCompute routine, initialize pointers to arrays for mean and covariance and set buffers. …. double w[2]; double indices[DIM] = {1, 0, 1}; /* calculating mean for 1st and 3d random vector components */ /* Initialize parameters of the task */ p = DIM; n = N; xstorage = VSL_SS_MATRIX_STORAGE_ROWS; covstorage = VSL_SS_MATRIX_STORAGE_FULL; w[0] = 0.0; w[1] = 0.0; for ( i = 0; i < p; i++ ) mean[i] = 0.0; for ( i = 0; i < p*p; i++ ) cov[i] = 0.0; status = vsldSSNewTask( &task, &p, &n, &xstorage, x, 0, indices ); status = vsldSSEditTask ( task, VSL_SS_ED_ACCUM_WEIGHT, w ); status = vsldSSEditCovCor( task, mean, cov, &covstorage, 0, 0 ); You can process data arrays that come in blocks as follows: for ( i = 0; i < num_of_blocks; i++ ) { status = vsldSSCompute( task, VSL_SS_COV, VSL_SS_METHOD_FAST ); /* Read new data block into array x */ }… Calculating Different Estimates for Variable Data The context of your calculation may change in the process of data analysis. The example below shows the data that comes in two blocks. You need to estimate a covariance matrix for the complete data, and the third central moment for the second block of the data using the weights that were accumulated for the previous datasets. The second block of the data is stored in another array. You can proceed as follows: /* Set parameters for the task */ p = DIM; n = N; xstorage = VSL_SS_MATRIX_STORAGE_ROWS; covstorage = VSL_SS_MATRIX_STORAGE_FULL; w[0] = 0.0; w[1] = 0.0; for ( i = 0; i < p; i++ ) mean[i] = 0.0; for ( i = 0; i < p*p; i++ ) cov[i] = 0.0; /* Create task */ status = vsldSSNewTask( &task, &p, &n, &xstorage, x1, 0, indices ); 10 Intel® Math Kernel Library Reference Manual 2304 /* Initialize the task parameters */ status = vsldSSEditTask( task, VSL_SS_ED_ACCUM_WEIGHT, w ); status = vsldSSEditCovCor( task, mean, cov, &covstorage, 0, 0 ); /* Calculate covariance for the x1 data */ status = vsldSSCompute( task, VSL_SS_COV, VSL_SS_METHOD_FAST ); /* Initialize array of the 3d central moments and pass the pointer to the task */ for ( i = 0; i < p; i++ ) c3_m[i] = 0.0; /* Modify task context */ status = vsldSSEditTask( task, VSL_SS_ED_3C_MOM, c3_m ); status = vsldSSEditTask( task, VSL_SS_ED_OBSERV, x2 ); /* Calculate covariance for the x1 & x2 data block */ /* Calculate the 3d central moment for the 2nd data block using earlier accumulated weight */ status = vsldSSCompute(task, VSL_SS_COV|VSL_SS_3C_MOM, VSL_SS_METHOD_FAST ); … status = vslSSDeleteTask( &task ); Similarly, you can modify indices of the variables to be processed for the next data block. Mathematical Notation and Definitions The following notations are used in the mathematical definitions and the description of the Intel MKL VSL Summary Statistics functions. Matrix and Weights of Observations For a random p-dimensional vector ? = (?1,..., ?i,..., ?p), this manual denotes the following: • (X)i=(xij)j=1..n is the result of n independent observations for the i-th component ?i of the vector ?. • The two-dimensional array X=(xij)p x n is the matrix of observations. • The column [X]j=(xij)i=1..p of the matrix X is the j-th observation of the random vector ?. Each observation [X]j is assigned a non-negative weight wj , where • The vector (wj)j=1..n is a vector of weights corresponding to n observations of the random vector ?. • is the accumulated weight corresponding to observations X. Vector of sample means for all i = 1, ..., p. Vector of sample variances Statistical Functions 10 2305 for all i = 1, ..., p. Vector of sample raw/algebraic moments of k-th order, k = 1 for all i = 1, ..., p. Vector of sample central moments of the third and the fourth order for all i = 1, ..., p and k = 3, 4. Vector of sample excess kurtosis values for all i = 1, ..., p. Vector of sample skewness values for all i = 1, ..., p. Vector of sample variation coefficients for all i = 1, ..., p. Matrix of order statistics Matrix Y = (yij)pxn, in which the i-th row (Y)i = (yij)j=1..n is obtained as a result of sorting in the ascending order of row (X)i = (xij)j=1..n in the original matrix of observations. 10 Intel® Math Kernel Library Reference Manual 2306 Vector of sample minimum values for all i = 1, ..., p. Vector of sample maximum values for all i = 1, ..., p. Vector of sample median values for all i = 1, ..., p. Vector of sample quantile values For a positive integer number q and k belonging to the interval [0, q-1], point z i is the k-th q quantile of the random variable ?i if P{?i = zi} = ß and P{?i = zi} = 1 - ß, where • P is the probability measure. • ß = k/n is the quantile order. The calculation of quantiles is as follows: j = [(n-1)ß] and f = {(n-1)ß} as integer and fractional parts of the number (n-1)ß, respectively, and the vector of sample quantile values is Q(X,ß) = (Q1(X,ß), ..., Qp(X,ß)) where (Qi(X,ß) = yi,j+1 + f(yi,j+2 - yi,j+1) for all i = 1, ..., p. Variance-covariance matrix C(X) = (cij(X))p x p where Statistical Functions 10 2307 Pooled and group variance-covariance matrices The set N = {1, ..., n} is partitioned into non-intersecting subsets The observation [X]j = (xij)i=1..p belongs to the group r if j ? Gr. One observation belongs to one group only. The group mean and variance-covariance matrices are calculated similarly to the formulas above: for all i = 1, ..., p, where for all i = 1, ..., p and j = 1, ..., p. A pooled variance-covariance matrix and a pooled mean are computed as weighted mean over group covariance matrices and group means, correspondingly: for all i = 1, ..., p, for all i = 1, ..., p and j = 1, ..., p. Correlation matrix 10 Intel® Math Kernel Library Reference Manual 2308 for all i = 1, ..., p and j = 1, ..., p. Partial variance-covariance matrix For a random vector ? partitioned into two components Z and Y, a variance-covariance matrix C describes the structure of dependencies in the vector ?: The partial covariance matrix P(X) =(pij(X))kxk is defined as where k is the dimension of Y. Partial correlation matrix The following is a partial correlation matrix for all i = 1, ..., k and j = 1, ..., k: where • k is the dimension of Y. • pij(X) are elements of the partial variance-covariance matrix. Statistical Functions 10 2309 10 Intel® Math Kernel Library Reference Manual 2310 Fourier Transform Functions 11 The general form of the discrete Fourier transform is for kl = 0, ... nl-1 (l = 1, ..., d), where s is a scale factor, d = -1 for the forward transform, and d = +1 for the inverse (backward) transform. In the forward transform, the input (periodic) sequence {wj1, j2, ..., jd} typically belongs to the set of complex-valued sequences and real-valued sequences (forward domain). Respective domains for the backward transform, or backward domains, are represented by complex-valued sequences and complex-valued conjugate-even sequences. Math Kernel Library (Intel® MKL) provides an interface for computing a discrete Fourier transform through the fast Fourier transform algorithm. This chapter describes the following implementations of the fast Fourier transform functions available in Intel MKL: • Fast Fourier transform (FFT) functions for single-processor or shared-memory systems (see FFT Functions below) • Cluster FFT functions for distributed-memory architectures (available with Intel® MKL for the Linux* and Windows* operating systems only). NOTE Intel MKL also supports the FFTW3* interfaces to the fast Fourier transform functionality for symmetric multiprocessing (SMP) systems. Both FFT and Cluster FFT functions support a five-stage usage model for computing an FFT: 1. Allocate a fresh descriptor for the problem with a call to the DftiCreateDescriptor or DftiCreateDescriptorDM function. The descriptor captures the configuration of the transform, such as the dimensionality (or rank), sizes, number of transforms, memory layout of the input/output data (defined by strides), and scaling factors. Many of the configuration settings are assigned default values in this call and may need modification depending on your application. 2. Optionally adjust the descriptor configuration with a call to the DftiSetValue or DftiSetValueDM function as needed. Typically, you must carefully define the data storage layout for an FFT or the data distribution among processes for a Cluster FFT. The configuration settings of the descriptor, such as the default values, can be obtained with the DftiGetValue or DftiGetValueDM function. 3. Commit the descriptor with a call to the DftiCommitDescriptor or DftiCommitDescriptorDM function, that is, make the descriptor ready for the transform computation. Once the descriptor is committed, the parameters of the transform, such as the type and number of transforms, strides and distances, the type and storage layout of the data, and so on, are "frozen" in the descriptor. 4. Compute the transform with a call to the DftiComputeForward/DftiComputeBackward or DftiComputeForwardDM/DftiComputeBackwardDM functions as many times as needed. With the committed descriptor, the compute functions only accept pointers to the input/output data and compute the transform as defined. To modify any configuration parameters later on, use DftiSetValue followed by DftiCommitDescriptor (DftiSetValueDM followed by DftiCommitDescriptorDM) or create and commit another descriptor. 5. Deallocate the descriptor with a call to the DftiFreeDescriptor or DftiFreeDescriptorDM function. This will return the memory internally consumed by the descriptor to the operating system. All the above functions return an integer status value, which is zero upon successful completion of the operation. You can interpret a non-zero status with the help of the DftiErrorClass or DftiErrorMessage function. 2311 The FFT functions support lengths with arbitrary factors. You can improve performance of the Intel MKL FFT if the length of your data vector permits factorization into powers of optimized radices. See the Intel MKL User's Guide for specific radices supported efficiently and the length constraints. NOTE The FFT functions assume the Cartesian representation of complex data (that is, the real and imaginary parts define a complex number). The Intel MKL Vector Mathematical Functions provide an efficient tool for conversion to and from the polar representation (see Example "Conversion from Cartesian to polar representation of complex data" and Example "Conversion from polar to Cartesian representation of complex data"). Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 FFT Functions The fast Fourier transform function library of Intel MKL provides one-dimensional, two-dimensional, and multi-dimensional transforms (of up to seven dimensions); and both Fortran and C interfaces for all transform functions. Table "FFT Functions in Intel MKL" lists FFT functions implemented in Intel MKL: FFT Functions in Intel MKL Function Name Operation Descriptor Manipulation Functions DftiCreateDescriptor Allocates the descriptor data structure and initializes it with default configuration values. DftiCommitDescriptor Performs all initialization for the actual FFT computation. DftiFreeDescriptor Frees memory allocated for a descriptor. DftiCopyDescriptor Makes a copy of an existing descriptor. FFT Computation Functions DftiComputeForward Computes the forward FFT. DftiComputeBackward Computes the backward FFT. Descriptor Configuration Functions DftiSetValue Sets one particular configuration parameter with the specified configuration value. DftiGetValue Gets the value of one particular configuration parameter. Status Checking Functions DftiErrorClass Checks if the status reflects an error of a predefined class. DftiErrorMessage Translates the numeric value of an error status into a message. 11 Intel® Math Kernel Library Reference Manual 2312 Computing an FFT The FFT functions described later in this chapter are provided with the Fortran and C interfaces. Fortran 95 is required because it offers features that have no counterpart in FORTRAN 77. NOTE The Fortran interface of the FFT computation functions requires one-dimensional data arrays for any dimension of FFT problem. For multidimensional transforms, you can pass the address of the first column of the multidimensional data to the computation functions. The materials presented in this chapter assume the availability of native complex types in C as they are specified in C9X. You can find code examples that use FFT interface functions to compute transform results in the Fourier Transform Functions Code Examples section in the Appendix C. For most common situations, an FFT computation can be effected by four function calls (refer to the usage model for details). A single data structure, the descriptor, stores configuration parameters that can be changed independently. The descriptor data structure, when created, contains information about the length and domain of the FFT to be computed, as well as the setting of several configuration parameters. Default settings for some of these parameters are as follows: • The FFT to be computed does not have a scale factor; • There is only one set of data to be transformed; • The data is stored contiguously in memory; • The computed result overwrites the input data (the transform is in-place); The default settings can be changed one-at-a-time through the function DftiSetValue as illustrated in the Example "Changing Default Settings (Fortran)" and Example "Changing Default Settings (C)". FFT Interface To use the FFT functions, you need to access the module MKL_DFTI through the "use" statement in Fortran; or include the header file mkl_dfti.h in C. The Fortran interface provides a derived type DFTI_DESCRIPTOR, named constants representing various names of configuration parameters and their possible values, and overloaded functions through the generic functionality of Fortran 95. The C interface provides the DFTI_DESCRIPTOR_HANDLE type, named constants of two enumeration types DFTI_CONFIG_PARAM and DFTI_CONFIG_VALUE, and functions, some of which accept different numbers of input arguments. NOTE The current version of the library may not support some of the FFT functions or functionality described in the subsequent sections of this chapter. You can find the complete list of the implementation-specific exceptions in the Intel MKL Release Notes. For the main categories of Intel MKL FFT functions, see FFT Functions. Descriptor Manipulation Functions There are four functions in this category: create a descriptor, commit a descriptor, copy a descriptor, and free a descriptor. Fourier Transform Functions 11 2313 DftiCreateDescriptor Allocates the descriptor data structure and initializes it with default configuration values. Syntax Fortran: status = DftiCreateDescriptor( desc_handle, precision, forward_domain, dimension, length ) C: status = DftiCreateDescriptor(&desc_handle, precision, forward_domain, dimension, length); Include Files • FORTRAN 90: mkl_dfti.f90 • C: mkl_dfti.h Input Parameters Name Type Description precision FORTRAN: INTEGER C: enum Precision of the transform: DFTI_SINGLE or DFTI_DOUBLE. forward_domain FORTRAN: INTEGER C: enum Forward domain of the transform: DFTI_COMPLEX or DFTI_REAL. dimension FORTRAN: INTEGER C: MKL_LONG Dimension of the transform. length FORTRAN: INTEGER if dimension = 1. Array INTEGER, DIMENSION(*) otherwise. C: MKL_LONG if dimension == 1. Array of type MKL_LONG otherwise. Length of the transform for a one-dimensional transform. Lengths of each dimension for a multi-dimensional transform. Output Parameters Name Type Description desc_handle FORTRAN: DFTI_DESCRIPTOR C: DFTI_DESCRIPTOR_HANDLE FFT descriptor. status FORTRAN: INTEGER C: MKL_LONG Function completion status. 11 Intel® Math Kernel Library Reference Manual 2314 Description This function allocates memory for the descriptor data structure and instantiates it with all the default configuration settings with respect to the precision, forward domain, dimension, and length of the desired transform. Because memory is allocated dynamically, the result is actually a pointer to the created descriptor. This function is slightly different from the "initialization" function that can be found in software packages or libraries that implement more traditional algorithms for computing FFT. This function does not perform any significant computational work such as computation of twiddle factors. The function DftiCommitDescriptor does this work after the function DftiSetValue has set values of all needed parameters. The function returns the zero status when it completes successfully. See Status Checking Functions for more information on the returned status. Interface and Prototype ! Fortran interface. ! Note that the body provided below only illustrates the list of different ! parameters and the types of dummy parameters. You can rely only on the function ! name following keyword INTERFACE. For the precise definition of the ! interface, see the include/mkl_dfti.f90 file in the Intel MKL directory. INTERFACE DftiCreateDescriptor FUNCTION some_actual_function_1d(desc, precision, domain, dim, length) INTEGER :: some_actual_function_1d ... INTEGER, INTENT(IN) :: length END FUNCTION some_actual_function_1d FUNCTION some_actual_function_md(desc, precision, domain, dim, lengths) INTEGER :: some_actual_function_md ... INTEGER, INTENT(IN), DIMENSION(*) :: lengths END FUNCTION some_actual_function_md ... END INTERFACE DftiCreateDescriptor Note that the function is overloaded, because the actual parameter for the formal parameter length can be a scalar or a rank-one array. The function is also overloaded with respect to the type of the precision parameter to provide an option of using a precision-specific function for the generic name. Using more specific functions can reduce the size of statically linked executable for the applications using only single-precision FFTs or only double-precision FFTs. To use this option, change the "USE MKL_DFTI" statement in your program unit to one of the following: USE MKL_DFTI, FORGET=>DFTI_SINGLE, DFTI_SINGLE=>DFTI_SINGLE_R USE MKL_DFTI, FORGET=>DFTI_DOUBLE, DFTI_DOUBLE=>DFTI_DOUBLE_R where the name "FORGET" can be replaced with any name that is not used in the program unit. /* C prototype. * Note that the preprocessor definition provided below only illustrates * that the actual function called may be determined at compile time. * You can rely only on the declaration of the function. * For precise definition of the preprocessor macro, see the include/mkl_dfti.h * file in the Intel MKL directory. */ MKL_LONG DftiCreateDescriptor(DFTI_DESCRIPTOR_HANDLE * pHandle, enum DFTI_CONFIG_VALUE precision, enum DFTI_CONFIG_VALUE domain, MKL_LONG dimension, ... /* length/lengths */ ); Fourier Transform Functions 11 2315 #define DftiCreateDescriptor(desc,prec,domain,dim,sizes) \ ((prec)==DFTI_SINGLE && (dim)==1) ? \ some_actual_function_s1d((desc),(domain),(MKL_LONG)(sizes)) : \ ... Variable length/lengths is interpreted as a scalar (MKL_LONG) or an array (MKL_LONG*), depending on the value of parameter dimension. If the value of parameter precision is known at compile time, an optimizing compiler retains only the call to the respective specific function, thereby reducing the size of the statically linked application. Avoid direct calls to the specific functions used in the preprocessor macro definition, because their interface may change in future releases of the library. If the use of the macro is undesirable, you can safely undefine it after inclusion of the Intel MKL FFT header file, as follows: #include "mkl_dfti.h" #undef DftiCreateDescriptor See Also DFTI_PRECISION DFTI_FORWARD_DOMAIN DFTI_DIMENSION, DFTI_LENGTHS Configuration Parameters DftiCommitDescriptor Performs all initialization for the actual FFT computation. Syntax Fortran: status = DftiCommitDescriptor( desc_handle ) C: status = DftiCommitDescriptor(desc_handle); Include Files • FORTRAN 90: mkl_dfti.f90 • C: mkl_dfti.h Input Parameters Name Type Description desc_handle FORTRAN: DFTI_DESCRIPTOR C: DFTI_DESCRIPTOR_HANDLE FFT descriptor. Output Parameters Name Type Description desc_handle FORTRAN: DFTI_DESCRIPTOR C: DFTI_DESCRIPTOR_HANDLE Updated FFT descriptor. status FORTRAN: INTEGER C: MKL_LONG Function completion status. 11 Intel® Math Kernel Library Reference Manual 2316 Description This function completes initialization of a previously created descriptor, which is required before the descriptor can be used for FFT computations. Typically, this committal performs all initialization that facilitates the actual FFT computation. This initialization may involve exploring many different factorizations of the input length to find the optimal computation method. Any changes of configuration parameters of a committed descriptor via the set value function (see Descriptor Configuration) requires a re-committal of the descriptor before a computation function can be invoked. Typically, this committal function call is immediately followed by a computation function call (see FFT Computation). The function returns the zero status when it completes successfully. See Status Checking Functions for more information on the returned status. Interface and Prototype ! Fortran interface INTERFACE DftiCommitDescriptor !Note that the body provided here is to illustrate the different !argument list and types of dummy arguments. The interface !does not guarantee what the actual function names are. !Users can only rely on the function name following the !keyword INTERFACE FUNCTION some_actual function_1 ( Desc_Handle ) INTEGER :: some_actual function_1 TYPE(DFTI_DESCRIPTOR), POINTER :: Desc_Handle END FUNCTION some_actual function_1 END INTERFACE DftiCommitDescriptor /* C prototype */ MKL_LONG DftiCommitDescriptor( DFTI_DESCRIPTOR_HANDLE ); DftiFreeDescriptor Frees the memory allocated for a descriptor. Syntax Fortran: status = DftiFreeDescriptor( desc_handle ) C: status = DftiFreeDescriptor(&desc_handle); Include Files • FORTRAN 90: mkl_dfti.f90 • C: mkl_dfti.h Input Parameters Name Type Description desc_handle FORTRAN: DESCRIPTOR_HANDLE C: DFTI_DESCRIPTOR_HANDLE FFT descriptor. Fourier Transform Functions 11 2317 Output Parameters Name Type Description desc_handle FORTRAN: DESCRIPTOR_HANDLE C: DFTI_DESCRIPTOR_HANDLE Memory for the FFT descriptor is released. status FORTRAN: INTEGER C: MKL_LONG Function completion status. Description This function frees all memory allocated for a descriptor. NOTE Memory allocation/deallocation inside Intel MKL is managed by Intel MKL memory management software. So, even after successful completion of FreeDescriptor, the memory space may continue being allocated for the application because the memory management software sometimes does not return the memory space to the OS, but considers the space free and can reuse it for future memory allocation. See Example "mkl_free_buffers Usage with FFT Functions" in the description of the service function FreeBuffers on how to use Intel MKL memory management software and release memory to the OS. The function returns the zero status when it completes successfully. See Status Checking Functions for more information on the returned status. Interface and Prototype ! Fortran interface INTERFACE DftiFreeDescriptor //Note that the body provided here is to illustrate the different //argument list and types of dummy arguments. The interface //does not guarantee what the actual function names are. //Users can only rely on the function name following the //keyword INTERFACE FUNCTION some_actual_function_3( Desc_Handle ) INTEGER :: some_actual_function_3 TYPE(DFTI_DESCRIPTOR), POINTER :: Desc_Handle END FUNCTION some_actual_function_3 END INTERFACE DftiFreeDescriptor /* C prototype */ MKL_LONG DftiFreeDescriptor( DFTI_DESCRIPTOR_HANDLE * ); DftiCopyDescriptor Makes a copy of an existing descriptor. Syntax Fortran: status = DftiCopyDescriptor( desc_handle_original, desc_handle_copy ) C: status = DftiCopyDescriptor(desc_handle_original, &desc_handle_copy); 11 Intel® Math Kernel Library Reference Manual 2318 Include Files • FORTRAN 90: mkl_dfti.f90 • C: mkl_dfti.h Input Parameters Name Type Description desc_handle_original FORTRAN: DESCRIPTOR_HANDLE C: DFTI_DESCRIPTOR_HANDLE The FFT descriptor to make a copy of. Output Parameters Name Type Description desc_handle_copy FORTRAN: DESCRIPTOR_HANDLE C: DFTI_DESCRIPTOR_HANDLE The copy of the FFT descriptor. status FORTRAN: INTEGER C: MKL_LONG Function completion status. Description This function makes a copy of an existing descriptor and provides a pointer to it. The purpose is that all information of the original descriptor will be maintained even if the original is destroyed via the free descriptor function DftiFreeDescriptor. The function returns the zero status when it completes successfully. See Status Checking Functions for more information on the returned status. Interface and Prototype ! Fortran interface INTERFACE DftiCopyDescriptor ! Note that the body provided here is to illustrate the different !argument list and types of dummy arguments. The interface !does not guarantee what the actual function names are. !Users can only rely on the function name following the !keyword INTERFACE FUNCTION some_actual_function_2( Desc_Handle_Original, Desc_Handle_Copy ) INTEGER :: some_actual_function_2 TYPE(DFTI_DESCRIPTOR), POINTER :: Desc_Handle_Original, Desc_Handle_Copy END FUNCTION some_actual_function_2 END INTERFACE DftiCopyDescriptor /* C prototype */ MKL_LONG DftiCopyDescriptor( DFTI_DESCRIPTOR_HANLDE, DFTI_DESCRIPTOR_HANDLE * ); FFT Computation Functions There are two functions in this category: compute the forward transform, and compute the backward transform. Fourier Transform Functions 11 2319 DftiComputeForward Computes the forward FFT. Syntax Fortran: status = DftiComputeForward( desc_handle, x_inout ) status = DftiComputeForward( desc_handle, x_in, y_out ) status = DftiComputeForward( desc_handle, xre_inout, xim_inout ) status = DftiComputeForward( desc_handle, xre_in, xim_in, yre_out, yim_out ) C: status = DftiComputeForward(desc_handle, x_inout); status = DftiComputeForward(desc_handle, x_in, y_out); status = DftiComputeForward(desc_handle, xre_inout, xim_inout); status = DftiComputeForward(desc_handle, xre_in, xim_in, yre_out, yim_out); Input Parameters Name Type Description desc_handle FORTRAN: DFTI_DESCRIPTOR C: DFTI_DESCRIPTOR_HANDLE FFT descriptor. x_inout, x_in FORTRAN: Array REAL(KIND=WP) or COMPLEX(KIND=WP), DIMENSION(*), where type and working precision WP must be consistent with the forward domain and precision specified in the descriptor. C: Array of type float or double depending on the precision of the transform, specified in the DFTI_PRECISION configuration setting. Data to be transformed in case of a real forward domain, specified in the DFTI_FORWARD_DOMAIN configuration setting. xre_inout, xim_inout, xre_in, xim_in FORTRAN: Array REAL(KIND=WP) or COMPLEX(KIND=WP), DIMENSION(*), where type and working precision WP must be consistent with the forward domain and precision specified in the descriptor. C: Array of type float or double depending on the precision of the transform. Real and imaginary parts of the data to be transformed in case of a complex forward domain, specified in the DFTI_FORWARD_DOMAIN configuration setting. The suffix in parameter names corresponds to the value of the configuration parameter DFTI_PLACEMENT as follows: • _inout to DFTI_INPLACE • _in or _out to DFTI_NOT_INPLACE 11 Intel® Math Kernel Library Reference Manual 2320 Output Parameters Name Type Description y_out FORTRAN: Array REAL(KIND=WP) or COMPLEX(KIND=WP), DIMENSION(*), where type and working precision WP must be consistent with the forward domain and precision specified in the descriptor. C: Array of type float or double depending on the precision of the transform. The transformed data in case of a real backward domain, determined by the DFTI_FORWARD_DOMAIN configuration setting. xre_inout, xim_inout, yre_out, yim_out FORTRAN: Array REAL(KIND=WP) or COMPLEX(KIND=WP), DIMENSION(*), where type and working precision WP must be consistent with the forward domain and precision specified in the descriptor. C: Array of type float or double depending on the precision of the transform. Real and imaginary parts of the transformed data in case of a complex backward domain, determined by the DFTI_FORWARD_DOMAIN configuration setting. status FORTRAN: INTEGER C: MKL_LONG Function completion status. The suffix in parameter names corresponds to the value of the configuration parameter DFTI_PLACEMENT as follows: • _inout to DFTI_INPLACE • _in or _out to DFTI_NOT_INPLACE Include Files • FORTRAN 90: mkl_dfti.f90 • C: mkl_dfti.h Description The DftiComputeForward function accepts the descriptor handle parameter and one or more data parameters. Provided the descriptor is configured and committed successfully, this function computes the forward FFT, that is, the transform with the minus sign in the exponent, d = -1. The number and types of the data parameters that the function requires may vary depending on the configuration of the descriptor. This variation is accommodated by variable parameters in C and the generic interface in Fortran. The generic Fortran interface to the function is based on a set of specific functions. These functions can check for inconsistency between the required and actual number of parameters. However, the specific functions disregard the type of the actual parameters and instead use the interpretation defined in the descriptor by configuration parameters FTI_FORWARD_DOMAIN, DFTI_INPUT_STRIDES, DFTI_INPUT_DISTANCE, and so on. The function returns the zero status when completes successfully. See Status Checking Functions for more information on the returned status. Interface and Prototype ! Fortran interface. ! Note that the body provided below only illustrates the list of different ! parameters and the types of dummy parameters. You can rely only on the function Fourier Transform Functions 11 2321 ! name following keyword INTERFACE. For the precise definition of the ! interface, see the include/mkl_dfti.f90 file in the Intel MKL directory. INTERFACE DftiComputeForward FUNCTION some_actual_function_1(desc,sSrcDst) INTEGER some_actual_function_1 REAL(4), INTENT(INOUT), DIMENSION(*) :: sSrcDst ... END FUNCTION some_actual_function_1 FUNCTION some_actual_function_2(desc,cSrcDst) INTEGER some_actual_function_2 COMPLEX(8), INTENT(INOUT), DIMENSION(*) :: cSrcDst ... END FUNCTION some_actual_function_2 FUNCTION some_actual_function_3(desc,sSrcDstRe,sSrcDstIm) INTEGER some_actual_function_3 REAL(4), INTENT(INOUT), DIMENSION(*) :: sSrcDstRe REAL(4), INTENT(INOUT), DIMENSION(*) :: sSrcDstIm ... END FUNCTION some_actual_function_3 ... END INTERFACE DftiComputeForward The Fortran interface requires that the data parameters have the type of assumed-size rank-1 array, even for multidimensional transforms. The implementations of the FFT interface require the data stored linearly in memory with a regular "stride" pattern capable of describing multidimensional array layout (discussed more fully in DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES, see also [3]), and the function requires that the data parameters refer to the first element of the data. Consequently, the data arrays should be specified with the DIMENSION(*) attribute and the storage associated with the actual multidimensional arrays via the EQUIVALENCE statement. /* C prototype */ MKL_LONG DftiComputeForward( DFTI_DESCRIPTOR_HANDLE, void*, ... ); See Also DFTI_FORWARD_DOMAIN DFTI_PLACEMENT DFTI_PACKED_FORMAT DFTI_DIMENSION, DFTI_LENGTHS DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES DftiComputeBackward DftiComputeBackward Computes the backward FFT. Syntax Fortran: status = DftiComputeBackward( desc_handle, x_inout ) status = DftiComputeBackward( desc_handle, y_in, x_out ) status = DftiComputeBackward( desc_handle, xre_inout, xim_inout ) status = DftiComputeBackward( desc_handle, yre_in, yim_in, xre_out, xim_out ) C: status = DftiComputeBackward(desc_handle, x_inout); 11 Intel® Math Kernel Library Reference Manual 2322 status = DftiComputeBackward(desc_handle, y_in, x_out); status = DftiComputeBackward(desc_handle, xre_inout, xim_inout); status = DftiComputeBackward(desc_handle, yre_in, yim_in, xre_out, xim_out); Input Parameters Name Type Description desc_handle FORTRAN: DFTI_DESCRIPTOR C: DFTI_DESCRIPTOR_HANDLE FFT descriptor. x_inout, y_in FORTRAN: Array REAL(KIND=WP) or COMPLEX(KIND=WP), DIMENSION(*), where type and working precision WP must be consistent with the forward domain and precision specified in the descriptor. C: Array of type float or double depending on the precision of the transform, specified in the DFTI_PRECISION configuration setting. Data to be transformed in case of a real backward domain, determined by the DFTI_FORWARD_DOMAIN configuration setting. xre_inout, xim_inout, yre_in, yim_in FORTRAN: Array REAL(KIND=WP) or COMPLEX(KIND=WP), DIMENSION(*), where type and working precision WP must be consistent with the forward domain and precision specified in the descriptor. C: Array of type float or double depending on the precision of the transform. Real and imaginary parts of the data to be transformed in case of a complex backward domain, determined by the DFTI_FORWARD_DOMAIN configuration setting. The suffix in parameter names corresponds to the value of the configuration parameter DFTI_PLACEMENT as follows: • _inout to DFTI_INPLACE • _in or _out to DFTI_NOT_INPLACE Output Parameters Name Type Description x_out FORTRAN: Array REAL(KIND=WP) or COMPLEX(KIND=WP), DIMENSION(*), where type and working precision WP must be consistent with the forward domain and precision specified in the descriptor. C: Array of type float or double depending on the precision of the transform. The transformed data in case of a real forward domain, specified in the DFTI_FORWARD_DOMAIN configuration setting. xre_inout, xim_inout, xre_out, xim_out FORTRAN: Array REAL(KIND=WP) or COMPLEX(KIND=WP), DIMENSION(*), where type and working precision WP must be consistent with the forward domain and precision specified in the descriptor. Real and imaginary parts of the transformed data in case of a complex forward domain, specified in the DFTI_FORWARD_DOMAIN configuration setting. Fourier Transform Functions 11 2323 Name Type Description C: Array of type float or double depending on the precision of the transform. status FORTRAN: INTEGER C: MKL_LONG Function completion status. The suffix in parameter names corresponds to the value of the configuration parameter DFTI_PLACEMENT as follows: • _inout to DFTI_INPLACE • _in or _out to DFTI_NOT_INPLACE Include Files • FORTRAN 90: mkl_dfti.f90 • C: mkl_dfti.h Description The function accepts the descriptor handle parameter and one or more data parameters. Provided the descriptor is configured and committed successfully, the DftiComputeBackward function computes the inverse FFT, that is, the transform with the plus sign in the exponent, d = +1. The number and types of the data parameters that the function requires may vary depending on the configuration of the descriptor. This variation is accommodated by variable parameters in C and the generic interface in Fortran. The generic Fortran interface to the computation function is based on a set of specific functions. These functions can check for inconsistency between the required and actual number of parameters. However, the specific functions disregard the type of the actual parameters and instead use the interpretation defined in the descriptor by configuration parameters DFTI_FORWARD_DOMAIN, DFTI_INPUT_STRIDES, DFTI_INPUT_DISTANCE, and so on. The function returns the zero status when completes successfully. See Status Checking Functions for more information on the returned status. Interface and Prototype ! Fortran interface. ! Note that the body provided below only illustrates the list of different ! parameters and the types of dummy parameters. You can rely only on the function ! name following keyword INTERFACE. For the precise definition of the ! interface, see the include/mkl_dfti.f90 file in the Intel MKL directory. INTERFACE DftiComputeBackward FUNCTION some_actual_function_1(desc,sSrcDst) INTEGER some_actual_function_1 REAL(4), INTENT(INOUT), DIMENSION(*) :: sSrcDst ... END FUNCTION some_actual_function_1 FUNCTION some_actual_function_2(desc,cSrcDst) INTEGER some_actual_function_2 COMPLEX(8), INTENT(INOUT), DIMENSION(*) :: cSrcDst ... END FUNCTION some_actual_function_2 FUNCTION some_actual_function_3(desc,sSrcDstRe,sSrcDstIm) INTEGER some_actual_function_3 REAL(4), INTENT(INOUT), DIMENSION(*) :: sSrcDstRe REAL(4), INTENT(INOUT), DIMENSION(*) :: sSrcDstIm ... END FUNCTION some_actual_function_3 11 Intel® Math Kernel Library Reference Manual 2324 ... END INTERFACE DftiComputeBackward The Fortran interface requires that the data parameters have the type of assumed-size rank-1 array, even for multidimensional transforms. The implementations of the FFT interface require the data stored linearly in memory with a regular "stride" pattern capable of describing multidimensional array layout (discussed more fully in DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES, see also [3]), and the function requires that the data parameters refer to the first element of the data. Consequently, the data arrays should be specified with the DIMENSION(*) attribute and the storage associated with the actual multidimensional arrays via the EQUIVALENCE statement. /* C prototype */ MKL_LONG DftiComputeBackward( DFTI_DESCRIPTOR_HANDLE, void *, ... ); See Also DFTI_FORWARD_DOMAIN DFTI_PLACEMENT DFTI_PACKED_FORMAT DFTI_DIMENSION, DFTI_LENGTHS DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES DftiComputeForward Descriptor Configuration Functions There are two functions in this category: the value setting function DftiSetValue sets one particular configuration parameter to an appropriate value, and the value getting function DftiGetValue reads the value of one particular configuration parameter. While all configuration parameters are readable, you cannot set a few of them. Some of these contain fixed information of a particular implementation such as version number, or dynamic information, which is derived by the implementation during execution of one of the functions. See Configuration Settings for details. DftiSetValue Sets one particular configuration parameter with the specified configuration value. Syntax Fortran: status = DftiSetValue( desc_handle, config_param, config_val ) C: status = DftiSetValue(desc_handle, config_param, config_val); Include Files • FORTRAN 90: mkl_dfti.f90 • C: mkl_dfti.h Fourier Transform Functions 11 2325 Input Parameters Name Type Description desc_handle FORTRAN: DFTI_DESCRIPTOR C: DFTI_DESCRIPTOR_HANDLE FFT descriptor. config_param FORTRAN: INTEGER C: enum Configuration parameter. config_val Depends on the configuration parameter. Configuration value. Output Parameters Name Type Description desc_handle FORTRAN: DFTI_DESCRIPTOR C: DFTI_DESCRIPTOR_HANDLE Updated FFT descriptor. status FORTRAN: INTEGER C: MKL_LONG Function completion status. Description This function sets one particular configuration parameter with the specified configuration value. Each configuration parameter is a named constant, and the configuration value must have the corresponding type, which can be a named constant or a native type. For available configuration parameters and the corresponding configuration values, see: • DFTI_PRECISION • DFTI_FORWARD_DOMAIN • DFTI_DIMENSION, DFTI_LENGTH • DFTI_PLACEMENT • DFTI_FORWARD_SCALE, DFTI_BACKWARD_SCALE • DFTI_NUMBER_OF_USER_THREADS • DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES • DFTI_NUMBER_OF_TRANSFORMS • DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE • DFTI_COMPLEX_STORAGE, DFTI_REAL_STORAGE, DFTI_CONJUGATE_EVEN_STORAGE • DFTI_PACKED_FORMAT • DFTI_WORKSPACE • DFTI_ORDERING The DftiSetValue function cannot be used to change configuration parameters DFTI_FORWARD_DOMAIN, DFTI_PRECISION, DFTI_DIMENSION, and DFTI_LENGTHS. Use the DftiCreateDescriptor function to set them. The function returns the zero status when it completes successfully. See Status Checking Functions for more information on the returned status. Interface and Prototype ! Fortran interface INTERFACE DftiSetValue //Note that the body provided here is to illustrate the different 11 Intel® Math Kernel Library Reference Manual 2326 //argument list and types of dummy arguments. The interface //does not guarantee what the actual function names are. //Users can only rely on the function name following the //keyword INTERFACE FUNCTION some_actual_function_6_INTVAL( Desc_Handle, Config_Param, INTVAL ) INTEGER :: some_actual_function_6_INTVAL Type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle INTEGER, INTENT(IN) :: Config_Param INTEGER, INTENT(IN) :: INTVAL END FUNCTION some_actual_function_6_INTVAL FUNCTION some_actual_function_6_SGLVAL( Desc_Handle, Config_Param, SGLVAL ) INTEGER :: some_actual_function_6_SGLVAL Type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle INTEGER, INTENT(IN) :: Config_Param REAL, INTENT(IN) :: SGLVAL END FUNCTION some_actual_function_6_SGLVAL FUNCTION some_actual_function_6_DBLVAL( Desc_Handle, Config_Param, DBLVAL ) INTEGER :: some_actual_function_6_DBLVAL Type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle INTEGER, INTENT(IN) :: Config_Param REAL (KIND(0D0)), INTENT(IN) :: DBLVAL END FUNCTION some_actual_function_6_DBLVAL FUNCTION some_actual_function_6_INTVEC( Desc_Handle, Config_Param, INTVEC ) INTEGER :: some_actual_function_6_INTVEC Type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle INTEGER, INTENT(IN) :: Config_Param INTEGER, INTENT(IN) :: INTVEC(*) END FUNCTION some_actual_function_6_INTVEC FUNCTION some_actual_function_6_CHARS( Desc_Handle, Config_Param, CHARS ) INTEGER :: some_actual_function_6_CHARS Type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle INTEGER, INTENT(IN) :: Config_Param CHARCTER(*), INTENT(IN) :: CHARS END FUNCTION some_actual_function_6_CHARS END INTERFACE DftiSetValue /* C prototype */ MKL_LONG DftiSetValue( DFTI_DESCRIPTOR_HANDLE, DFTI_CONFIG_PARAM , ... ); See Also Configuration Settings DftiCreateDescriptor DftiGetValue DftiGetValue Gets the configuration value of one particular configuration parameter. Syntax Fortran: status = DftiGetValue( desc_handle, config_param, config_val ) C: status = DftiGetValue(desc_handle, config_param, &config_val); Include Files • FORTRAN 90: mkl_dfti.f90 • C: mkl_dfti.h Fourier Transform Functions 11 2327 Input Parameters Name Type Description desc_handle FORTRAN: DFTI_DESCRIPTOR C: DFTI_DESCRIPTOR_HANDLE FFT descriptor. config_param FORTRAN: INTEGER C: enum Configuration parameter. See Table "Configuration Parameters" for allowable values of config_param. Output Parameters Name Type Description config_val Depends on the configuration parameter. Configuration value. status FORTRAN: INTEGER C: MKL_LONG Function completion status. Description This function gets the configuration value of one particular configuration parameter. Each configuration parameter is a named constant, and the configuration value must have the corresponding type, which can be a named constant or a native type. For available configuration parameters and the corresponding configuration values, see: • DFTI_PRECISION • DFTI_FORWARD_DOMAIN • DFTI_DIMENSION, DFTI_LENGTH • DFTI_PLACEMENT • DFTI_FORWARD_SCALE, DFTI_BACKWARD_SCALE • DFTI_NUMBER_OF_USER_THREADS • DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES • DFTI_NUMBER_OF_TRANSFORMS • DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE • DFTI_COMPLEX_STORAGE, DFTI_REAL_STORAGE, DFTI_CONJUGATE_EVEN_STORAGE • DFTI_PACKED_FORMAT • DFTI_WORKSPACE • DFTI_COMMIT_STATUS • DFTI_ORDERING The function returns the zero status when it completes successfully. See Status Checking Functions for more information on the returned status. Interface and Prototype ! Fortran interface INTERFACE DftiGetValue //Note that the body provided here is to illustrate the different //argument list and types of dummy arguments. The interface //does not guarantee what the actual function names are. //Users can only rely on the function name following the //keyword INTERFACE FUNCTION some_actual_function_7_INTVAL( Desc_Handle, Config_Param, INTVAL ) INTEGER :: some_actual_function_7_INTVAL 11 Intel® Math Kernel Library Reference Manual 2328 Type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle INTEGER, INTENT(IN) :: Config_Param INTEGER, INTENT(OUT) :: INTVAL END FUNCTION DFTI_GET_VALUE_INTVAL FUNCTION some_actual_function_7_SGLVAL( Desc_Handle, Config_Param, SGLVAL ) INTEGER :: some_actual_function_7_SGLVAL Type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle INTEGER, INTENT(IN) :: Config_Param REAL, INTENT(OUT) :: SGLVAL END FUNCTION some_actual_function_7_SGLVAL FUNCTION some_actual_function_7_DBLVAL( Desc_Handle, Config_Param, DBLVAL ) INTEGER :: some_actual_function_7_DBLVAL Type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle INTEGER, INTENT(IN) :: Config_Param REAL (KIND(0D0)), INTENT(OUT) :: DBLVAL END FUNCTION some_actual_function_7_DBLVAL FUNCTION some_actual_function_7_INTVEC( Desc_Handle, Config_Param, INTVEC ) INTEGER :: some_actual_function_7_INTVEC Type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle INTEGER, INTENT(IN) :: Config_Param INTEGER, INTENT(OUT) :: INTVEC(*) END FUNCTION some_actual_function_7_INTVEC FUNCTION some_actual_function_7_INTPNT( Desc_Handle, Config_Param, INTPNT ) INTEGER :: some_actual_function_7_INTPNT Type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle INTEGER, INTENT(IN) :: Config_Param INTEGER, DIMENSION(*), POINTER :: INTPNT END FUNCTION some_actual_function_7_INTPNT FUNCTION some_actual_function_7_CHARS( Desc_Handle, Config_Param, CHARS ) INTEGER :: some_actual_function_7_CHARS Type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle INTEGER, INTENT(IN) :: Config_Param CHARCTER(*), INTENT(OUT):: CHARS END FUNCTION some_actual_function_7_CHARS END INTERFACE DftiGetValue /* C prototype */ MKL_LONG DftiGetValue( DFTI_DESCRIPTOR_HANDLE, DFTI_CONFIG_PARAM , ... ); See Also Configuration Settings DftiSetValue Status Checking Functions All of the descriptor manipulation, FFT computation, and descriptor configuration functions return an integer value denoting the status of the operation. Two functions serve to check the status. The first function is a logical function that checks if the status reflects an error of a predefined class, and the second is an error message function that returns a character string. DftiErrorClass Checks whether the status reflects an error of a predefined class. Syntax Fortran: predicate = DftiErrorClass( status, error_class ) Fourier Transform Functions 11 2329 C: predicate = DftiErrorClass(status, error_class); Include Files • FORTRAN 90: mkl_dfti.f90 • C: mkl_dfti.h Input Parameters Name Type Description status FORTRAN: INTEGER C: MKL_LONG Completion status of an FFT function. error_class FORTRAN: INTEGER C: MKL_LONG Predefined error class. Output Parameters Name Type Description predicate FORTRAN: LOGICAL C: MKL_LONG Result of checking. Description The FFT interface in Intel MKL provides a set of predefined error classes listed in Table "Predefined Error Classes". They are named constants and have the type INTEGER in Fortran and MKL_LONG in C. Predefined Error Classes Named Constants Comments DFTI_NO_ERROR No error. The zero status belongs to this class. DFTI_MEMORY_ERROR Usually associated with memory allocation DFTI_INVALID_CONFIGURATION Invalid settings of one or more configuration parameters DFTI_INCONSISTENT_CONFIGURATION Inconsistent configuration or input parameters DFTI_NUMBER_OF_THREADS_ERROR Number of OMP threads in the computation function is not equal to the number of OMP threads in the initialization stage (commit function) DFTI_MULTITHREADED_ERROR Usually associated with a value that OMP routines return in case of errors DFTI_BAD_DESCRIPTOR Descriptor is unusable for computation DFTI_UNIMPLEMENTED Unimplemented legitimate settings; implementation dependent DFTI_MKL_INTERNAL_ERROR Internal library error DFTI_1D_LENGTH_EXCEEDS_INT32 Length of one of dimensions exceeds 232 -1 (4 bytes). 11 Intel® Math Kernel Library Reference Manual 2330 The DftiErrorClass function returns a non-zero value in C or the value of .TRUE. in Fortran if the status belongs to a predefined error class. To check whether a function call was successful, call DftiErrorClass with a specific error class. However, the zero value of the status belongs to the DFTI_NO_ERROR class and thus the zero status indicates successful completion of an operation. See Example "Using Status Checking Functions" for an illustration of correct use of the status checking functions. NOTE It is incorrect to directly compare a status with a predefined class. Interface and Prototype //Fortran interface INTERFACE DftiErrorClass //Note that the body provided here is to illustrate the different //argument list and types of dummy arguments. The interface //does not guarantee what the actual function names are. //Users can only rely on the function name following the //keyword INTERFACE FUNCTION some_actual_function_8( Status, Error_Class ) LOGICAL some_actual_function_8 INTEGER, INTENT(IN) :: Status, Error_Class END FUNCTION some_actual_function_8 END INTERFACE DftiErrorClass /* C prototype */ MKL_LONG DftiErrorClass( MKL_LONG , MKL_LONG ); DftiErrorMessage Generates an error message. Syntax Fortran: error_message = DftiErrorMessage( status ) C: error_message = DftiErrorMessage(status); Include Files • FORTRAN 90: mkl_dfti.f90 • C: mkl_dfti.h Input Parameters Name Type Description status FORTRAN: INTEGER C: MKL_LONG Completion status of a function. Fourier Transform Functions 11 2331 Output Parameters Name Type Description error_message FORTRAN: CHARACTER(LEN=DFTI_MAX_MESSAGE_LENGTH ) C: Array of char The character string with the error message. Description The error message function generates an error message character string. In Fortran, use a character string of length DFTI_MAX_MESSAGE_LENGTH as a target for the error message. In C, the function returns a pointer to a constant character string, that is, a character array with terminating '\0' character, and you do not need to free this pointer. Example "Using Status Checking Function" shows how this function can be used. Interface and Prototype //Fortran interface INTERFACE DftiErrorMessage //Note that the body provided here is to illustrate the different //argument list and types of dummy arguments. The interface //does not guarantee what the actual function names are. //Users can only rely on the function name following the //keyword INTERFACE FUNCTION some_actual_function_9( Status ) CHARACTER(LEN=DFTI_MAX_MESSAGE_LENGTH) some_actual_function_9( Status ) INTEGER, INTENT(IN) :: Status END FUNCTION some_actual_function_9 END INTERFACE DftiErrorMessage /* C prototype */ char *DftiErrorMessage( MKL_LONG ); Configuration Settings Each of the configuration parameters is identified by a named constant in the MKL_DFTI module. In C, these named constants have the enumeration type DFTI_CONFIG_PARAM. All the Intel MKL FFT configuration parameters are readable. Some of them are read-only, while others can be set using the DftiCreateDescriptor or DftiSetValue function. Values of the configuration parameters fall into the following groups: • Values that have native data types. For example, the number of simultaneous transforms requested has an integer value, while the scale factor for a forward transform is a single-precision number. • Values that are discrete in nature and are provided in the MKL_DFTI module as named constants. For example, the domain of the forward transform requires values to be named constants. In C, the named constants for configuration values have the enumeration type DFTI_CONFIG_VALUE. Table "Configuration Parameters" summarises the information on configuration parameters, along with their types and values. For more details of each configuration parameter, see the subsection describing this parameter. 11 Intel® Math Kernel Library Reference Manual 2332 Configuration Parameters Configuration Parameter Type/Value Comments Most common configuration parameters, no default, must be set explicitly by DftiCreateDescriptor DFTI_PRECISION Named constant DFTI_SINGLE or DFTI_DOUBLE Precision of the computation. DFTI_FORWARD_DOMAIN Named constant DFTI_COMPLEX or DFTI_REAL Type of the transform. DFTI_DIMENSION Integer scalar Dimension of the transform. DFTI_LENGTH Integer scalar/array Lengths of each dimension. Common configuration parameters, settable by DftiSetValue DFTI_PLACEMENT Named constant DFTI_INPLACE or DFTI_NOT_INPLACE Defines whether the result overwrites the input data. Default value: DFTI_INPLACE. DFTI_FORWARD_SCALE Floating-point scalar Scale factor for the forward transform. Default value: 1.0. Precision of the value should be the same as defined by DFTI_PRECISION. DFTI_BACKWARD_SCALE Floating-point scalar Scale factor for the backward transform. Default value: 1.0. Precision of the value should be the same as defined by DFTI_PRECISION. DFTI_NUMBER_OF_USER_THREADS Integer scalar Number of threads that concurrently use the same descriptor to compute FFT. DFTI_DESCRIPTOR_NAME Character string Assigns a name to a descriptor. Assumed length of the string is DFTI_MAX_NAME_LENGTH. Default value: empty string. Data layout configuration parameters for single and multiple transforms. Settable by DftiSetValue DFTI_INPUT_STRIDES Integer array Defines the input data layout. DFTI_OUTPUT_STRIDES Integer array Defines the output data layout. DFTI_NUMBER_OF_TRANSFORMS Integer scalar Number of transforms. Default value: 1. DFTI_INPUT_DISTANCE Integer scalar Defines the distance between input data sets for multiple transforms. Default value: 0. DFTI_OUTPUT_DISTANCE Integer scalar Defines the distance between output data sets for multiple transforms. Default value: 0. Fourier Transform Functions 11 2333 Configuration Parameter Type/Value Comments DFTI_COMPLEX_STORAGE Named constant DFTI_COMPLEX_COMPLE X or DFTI_REAL_REAL Defines whether the real and imaginary parts of data for a complex transform are interleaved in one array or split in two arrays. Default value: DFTI_COMPLEX_COMPLEX. DFTI_REAL_STORAGE Named constant DFTI_REAL_REAL Defines how real data for a real transform is stored. Only the DFTI_REAL_REAL value is supported. DFTI_CONJUGATE_EVEN_STORAGE Named constant DFTI_COMPLEX_COMPLE X or DFTI_COMPLEX_REAL Defines whether the complex data in the backward domain of a real transform is stored as complex elements or as real elements. For the default value, see the detailed description. DFTI_PACKED_FORMAT Named constant DFTI_CCE_FORMAT, DFTI_CCS_FORMAT, DFTI_PACK_FORMAT, or DFTI_PERM_FORMAT Defines the layout of real elements in the backward domain of a onedimensional or two-dimensional real transform. Advanced configuration parameters, settable by DftiSetValue DFTI_WORKSPACE Named constant DFTI_ALLOW or DFTI_AVOID Defines whether the library should prefer algorithms using additional memory. Default value: DFTI_ALLOW. DFTI_ORDERING Named constant DFTI_ORDERED or DFTI_BACKWARD_SCRAM BLED Defines whether the result of a complex transform is ordered or permuted. Default value: DFTI_ORDERED. Read-Only configuration parameters DFTI_COMMIT_STATUS Named constant DFTI_UNCOMMITTED or DFTI_COMMITTED Readiness of the descriptor for computation. DFTI_VERSION String Version of Intel MKL. Assumed length of the string is DFTI_VERSION_LENGTH. DFTI_PRECISION The configuration parameter DFTI_PRECISION denotes the floating-point precision in which the transform is to be carried out. A setting of DFTI_SINGLE stands for single precision, and a setting of DFTI_DOUBLE stands for double precision. The data must be presented in this precision, the computation is carried out in this precision, and the result is delivered in this precision. DFTI_PRECISION does not have a default value. Set it explicitly by calling the DftiCreateDescriptor function. NOTE Fortran module MKL_DFTI also defines named constants DFTI_SINGLE_R and DFTI_DOUBLE_R, with the same semantics as DFTI_SINGLE and DFTI_DOUBLE, respectively. Do not use these constants to set the DFTI_PRECISION configuration parameter. Use them only as described in section DftiCreateDescriptor. 11 Intel® Math Kernel Library Reference Manual 2334 See Also DFTI_FORWARD_DOMAIN DFTI_DIMENSION, DFTI_LENGTHS DftiCreateDescriptor DFTI_FORWARD_DOMAIN The general form of a discrete Fourier transform is where w is the input sequence, z is the output sequence, both indexed by kl = 0, ... nl-1, for l = 1, ..., d, scale factor s is an arbitrary real number with the default value of 1.0, d is the sign in the exponent, and d = -1 for the forward transform and d = +1 for the backward transform. The Intel MKL implementation of the FFT algorithm, used for fast computation of discrete Fourier transforms, supports forward transforms on input sequences of two domains, as specified by configuration parameter DFTI_FORWARD_DOMAIN: general complex-valued sequences (DFTI_COMPLEX domain) and general realvalued sequences (DFTI_REAL domain). The forward transform maps the forward domain to the corresponding backward domain, as shown in Table "Correspondence of Forward and Backward Domain". The conjugate-even domain covers complex-valued sequences with the symmetry property: where the index arithmetic is performed modulo respective size, that is, and therefore Due to this property of conjugate-even sequences, only a part of such sequence is stored in the computer memory, as described in DFTI_CONJUGATE_EVEN_STORAGE. Correspondence of Forward and Backward Domain Forward Domain Implied Backward Domain Complex (DFTI_COMPLEX) Complex (DFTI_COMPLEX) Real (DFTI_REAL) Conjugate-even DFTI_FORWARD_DOMAIN does not have a default value. Set it explicitly by calling the DftiCreateDescriptor function. See Also DFTI_PRECISION DFTI_DIMENSION, DFTI_LENGTHS DftiCreateDescriptor Fourier Transform Functions 11 2335 DFTI_DIMENSION, DFTI_LENGTHS The dimension of the transform is a positive integer value represented in an integer scalar of Integer data type in Fortran and MKL_LONG data type in C. For a one-dimensional transform, the transform length is specified by a positive integer value represented in an integer scalar of Integer data type in Fortran and MKL_LONG data type in C. For multi-dimensional (= 2) transform, the lengths of each of the dimensions are supplied in an integer array (Integer data type in Fortran and MKL_LONG data type in C). DFTI_DIMENSION and DFTI_LENGTHS do not have a default value. To set them, use the DftiCreateDescriptor function and not the DftiSetValue function. See Also DFTI_FORWARD_DOMAIN DFTI_PRECISION DftiCreateDescriptor DftiSetValue DFTI_PLACEMENT By default, the computational functions overwrite the input data with the output result. That is, the default setting of the configuration parameter DFTI_PLACEMENT is DFTI_INPLACE. You can change that by setting it to DFTI_NOT_INPLACE. NOTE The data sets have no common elements. See Also DftiSetValue DFTI_FORWARD_SCALE, DFTI_BACKWARD_SCALE The forward transform and backward transform are each associated with a scale factor s of its own having the default value of 1. You can specify the scale factors using one or both of the configuration parameters DFTI_FORWARD_SCALE and DFTI_BACKWARD_SCALE. For example, for a one-dimensional transform of length n, you can use the default scale of 1 for the forward transform and set the scale factor for the backward transform to be 1/n, thus making the backward transform the inverse of the forward transform. Set the scale factor configuration parameter using a real floating-point data type of the same precision as the value for DFTI_PRECISION. NOTE For inquiry of the scale factor with the DftiGetValue function in C, the config_val parameter must have the same floating-point precision as the descriptor. See Also DftiSetValue DFTI_PRECISION DftiGetValue DFTI_NUMBER_OF_USER_THREADS Use one of the following techniques to parallelize your application: a. You specify the parallel mode within the FFT module of Intel MKL instead of creating threads in your application. See Intel MKL User's Guide for more information on how to do this. See also Example "Using Intel MKL Internal Threading Mode". 11 Intel® Math Kernel Library Reference Manual 2336 b. You create threads in the application yourself and have each thread perform all stages of FFT implementation, including descriptor initialization, FFT computation, and descriptor deallocation. In this case, each descriptor is used only within its corresponding thread. In this case, set single-threaded mode for Intel MKL. See Example "Using Parallel Mode with Multiple Descriptors Initialized in a Parallel Region". c. You create threads in the application yourself after initializing all FFT descriptors. This implies that threading is employed for parallel FFT computation only, and the descriptors are released upon return from the parallel region. In this case, each descriptor is used only within its corresponding thread. You must explicitly set the single-threaded mode for Intel MKL, otherwise, the actual number of threads may differ from one, because the DftiCommitDescriptor function is not in a parallel region. See Example "Using Parallel Mode with Multiple Descriptors Initialized in One Thread". d. You create threads in the application yourself after initializing the only FFT descriptor. This implies that threading is employed for parallel FFT computation only, and the descriptor is released upon return from the parallel region. In this case, each thread uses the same descriptor. See Example "Using Parallel Mode with a Common Descriptor". In cases "a", "b", and "c", listed above, set the parameter DFTI_NUMBER_OF_USER_THREADS to 1 (its default value), since each particular descriptor instance is used only in a single thread. In case "d", use the DftiSetValue() function to set the DFTI_NUMBER_OF_USER_THREADS to the actual number of FFT computation threads, because multiple threads will be using the same descriptor. If this setting is not done, your program will work incorrectly or fail, since the descriptor contains individual data for each thread. WARNING • Avoid parallelizing your program and employing the Intel MKL internal threading simultaneously because this will slow down the performance. Note that in case "d" above, FFT computation is automatically initiated in a single-threading mode. • Do not change the number of threads after the DftiCommitDescriptor() function completes FFT initialization. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 See Also DftiSetValue DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES The FFT interface provides configuration parameters that define the layout of multidimensional data in the computer memory. For d-dimensional data set X defined by dimensions N1 x N2 x ... x Nd, the layout describes where a particular element X(k1, k2, ..., kd) of the data set is located. The memory address of the element X(k1, k2 , ..., kd) is expressed by the formula address of X(k1, k2, ..., kd) = address of X(0, 0, ..., 0) + offset = address of X(0, 0, ..., 0) + s0 + k1*s1 + k2*s2 + ...+ kd*sd, where s0 is the displacement and s1, ..., sd are generalized strides. The configuration parameters DFTI_INPUT_STRIDES and DFTI_OUTPUT_STRIDES enable you to get and set these values. The configuration value is an array of values (s0, s1, ..., sd) of INTEGER data type in Fortran and MKL_LONG data type in C. The offset is counted in elements of the data type defined by the descriptor configuration (rather than by the type of the variable passed to the computation functions). Specifically, the DFTI_FORWARD_DOMAIN, DFTI_COMPLEX_STORAGE, and DFTI_CONJUGATE_EVEN_STORAGE configuration parameters define the type of the elements as shown in Table "Assumed Element Types of the Input/Output Data": Fourier Transform Functions 11 2337 Assumed Element Types of the Input/Output Data Descriptor Configuration Element Type in the Forward Domain Element Type in the Backward Domain DFTI_FORWARD_DOMAIN=DFTI_COMPLEX DFTI_COMPLEX_STORAGE=DFTI_COMPLEX_COMPLEX Complex Complex DFTI_FORWARD_DOMAIN=DFTI_COMPLEX DFTI_COMPLEX_STORAGE=DFTI_REAL_REAL Real Real DFTI_FORWARD_DOMAIN=DFTI_REAL DFTI_CONJUGATE_EVEN_STORAGE=DFTI_COMPLEX_REAL Real Real DFTI_FORWARD_DOMAIN=DFTI_REAL DFTI_CONJUGATE_EVEN_STORAGE=DFTI_COMPLEX_COMPLEX Real Complex The DFTI_INPUT_STRIDES configuration parameter describes the layout of the input data, and the element type is defined by the forward domain for the DftiComputeForward function, and by the backward domain for the DftiComputeBackward function. The DFTI_OUTPUT_STRIDES configuration parameter describes the layout of the output data, and the element type is defined by the backward domain for the DftiComputeForward function, and by the forward domain for DftiComputeBackward function. For in-place transforms, the configuration set by DFTI_OUTPUT_STRIDES is ignored except when the element types in forward and backward domains are different. If they are different, set DFTI_OUTPUT_STRIDES explicitly (even though the transform is in-place). For in-place transforms, the configuration must be consistent, that is, the locations of the first elements in input and output must coincide in each dimension. The DFTI_PLACEMENT configuration parameter defines whether the transform is in-place or out-of-place. The configuration parameters define the layout of input and output data, and not the forward-domain and backward-domain data. If the data layouts in forward domain and backward domain differ, set DFTI_INPUT_STRIDES and DFTI_OUTPUT_STRIDES explicitly and then commit the descriptor before calling computation functions. The FFT interface supports both positive and negative stride values. If you use negative strides, set the displacement of the data as follows: The default setting of strides in a general multi-dimensional case assumes that the array that contains the data has no padding. The order of the strides depends on the programming language. For example: /* C/C++ */ MKL_LONG dims[] = { nd, …, n2, n1 }; DftiCreateDescriptor( &hand, precision, domain, d, dims ); // The above call assumes data declaration: type X[nd]…[n2][n1] // Default strides are { 0, nd*…*n2*n1, …, n2*n1, n1, 1 } ! Fortran INTEGER :: dims(d) = [n1, n2, …, nd] status = DftiCreateDescriptor( hand, precision, domain, d, dims) ! The above call assumes data declaration: type X(n1,n2,…,nd) ! Default strides are [ 0, 1, n1, n1*n2, …, n1*n2*…*nd] 11 Intel® Math Kernel Library Reference Manual 2338 Note that in case of a real FFT (DFTI_DOMAIN=DFTI_REAL), where different data layouts in the backward domain are available (see DFTI_PACKED_FORMAT), the default value of the strides is not intuitive for the recommended CCE format (configuration setting DFTI_CONJUGATE_EVEN_STORAGE=DFTI_COMPLEX_COMPLEX). In case of an in-place real transform with the CCE format, set the strides explicitly, as follows: /* C/C++ */ MKL_LONG dims[] = { nd, …, n2, n1 }; MKL_LONG rstrides[] = { 0, nd*…*n2*(n1/2+1), …, 2*n2*(n1/2+1), 2*(n1/2+1), 1 }; MKL_LONG cstrides[] = { 0, nd*…*n2*(n1/2+1), …, n2*(n1/2+1), (n1/2+1), 1 }; DftiCreateDescriptor( &hand, precision, DFTI_REAL, d, dims ); DftiSetValue(hand, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX); // Set the strides appropriately for forward/backward transform ! Fortran INTEGER :: dims(d) = [n1, n2, …, nd] INTEGER :: rstrides(1+d) = [0, 1, 2*(n1/2+1), 2*(n1/2+1)*n2, … ] INTEGER :: cstrides(1+d) = [0, 1, (n1/2+1), (n1/2+1)*n2, … ] status = DftiCreateDescriptor( hand, precision, domain, d, dims) status = DftiSetValue( hand, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX) ! Set the strides appropriately for forward/backward transform See Also DFTI_FORWARD_DOMAIN DFTI_PLACEMENT DftiSetValue DftiCommitDescriptor DftiComputeForward DftiComputeBackward DFTI_NUMBER_OF_TRANSFORMS In some situations, you may need to perform a number of FFTs of the same dimension and lengths. For example, you may need to transform a number of one-dimensional data sets of the same length. To specify this number, use the DFTI_NUMBER_OF_TRANSFORMS parameter, which has the default value of 1. You can set this parameter to a positive integer value using the Integer data type in Fortran and MKL_LONG data type in C. NOTE The data sets to be transformed must not have common elements. Therefore one (or both) of the configuration parameters DFTI_INPUT_DISTANCE and DFTI_OUTPUT_DISTANCE is required if DFTI_NUMBER_OF_TRANSFORMS is greater than one. See Also DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE DftiSetValue DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE The FFT interface in Intel MKL enables computation of multiple transforms. To compute multiple transforms, you need to specify the data distribution of the multiple sets of data. The distance between the first data elements of consecutive data sets, DFTI_INPUT_DISTANCE for input data or DFTI_OUTPUT_DISTANCE for output data, specifies the distribution. The configuration setting is a value of INTEGER data type in Fortran and MKL_LONG data type in C. The default value for both configuration settings is one. You must set this parameter explicitly if the number of transforms is greater than one (see DFTI_NUMBER_OF_TRANSFORMS). The distance is counted in elements of the data type defined by the descriptor configuration (rather than by the type of the variable passed to the computation functions). Specifically, the DFTI_FORWARD_DOMAIN, DFTI_COMPLEX_STORAGE, and DFTI_CONJUGATE_EVEN_STORAGE configuration parameters define the type of the elements as shown in Table "Assumed Element Types of the Input/Output Data". Fourier Transform Functions 11 2339 For in-place transforms, the configuration set by DFTI_OUTPUT_DISTANCE is ignored except when the element types in forward and backward domains are different. If they are different, set DFTI_OUTPUT_DISTANCE explicitly (even though the transform is in-place). For in-place transforms, the configuration must be consistent, that is, the locations of the data sets on input and output must coincide. The DFTI_PLACEMENT configuration parameter defines whether the transform is in-place or out-of-place. The configuration parameters define the distance within input and output data, and not within the forwarddomain and backward-domain data. If the distances in the forward and backward domains differ, set DFTI_INPUT_DISTANCE and DFTI_OUTPUT_DISTANCE explicitly and then commit the descriptor before calling computation functions. The following examples illustrate setting of the DFTI_INPUT_DISTANCE configuration parameter: MKL_LONG dims[] = { nd, …, n2, n1 }; MKL_LONG distance = nd*…*n2*n1; DftiCreateDescriptor( &hand, precision, DFTI_COMPLEX, d, dims ); DftiSetValue( hand, DFTI_NUMBER_OF_TRANSFORMS, (MLK_LONG)howmany ); DftiSetValue( hand, DFTI_INPUT_DISTANCE, distance ); ! Fortran INTEGER :: dims(d) = [n1, n2, …, nd] INTEGER :: distance = n1*n2*…*nd status = DftiCreateDescriptor( hand, precision, DFTI_COMPLEX, d, dims) status = DftiSetValue( hand, DFTI_NUMBER_OF_TRANSFORMS, howmany ) status = DftiSetValue( hand, DFTI_INPUT_DISTANCE, distance ); See Also DFTI_PLACEMENT DftiSetValue DftiCommitDescriptor DftiComputeForward DftiComputeBackward DFTI_COMPLEX_STORAGE, DFTI_REAL_STORAGE, DFTI_CONJUGATE_EVEN_STORAGE Depending on the value of configuration parameter DFTI_FORWARD_DOMAIN, the implementation of FFT supports several storage schemes for input and output data (see document [3] for the rationale behind the definition of the storage schemes). The data elements are placed within contiguous memory blocks, defined with generalized strides (see DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES). For multiple transforms, each nth set of data (where nth=0) should be located within the same memory block, and the data sets should be placed at a distance from each other (see DFTI_NUMBER_OF TRANSFORMS and DFTI_INPUT DISTANCE, DFTI_OUTPUT_DISTANCE). NOTE In C/C++, avoid setting up multidimensional arrays with lists of pointers to one-dimensional arrays. Instead use a one-dimensional array with the explicit indexing to access the data elements. C notation is used in this section to describe association of mathematical entities with the data elements stored in memory. FFT Examples demonstrate the usage of storage formats in both C and Fortran. Storage schemes for complex domain. For the DFTI_COMPLEX forward domain, both input and output sequences belong to the complex domain. In this case, the configuration parameter DFTI_COMPLEX_STORAGE can have one of the two values: DFTI_COMPLEX_COMPLEX (default) or DFTI_REAL_REAL. NOTE In the Intel MKL FFT implementation, storage schemes for a forward complex domain and the respective backward complex domain are the same. With DFTI_COMPLEX_COMPLEX storage, the complex-valued data sequence is referenced by a single complex parameter Z so that complex-valued element zk1, k2, ..., kd of the sequence is located at Z[nth*distance + stride0 + k1*stride1 + k2*stride2+ ... kd*strided] as a structure consisting of the real and imaginary parts. 11 Intel® Math Kernel Library Reference Manual 2340 The following example illustrates a typical usage of the DFTI_COMPLEX_COMPLEX storage: complex :: x(n) ... ! on input, for i=1,...,N: x(i) = ri-1 status = DftiComputeForward( desc_handle, x ) ! on output, for i=1,...,N: x(i) = zi-1 With the DFTI_REAL_REAL storage, the complex-valued data sequence is referenced by two real parameters ZRe and ZIm so that complex-valued element zk1, k2, ..., kd of the sequence is computed as ZRe[nth*distance + stride0 + k1*stride1 + k2*stride2+ ... kd*strided] + v(-1) × ZIm[nth*distance + stride0 + k1*stride1 + k2*stride2+ ... kd*strided]. A typical usage of the DFTI_REAL_REAL storage is illustrated by the following example: real :: xre(n), xim(n) ... status = DftiSetValue( desc_handle, DFTI_COMPLEX_STORAGE, DFTI_REAL_REAL) ! on input, for i=1,...,N: cmplx(xre(i),xim(i)) = ri-1 status = DftiComputeForward( desc_handle, xre, xim ) ! on output, for i=1,...,N: cmplx(xre(i),xim(i)) = zi-1 Storage scheme for the real and conjugate-even domains. The setting for the storage schemes for real and conjugate-even domains is recorded in the configuration parameters DFTI_REAL_STORAGE and DFTI_CONJUGATE_EVEN_STORAGE. Since a forward real domain corresponds to a conjugate-even backward domain, they are considered together. The example below uses one-, two- and three-dimensional real to conjugate-even transforms. In-place computation is assumed whenever possible (that is, when the input data type matches the output data type). One-Dimensional Transform Consider a one-dimensional n-length transform of the form There is a symmetry: For even n: z(n/2+i) = conjg(z(n/2-i)), 1=i=n/2-1, and moreover z(0) and z(n/2) are real values. For odd n: z(m+i) = conjg(z(m-i+1)), m = floor(n/2), 1=i=m, and moreover z(0) is real value. Comparison of the Storage Effects of Complex-to-Complex and Real-to-Complex FFTs for a Forward Transform N=8 Input Vectors Output Vectors Complex FFT Real FFT Complex FFT Real FFT Complex Data Real Data Complex Data Real Data Real Imaginary Real Imaginary CCS Pack Perm Fourier Transform Functions 11 2341 N=8 r0 0.000000 r0 z0 0.000000 z0 z0 z0 r1 0.000000 r1 Re(z1) Im(z1) 0.000000 Re(z1) z4 r2 0.000000 r2 Re(z2) Im(z2) Re(z1) Im(z1) Re(z1) r3 0.000000 r3 Re(z3) Im(z3) Im(z1) Re(z2) Im(z1) r4 0.000000 r4 z4 0.000000 Re(z2) Im(z2) Re(z2) r5 0.000000 r5 Re(z3) -Im(z3) Im(z2) Re(z3) Im(z2) r6 0.000000 r6 Re(z2) -Im(z2) Re(z3) Im(z3) Re(z3) r7 0.000000 r7 Re(z1) -Im(z1) Im(z3) z4 Im(z3) z4 0.000000 N=7 Input Vectors Output Vectors Complex FFT Real FFT Complex FFT Real FFT Complex Data Real Data Complex Data Real Data Real Imaginary Real Imaginary CCS Pack Perm r0 0.000000 r0 z0 0.000000 z0 z0 z0 r1 0.000000 r1 Re(z1) Im(z1) 0.000000 Re(z1) Re(z1) r2 0.000000 r2 Re(z2) Im(z2) Re(z1) Im(z1) Im(z1) r3 0.000000 r3 Re(z3) Im(z3) Im(z1) Re(z2) Re(z2) r4 0.000000 r4 Re(z3) -Im(z3) Re(z2) Im(z2) Im(z2) r5 0.000000 r5 Re(z2) -Im(z2) Im(z2) Re(z3) Re(z3) r6 0.000000 r6 Re(z1) -Im(z1) Re(z3) Im(z3) Im(z3) Im(z3) Comparison of the Storage Effects of Complex-to-Complex and Complex-to-Real FFTs for Backward Transform N=8 Input Vectors Output Vectors Complex FFT Real FFT Complex FFT Real FFT Complex Data Real Data Complex Data Real Data 11 Intel® Math Kernel Library Reference Manual 2342 N=8 Real Imaginary Real Imaginary CCS Pack Perm r0 0.000000 r0 z0 0.000000 z0 z0 z0 r1 0.000000 r1 Re(z1) Im(z1) 0.000000 Re(z1) z4 r2 0.000000 r2 Re(z2) Im(z2) Re(z1) Im(z1) Re(z1) r3 0.000000 r3 Re(z3) Im(z3) Im(z1) Re(z2) Im(z1) r4 0.000000 r4 z4 Re(z2) Im(z2) Re(z2) r5 0.000000 r5 Re(z3) -Im(z3) Im(z2) Re(z3) Im(z2) r6 0.000000 r6 Re(z2) -Im(z2) Re(z3) Im(z3) Re(z3) r7 0.00000 r7 Re(z1) -Im(z1) Im(z3) z4 Im(z3) z4 0.000000 N=7 Input Vectors Output Vectors Complex FFT Real FFT Complex FFT Real FFT Complex Data Real Data Complex Data Real Data Real Imaginary Real Imaginary CCS Pack Perm r0 0.000000 r0 z0 0.000000 z0 z0 z0 r1 0.000000 r1 Re(z1) Im(z1) 0.000000 Re(z1) Re(z1) r2 0.000000 r2 Re(z2) Im(z2) Re(z1) Im(z1) Im(z1) r3 0.000000 r3 Re(z3) Im(z3) Im(z1) Re(z2) Re(z2) r4 0.000000 r4 Re(z3) -Im(z3) Re(z2) Im(z2) Im(z2) r5 0.000000 r5 Re(z2) -Im(z2) Im(z2) Re(z3) Re(z3) r6 0.000000 r6 Re(z1) -Im(z1) Re(z3) Im(z3) Im(z3) Im(z3) Assume that the stride has the default value of one. This complex conjugate symmetric vector can be stored in the complex array of size m+1 or in the real array of size 2m+2 or 2m depending on which packed format is used. Two-Dimensional Transform Each of the real-to-complex functions computes the forward FFT of a two-dimensional real matrix according to the mathematical equation Fourier Transform Functions 11 2343 The mathematical result zj,p, 0=j=m-1, 0=p=n-1, is the complex matrix of size (m,n). This mathematical result can be stored in the real two-dimensional array of size: (m+2,n+2) (CCS format), or (m,n) (Pack or Perm formats), or (2*(m/2+1), n) (CCE format, Fortran interface), (m, 2*(n/2+1)) (CCE format, C interface) or in the complex two-dimensional array of size: (m/2+1, n) (CCE format, Fortran interface), (m, n/2+1) (CCE format, C interface) Since the multidimensional array data are arranged differently in Fortran and C (see DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES), the output array that holds the computational result contains complex conjugatesymmetric columns (for Fortran) or complex conjugate-symmetric rows (for C). The following tables give examples of output data layout in Pack format for a forward two-dimensional realto- complex FFT of a 6-by-4 real matrix. Note that the same layout is used for the input data of the corresponding backward complex-to-real FFT. Fortran-interface Data Layout for a 6-by-4 Matrix z(1,1) Re z(1,2) Im z(1,2) z(1,3) Re z(2,1) Re z(2,2) Re z(2,3) Re z(2,4) Im z(2,1) Im z(2,2) Im z(2,3) Im z(2,4) Re z(3,1) Re z(3,2) Re z(3,3) Re z(3,4) Im z(3,1) Im z(3,2) Im z(3,3) Im z(3,4) z(4,1) Re z(4,2) Im z(4,2) z(4,3) For the above example, the stride array is (0, 1, 6). C-interface Data Layout for a 6-by-4 Matrix z(1,1) Re z(1,2) Im z(1,2) z(1,3) Re z(2,1) Re z(2,2) Im z(2,2) Re z(2,3) Im z(2,1) Re z(3,2) Im z(3,2) Im z(2,3) Re z(3,1) Re z(4,2) Im z(4,2) Re z(3,3) Im z(3,1) Re z(5,2) Im z(5,2) Im z(3,3) z(4,1) Re z(6,2) Im z(6,2) z(4,3) For the second example, the stride array is (0, 4, 1). See DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES for details. See also DFTI_PACKED_FORMAT. Three-Dimensional Transform 11 Intel® Math Kernel Library Reference Manual 2344 Each of the real-to-complex functions computes the forward FFT of a three-dimensional real matrix according to the mathematical equation The mathematical result zj,t,q, 0 = j = m-1, 0 = t = n-1, 0 = q = k-1 is the complex matrix of size (m,n,k), which is a complex conjugate-symmetric, or conjugate-even, matrix as follows: zm1,n1,k1 = conjg(zm-m1,n-n1,k-k1), where each dimension is periodic. This mathematical result can be stored in the real three-dimensional array of size: (m/2+1,n,k) (CCE format, Fortran interface), (m,n,k/2+1) (CCE format, C interface). Since the multidimensional array data are arranged differently in Fortran and C (see DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES), the output array that holds the computational result contains complex conjugatesymmetric columns (for Fortran) or complex conjugate-symmetric rows (for C). NOTE CCE is the only packed format for a three-dimensional real FFT. In both in-place and out-ofplace REAL FFT, for real data, the stride and distance parameters are in REAL units and for complex data, they are in COMPLEX units. So elements of the input and output data can be placed in different elements of input-output array of the in-place FFT. 1. DFTI_REAL_REAL for real domain, DFTI_COMPLEX_REAL for conjugate-even domain (by default). It is used for 1D and 2D REAL FFT. • A typical usage of in-place transform is as follows: // m = floor( n/2 ) REAL :: X(0:2*m+1) ...some other code... ...assuming inplace transform... Status = DftiComputeForward( Desc_Handle, X ) On input, X(p) = rp, p = 0,1,...,n-1. On output, Output data stored in one of formats: Pack, Perm or CCS (see DFTI_PACKED_FORMAT). CCS format: X(2*k) = Re(zk), X(2*k+1) = Im(zk), k = 0,1,...,m. Pack format: even n: X(0) = Re(z0), X(2*k-1) = Re(zk), X(2*k) = Im(zk), k = 1,...,m-1, and X(n-1) = Re(zm) odd n: X(0) = Re(z0), X(2*k-1) = Re(zk), X(2*k) = Im(zk), k = 1,...,m Perm format: even n: X(0) = Re(z0), X(1) = Re(zm), X(2*k) = Re(zk), X(2*k+1) = Im(zk), k = 1,...,m-1, Fourier Transform Functions 11 2345 odd n: X(0) = Re(z0), X(2*k-1) = Re(zk), X(2*k) = Im(zk), k = 1,...,m. See Example "One-dimensional In-place FFT (Fortran Interface)", Example "One-dimensional In-place FFT (C Interface)", Example "Two-dimensional FFT (Fortran Interface)", and Example "Twodimensional FFT (C Interface)". Input and output data exchange roles in the backward transform. • A typical usage of out-of-place transform is as follows: // m = floor( n/2 ) REAL :: X(0:n-1) REAL :: Y(0:2*m+1) ...some other code... ...assuming out-of-place transform... Status = DftiComputeForward( Desc_Handle, X, Y ) On input, X(p) = rp, p = 0,1,...,n-1. On output, Output data stored in one of formats: Pack, Perm or CCS (see DFTI_PACKED_FORMAT). CCS format: Y(2*k) = Re(zk), Y(2*k+1) = Im(zk), k = 0,1,...,m. Pack format: even n: Y(0) = Re(z0), Y(2*k-1) = Re(zk), Y(2*k) = Im(zk), k = 1,...,m-1, and Y(n-1) = Re(zm) odd n: Y(0) = Re(z0), Y(2*k-1) = Re(zk), Y(2*k) = Im(zk), k = 1,...,m Perm format: even n: Y(0) = Re(z0), Y(1) = Re(zm), Y(2*k) = Re(zk) , Y(2*k+1) = Im(zk), k = 1,...,m-1, odd n: Y(0) = Re(z0), Y(2*k-1) = Re(zk), Y(2*k) = Im(zk), k = 1,...,m. Notice that if the stride of the output array is not set to the default value unit stride, the real and imaginary parts of one complex element will be placed with this stride. For example: CCS format: Y(2*k*s) = Re(zk), Y((2*k+1)*s) = Im(zk), k = 0,1, ..., m, s - stride. See Example "One-dimensional Out-of-place FFT (Fortran Interface)" and Example "One-dimensional Out-of-place FFT (C Interface)". Input and output data exchange roles in the backward transform. 2. DFTI_REAL_REAL for real domain, DFTI_COMPLEX_COMPLEX for conjugate-even domain. It is used for 1D, 2D and 3D REAL FFT. The CCE format is set by default. You must explicitly set the storage scheme in this case, because its value is not the default one. 11 Intel® Math Kernel Library Reference Manual 2346 • A typical usage of in-place transform is as follows: // m = floor( n/2 ) REAL :: X(0:m*2) ...some other code... ...assuming in-place transform... Status = DftiSetValue( Desc_Handle, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX) ... Status = DftiComputeForward( Desc_Handle, X) On input, X(p) = rp, p = 0,1,...,n-1. On output, X(2*k) = Re(zk), X(2*k+1) = Im(zk), k = 0,1,...,m. See Example "Two-Dimensional REAL In-place FFT (Fortran Interface)". Input and output data exchange roles in the backward transform. • A typical usage of out-of-place transform is as follows: // m = floor( n/2 ) REAL :: X(0:n-1) COMPLEX :: Y(0:m) ...some other code... ...assuming out-of-place transform... Status = DftiSetValue( Desc_Handle, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX) ... Status = DftiComputeForward( Desc_Handle, X, Y ) On input, X(p) = rp, p = 0,1,...,n-1. On output, Y(k) = zk, k = 0,1,...,m. See Example "Two-Dimensional REAL Out-of-place FFT (Fortran Interface)" and Example "Three- Dimensional REAL FFT (C Interface)" Input and output data exchange roles in the backward transform. See Also DftiSetValue DFTI_PACKED_FORMAT The result of the forward transform (that is, in the frequency domain) of real data is represented in several possible packed formats: Pack, Perm, CCS, or CCE. The data can be packed due to the symmetry property of the FFT of real data. Use the following non-default settings for real transforms of all ranks: • The configuration parameter DFTI_CONJUGATE_EVEN_STORAGE has the value of DFTI_COMPLEX_COMPLEX. • Elements of the result in the conjugate-even domain have a complex type. • The configuration parameter DFTI_PACKED_FORMAT has the value of DFTI_CCE_FORMAT. Fourier Transform Functions 11 2347 The following setting is the default for one-dimensional and two-dimensional real transforms: • The configuration parameter DFTI_CONJUGATE_EVEN_STORAGE has the value of DFTI_COMPLEX_REAL. • Data elements in the frequency domain have a real type. • The value of DFTI_PACKED_FORMAT defines how real and imaginary parts of the data are laid out in the result. NOTE This setting does not apply to three-dimensional and higher-rank transforms. Though not recommended, it is the default for backward compatibility. The CCE format stores the values of the first half of the output complex conjugate-even signal resulting from the forward FFT. For a multi-dimensional real transform, n1 * n2 * n3 * ... * nk the size of complex matrix in CCE format is (n1/2+1)* n2 * n3 * ... * nk for Fortran and n1 * n2 * ... * (nk/2+1) for C. The CCS format is similar to the CCE format and is the same format for one-dimensional transform. It differs slightly for multi-dimensional real transforms. In CCS format, the output samples of the FFT are arranged as shown in Table "Packed Format Output Samples" for a one-dimensional FFT and in Table "CCS Format Output Samples (Two-Dimensional Matrix (m+2)-by-(n+2))" for a two-dimensional FFT. The Pack format is a compact representation of a complex conjugate-symmetric sequence, but the elements are arranged intuitively for complex FFT algorithms rather than for real FFTs. In the Pack format, the output samples of the FFT are arranged as shown in Table "Packed Format Output Samples" for one-dimensional FFT and in Table "Pack Format Output Samples (Two-Dimensional Matrix m-by-n)" for two-dimensional FFT. The Perm format is a permutation of the Pack format for even lengths and is the same as the Pack format for odd lengths. In Perm format, the output samples of the FFT are arranged as shown in Table "Packed Format Output Samples" for a one-dimensional FFT and in Table "Perm Format Output Samples (Two-Dimensional Matrix m-by-n)" for a two-dimensional FFT. Packed Format Output Samples For n = 2*s FFT Real 0 1 2 3 ... n-2 n-1 n n+1 CCS R0 0 R1 I1 ... Rn/2-1 In/2-1 Rn/2 0 Pack R0 R1 I1 R2 ... In/2-1 Rn/2 Perm R0 Rn/2 R1 I1 ... Rn/2-1 In/2-1 For n = 2*s + 1 FFT Real 0 1 2 3 ... n-4 n-3 n-2 n-1 n n+1 CCS R0 0 R1 I1 ... Is-2 Rs-1 Is-1 Rs Is Pack R0 R1 I1 R2 ... Rs-1 Is-1 Rs Is Perm R0 R1 I1 R2 ... Rs-1 Is-1 Rs Is Note that Table "Packed Format Output Samples" uses the following notation for complex data entries: Rj = Re zj Ij = Im zj See also Table "Comparison of the Storage Effects of Complex-to-Complex and Real-to-Complex FFTs for Forward Transform" and Table "Comparison of the Storage Effects of Complex-to-Complex and Complex-to- Real FFTs for Backward Transform". 11 Intel® Math Kernel Library Reference Manual 2348 CCS Format Output Samples (Two-Dimensional Matrix (m+2)-by-(n+2)) For m = 2*s, n = 2*k z(1,1) 0 REz(1,2) IMz(1,2) .. . REz(1,k) IMz(1,k) z(1,k+1) 0 0 0 0 0 .. . 0 0 0 0 REz(2,1) REz(2,2) REz(2,3) REz(2,4) .. . REz(2,n-1) REz(2,n) n/u* n/u IMz(2,1) IMz(2,2) IMz(2,3) IMz(2,4) .. . IMz(2,n-1) IMz(2,n) n/u n/u ... ... ... ... .. . ... ... n/u n/u REz(m/ 2,1) REz(m/ 2,2) REz(m/2,3) REz(m/2,4) .. . REz(m/ 2,n-1) REz(m/2,n) n/u n/u IMz(m/ 2,1) IMz(m/ 2,2) IMz(m/2,3) IMz(m/2,4) .. . IMz(m/ 2,n-1) IMz(m/2,n) n/u n/u z(m/ 2+1,1) 0 REz(m/ 2+1,2) IMz(m/ 2+1,2) .. . REz(m/ 2+1,k) IMz(m/ 2+1,k) z(m/2+1,k +1) 0 0 0 0 0 .. . 0 0 n/u n/u For m = 2*s+1, n = 2*k z(1,1) 0 REz(1,2) IMz(1,2) .. . REz(1,k) IMz(1,k) z(1,k+1) 0 0 0 0 0 .. . 0 0 0 0 REz(2,1) REz(2,2) REz(2,3) REz(2,4) .. . REz(2,n-1) REz(2,n) n/u n/u IMz(2,1) IMz(2,2) IMz(2,3) IMz(2,4) .. . IMz(2,n-1) IMz(2,n) n/u n/u ... ... ... ... .. . ... ... n/u n/u REz(s,1) REz(s,2) REz(s,3) REz(s,4) .. . REz(s,n-1) REz(s,n) n/u n/u IMz(s,1) IMz(s,2) IMz(s,3) IMz(s,4) .. . IMz(s,n-1) IMz(s,n) n/u n/u For m = 2*s, n = 2*k+1 z(1,1) 0 REz(1,2) IMz(1,2) .. . IMz(1,k-1) REz(1,k) IM z(1,k) 0 0 0 0 .. . 0 0 0 REz(2,1) REz(2,2) REz(2,3) REz(2,4) .. . REz(2,n-1) REz(2,n) n/u* Fourier Transform Functions 11 2349 For m = 2*s, n = 2*k+1 IMz(2,1) IMz(2,2) IMz(2,3) IMz(2,4) .. . IMz(2,n-1) IMz(2,n) n/u ... ... ... ... .. . ... ... n/u REz(m/ 2,1) REz(m/ 2,2) REz(m/2,3) REz(m/2,4) .. . REz(m/2,n-1) REz(m/2,n) n/u IMz(m/ 2,1) IMz(m/ 2,2) IMz(m/2,3) IMz(m/2,4) .. . IMz(m/2,n-1) IMz(m/2,n) n/u z(m/ 2+1,1) 0 REz(m/ 2+1,2) IMz(m/ 2+1,2) .. . IMz(m/ 2+1,k-1) REz(m/ 2+1,k) IMz(m/ 2+1,k) 0 0 0 0 .. . 0 0 n/u For m = 2*s+1, n = 2*k+1 z(1,1) 0 REz(1,2) IMz(1,2) .. . IMz(1,k-1) REz(1,k) IMz(1,k) 0 0 0 0 .. . 0 0 0 REz(2,1) REz(2,2) REz(2,3) REz(2,4) .. . REz(2,n-1) REz(2,n) n/u IMz(2,1) IMz(2,2) IMz(2,3) IMz(2,4) .. . IMz(2,n-1) IMz(2,n) n/u ... ... ... ... .. . ... ... n/u REz(s,1) REz(s,2) REz(s,3) REz(s,4) .. . REz(s,n-1) REz(s,n) n/u IMz(s,1) IMz(s,2) IMz(s,3) IMz(s,4) .. . IMz(s,n-1) IMz(s,n) n/u * n/u - not used. Note that in the Table "CCS Format Output Samples (Two-Dimensional Matrix (m+2)-by-(n+2))", (n+2) columns are used for even n = k*2, while n columns are used for odd n = k*2+1. Pack Format Output Samples (Two-Dimensional Matrix m-by-n) For m = 2*s, n = 2*k z(1,1) REz(1,2) IMz(1,2) REz(1,3) ... IMz(1,k) z(1,k+1) REz(2,1) REz(2,2) REz(2,3) REz(2,4) ... REz(2,n-1) REz(2,n) IMz(2,1) IMz(2,2) IMz(2,3) IMz(2,4) ... IMz(2,n-1) IMz(2,n) ... ... ... ... ... ... ... REz(m/2,1) REz(m/2,2) REz(m/2,3) REz(m/2,4) ... REz(m/2,n-1) REz(m/2,n) IMz(m/2,1) IMz(m/2,2) IMz(m/2,3) IMz(m/2,4) ... IMz(m/2,n-1) IMz(m/2,n) z(m/2+1,1) REz(m/ 2+1,2) IMz(m/ 2+1,2) REz(m/2+1,3) ... IMz(m/2+1,k) z(m/2+1,k+1) 11 Intel® Math Kernel Library Reference Manual 2350 For m = 2*s+1, n = 2*k z(1,1) REz(1,2) IMz(1,2) REz(1,3) ... IMz(1,k) z(1,n/2+1) REz(2,1) REz(2,2) REz(2,3) REz(2,4) ... REz(2,n-1) REz(2,n) IMz(2,1) IMz(2,2) IMz(2,3) IMz(2,4) ... IMz(2,n-1) IMz(2,n) ... ... ... ... ... ... ... REz(s,1) REz(s,2) REz(s,3) REz(s,4) ... REz(s,n-1) REz(s,n) IMz(s,1) IMz(s,2) IMz(s,3) IMz(s,4) ... IMz(s,n-1) IMz(s,n) Perm Format Output Samples (Two-Dimensional Matrix m-by-n) For m = 2*s, n = 2*k+1 z(1,1) z(1,k+1) REz(1,2) IMz(1,2) ... REz(1,k) IMz(1,k) z(m/2+1,1) z(m/2+1,k +1) REz(m/ 2+1,2) IMz(m/2+1,2) ... REz(m/2+1,k) IMz(m/2+1,k) REz(2,1) REz(2,2) REz(2,3) REz(2,4) ... REz(2,n-1) REz(2,n) IMz(2,1) IMz(2,2) IMz(2,3) IMz(2,4) ... IMz(2,n-1) IMz(2,n) ... ... ... ... ... ... ... REz(m/2,1) REz(m/2,2) REz(m/2,3) REz(m/2,4) ... REz(m/2,n-1) REz(m/2,n) IMz(m/2,1) IMz(m/2,2) IMz(m/2,3) IMz(m/2,4) ... IMz(m/2,n-1) IMz(m/2,n) For m = 2*s+1, n = 2*k+1 z(1,1) z(1,k+1) REz(1,2) IMz(1,2) ... REz(1,k) IMz(1,k) REz(2,1) REz(2,2) REz(2,3) REz(2,4) ... REz(2,n-1) REz(2,n) IMz(2,1) IMz(2,2) IMz(2,3) IMz(2,4) ... IMz(2,n-1) IMz(2,n) ... ... ... ... ... ... ... REz(s,1) REz(s,2) REz(s,3) REz(s,4) ... REz(s,n-1) REz(s,n) IMz(s,1) IMz(s,2) IMz(s,3) IMz(s,4) ... IMz(s,n-1) IMz(s,n) The tables for two-dimensional FFT use Fortran-interface conventions. For C-interface specifics in storing packed data, see DFTI_COMPLEX_STORAGE, DFTI_REAL_STORAGE, DFTI_CONJUGATE_EVEN_STORAGE. See also Table "Fortran-interface Data Layout for a 6-by-4 Matrix" and Table "C-interface Data Layout for a 6- by-4 Matrix" for examples of Fortran-interface and C-interface formats. To better understand packed formats for two-dimensional transforms, refer to these examples in your Intel MKL directory: C: ./examples/dftc/source/config_conjugate_even_storage.c Fortran: ./examples/dftf/source/config_conjugate_even_storage.f90 See Also DftiSetValue DFTI_WORKSPACE The computation step for some FFT algorithms requires a scratch space for permutation or other purposes. To manage the use of the auxiliary storage, Intel MKL enables you to set the configuration parameter DFTI_WORKSPACE with the following values: Fourier Transform Functions 11 2351 DFTI_ALLOW (default) Permits the use of the auxiliary storage. DFTI_AVOID Instructs Intel MKL to avoid using the auxiliary storage if possible. See Also DftiSetValue DFTI_COMMIT_STATUS The DFTI_COMMIT_STATUS configuration parameter indicates whether the descriptor is ready for computation. The parameter has two possible values: DFTI_UNCOMMITTED Default value, set after a successful call of DftiCreateDescriptor. DFTI_COMMITTED The value after a successful call to DftiCommitDescriptor. A computation function called with an uncommitted descriptor returns an error. You cannot directly set this configuration parameter in a call to DftiSetValue, but a change in the configuration of a committed descriptor may change the commit status of the descriptor to DFTI_UNCOMMITTED. See Also DftiCreateDescriptor DftiCommitDescriptor DftiSetValue DFTI_ORDERING Some FFT algorithms apply an explicit permutation stage that is time consuming [4]. The exclusion of this step is similar to applying an FFT to input data whose order is scrambled, or allowing a scrambled order of the FFT results. In applications such as convolution and power spectrum calculation, the order of result or data is unimportant and thus using scrambled data is acceptable if it leads to better performance. The following options are available in Intel MKL: • DFTI_ORDERED: Forward transform data ordered, backward transform data ordered (default option). • DFTI_BACKWARD_SCRAMBLED: Forward transform data ordered, backward transform data scrambled. Table "Scrambled Order Transform" tabulates the effect of this configuration setting. Scrambled Order Transform DftiComputeForward DftiComputeBackward DFTI_ORDERING Input ? Output Input ? Output DFTI_ORDERED ordered ? ordered ordered ? ordered DFTI_BACKWARD_SCRAMBLED ordered ? scrambled scrambled ? ordered NOTE The word "scrambled" in this table means "permit scrambled order if possible". In some situations permitting out-of-order data gives no performance advantage and an implementation may choose to ignore the suggestion. See Also DftiSetValue Cluster FFT Functions This section describes the cluster Fast Fourier Transform (FFT) functions implemented in Intel® MKL. 11 Intel® Math Kernel Library Reference Manual 2352 NOTE These functions are available only for the Linux* and Windows* operating systems. The cluster FFT function library was designed to perform fast Fourier transforms on a cluster, that is, a group of computers interconnected via a network. Each computer (node) in the cluster has its own memory and processor(s). Data interchanges between the nodes are provided by the network. One or more processes may be running in parallel on each cluster node. To organize communication between different processes, the cluster FFT function library uses the Message Passing Interface (MPI). To avoid dependence on a specific MPI implementation (for example, MPICH, Intel® MPI, and others), the library works with MPI via a message-passing library for linear algebra called BLACS. Cluster FFT functions of Intel MKL provide one-dimensional, two-dimensional, and multi-dimensional (up to the order of 7) functions and both Fortran and C interfaces for all transform functions. To develop applications using the cluster FFT functions, you should have basic skills in MPI programming. The interfaces for the Intel MKL cluster FFT functions are similar to the corresponding interfaces for the conventional Intel MKL FFT functions, described earlier in this chapter. Refer there for details not explained in this section. Table "Cluster FFT Functions in Intel MKL" lists cluster FFT functions implemented in Intel MKL: Cluster FFT Functions in Intel MKL Function Name Operation Descriptor Manipulation Functions DftiCreateDescriptorDM Allocates memory for the descriptor data structure and preliminarily initializes it. DftiCommitDescriptorDM Performs all initialization for the actual FFT computation. DftiFreeDescriptorDM Frees memory allocated for a descriptor. FFT Computation Functions DftiComputeForwardDM Computes the forward FFT. DftiComputeBackwardDM Computes the backward FFT. Descriptor Configuration Functions DftiSetValueDM Sets one particular configuration parameter with the specified configuration value. DftiGetValueDM Gets the value of one particular configuration parameter. Computing Cluster FFT The cluster FFT functions described later in this section are provided with Fortran and C interfaces. Fortran stands for Fortran 95. Cluster FFT computation is performed by DftiComputeForwardDM and DftiComputeBackwardDM functions, called in a program using MPI, which will be referred to as MPI program. After an MPI program starts, a number of processes are created. MPI identifies each process by its rank. The processes are independent of one another and communicate via MPI. A function called in an MPI program is invoked in all the processes. Each process manipulates data according to its rank. Input or output data for a cluster FFT transform is a sequence of real or complex values. A cluster FFT computation function operates local part of the input data, i.e. some part of the data to be operated in a particular process, as well as generates local part of the output data. While each process performs its part of computations, running in parallel and communicating through MPI, the processes perform the entire FFT computation. FFT computations using the Intel MKL cluster FFT functions are typically effected by a number of steps listed below: 1. Initiate MPI by calling MPI_Init in C/C++ or MPI_INIT in Fortran (the function must be called prior to calling any FFT function and any MPI function). Fourier Transform Functions 11 2353 2. Allocate memory for the descriptor and create it by calling DftiCreateDescriptorDM. 3. Specify one of several values of configuration parameters by one or more calls to DftiSetValueDM. 4. Obtain values of configuration parameters needed to create local data arrays; the values are retrieved by calling DftiGetValueDM. 5. Initialize the descriptor for the FFT computation by calling DftiCommitDescriptorDM. 6. Create arrays for local parts of input and output data and fill the local part of input data with values. (For more information, see Distributing Data among Processes.) 7. Compute the transform by calling DftiComputeForwardDM or DftiComputeBackwardDM. 8. Gather local output data into the global array using MPI functions. (This step is optional because you may need to immediately employ the data differently.) 9. Release memory allocated for the descriptor by calling DftiFreeDescriptorDM. 10.Finalize communication through MPI by calling MPI_Finalize in C/C++ or MPI_FINALIZE in Fortran (the function must be called after the last call to a cluster FFT function and the last call to an MPI function). Several code examples in the "Examples for Cluster FFT Functions" section in Appendix C illustrate cluster FFT computations. Distributing Data among Processes The Intel MKL cluster FFT functions store all input and output multi-dimensional arrays (matrices) in onedimensional arrays (vectors). The arrays are stored in the row-major order in C/C++ and in the columnmajor order in Fortran. For example, a two-dimensional matrix A of size (m,n) is stored in a vector B of size m*n so that • B[i*n+j]=A[i][j] in C/C++ (i=0, ..., m-1, j=0, ..., n-1) • B(j*m+i)=A(i,j) in Fortran (i=1, ..., m, j=1, ..., n). NOTE Order of FFT dimensions is the same as the order of array dimensions in the programming language. For example, a 3-dimensional FFT with Lengths=(m,n,l) can be computed over an array Ar[m][n][l] in C/C++ or AR(m,n,l) in Fortran. All MPI processes involved in cluster FFT computation operate their own portions of data. These local arrays make up the virtual global array that the fast Fourier transform is applied to. It is your responsibility to properly allocate local arrays (if needed), fill them with initial data and gather resulting data into an actual global array or process the resulting data differently. To be able do this, see sections below on how the virtual global array is composed of the local ones. Multi-dimensional transforms If the dimension of transform is greater than one, the cluster FFT function library splits data in the dimension whose index changes most slowly, so that the parts contain all elements with several consecutive values of this index. It is the first dimension in C and the last one in Fortran. If the global array is two-dimensional, in C, it gives each process several consecutive rows. The term "rows" will be used regardless of the array dimension and programming language. Local arrays are placed in memory allocated for the virtual global array consecutively, in the order determined by process ranks. For example, in case of two processes, during the computation of a three-dimensional transform whose matrix has size (11,15,12), the processes may store local arrays of sizes (6,15,12) and (5,15,12), respectively. If p is the number of MPI processes and the matrix of a transform to be computed has size (m,n,l), in C, each MPI process works with local data array of size (mq , n, l), where Smq=m, q=0, ... , p-1. Local input arrays must contain appropriate parts of the actual global input array, and then local output arrays will contain appropriate parts of the actual global output array. You can figure out which particular rows of the global array the local array must contain from the following configuration parameters of the cluster FFT interface: CDFT_LOCAL_NX, CDFT_LOCAL_START_X, and CDFT_LOCAL_SIZE. To retrieve values of the parameters, use the DftiGetValueDM function: • CDFT_LOCAL_NX specifies how many rows of the global array the current process receives. 11 Intel® Math Kernel Library Reference Manual 2354 • CDFT_LOCAL_START_X specifies which row of the global input or output array corresponds to the first row of the local input or output array. If A is a global array and L is the appropriate local array, then – L[i][j][k]=A[i+cdft_local_start_x][j][k], where i=0, ..., mq-1, j=0, ..., n-1, k=0, ..., l-1 for C/C++ – L(i,j,k)=A(i,j,k+cdft_local_start_x-1), where i=1, ..., m, j=1, ..., n, k=1, ..., lq for Fortran. Example "2D Out-of-place Cluster FFT Computation" in Appendix C shows how the data is distributed among processes for a two-dimensional cluster FFT computation. One-dimensional transforms In this case, input and output data are distributed among processes differently and even the numbers of elements stored in a particular process before and after the transform may be different. Each local array stores a segment of consecutive elements of the appropriate global array. Such segment is determined by the number of elements and a shift with respect to the first array element. So, to specify segments of the global input and output arrays that a particular process receives, four configuration parameters are needed: CDFT_LOCAL_NX, CDFT_LOCAL_START_X, CDFT_LOCAL_OUT_NX, and CDFT_LOCAL_OUT_START_X. Use the DftiGetValueDM function to retrieve their values. The meaning of the four configuration parameters depends upon the type of the transform, as shown in Table "Data Distribution Configuration Parameters for 1D Transforms": Data Distribution Configuration Parameters for 1D Transforms Meaning of the Parameter Forward Transform Backward Transform Number of elements in input array CDFT_LOCAL_NX CDFT_LOCAL_OUT_NX Elements shift in input array CDFT_LOCAL_START_X CDFT_LOCAL_OUT_START_X Number of elements in output array CDFT_LOCAL_OUT_NX CDFT_LOCAL_NX Elements shift in output array CDFT_LOCAL_OUT_START_X CDFT_LOCAL_START_X Memory size for local data The memory size needed for local arrays cannot be just calculated from CDFT_LOCAL_NX (CDFT_LOCAL_OUT_NX), because the cluster FFT functions sometimes require allocating a little bit more memory for local data than just the size of the appropriate sub-array. The configuration parameter CDFT_LOCAL_SIZE specifies the size of the local input and output array in data elements. Each local input and output arrays must have size not less than CDFT_LOCAL_SIZE*size_of_element. Note that in the current implementation of the cluster FFT interface, data elements can be real or complex values, each complex value consisting of the real and imaginary parts. If you employ a user-defined workspace for inplace transforms (for more information, refer to Table "Settable configuration Parameters"), it must have the same size as the local arrays. Example "1D In-place Cluster FFT Computations" in Appendix C illustrates how the cluster FFT functions distribute data among processes in case of a one-dimensional FFT computation effected with a user-defined workspace. Available Auxiliary Functions If a global input array is located on one MPI process and you want to obtain its local parts or you want to gather the global output array on one MPI process, you can use functions MKL_CDFT_ScatterData and MKL_CDFT_GatherData to distribute or gather data among processes, respectively. These functions are defined in a file that is delivered with Intel MKL and located in the following subdirectory of the Intel MKL installation directory: examples/cdftc/source/cdft_example_support.c for C/C++ and examples/ cdftf/source/cdft_example_support.f90 for Fortran. Fourier Transform Functions 11 2355 Restriction on Lengths of Transforms The algorithm that the Intel MKL cluster FFT functions use to distribute data among processes imposes a restriction on lengths of transforms with respect to the number of MPI processes used for the FFT computation: • For a multi-dimensional transform, lengths of the first two dimensions in C/C++ or of the last two dimensions in Fortran must be not less than the number of MPI processes. • Length of a one-dimensional transform must be the product of two integers each of which is not less than the number of MPI processes. Non-compliance with the restriction causes an error CDFT_SPREAD_ERROR (refer to Error Codes for details). To achieve the compliance, you can change the transform lengths and/or the number of MPI processes, which is specified at start of an MPI program. MPI-2 enables changing the number of processes during execution of an MPI program. Cluster FFT Interface To use the cluster FFT functions, you need to access the module MKL_CDFT through the "use" statement in Fortran; or access the header file mkl_cdft.h through "include" in C/C++. The Fortran interface provides a derived type DFTI_DESCRIPTOR_DM; a number of named constants representing various names of configuration parameters and their possible values; and a number of overloaded functions through the generic functionality of Fortran 95. The C interface provides a structure type DFTI_DESCRIPTOR_DM_HANDLE and a number of functions, some of which accept a different number of input arguments. To provide communication between parallel processes through MPI, the following include statement must be present in your code: • Fortran: INCLUDE "mpif.h" (for some MPI versions, "mpif90.h" header may be used instead). • C/C++: #include "mpi.h" There are three main categories of the cluster FFT functions in Intel MKL: 1. Descriptor Manipulation. There are three functions in this category. The DftiCreateDescriptorDM function creates an FFT descriptor whose storage is allocated dynamically. The DftiCommitDescriptorDM function "commits" the descriptor to all its settings. The DftiFreeDescriptorDM function frees up the memory allocated for the descriptor. 2. FFT Computation. There are two functions in this category. The DftiComputeForwardDM function performs the forward FFT computation, and the DftiComputeBackwardDM function performs the backward FFT computation. 3. Descriptor Configuration. There are two functions in this category. The DftiSetValueDM function sets one specific configuration value to one of the many configuration parameters. The DftiGetValueDM function gets the current value of any of these configuration parameters, all of which are readable. These parameters, though many, are handled one at a time. Descriptor Manipulation Functions There are three functions in this category: create a descriptor, commit a descriptor, and free a descriptor. 11 Intel® Math Kernel Library Reference Manual 2356 DftiCreateDescriptorDM Allocates memory for the descriptor data structure and preliminarily initializes it. Syntax Fortran: Status = DftiCreateDescriptorDM(comm, handle, v1, v2, dim, size) Status = DftiCreateDescriptorDM(comm, handle, v1, v2, dim, sizes) C: status = DftiCreateDescriptorDM(comm, &handle, v1, v2, dim, size ); status = DftiCreateDescriptorDM(comm, &handle, v1, v2, dim, sizes ); Include Files • FORTRAN 90: mkl_cdft.f90 • C: mkl_cdft.h Input Parameters comm MPI communicator, e.g. MPI_COMM_WORLD. v1 Precision of the transform. v2 Type of the forward domain. Must be DFTI_COMPLEX for complex-tocomplex transforms or DFTI_REAL for real-to-complex transforms. dim Dimension of the transform. size Length of the transform in a one-dimensional case. sizes Lengths of the transform in a multi-dimensional case. Output Parameters handle Pointer to the descriptor handle of transform. If the function completes successfully, the pointer to the created handle is stored in the variable. Description This function allocates memory in a particular MPI process for the descriptor data structure and instantiates it with default configuration settings with respect to the precision, domain, dimension, and length of the desired transform. The domain is understood to be the domain of the forward transform. The result is a pointer to the created descriptor. This function is slightly different from the "initialization" function DftiCommitDescriptorDM in a more traditional software packages or libraries used for computing the FFT. This function does not perform any significant computation work, such as twiddle factors computation, because the default configuration settings can still be changed using the function DftiSetValueDM. The value of the parameter v1 is specified through named constants DFTI_SINGLE and DFTI_DOUBLE. It corresponds to precision of input data, output data, and computation. A setting of DFTI_SINGLE indicates single-precision floating-point data type and a setting of DFTI_DOUBLE indicates double-precision floatingpoint data type. The parameter dim is a simple positive integer indicating the dimension of the transform. In C/C++, for one-dimensional transforms, length is a single integer value of the parameter size having type MKL_LONG; for multi-dimensional transforms, length is supplied with the parameter sizes, which is an array of integers having type MKL_LONG. In Fortran, length is an integer or an array of integers. Fourier Transform Functions 11 2357 Return Values The function returns DFTI_NO_ERROR when completes successfully. In this case, the pointer to the created descriptor handle is stored in handle. If the function fails, it returns a value of another error class constant Interface and Prototype ! Fortran Interface INTERFACE DftiCreateDescriptorDM INTEGER(4) FUNCTION DftiCreateDescriptorDMn(C,H,P1,P2,D,L) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: H INTEGER(4) C,P1,P2,D,L(*) END FUNCTION INTEGER(4) FUNCTION DftiCreateDescriptorDM1(C,H,P1,P2,D,L) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: H INTEGER(4) C,P1,P2,D,L END FUNCTION END INTERFACE /* C/C++ prototype */ MKL_LONG DftiCreateDescriptorDM(MPI_Comm,DFTI_DESCRIPTOR_DM_HANDLE*, enum DFTI_CONFIG_VALUE,enum DFTI_CONFIG_VALUE,MKL_LONG,...); DftiCommitDescriptorDM Performs all initialization for the actual FFT computation. Syntax Fortran: Status = DftiCommitDescriptorDM(handle) C: status = DftiCommitDescriptorDM(handle); Include Files • FORTRAN 90: mkl_cdft.f90 • C: mkl_cdft.h Input Parameters handle The descriptor handle. Must be valid, that is, created in a call to DftiCreateDescriptorDM. 11 Intel® Math Kernel Library Reference Manual 2358 Description The cluster FFT interface requires a function that completes initialization of a previously created descriptor before the descriptor can be used for FFT computations in a particular MPI process. The DftiCommitDescriptorDM function performs all initialization that facilitates the actual FFT computation. For the current implementation, it may involve exploring many different factorizations of the input length to search for highly efficient computation method. Any changes of configuration parameters of a committed descriptor via the set value function (see Descriptor Configuration) requires a re-committal of the descriptor before a computation function can be invoked. Typically, this committal function is called right before a computation function call (see FFT Computation). Return Values The function returns DFTI_NO_ERROR when completes successfully. If the function fails, it returns a value of another error class constant (for the list of constants, refer to the Error Codes section). Interface and Prototype ! Fortran Interface INTERFACE DftiCommitDescriptorDM INTEGER(4) FUNCTION DftiCommitDescriptorDM(handle); TYPE(DFTI_DESCRIPTOR_DM), POINTER :: handle END FUNCTION END INTERFACE /* C/C++ prototype */ MKL_LONG DftiCommitDescriptorDM(DFTI_DESCRIPTOR_DM_HANDLE handle); DftiFreeDescriptorDM Frees memory allocated for a descriptor. Syntax Fortran: Status = DftiFreeDescriptorDM(handle) C: status = DftiFreeDescriptorDM(&handle); Include Files • FORTRAN 90: mkl_cdft.f90 • C: mkl_cdft.h Input Parameters handle The descriptor handle. Must be valid, that is, created in a call to DftiCreateDescriptorDM. Fourier Transform Functions 11 2359 Output Parameters handle The descriptor handle. Memory allocated for the handle is released on output. Description This function frees up all memory allocated for a descriptor in a particular MPI process. Call the DftiFreeDescriptorDM function to delete the descriptor handle. Upon successful completion of DftiFreeDescriptorDM the descriptor handle is no longer valid. Return Values The function returns DFTI_NO_ERROR when completes successfully. If the function fails, it returns a value of another error class constant (for the list of constants, refer to the Error Codes section). Interface and Prototype ! Fortran Interface INTERFACE DftiFreeDescriptorDM INTEGER(4) FUNCTION DftiFreeDescriptorDM(handle) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: handle END FUNCTION END INTERFACE /* C/C++ prototype */ MKL_LONG DftiFreeDescriptorDM(DFTI_DESCRIPTOR_DM_HANDLE *handle); FFT Computation Functions There are two functions in this category: compute the forward transform and compute the backward transform. DftiComputeForwardDM Computes the forward FFT. Syntax Fortran: Status = DftiComputeForwardDM(handle, in_X, out_X) Status = DftiComputeForwardDM(handle, in_out_X) C: status = DftiComputeForwardDM(handle, in_X, out_X); status = DftiComputeForwardDM(handle, in_out_X); Include Files • FORTRAN 90: mkl_cdft.f90 11 Intel® Math Kernel Library Reference Manual 2360 • C: mkl_cdft.h Input Parameters handle The descriptor handle. in_X, in_out_X Local part of input data. Array of complex values. Refer to the Distributing Data among Processes section on how to allocate and initialize the array. Output Parameters out_X, in_out_X Local part of output data. Array of complex values. Refer to the Distributing Data among Processes section on how to allocate the array. Description The DftiComputeForwardDM function computes the forward FFT. Forward FFT is the transform using the factor e-i2p/n. Before you call the function, the valid descriptor, created by DftiCreateDescriptorDM, must be configured and committed using the DftiCommitDescriptorDM function. The computation is carried out by calling the DftiComputeForward function. So, the functions have very much in common, and details not explicitly mentioned below can be found in the description of DftiComputeForward. Local part of input data, as well as local part of the output data, is an appropriate sequence of complex values (each complex value consists of two real numbers: real part and imaginary part) that a particular process stores. See the Distributing Data Among Processes section for details. Refer to the Configuration Settings section for the list of configuration parameters that the descriptor passes to the function. The configuration parameter DFTI_PRECISION determines the precision of input data, output data, and transform: a setting of DFTI_SINGLE indicates single-precision floating-point data type and a setting of DFTI_DOUBLE indicates double-precision floating-point data type. The configuration parameter DFTI_PLACEMENT informs the function whether the computation should be inplace. If the value of this parameter is DFTI_INPLACE (default), you must call the function with two parameters, otherwise you must supply three parameters. If DFTI_PLACEMENT = DFTI_INPLACE and three parameters are supplied, then the third parameter is ignored. CAUTION Even in case of an out-of-place transform, local array of input data in_X may be changed. To save data, make its copy before calling DftiComputeForwardDM. In case of an in-place transform, DftiComputeForwardDM dynamically allocates and deallocates a work buffer of the same size as the local input/output array requires. NOTE You can specify your own workspace of the same size through the configuration parameter CDFT_WORKSPACE to avoid redundant memory allocation. Return Values The function returns DFTI_NO_ERROR when completes successfully. If the function fails, it returns a value of another error class constant (for the list of constants, refer to the Error Codes section). Fourier Transform Functions 11 2361 Interface and Prototype ! Fortran Interface INTERFACE DftiComputeForwardDM INTEGER(4) FUNCTION DftiComputeForwardDM(h, in_X, out_X) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h COMPLEX(8), DIMENSION(*) :: in_x, out_X END FUNCTION DftiComputeForwardDM INTEGER(4) FUNCTION DftiComputeForwardDMi(h, in_out_X) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h COMPLEX(8), DIMENSION(*) :: in_out_X END FUNCTION DftiComputeForwardDMi INTEGER(4) FUNCTION DftiComputeForwardDMs(h, in_X, out_X) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h COMPLEX(4), DIMENSION(*) :: in_x, out_X END FUNCTION DftiComputeForwardDMs INTEGER(4) FUNCTION DftiComputeForwardDMis(h, in_out_X) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h COMPLEX(4), DIMENSION(*) :: in_out_X END FUNCTION DftiComputeForwardDMis END INTERFACE /* C/C++ prototype */ MKL_LONG DftiComputeForwardDM(DFTI_DESCRIPTOR_DM_HANDLE handle, void *in_X,...); DftiComputeBackwardDM Computes the backward FFT. Syntax Fortran: Status = DftiComputeBackwardDM(handle, in_X, out_X) Status = DftiComputeBackwardDM(handle, in_out_X) C: status = DftiComputeBackwardDM(handle, in_X, out_X); status = DftiComputeBackwardDM(handle, in_out_X); Include Files • FORTRAN 90: mkl_cdft.f90 • C: mkl_cdft.h 11 Intel® Math Kernel Library Reference Manual 2362 Input Parameters handle The descriptor handle. in_X, in_out_X Local part of input data. Array of complex values. Refer to the Distributing Data among Processes section on how to allocate and initialize the array. Output Parameters out_X, in_out_X Local part of output data. Array of complex values. Refer to the Distributing Data among Processes section on how to allocate the array. Description The DftiComputeBackwardDM function computes the backward FFT. Backward FFT is the transform using the factor ei2p/n. Before you call the function, the valid descriptor, created by DftiCreateDescriptorDM, must be configured and committed using the DftiCommitDescriptorDM function. The computation is carried out by calling the DftiComputeBackward function. So, the functions have very much in common, and details not explicitly mentioned below can be found in the description of DftiComputeBackward. Local part of input data, as well as local part of the output data, is an appropriate sequence of complex values (each complex value consists of two real numbers: real part and imaginary part) that a particular process stores. See the Distributing Data among Processes section for details. Refer to the Configuration Settings section for the list of configuration parameters that the descriptor passes to the function. The configuration parameter DFTI_PRECISION determines the precision of input data, output data, and transform: a setting of DFTI_SINGLE indicates single-precision floating-point data type and a setting of DFTI_DOUBLE indicates double-precision floating-point data type. The configuration parameter DFTI_PLACEMENT informs the function whether the computation should be inplace. If the value of this parameter is DFTI_INPLACE (default), you must call the function with two parameters, otherwise you must supply three parameters. If DFTI_PLACEMENT = DFTI_INPLACE and three parameters are supplied, then the third parameter is ignored. CAUTION Even in case of an out-of-place transform, local array of input data in_X may be changed. To save data, make its copy before calling DftiComputeBackwardDM. In case of an in-place transform, DftiComputeBackwardDM dynamically allocates and deallocates a work buffer of the same size as the local input/output array requires. NOTE You can specify your own workspace of the same size through the configuration parameter CDFT_WORKSPACE to avoid redundant memory allocation. Return Values The function returns DFTI_NO_ERROR when completes successfully. If the function fails, it returns a value of another error class constant (for the list of constants, refer to the Error Codes section). Fourier Transform Functions 11 2363 Interface and Prototype ! Fortran Interface INTERFACE DftiComputeBackwardDM INTEGER(4) FUNCTION DftiComputeBackwardDM(h, in_X, out_X) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h COMPLEX(8), DIMENSION(*) :: in_x, out_X END FUNCTION DftiComputeBackwardDM INTEGER(4) FUNCTION DftiComputeBackwardDMi(h, in_out_X) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h COMPLEX(8), DIMENSION(*) :: in_out_X END FUNCTION DftiComputeBackwardDMi INTEGER(4) FUNCTION DftiComputeBackwardDMs(h, in_X, out_X) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h COMPLEX(4), DIMENSION(*) :: in_x, out_X END FUNCTION DftiComputeBackwardDMs INTEGER(4) FUNCTION DftiComputeBackwardDMis(h, in_out_X) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h COMPLEX(4), DIMENSION(*) :: in_out_X END FUNCTION DftiComputeBackwardDMis END INTERFACE /* C/C++ prototype */ MKL_LONG DftiComputeBackwardDM(DFTI_DESCRIPTOR_DM_HANDLE handle, void *in_X,...); Descriptor Configuration Functions There are two functions in this category: the value setting function DftiSetValueDM sets one particular configuration parameter to an appropriate value, the value getting function DftiGetValueDM reads the value of one particular configuration parameter. Some configuration parameters used by cluster FFT functions originate from the conventional FFT interface (see Configuration Settings subsection in the "FFT Functions" section for details). Other configuration parameters are specific to the cluster FFT. Integer values of these parameters have type MKL_LONG in C/C++ and INTEGER(4) in Fortran. The exact type of the configuration parameters being floating-point scalars is float or double in C/C++ and REAL(4) or REAL(8) in Fortran. The configuration parameters whose values are named constants have the enum type in C/C++ and INTEGER in Fortran. They are defined in the mkl_cdft.h header file in C/C++ and MKL_CDFT module in Fortran. Names of the configuration parameters specific to the cluster FFT interface have prefix CDFT. 11 Intel® Math Kernel Library Reference Manual 2364 DftiSetValueDM Sets one particular configuration parameter with the specified configuration value. Syntax Fortran: Status = DftiSetValueDM (handle, param, value) C: status = DftiSetValueDM (handle, param, value); Include Files • FORTRAN 90: mkl_cdft.f90 • C: mkl_cdft.h Input Parameters handle The descriptor handle. Must be valid, that is, created in a call to DftiCreateDescriptorDM. param Name of a parameter to be set up in the descriptor handle. See Table "Settable Configuration Parameters" for the list of available parameters. value Value of the parameter. Description This function sets one particular configuration parameter with the specified configuration value. The configuration parameter is one of the named constants listed in the table below, and the configuration value must have the corresponding type. See Configuration Settings for details of the meaning of each setting and for possible values of the parameters whose values are named constants. Settable Configuration Parameters Parameter Name Data Type Description Default Value DFTI_FORWARD_SCALE Floating-point scalar Scale factor of forward transform. 1.0 DFTI_BACKWARD_SCALE Floating-point scalar Scale factor of backward transform. 1.0 DFTI_PLACEMENT Named constant Placement of the computation result. DFTI_INPLACE DFTI_ORDERING Named constant Scrambling of data order. DFTI_ORDERED DFTI_WORKSPACE Array of an appropriate type Auxiliary buffer, a userdefined workspace. Enables saving memory during inplace computations. NULL (allocate workspace dynamically). DFTI_PACKED_FORMAT Named constant Packed format, real data. • DFTI_PERM_FORMAT ? default and the only available value for one-dimensional transforms Fourier Transform Functions 11 2365 Parameter Name Data Type Description Default Value • DFTI_CCE_FORMAT ? default and the only available value for multi-dimensional transforms DFTI_TRANSPOSE Named constant This parameter determines how the output data is located for multi-dimensional transforms. If the parameter value is DFTI_NONE, the data is located in a usual manner described in this manual. If the value is DFTI_ALLOW, the last (first) global transposition is not performed for a forward (backward) transform. DFTI_NONE Return Values The function returns DFTI_NO_ERROR when completes successfully. If the function fails, it returns a value of another error class constant (for the list of constants, refer to the Error Codes section). 11 Intel® Math Kernel Library Reference Manual 2366 Interface and Prototype ! Fortran Interface INTERFACE DftiSetValueDM INTEGER(4) FUNCTION DftiSetValueDM(h, p, v) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h INTEGER(4) :: p, v END FUNCTION INTEGER(4) FUNCTION DftiSetValueDMd(h, p, v) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h INTEGER(4) :: p REAL(8) :: v END FUNCTION INTEGER(4) FUNCTION DftiSetValueDMs(h, p, v) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h INTEGER(4) :: p REAL(4) :: v END FUNCTION INTEGER(4) FUNCTION DftiSetValueDMsw(h, p, v) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h INTEGER(4) :: p COMPLEX(4) :: v(*) END FUNCTION INTEGER(4) FUNCTION DftiSetValueDMdw(h, p, v) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h INTEGER(4) :: p COMPLEX(8) :: v(*) END FUNCTION END INTERFACE /* C/C++ prototype */ MKL_LONG DftiSetValueDM(DFTI_DESCRIPTOR_DM_HANDLE handle, int param,...); DftiGetValueDM Gets the value of one particular configuration parameter. Fourier Transform Functions 11 2367 Syntax Fortran: Status = DftiGetValueDM(handle, param, value) C: status = DftiGetValueDM(handle, param, &value); Include Files • FORTRAN 90: mkl_cdft.f90 • C: mkl_cdft.h Input Parameters handle The descriptor handle. Must be valid, that is, created in a call to DftiCreateDescriptorDM. param Name of a parameter to be retrieved from the descriptor. See Table "Retrievable Configuration Parameters" for the list of available parameters. Output Parameters value Value of the parameter. Description This function gets the configuration value of one particular configuration parameter. The configuration parameter is one of the named constants listed in the table below, and the configuration value is the corresponding appropriate type, which can be a named constant or a native type. Possible values of the named constants can be found in Table "Configuration Parameters" and relevant subsections of the Configuration Settings section. Retrievable Configuration Parameters Parameter Name Data Type Description DFTI_PRECISION Named constant Precision of computation, input data and output data. DFTI_DIMENSION Integer scalar Dimension of the transform DFTI_LENGTHS Array of integer values Array of lengths of the transform. Number of lengths corresponds to the dimension of the transform. DFTI_FORWARD_SCALE Floating-point scalar Scale factor of forward transform. DFTI_BACKWARD_SCALE Floating-point scalar Scale factor of backward transform. DFTI_PLACEMENT Named constant Placement of the computation result. DFTI_COMMIT_STATUS Named constant Shows whether descriptor has been committed. DFTI_FORWARD_DOMAIN Named constant Forward domain of transforms, has the value of DFTI_COMPLEX or DFTI_REAL. DFTI_ORDERING Named constant Scrambling of data order. 11 Intel® Math Kernel Library Reference Manual 2368 Parameter Name Data Type Description CDFT_MPI_COMM Type of MPI communicator MPI communicator used for transforms. CDFT_LOCAL_SIZE Integer scalar Necessary size of input, output, and buffer arrays in data elements. CDFT_LOCAL_X_START Integer scalar Row/element number of the global array that corresponds to the first row/element of the local array. For more information, see Distributing Data among Processes. CDFT_LOCAL_NX Integer scalar The number of rows/elements of the global array stored in the local array. For more information, see Distributing Data among Processes. CDFT_LOCAL_OUT_X_START Integer scalar Element number of the appropriate global array that corresponds to the first element of the input or output local array in a 1D case. For details, see Distributing Data among Processes. CDFT_LOCAL_OUT_NX Integer scalar The number of elements of the appropriate global array that are stored in the input or output local array in a 1D case. For details, see Distributing Data among Processes. Return Values The function returns DFTI_NO_ERROR when completes successfully. If the function fails, it returns a value of another error class constant (for the list of constants, refer to the Error Codes section). Fourier Transform Functions 11 2369 Interface and Prototype ! Fortran Interface INTERFACE DftiGetValueDM INTEGER(4) FUNCTION DftiGetValueDM(h, p, v) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h INTEGER(4) :: p, v END FUNCTION INTEGER(4) FUNCTION DftiGetValueDMar(h, p, v) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h INTEGER(4) :: p, v(*) END FUNCTION INTEGER(4) FUNCTION DftiGetValueDMd(h, p, v) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h INTEGER(4) :: p REAL(8) :: v END FUNCTION INTEGER(4) FUNCTION DftiGetValueDMs(h, p, v) TYPE(DFTI_DESCRIPTOR_DM), POINTER :: h INTEGER(4) :: p REAL(4) :: v END FUNCTION END INTERFACE /* C/C++ prototype */ MKL_LONG DftiGetValueDM(DFTI_DESCRIPTOR_DM_HANDLE handle, int param,...); Error Codes All the cluster FFT functions return an integer value denoting the status of the operation. These values are identified by named constants. Each function returns DFTI_NO_ERROR if no errors were encountered during execution. Otherwise, a function generates an error code. In addition to FFT error codes, the cluster FFT interface has its own ones. Named constants specific to the cluster FFT interface have prefix "CDFT" in names. Table "Error Codes that Cluster FFT Functions Return" lists error codes that the cluster FFT functions may return. Error Codes that Cluster FFT Functions Return Named Constants Comments DFTI_NO_ERROR No error. DFTI_MEMORY_ERROR Usually associated with memory allocation. 11 Intel® Math Kernel Library Reference Manual 2370 Named Constants Comments DFTI_INVALID_CONFIGURATION Invalid settings of one or more configuration parameters. DFTI_INCONSISTENT_CONFIGURA TION Inconsistent configuration or input parameters. DFTI_NUMBER_OF_THREADS_ERRO R Number of OMP threads in the computation function is not equal to the number of OMP threads in the initialization stage (commit function). DFTI_MULTITHREADED_ERROR Usually associated with a value that OMP routines return in case of errors. DFTI_BAD_DESCRIPTOR Descriptor is unusable for computation. DFTI_UNIMPLEMENTED Unimplemented legitimate settings; implementation dependent. DFTI_MKL_INTERNAL_ERROR Internal library error. DFTI_1D_LENGTH_EXCEEDS_INT3 2 Length of one of dimensions exceeds 232 -1 (4 bytes). CDFT_SPREAD_ERROR Data cannot be distributed (For more information, see Distributing Data among Processes.) CDFT_MPI_ERROR MPI error. Occurs when calling MPI. Fourier Transform Functions 11 2371 11 Intel® Math Kernel Library Reference Manual 2372 PBLAS Routines 12 This chapter describes the Intel® Math Kernel Library implementation of the PBLAS (Parallel Basic Algebra Subprograms) routines from the ScaLAPACK package for distributed-memory architecture. PBLAS is intended for using in vector-vector, matrix-vector, and matrix-matrix operations to simplify the parallelization of linear codes. The design of PBLAS is as consistent as possible with that of the BLAS. The routine descriptions are arranged in several sections according to the PBLAS level of operation: • PBLAS Level 1 Routines (distributed vector-vector operations) • PBLAS Level 2 Routines (distributed matrix-vector operations) • PBLAS Level 3 Routines (distributed matrix-matrix operations) Each section presents the routine and function group descriptions in alphabetical order by the routine group name; for example, the p?asum group, the p?axpy group. The question mark in the group name corresponds to a character indicating the data type (s, d, c, and z or their combination); see Routine Naming Conventions. NOTE PBLAS routines are provided only with Intel® MKL versions for Linux* and Windows* OSs. Generally, PBLAS runs on a network of computers using MPI as a message-passing layer and a set of prebuilt communication subprograms (BLACS), as well as a set of PBLAS optimized for the target architecture. The Intel MKL version of PBLAS is optimized for Intel® processors. For the detailed system and environment requirements see Intel® MKL Release Notes and Intel® MKL User's Guide. For full reference on PBLAS routines and related information, see http://www.netlib.org/scalapack/html/ pblas_qref.html. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Overview The model of the computing environment for PBLAS is represented as a one-dimensional array of processes or also a two-dimensional process grid. To use PBLAS, all global matrices or vectors must be distributed on this array or grid prior to calling the PBLAS routines. PBLAS uses the two-dimensional block-cyclic data distribution as a layout for dense matrix computations. This distribution provides good work balance between available processors, as well as gives the opportunity to use PBLAS Level 3 routines for optimal local computations. Information about the data distribution that is required to establish the mapping between each global array and its corresponding process and memory location is contained in the so called array descriptor associated with each global array. Table "Content of the array descriptor for dense matrices" gives an example of an array descriptor structure. Content of Array Descriptor for Dense Matrices Array Element # Name Definition 1 dtype Descriptor type ( =1 for dense matrices) 2 ctxt BLACS context handle for the process grid 2373 Array Element # Name Definition 3 m Number of rows in the global array 4 n Number of columns in the global array 5 mb Row blocking factor 6 nb Column blocking factor 7 rsrc Process row over which the first row of the global array is distributed 8 csrc Process column over which the first column of the global array is distributed 9 lld Leading dimension of the local array The number of rows and columns of a global dense matrix that a particular process in a grid receives after data distributing is denoted by LOCr() and LOCc(), respectively. To compute these numbers, you can use the ScaLAPACK tool routine numroc. After the block-cyclic distribution of global data is done, you may choose to perform an operation on a submatrix of the global matrix A, which is contained in the global subarray sub(A), defined by the following 6 values (for dense matrices): m The number of rows of sub(A) n The number of columns of sub(A) a A pointer to the local array containing the entire global array A ia The row index of sub(A) in the global array ja The column index of sub(A) in the global array desca The array descriptor for the global array A Intel MKL provides the PBLAS routines with interface similar to the interface used in the Netlib PBLAS (see http://www.netlib.org/scalapack/html/pblas_qref.html). Routine Naming Conventions The naming convention for PBLAS routines is similar to that used for BLAS routines (see Routine Naming Conventions in Chapter 2). A general rule is that each routine name in PBLAS, which has a BLAS equivalent, is simply the BLAS name prefixed by initial letter p that stands for "parallel". The Intel MKL PBLAS routine names have the following structure: p ( ) The field indicates the Fortran data type: s real, single precision c complex, single precision d real, double precision z complex, double precision i integer Some routines and functions can have combined character codes, such as sc or dz. For example, the function pscasum uses a complex input array and returns a real value. The field, in PBLAS level 1, indicates the operation type. For example, the PBLAS level 1 routines p? dot, p?swap, p?copy compute a vector dot product, vector swap, and a copy vector, respectively. In PBLAS level 2 and 3, reflects the matrix argument type: ge general matrix sy symmetric matrix he Hermitian matrix tr triangular matrix In PBLAS level 3, the =tran indicates the transposition of the matrix. 12 Intel® Math Kernel Library Reference Manual 2374 The field, if present, provides additional details of the operation. The PBLAS level 1 names can have the following characters in the field: c conjugated vector u unconjugated vector The PBLAS level 2 names can have the following additional characters in the field: mv matrix-vector product sv solving a system of linear equations with matrix-vector operations r rank-1 update of a matrix r2 rank-2 update of a matrix. The PBLAS level 3 names can have the following additional characters in the field: mm matrix-matrix product sm solving a system of linear equations with matrix-matrix operations rk rank-k update of a matrix r2k rank-2k update of a matrix. The examples below show how to interpret PBLAS routine names: pddot

: double-precision real distributed vector-vector dot product pcdotc

: complex distributed vector-vector dot product, conjugated pscasum

: sum of magnitudes of distributed vector elements, single precision real output and single precision complex input pcdotu

: distributed vector-vector dot product, unconjugated, complex psgemv

: distributed matrix-vector product, general matrix, single precision pztrmm

: distributed matrix-matrix product, triangular matrix, double-precision complex. PBLAS Level 1 Routines PBLAS Level 1 includes routines and functions that perform distributed vector-vector operations. Table "PBLAS Level 1 Routine Groups and Their Data Types" lists the PBLAS Level 1 routine groups and the data types associated with them. PBLAS Level 1 Routine Groups and Their Data Types Routine or Function Group Data Types Description p?amax s, d, c, z Calculates an index of the distributed vector element with maximum absolute value p?asum s, d, sc, dz Calculates sum of magnitudes of a distributed vector p?axpy s, d, c, z Calculates distributed vector-scalar product p?copy s, d, c, z Copies a distributed vector p?dot s, d Calculates a dot product of two distributed real vectors p?dotc c, z Calculates a dot product of two distributed complex vectors, one of them is conjugated PBLAS Routines 12 2375 Routine or Function Group Data Types Description p?dotu c, z Calculates a dot product of two distributed complex vectors p?nrm2 s, d, sc, dz Calculates the 2-norm (Euclidean norm) of a distributed vector p?scal s, d, c, z, cs, zd Calculates a product of a distributed vector by a scalar p?swap s, d, c, z Swaps two distributed vectors p?amax Computes the global index of the element of a distributed vector with maximum absolute value. Syntax call psamax(n, amax, indx, x, ix, jx, descx, incx) call pdamax(n, amax, indx, x, ix, jx, descx, incx) call pcamax(n, amax, indx, x, ix, jx, descx, incx) call pzamax(n, amax, indx, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The functions p?amax compute global index of the maximum element in absolute value of a distributed vector sub(x), where sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1. Input Parameters n (global) INTEGER. The length of distributed vector sub(x), n=0. x (local) REAL for psamax DOUBLE PRECISION for pdamax COMPLEX for pcamax DOUBLE COMPLEX for pzamax Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters amax (global) REAL for psamax. 12 Intel® Math Kernel Library Reference Manual 2376 DOUBLE PRECISION for pdamax. COMPLEX for pcamax. DOUBLE COMPLEX for pzamax. Maximum absolute value (magnitude) of elements of the distributed vector only in its scope. indx (global) INTEGER. The global index of the maximum element in absolute value of the distributed vector sub(x) only in its scope. p?asum Computes the sum of magnitudes of elements of a distributed vector. Syntax call psasum(n, asum, x, ix, jx, descx, incx) call pscasum(n, asum, x, ix, jx, descx, incx) call pdasum(n, asum, x, ix, jx, descx, incx) call pdzasum(n, asum, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The functions p?asum compute the sum of the magnitudes of elements of a distributed vector sub(x), where sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1. Input Parameters n (global) INTEGER. The length of distributed vector sub(x), n=0. x (local) REAL for psasum DOUBLE PRECISION for pdasum COMPLEX for pscasum DOUBLE COMPLEX for pdzasum Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters asum (local) REAL for psasum and pscasum. DOUBLE PRECISION for pdasum and pdzasum Contains the sum of magnitudes of elements of the distributed vector only in its scope. PBLAS Routines 12 2377 p?axpy Computes a distributed vector-scalar product and adds the result to a distributed vector. Syntax call psaxpy(n, a, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pdaxpy(n, a, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pcaxpy(n, a, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pzaxpy(n, a, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?axpy routines perform the following operation with distributed vectors: sub(y) := sub(y) + a*sub(x) where: a is a scalar; sub(x) and sub(y) are n-element distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. a (local) REAL for psaxpy DOUBLE PRECISION for pdaxpy COMPLEX for pcaxpy DOUBLE COMPLEX for pzaxpy Specifies the scalar a. x (local) REAL for psaxpy DOUBLE PRECISION for pdaxpy COMPLEX for pcaxpy DOUBLE COMPLEX for pzaxpy Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for psaxpy DOUBLE PRECISION for pdaxpy COMPLEX for pcaxpy 12 Intel® Math Kernel Library Reference Manual 2378 DOUBLE COMPLEX for pzaxpy Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global)INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by sub(y) := sub(y) + a*sub(x). p?copy Copies one distributed vector to another vector. Syntax call pscopy(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pdcopy(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pccopy(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pzcopy(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call picopy(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?copy routines perform a copy operation with distributed vectors defined as sub(y) = sub(x), where sub(x) and sub(y) are n-element distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. x (local) REAL for pscopy DOUBLE PRECISION for pdcopy COMPLEX for pccopy DOUBLE COMPLEX for pzcopy INTEGER for picopy Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). PBLAS Routines 12 2379 ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for pscopy DOUBLE PRECISION for pdcopy COMPLEX for pccopy DOUBLE COMPLEX for pzcopy INTEGER for picopy Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global)INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten with the distributed vector sub(x). p?dot Computes the dot product of two distributed real vectors. Syntax call psdot(n, dot, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pddot(n, dot, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The ?dot functions compute the dot product dot of two distributed real vectors defined as dot = sub(x)'*sub(y) where sub(x) and sub(y) are n-element distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. x (local) REAL for psdot 12 Intel® Math Kernel Library Reference Manual 2380 DOUBLE PRECISION for pddot Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for psdot DOUBLE PRECISION for pddot Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters dot (local) REAL for psdot DOUBLE PRECISION for pddot Dot product of sub(x) and sub(y) only in their scope. p?dotc Computes the dot product of two distributed complex vectors, one of them is conjugated. Syntax call pcdotc(n, dotu, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pzdotc(n, dotu, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?dotu functions compute the dot product dotc of two distributed vectors one of them is conjugated: dotc = conjg(sub(x)')*sub(y) where sub(x) and sub(y) are n-element distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. PBLAS Routines 12 2381 Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. x (local) COMPLEX for pcdotc DOUBLE COMPLEX for pzdotc Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) COMPLEX for pcdotc DOUBLE COMPLEX for pzdotc Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters dotc (local) COMPLEX for pcdotc DOUBLE COMPLEX for pzdotc Dot product of sub(x) and sub(y) only in their scope. p?dotu Computes the dot product of two distributed complex vectors. Syntax call pcdotu(n, dotu, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pzdotu(n, dotu, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?dotu functions compute the dot product dotu of two distributed vectors defined as dotu = sub(x)'*sub(y) where sub(x) and sub(y) are n-element distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. 12 Intel® Math Kernel Library Reference Manual 2382 Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. x (local) COMPLEX for pcdotu DOUBLE COMPLEX for pzdotu Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) COMPLEX for pcdotu DOUBLE COMPLEX for pzdotu Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters dotu (local) COMPLEX for pcdotu DOUBLE COMPLEX for pzdotu Dot product of sub(x) and sub(y) only in their scope. p?nrm2 Computes the Euclidean norm of a distributed vector. Syntax call psnrm2(n, norm2, x, ix, jx, descx, incx) call pdnrm2(n, norm2, x, ix, jx, descx, incx) call pscnrm2(n, norm2, x, ix, jx, descx, incx) call pdznrm2(n, norm2, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The p?nrm2 functions compute the Euclidean norm of a distributed vector sub(x), where sub(x) is an n-element distributed vector. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1. PBLAS Routines 12 2383 Input Parameters n (global) INTEGER. The length of distributed vector sub(x), n=0. x (local) REAL for psnrm2 DOUBLE PRECISION for pdnrm2 COMPLEX for pscnrm2 DOUBLE COMPLEX for pdznrm2 Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters norm2 (local) REAL for psnrm2 and pscnrm2. DOUBLE PRECISION for pdnrm2 and pdznrm2 Contains the Euclidean norm of a distributed vector only in its scope. p?scal Computes a product of a distributed vector by a scalar. Syntax call psscal(n, a, x, ix, jx, descx, incx) call pdscal(n, a, x, ix, jx, descx, incx) call pcscal(n, a, x, ix, jx, descx, incx) call pzscal(n, a, x, ix, jx, descx, incx) call pcsscal(n, a, x, ix, jx, descx, incx) call pzdscal(n, a, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The p?scal routines multiplies a n-element distributed vector sub(x) by the scalar a: sub(x) = a*sub(x), where sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1. Input Parameters n (global) INTEGER. The length of distributed vector sub(x), n=0. a (global) REAL for psscal and pcsscal DOUBLE PRECISION for pdscal and pzdscal 12 Intel® Math Kernel Library Reference Manual 2384 COMPLEX for pcscal DOUBLE COMPLEX for pzscal Specifies the scalar a. x (local) REAL for psscal DOUBLE PRECISION for pdscal COMPLEX for pcscal and pcsscal DOUBLE COMPLEX for pzscal and pzdscal Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters x Overwritten by the updated distributed vector sub(x) p?swap Swaps two distributed vectors. Syntax call psswap(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pdswap(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pcswap(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pzswap(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description Given two distributed vectors sub(x) and sub(y), the p?swap routines return vectors sub(y) and sub(x) swapped, each replacing the other. Here sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. x (local) REAL for psswap DOUBLE PRECISION for pdswap COMPLEX for pcswap DOUBLE COMPLEX for pzswap Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). PBLAS Routines 12 2385 ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for psswap DOUBLE PRECISION for pdswap COMPLEX for pcswap DOUBLE COMPLEX for pzswap Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters x Overwritten by distributed vector sub(y). y Overwritten by distributed vector sub(x). PBLAS Level 2 Routines This section describes PBLAS Level 2 routines, which perform distributed matrix-vector operations. Table "PBLAS Level 2 Routine Groups and Their Data Types" lists the PBLAS Level 2 routine groups and the data types associated with them. PBLAS Level 2 Routine Groups and Their Data Types Routine Groups Data Types Description p?gemv s, d, c, z Matrix-vector product using a distributed general matrix p?agemv s, d, c, z Matrix-vector product using absolute values for a distributed general matrix p?ger s, d Rank-1 update of a distributed general matrix p?gerc c, z Rank-1 update (conjugated) of a distributed general matrix p?geru c, z Rank-1 update (unconjugated) of a distributed general matrix p?hemv c, z Matrix-vector product using a distributed Hermitian matrix p?ahemv c, z Matrix-vector product using absolute values for a distributed Hermitian matrix p?her c, z Rank-1 update of a distributed Hermitian matrix p?her2 c, z Rank-2 update of a distributed Hermitian matrix 12 Intel® Math Kernel Library Reference Manual 2386 Routine Groups Data Types Description p?symv s, d Matrix-vector product using a distributed symmetric matrix p?asymv s, d Matrix-vector product using absolute values for a distributed symmetric matrix p?syr s, d Rank-1 update of a distributed symmetric matrix p?syr2 s, d Rank-2 update of a distributed symmetric matrix p?trmv s, d, c, z Distributed matrix-vector product using a triangular matrix p?atrmv s, d, c, z Distributed matrix-vector product using absolute values for a triangular matrix p?trsv s, d, c, z Solves a system of linear equations whose coefficients are in a distributed triangular matrix p?gemv Computes a distributed matrix-vector product using a general matrix. Syntax call psgemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pdgemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pcgemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pzgemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?gemv routines perform a distributed matrix-vector operation defined as sub(y) := alpha*sub(A)*sub(x) + beta*sub(y), or sub(y) := alpha*sub(A)'*sub(x) + beta*sub(y), or sub(y) := alpha*conjg(sub(A)')*sub(x) + beta*sub(y), where alpha and beta are scalars, sub(A) is a m-by-n submatrix, sub(A) = A(ia:ia+m-1, ja:ja+n-1), sub(x) and sub(y) are subvectors. PBLAS Routines 12 2387 When trans = 'N' or 'n', sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+m-1) if incy = m_y, and Y(iy: iy+m-1, jy) if incy = 1. When trans = 'T' or 't', or 'C', or 'c', sub(x) denotes X(ix, jx:jx+m-1) if incx = m_x, and X(ix: ix+m-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+m-1, jy) if incy = 1. Input Parameters trans (global) CHARACTER*1. Specifies the operation: if trans= 'N' or 'n', then sub(y) := alpha*sub(A)'*sub(x) + beta*sub(y); if trans= 'T' or 't', then sub(y) := alpha*sub(A)'*sub(x) + beta*sub(y); if trans= 'C' or 'c', then sub(y) := alpha*conjg(subA)')*sub(x) + beta*sub(y). m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(A), m=0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(A), n=0. alpha (global) REAL for psgemv DOUBLE PRECISION for pdgemv COMPLEX for pcgemv DOUBLE COMPLEX for pzgemv Specifies the scalar alpha. a (local) REAL for psgemv DOUBLE PRECISION for pdgemv COMPLEX for pcgemv DOUBLE COMPLEX for pzgemv Array, DIMENSION (lld_a, LOCq(ja+n-1)). Before entry this array must contain the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for psgemv DOUBLE PRECISION for pdgemv COMPLEX for pcgemv DOUBLE COMPLEX for pzgemv Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)) when trans = 'N' or 'n', and (jx-1)*m_x + ix+(m-1)*abs(incx)) otherwise. This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) REAL for psgemv 12 Intel® Math Kernel Library Reference Manual 2388 DOUBLE PRECISION for pdgemv COMPLEX for pcgemv DOUBLE COMPLEX for pzgemv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) REAL for psgemv DOUBLE PRECISION for pdgemv COMPLEX for pcgemv DOUBLE COMPLEX for pzgemv Array, DIMENSION (jy-1)*m_y + iy+(m-1)*abs(incy)) when trans = 'N' or 'n', and (jy-1)*m_y + iy+(n-1)*abs(incy)) otherwise. This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). p?agemv Computes a distributed matrix-vector product using absolute values for a general matrix. Syntax call psagemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pdagemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pcagemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pzagemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?agemv routines perform a distributed matrix-vector operation defined as sub(y) := abs(alpha)*abs(sub(A)')*abs(sub(x)) + abs(beta*sub(y)), or sub(y) := abs(alpha)*abs(sub(A)')*abs(sub(x)) + abs(beta*sub(y)), or sub(y) := abs(alpha)*abs(conjg(sub(A)'))*abs(sub(x)) + abs(beta*sub(y)), PBLAS Routines 12 2389 where alpha and beta are scalars, sub(A) is a m-by-n submatrix, sub(A) = A(ia:ia+m-1, ja:ja+n-1), sub(x) and sub(y) are subvectors. When trans = 'N' or 'n', sub(x) denotes X(ix:ix, jx:jx+n-1) if incx = m_x, and X(ix:ix+n-1, jx:jx) if incx = 1, sub(y) denotes Y(iy:iy, jy:jy+m-1) if incy = m_y, and Y(iy:iy+m-1, jy:jy) if incy = 1. When trans = 'T' or 't', or 'C', or 'c', sub(x) denotes X(ix:ix, jx:jx+m-1) if incx = m_x, and X(ix:ix+m-1, jx:jx) if incx = 1, sub(y) denotes Y(iy:iy, jy:jy+n-1) if incy = m_y, and Y(iy:iy+m-1, jy:jy) if incy = 1. Input Parameters trans (global) CHARACTER*1. Specifies the operation: if trans= 'N' or 'n', then sub(y) := |alpha|*|sub(A)|*|sub(x)| + |beta*sub(y)| if trans= 'T' or 't', then sub(y) := |alpha|*|sub(A)'|*|sub(x)| + |beta*sub(y)| if trans= 'C' or 'c', then sub(y) := |alpha|*|sub(A)'|*|sub(x)| + |beta*sub(y)|. m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(A), m=0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(A), n=0. alpha (global) REAL for psagemv DOUBLE PRECISION for pdagemv COMPLEX for pcagemv DOUBLE COMPLEX for pzagemv Specifies the scalar alpha. a (local) REAL for psagemv DOUBLE PRECISION for pdagemv COMPLEX for pcagemv DOUBLE COMPLEX for pzagemv Array, DIMENSION (lld_a, LOCq(ja+n-1)). Before entry this array must contain the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for psagemv DOUBLE PRECISION for pdagemv 12 Intel® Math Kernel Library Reference Manual 2390 COMPLEX for pcagemv DOUBLE COMPLEX for pzagemv Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)) when trans = 'N' or 'n', and (jx-1)*m_x + ix+(m-1)*abs(incx)) otherwise. This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) REAL for psagemv DOUBLE PRECISION for pdagemv COMPLEX for pcagemv DOUBLE COMPLEX for pzagemv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) REAL for psagemv DOUBLE PRECISION for pdagemv COMPLEX for pcagemv DOUBLE COMPLEX for pzagemv Array, DIMENSION (jy-1)*m_y + iy+(m-1)*abs(incy)) when trans = 'N' or 'n', and (jy-1)*m_y + iy+(n-1)*abs(incy)) otherwise. This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). p?ger Performs a rank-1 update of a distributed general matrix. Syntax call psger(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) call pdger(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) Include Files • C: mkl_pblas.h PBLAS Routines 12 2391 Description The p?ger routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*sub(y)' + sub(A), where: alpha is a scalar, sub(A) is a m-by-n distributed general matrix, sub(A)=A(ia:ia+m-1, ja:ja+n-1), sub(x) is an m-element distributed vector, sub(y) is an n-element distributed vector, sub(x) denotes X(ix, jx:jx+m-1) if incx = m_x, and X(ix: ix+m-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(A), m=0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(A), n=0. alpha (global) REAL for psger DOUBLE REAL for pdger Specifies the scalar alpha. x (local) REAL for psger DOUBLE REAL for pdger Array, DIMENSION at least (jx-1)*m_x + ix+(m-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for psger DOUBLE REAL for pdger Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. a (local) REAL for psger DOUBLE REAL for pdger Array, DIMENSION (lld_a, LOCq(ja+n-1)). Before entry this array contains the local pieces of the distributed matrix sub(A). 12 Intel® Math Kernel Library Reference Manual 2392 ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. Output Parameters a Overwritten by the updated distributed matrix sub(A). p?gerc Performs a rank-1 update (conjugated) of a distributed general matrix. Syntax call pcgerc(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) call pzgerc(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?gerc routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*conjg(sub(y)') + sub(A), where: alpha is a scalar, sub(A) is a m-by-n distributed general matrix, sub(A) = A(ia:ia+m-1, ja:ja+n-1), sub(x) is an m-element distributed vector, sub(y) is ann-element distributed vector, sub(x)denotes X(ix, jx:jx+m-1) if incx = m_x, and X(ix: ix+m-1, jx) if incx = 1, sub(y)denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(A), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(A), n = 0. alpha (global) COMPLEX for pcgerc DOUBLE COMPLEX for pzgerc Specifies the scalar alpha. x (local) COMPLEX for pcgerc DOUBLE COMPLEX for pzgerc Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). PBLAS Routines 12 2393 ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) COMPLEX for pcgerc DOUBLE COMPLEX for pzgerc Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. a (local) COMPLEX for pcgerc DOUBLE COMPLEX for pzgerc Array, DIMENSION at least (lld_a, LOCq(ja+n-1)). Before entry this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. Output Parameters a Overwritten by the updated distributed matrix sub(A). p?geru Performs a rank-1 update (unconjugated) of a distributed general matrix. Syntax call pcgeru(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) call pzgeru(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?geru routines perform a matrix-vector operation defined as sub(A) := alpha*sub(x)*sub(y)' + sub(A), where: 12 Intel® Math Kernel Library Reference Manual 2394 alpha is a scalar, sub(A) is a m-by-n distributed general matrix, sub(A)=A(ia:ia+m-1, ja:ja+n-1), sub(x) is an m-element distributed vector, sub(y) is an n-element distributed vector, sub(x) denotes X(ix, jx:jx+m-1) if incx = m_x, and X(ix: ix+m-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(A), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(A), n = 0. alpha (global) COMPLEX for pcgeru DOUBLE COMPLEX for pzgeru Specifies the scalar alpha. x (local) COMPLEX for pcgeru DOUBLE COMPLEX for pzgeru Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) COMPLEX for pcgeru DOUBLE COMPLEX for pzgeru Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. a (local) COMPLEX for pcgeru DOUBLE COMPLEX for pzgeru Array, DIMENSION at least (lld_a, LOCq(ja+n-1)). Before entry this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. PBLAS Routines 12 2395 Output Parameters a Overwritten by the updated distributed matrix sub(A). p?hemv Computes a distributed matrix-vector product using a Hermitian matrix. Syntax call pchemv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pzhemv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?hemv routines perform a distributed matrix-vector operation defined as sub(y) := alpha*sub(A)*sub(x) + beta*sub(y), where: alpha and beta are scalars, sub(A) is a n-by-n Hermitian distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) COMPLEX for pchemv DOUBLE COMPLEX for pzhemv Specifies the scalar alpha. a (local) COMPLEX for pchemv DOUBLE COMPLEX for pzhemv Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry when uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the Hermitian distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the n-by-n lower 12 Intel® Math Kernel Library Reference Manual 2396 triangular part of the distributed matrix sub(A) must contain the lower triangular part of the Hermitian distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) COMPLEX for pchemv DOUBLE COMPLEX for pzhemv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) COMPLEX for pchemv DOUBLE COMPLEX for pzhemv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) COMPLEX for pchemv DOUBLE COMPLEX for pzhemv Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). p?ahemv Computes a distributed matrix-vector product using absolute values for a Hermitian matrix. Syntax call pcahemv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pzahemv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) PBLAS Routines 12 2397 Include Files • C: mkl_pblas.h Description The p?ahemv routines perform a distributed matrix-vector operation defined as sub(y) := abs(alpha)*abs(sub(A))*abs(sub(x)) + abs(beta*sub(y)), where: alpha and beta are scalars, sub(A) is a n-by-n Hermitian distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) COMPLEX for pcahemv DOUBLE COMPLEX for pzahemv Specifies the scalar alpha. a (local) COMPLEX for pcahemv DOUBLE COMPLEX for pzahemv Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry when uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the Hermitian distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the Hermitian distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) COMPLEX for pcahemv DOUBLE COMPLEX for pzahemv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. 12 Intel® Math Kernel Library Reference Manual 2398 descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) COMPLEX for pcahemv DOUBLE COMPLEX for pzahemv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) COMPLEX for pcahemv DOUBLE COMPLEX for pzahemv Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). p?her Performs a rank-1 update of a distributed Hermitian matrix. Syntax call pcher(uplo, n, alpha, x, ix, jx, descx, incx, a, ia, ja, desca) call pzher(uplo, n, alpha, x, ix, jx, descx, incx, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?her routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*conjg(sub(x)') + sub(A), where: alpha is a real scalar, sub(A) is a n-by-n distributed Hermitian matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1), sub(x) is distributed vector. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(A) is used: PBLAS Routines 12 2399 If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) REAL for pcher DOUBLE REAL for pzher Specifies the scalar alpha. x (local) COMPLEX for pcher DOUBLE COMPLEX for pzher Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. a (local) COMPLEX for pcher DOUBLE COMPLEX for pzher Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry with uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the Hermitian distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and with uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the Hermitian distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated distributed matrix sub(A). With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated distributed matrix sub(A). p?her2 Performs a rank-2 update of a distributed Hermitian matrix. 12 Intel® Math Kernel Library Reference Manual 2400 Syntax call pcher2(uplo, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) call pzher2(uplo, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?her2 routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*conj(sub(y)')+ conj(alpha)*sub(y)*conj(sub(x)') + sub(A), where: alpha is a scalar, sub(A) is a n-by-n distributed Hermitian matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1), sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the distributed Hermitian matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) COMPLEX for pcher2 DOUBLE COMPLEX for pzher2 Specifies the scalar alpha. x (local) COMPLEX for pcher2 DOUBLE COMPLEX for pzher2 Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) COMPLEX for pcher2 DOUBLE COMPLEX for pzher2 Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). PBLAS Routines 12 2401 iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. a (local) COMPLEX for pcher2 DOUBLE COMPLEX for pzher2 Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry with uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the Hermitian distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and with uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the Hermitian distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated distributed matrix sub(A). With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated distributed matrix sub(A). p?symv Computes a distributed matrix-vector product using a symmetric matrix. Syntax call pssymv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pdsymv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?symv routines perform a distributed matrix-vector operation defined as sub(y) := alpha*sub(A)*sub(x) + beta*sub(y), where: 12 Intel® Math Kernel Library Reference Manual 2402 alpha and beta are scalars, sub(A) is a n-by-n symmetric distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) REAL for pssymv DOUBLE REAL for pdsymv Specifies the scalar alpha. a (local) REAL for pssymv DOUBLE REAL for pdsymv Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry when uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the symmetric distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the symmetric distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for pssymv DOUBLE REAL for pdsymv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) REAL for pssymv DOUBLE REAL for pdsymv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. PBLAS Routines 12 2403 y (local) REAL for pssymv DOUBLE REAL for pdsymv Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). p?asymv Computes a distributed matrix-vector product using absolute values for a symmetric matrix. Syntax call psasymv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pdasymv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?symv routines perform a distributed matrix-vector operation defined as sub(y) := abs(alpha)*abs(sub(A))*abs(sub(x)) + abs(beta*sub(y)), where: alpha and beta are scalars, sub(A) is a n-by-n symmetric distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. 12 Intel® Math Kernel Library Reference Manual 2404 alpha (global) REAL for psasymv DOUBLE REAL for pdasymv Specifies the scalar alpha. a (local) REAL for psasymv DOUBLE REAL for pdasymv Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry when uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the symmetric distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the symmetric distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for psasymv DOUBLE PRECISION for pdasymv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) REAL for psasymv DOUBLE PRECISION for pdasymv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) REAL for psasymv DOUBLE PRECISION for pdasymv Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). PBLAS Routines 12 2405 p?syr Performs a rank-1 update of a distributed symmetric matrix. Syntax call pssyr(uplo, n, alpha, x, ix, jx, descx, incx, a, ia, ja, desca) call pdsyr(uplo, n, alpha, x, ix, jx, descx, incx, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?syr routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*sub(x)' + sub(A), where: alpha is a scalar, sub(A) is a n-by-n distributed symmetric matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) is distributed vector. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) REAL for pssyr DOUBLE REAL for pdsyr Specifies the scalar alpha. x (local) REAL for pssyr DOUBLE REAL for pdsyr Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. a (local) REAL for pssyr DOUBLE REAL for pdsyr Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). 12 Intel® Math Kernel Library Reference Manual 2406 Before entry with uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the symmetric distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and with uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the symmetric distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated distributed matrix sub(A). With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated distributed matrix sub(A). p?syr2 Performs a rank-2 update of a distributed symmetric matrix. Syntax call pssyr2(uplo, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) call pdsyr2(uplo, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?syr2 routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*sub(y)'+ alpha*sub(y)*sub(x)' + sub(A), where: alpha is a scalar, sub(A) is a n-by-n distributed symmetric matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. PBLAS Routines 12 2407 Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the distributed symmetric matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) REAL for pssyr2 DOUBLE REAL for pdsyr2 Specifies the scalar alpha. x (local) REAL for pssyr2 DOUBLE REAL for pdsyr2 Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for pssyr2 DOUBLE REAL for pdsyr2 Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. a (local) REAL for pssyr2 DOUBLE REAL for pdsyr2 Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry with uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the distributed symmetric matrix and the strictly lower triangular part of sub(A) is not referenced, and with uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the distributed symmetric matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix A. 12 Intel® Math Kernel Library Reference Manual 2408 Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated distributed matrix sub(A). With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated distributed matrix sub(A). p?trmv Computes a distributed matrix-vector product using a triangular matrix. Syntax call pstrmv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pdtrmv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pctrmv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pztrmv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The p?trmv routines perform one of the following distributed matrix-vector operations defined as sub(x) := sub(A)*sub(x), or sub(x) :=sub( A)'*sub(x), or sub(x) := conjg(sub(A)')*sub(x), where: sub(A) is a n-by-n unit, or non-unit, upper or lower triangular distributed matrix, sub(A) = A(ia:ia+n-1, ja:ja+n-1), sub(x) is an n-element distributed vector. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, Input Parameters uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(A) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans (global) CHARACTER*1. Specifies the form of op(sub(A)) used in the matrix equation: if transa = 'N' or 'n', then sub(x) := sub(A)*sub(x); if transa = 'T' or 't', then sub(x) :=sub( A)'*sub(x); if transa = 'C' or 'c', then sub(x) := conjg(sub(A)')*sub(x). diag (global) CHARACTER*1. Specifies whether the matrix sub(A) is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n=0. PBLAS Routines 12 2409 a (local) REAL for pstrmv DOUBLE PRECISION for pdtrmv COMPLEX for pctrmv DOUBLE COMPLEX for pztrmv Array, DIMENSION at least (lld_a, LOCq(1, ja+n-1)). Before entry with uplo = 'U' or 'u', this array contains the local entries corresponding to the entries of the upper triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly lower triangular part of the distributed matrix sub(A) is not referenced. Before entry with uplo = 'L' or 'l', this array contains the local entries corresponding to the entries of the lower triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly upper triangular part of the distributed matrix sub(A) is not referenced . When diag = 'U' or 'u', the local entries corresponding to the diagonal elements of the submatrix sub(A) are not referenced either, but are assumed to be unity. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for pstrmv DOUBLE PRECISION for pdtrmv COMPLEX for pctrmv DOUBLE COMPLEX for pztrmv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters x Overwritten by the transformed distributed vector sub(x). p?atrmv Computes a distributed matrix-vector product using absolute values for a triangular matrix. Syntax call psatrmv(uplo, trans, diag, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pdatrmv(uplo, trans, diag, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pcatrmv(uplo, trans, diag, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) 12 Intel® Math Kernel Library Reference Manual 2410 call pzatrmv(uplo, trans, diag, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?atrmv routines perform one of the following distributed matrix-vector operations defined as sub(y) := abs(alpha)*abs(sub(A))*abs(sub(x))+ abs(beta*sub(y)), or sub(y) := abs(alpha)*abs(sub( A)')*abs(sub(x))+ abs(beta*sub(y)), or sub(y) := abs(alpha)*abs(conjg(sub(A)'))*abs(sub(x))+ abs(beta*sub(y)), where: alpha and beta are scalars, sub(A) is a n-by-n unit, or non-unit, upper or lower triangular distributed matrix, sub(A) = A(ia:ia+n-1, ja:ja+n-1), sub(x) is an n-element distributed vector. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(A) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans (global) CHARACTER*1. Specifies the form of op(sub(A)) used in the matrix equation: if trans = 'N' or 'n', then sub(y) := |alpha|*|sub(A)|*|sub(x)|+| beta*sub(y)|; if trans = 'T' or 't', then sub(y) := |alpha|*|sub(A)'|*|sub(x)| +|beta*sub(y)|; if trans = 'C' or 'c', then sub(y) := |alpha|*|conjg(sub(A)')|*| sub(x)|+|beta*sub(y)|. diag (global) CHARACTER*1. Specifies whether the matrix sub(A) is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n=0. alpha (global) REAL for psatrmv DOUBLE PRECISION for pdatrmv COMPLEX for pcatrmv DOUBLE COMPLEX for pzatrmv Specifies the scalar alpha. a (local) REAL for psatrmv DOUBLE PRECISION for pdatrmv COMPLEX for pcatrmv DOUBLE COMPLEX for pzatrmv Array, DIMENSION at least (lld_a, LOCq(1, ja+n-1)). PBLAS Routines 12 2411 Before entry with uplo = 'U' or 'u', this array contains the local entries corresponding to the entries of the upper triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly lower triangular part of the distributed matrix sub(A) is not referenced. Before entry with uplo = 'L' or 'l', this array contains the local entries corresponding to the entries of the lower triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly upper triangular part of the distributed matrix sub(A) is not referenced. When diag = 'U' or 'u', the local entries corresponding to the diagonal elements of the submatrix sub(A) are not referenced either, but are assumed to be unity. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for psatrmv DOUBLE PRECISION for pdatrmv COMPLEX for pcatrmv DOUBLE COMPLEX for pzatrmv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) REAL for psatrmv DOUBLE PRECISION for pdatrmv COMPLEX for pcatrmv DOUBLE COMPLEX for pzatrmv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) REAL for psatrmv DOUBLE PRECISION for pdatrmv COMPLEX for pcatrmv DOUBLE COMPLEX for pzatrmv Array, DIMENSION (jy-1)*m_y + iy+(m-1)*abs(incy)) when trans = 'N' or 'n', and (jy-1)*m_y + iy+(n-1)*abs(incy)) otherwise. This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. 12 Intel® Math Kernel Library Reference Manual 2412 Output Parameters x Overwritten by the transformed distributed vector sub(x). p?trsv Solves a system of linear equations whose coefficients are in a distributed triangular matrix. Syntax call pstrsv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pdtrsv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pctrsv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pztrsv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The p?trsv routines solve one of the systems of equations: sub(A)*sub(x) = b, or sub(A)'*sub(x) = b, or conjg(sub(A)')*sub(x) = b, where: sub(A) is a n-by-n unit, or non-unit, upper or lower triangular distributed matrix, sub(A) = A(ia:ia+n-1, ja:ja+n-1), b and sub(x) are n-element distributed vectors, sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1,. The routine does not test for singularity or near-singularity. Such tests must be performed before calling this routine. Input Parameters uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(A) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans (global) CHARACTER*1. Specifies the form of the system of equations: if transa = 'N' or 'n', then sub(A)*sub(x) = b; if transa = 'T' or 't', then sub(A)'*sub(x) = b; if transa = 'C' or 'c', then conjg(sub(A)')*sub(x) = b. diag (global) CHARACTER*1. Specifies whether the matrix sub(A) is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. a (local) REAL for pstrsv DOUBLE PRECISION for pdtrsv COMPLEX for pctrsv PBLAS Routines 12 2413 DOUBLE COMPLEX for pztrsv Array, DIMENSION at least (lld_a, LOCq(1, ja+n-1)). Before entry with uplo = 'U' or 'u', this array contains the local entries corresponding to the entries of the upper triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly lower triangular part of the distributed matrix sub(A) is not referenced. Before entry with uplo = 'L' or 'l', this array contains the local entries corresponding to the entries of the lower triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly upper triangular part of the distributed matrix sub(A) is not referenced . When diag = 'U' or 'u', the local entries corresponding to the diagonal elements of the submatrix sub(A) are not referenced either, but are assumed to be unity. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for pstrsv DOUBLE PRECISION for pdtrsv COMPLEX for pctrsv DOUBLE COMPLEX for pztrsv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). Before entry, sub(x) must contain the n-element right-hand side distributed vector b. ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters x Overwritten with the solution vector. PBLAS Level 3 Routines The PBLAS Level 3 routines perform distributed matrix-matrix operations. Table "PBLAS Level 3 Routine Groups and Their Data Types" lists the PBLAS Level 3 routine groups and the data types associated with them. PBLAS Level 3 Routine Groups and Their Data Types Routine Group Data Types Description p?geadd s, d, c, z Distributed matrix-matrix sum of general matrices p?tradd s, d, c, z Distributed matrix-matrix sum of triangular matrices p?gemm s, d, c, z Distributed matrix-matrix product of general matrices 12 Intel® Math Kernel Library Reference Manual 2414 Routine Group Data Types Description p?hemm c, z Distributed matrix-matrix product, one matrix is Hermitian p?herk c, z Rank-k update of a distributed Hermitian matrix p?her2k c, z Rank-2k update of a distributed Hermitian matrix p?symm s, d, c, z Matrix-matrix product of distributed symmetric matrices p?syrk s, d, c, z Rank-k update of a distributed symmetric matrix p?syr2k s, d, c, z Rank-2k update of a distributed symmetric matrix p?tran s, d Transposition of a real distributed matrix p?tranc c, z Transposition of a complex distributed matrix (conjugated) p?tranu c, z Transposition of a complex distributed matrix p?trmm s, d, c, z Distributed matrix-matrix product, one matrix is triangular p?trsm s, d, c, z Solution of a distributed matrix equation, one matrix is triangular p?geadd Performs sum operation for two distributed general matrices. Syntax call psgeadd(trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pdgeadd(trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pcgeadd(trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pzgeadd(trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?geadd routines perform sum operation for two distributed general matrices. The operation is defined as sub(C):=beta*sub(C) + alpha*op(sub(A)), where: op(x) is one of op(x) = x, or op(x) = x', alpha and beta are scalars, sub(C) is an m-by-n distributed matrix, sub(C)=C(ic:ic+m-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+m-1). Input Parameters trans (global) CHARACTER*1. Specifies the operation: PBLAS Routines 12 2415 if trans = 'N' or 'n', then op(sub(A)) := sub(A); if trans = 'T' or 't', then op(sub(A)) := sub(A)'; if trans = 'C' or 'c', then op(sub(A)) := sub(A)'. m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(C) and the number of columns of the submatrix sub(A), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(C) and the number of rows of the submatrix sub(A), n = 0. alpha (global) REAL for psgeadd DOUBLE PRECISION for pdgeadd COMPLEX for pcgeadd DOUBLE COMPLEX for pzgeadd Specifies the scalar alpha. a (local) REAL for psgeadd DOUBLE PRECISION for pdgeadd COMPLEX for pcgeadd DOUBLE COMPLEX for pzgeadd Array, DIMENSION (lld_a, LOCq(ja+m-1)). This array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) REAL for psgeadd DOUBLE PRECISION for pdgeadd COMPLEX for pcgeadd DOUBLE COMPLEX for pzgeadd Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local) REAL for psgeadd DOUBLE PRECISION for pdgeadd COMPLEX for pcgeadd DOUBLE COMPLEX for pzgeadd Array, DIMENSION (lld_c, LOCq(jc+n-1)). This array contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the updated submatrix. p?tradd Performs sum operation for two distributed triangular matrices. 12 Intel® Math Kernel Library Reference Manual 2416 Syntax call pstradd(uplo, trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pdtradd(uplo, trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pctradd(uplo, trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pztradd(uplo, trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?tradd routines perform sum operation for two distributed triangular matrices. The operation is defined as sub(C):=beta*sub(C) + alpha*op(sub(A)), where: op(x) is one of op(x) = x, or op(x) = x', or op(x) = conjg(x'). alpha and beta are scalars, sub(C) is an m-by-n distributed matrix, sub(C)=C(ic:ic+m-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+m-1). Input Parameters uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(C) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans (global) CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then op(sub(A)) := sub(A); if trans = 'T' or 't', then op(sub(A)) := sub(A)'; if trans = 'C' or 'c', then op(sub(A)) := conjg(sub(A)'). m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(C) and the number of columns of the submatrix sub(A), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(C) and the number of rows of the submatrix sub(A), n = 0. alpha (global) REAL for pstradd DOUBLE PRECISION for pdtradd COMPLEX for pctradd DOUBLE COMPLEX for pztradd Specifies the scalar alpha. a (local) REAL for pstradd DOUBLE PRECISION for pdtradd COMPLEX for pctradd DOUBLE COMPLEX for pztradd Array, DIMENSION (lld_a, LOCq(ja+m-1)). This array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. PBLAS Routines 12 2417 desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) REAL for pstradd DOUBLE PRECISION for pdtradd COMPLEX for pctradd DOUBLE COMPLEX for pztradd Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local) REAL for pstradd DOUBLE PRECISION for pdtradd COMPLEX for pctradd DOUBLE COMPLEX for pztradd Array, DIMENSION (lld_c, LOCq(jc+n-1)). This array contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the updated submatrix. p?gemm Computes a scalar-matrix-matrix product and adds the result to a scalar-matrix product for distributed matrices. Syntax call psgemm(transa, transb, m, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pdgemm(transa, transb, m, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pcgemm(transa, transb, m, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pzgemm(transa, transb, m, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?gemm routines perform a matrix-matrix operation with general distributed matrices. The operation is defined as sub(C) := alpha*op(sub(A))*op(sub(B)) + beta*sub(C), where: op(x) is one of op(x) = x, or op(x) = x', alpha and beta are scalars, 12 Intel® Math Kernel Library Reference Manual 2418 sub(A)=A(ia:ia+m-1, ja:ja+k-1), sub(B)=B(ib:ib+k-1, jb:jb+n-1), and sub(C)=C(ic:ic+m-1, jc:jc+n-1), are distributed matrices. Input Parameters transa (global) CHARACTER*1. Specifies the form of op(sub(A)) used in the matrix multiplication: if transa = 'N' or 'n', then op(sub(A)) = sub(A); if transa = 'T' or 't', then op(sub(A)) = sub(A)'; if transa = 'C' or 'c', then op(sub(A)) = sub(A)'. transb (global) CHARACTER*1. Specifies the form of op(sub(B)) used in the matrix multiplication: if transb = 'N' or 'n', then op(sub(B)) = sub(B); if transb = 'T' or 't', then op(sub(B)) = sub(B)'; if transb = 'C' or 'c', then op(sub(B)) = sub(B)'. m (global) INTEGER. Specifies the number of rows of the distributed matrices op(sub(A)) and sub(C), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrices op(sub(B)) and sub(C), n = 0. The value of n must be at least zero. k (global) INTEGER. Specifies the number of columns of the distributed matrix op(sub(A)) and the number of rows of the distributed matrix op(sub(B)). The value of k must be greater than or equal to 0. alpha (global) REAL for psgemm DOUBLE PRECISION for pdgemm COMPLEX for pcgemm DOUBLE COMPLEX for pzgemm Specifies the scalar alpha. When alpha is equal to zero, then the local entries of the arrays a and b corresponding to the entries of the submatrices sub(A) and sub(B) respectively need not be set on input. a (local) REAL for psgemm DOUBLE PRECISION for pdgemm COMPLEX for pcgemm DOUBLE COMPLEX for pzgemm Array, DIMENSION (lld_a, kla), where kla is LOCc(ja+k-1) when transa = 'N' or 'n', and is LOCq(ja+m-1) otherwise. Before entry this array must contain the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local)REAL for psgemm DOUBLE PRECISION for pdgemm COMPLEX for pcgemm DOUBLE COMPLEX for pzgemm Array, DIMENSION (lld_b, klb), where klb is LOCc(jb+n-1) when transb = 'N' or 'n', and is LOCq(jb+k-1) otherwise. Before entry this array must contain the local pieces of the distributed matrix sub(B). PBLAS Routines 12 2419 ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. beta (global) REAL for psgemm DOUBLE PRECISION for pdgemm COMPLEX for pcgemm DOUBLE COMPLEX for pzgemm Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local)REAL for psgemm DOUBLE PRECISION for pdgemm COMPLEX for pcgemm DOUBLE COMPLEX for pzgemm Array, DIMENSION (lld_a, LOCq(jc+n-1)). Before entry this array must contain the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the m-by-n distributed matrix alpha*op(sub(A))*op(sub(B)) + beta*sub(C). p?hemm Performs a scalar-matrix-matrix product (one matrix operand is Hermitian) and adds the result to a scalarmatrix product. Syntax call pchemm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pzhemm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?hemm routines perform a matrix-matrix operation with distributed matrices. The operation is defined as sub(C):=alpha*sub(A)*sub(B)+ beta*sub(C), or sub(C):=alpha*sub(B)*sub(A)+ beta*sub(C), where: alpha and beta are scalars, 12 Intel® Math Kernel Library Reference Manual 2420 sub(A) is a Hermitian distributed matrix, sub(A)=A(ia:ia+m-1, ja:ja+m-1), if side = 'L', and sub(A)=A(ia:ia+n-1, ja:ja+n-1), if side = 'R'. sub(B) and sub(C) are m-by-n distributed matrices. sub(B)=B(ib:ib+m-1, jb:jb+n-1), sub(C)=C(ic:ic+m-1, jc:jc+n-1). Input Parameters side (global) CHARACTER*1. Specifies whether the Hermitian distributed matrix sub(A) appears on the left or right in the operation: if side = 'L' or 'l', then sub(C) := alpha*sub(A) *sub(B) + beta*sub(C); if side = 'R' or 'r', then sub(C) := alpha*sub(B) *sub(A) + beta*sub(C). uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(A) is used: if uplo = 'U' or 'u', then the upper triangular part is used; if uplo = 'L' or 'l', then the lower triangular part is used. m (global) INTEGER. Specifies the number of rows of the distribute submatrix sub(C), m = 0. n (global) INTEGER. Specifies the number of columns of the distribute submatrix sub(C), n = 0. alpha (global) COMPLEX for pchemm DOUBLE COMPLEX for pzhemm Specifies the scalar alpha. a (local) COMPLEX for pchemm DOUBLE COMPLEX for pzhemm Array, DIMENSION (lld_a, LOCq(ja+na-1)). Before entry this array must contain the local pieces of the symmetric distributed matrix sub(A), such that when uplo = 'U' or 'u', the na-byna upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the Hermitian distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the na-by-na lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the Hermitian distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) COMPLEX for pchemm DOUBLE COMPLEX for pzhemm Array, DIMENSION (lld_b, LOCq(jb+n-1) ). Before entry this array must contain the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. beta (global) COMPLEX for pchemm PBLAS Routines 12 2421 DOUBLE COMPLEX for pzhemm Specifies the scalar beta. When beta is set to zero, then sub(C) need not be set on input. c (local) COMPLEX for pchemm DOUBLE COMPLEX for pzhemm Array, DIMENSION (lld_c, LOCq(jc+n-1)). Before entry this array must contain the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively descc (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the m-by-n updated distributed matrix. p?herk Performs a rank-k update of a distributed Hermitian matrix. Syntax call pcherk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pzherk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?herk routines perform a distributed matrix-matrix operation defined as sub(C):=alpha*sub(A)*conjg(sub(A)')+ beta*sub(C), or sub(C):=alpha*conjg(sub(A)')*sub(A)+ beta*sub(C), where: alpha and beta are scalars, sub(C) is an n-by-n Hermitian distributed matrix, sub(C)=C(ic:ic+n-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+k-1), if trans = 'N' or 'n', and sub(A)=A(ia:ia+k-1, ja:ja+n-1) otherwise. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(C) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(C) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(C) is used. trans (global) CHARACTER*1. Specifies the operation: 12 Intel® Math Kernel Library Reference Manual 2422 if trans = 'N' or 'n', then sub(C) := alpha*sub(A)*conjg(sub(A)') + beta*sub(C); if trans = 'C' or 'c', then sub(C) := alpha*conjg(sub(A)')*sub(A) + beta*sub(C). n (global) INTEGER. Specifies the order of the distributed matrix sub(C), n = 0. k (global) INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the distributed matrix sub(A) , and on entry with trans = 'T' or 't' or 'C' or 'c', k specifies the number of rows of the distributed matrix sub(A), k = 0. alpha (global) REAL for pcherk DOUBLE PRECISION for pzherk Specifies the scalar alpha. a (local) COMPLEX for pcherk DOUBLE COMPLEX for pzherk Array, DIMENSION (lld_a, kla), where kla is LOCq(ja+k-1) when trans = 'N' or 'n', and is LOCq(ja+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) REAL for pcherk DOUBLE PRECISION for pzherk Specifies the scalar beta. c (local) COMPLEX for pcherk DOUBLE COMPLEX for pzherk Array, DIMENSION (lld_c, LOCq(jc+n-1)). Before entry with uplo = 'U' or 'u', this array contains n-by-n upper triangular part of the symmetric distributed matrix sub(C) and its strictly lower triangular part is not referenced. Before entry with uplo = 'L' or 'l', this array contains n-by-n lower triangular part of the symmetric distributed matrix sub(C) and its strictly upper triangular part is not referenced. ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c With uplo = 'U' or 'u', the upper triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. With uplo = 'L' or 'l', the lower triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. PBLAS Routines 12 2423 p?her2k Performs a rank-2k update of a Hermitian distributed matrix. Syntax Fortran 77: call pcher2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pzher2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?her2k routines perform a distributed matrix-matrix operation defined as sub(C):=alpha*sub(A)*conjg(sub(B)')+ conjg(alpha)*sub(B)*conjg(sub(A)')+beta*sub(C), or sub(C):=alpha*conjg(sub(A)')*sub(A)+ conjg(alpha)*conjg(sub(B)')*sub(A) + beta*sub(C), where: alpha and beta are scalars, sub(C) is an n-by-n Hermitian distributed matrix, sub(C) = C(ic:ic+n-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+k-1), if trans = 'N' or 'n', and sub(A) = A(ia:ia+k-1, ja:ja+n-1) otherwise. sub(B) is a distributed matrix, sub(B) = B(ib:ib+n-1, jb:jb+k-1), if trans = 'N' or 'n', and sub(B)=B(ib:ib+k-1, jb:jb+n-1) otherwise. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(C) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(C) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(C) is used. trans (global) CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then sub(C) := alpha*sub(A)*conjg(sub(B)') + conjg(alpha)*sub(B)*conjg(sub(A)') + beta*sub(C); if trans = 'C' or 'c', then sub(C) := alpha*conjg(sub(A)')*sub(A) + conjg(alpha)*conjg(sub(B)')*sub(A) + beta*sub(C). n (global) INTEGER. Specifies the order of the distributed matrix sub(C), n = 0. k (global) INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the distributed matrices sub(A) and sub(B), and on entry with trans = 'C' or 'c' , k specifies the number of rows of the distributed matrices sub(A) and sub(B), k = 0. alpha (global) COMPLEX for pcher2k 12 Intel® Math Kernel Library Reference Manual 2424 DOUBLE COMPLEX for pzher2k Specifies the scalar alpha. a (local) COMPLEX for pcher2k DOUBLE COMPLEX for pzher2k Array, DIMENSION (lld_a, kla), where kla is LOCq(ja+k-1) when trans = 'N' or 'n', and is LOCq(ja+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) COMPLEX for pcher2k DOUBLE COMPLEX for pzher2k Array, DIMENSION (lld_b, klb), where klb is LOCq(jb+k-1) when trans = 'N' or 'n', and is LOCq(jb+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. beta (global) REAL for pcher2k DOUBLE PRECISION for pzher2k Specifies the scalar beta. c (local) COMPLEX for pcher2k DOUBLE COMPLEX for pzher2k Array, DIMENSION (lld_c, LOCq(jc+n-1)). Before entry with uplo = 'U' or 'u', this array contains n-by-n upper triangular part of the symmetric distributed matrix sub(C) and its strictly lower triangular part is not referenced. Before entry with uplo = 'L' or 'l', this array contains n-by-n lower triangular part of the symmetric distributed matrix sub(C) and its strictly upper triangular part is not referenced. ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c With uplo = 'U' or 'u', the upper triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. With uplo = 'L' or 'l', the lower triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. PBLAS Routines 12 2425 p?symm Performs a scalar-matrix-matrix product (one matrix operand is symmetric) and adds the result to a scalarmatrix product for distribute matrices. Syntax Fortran 77: call pssymm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pdsymm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pcsymm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pzsymm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?symm routines perform a matrix-matrix operation with distributed matrices. The operation is defined as sub(C):=alpha*sub(A)*sub(B)+ beta*sub(C), or sub(C):=alpha*sub(B)*sub(A)+ beta*sub(C), where: alpha and beta are scalars, sub(A) is a symmetric distributed matrix, sub(A)=A(ia:ia+m-1, ja:ja+m-1), if side ='L', and sub(A)=A(ia:ia+n-1, ja:ja+n-1), if side ='R'. sub(B) and sub(C) are m-by-n distributed matrices. sub(B)=B(ib:ib+m-1, jb:jb+n-1), sub(C)=C(ic:ic+m-1, jc:jc+n-1). Input Parameters side (global) CHARACTER*1. Specifies whether the symmetric distributed matrix sub(A) appears on the left or right in the operation: if side = 'L' or 'l', then sub(C) := alpha*sub(A) *sub(B) + beta*sub(C); if side = 'R' or 'r', then sub(C) := alpha*sub(B) *sub(A) + beta*sub(C). uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(A) is used: if uplo = 'U' or 'u', then the upper triangular part is used; if uplo = 'L' or 'l', then the lower triangular part is used. m (global) INTEGER. Specifies the number of rows of the distribute submatrix sub(C), m = 0. 12 Intel® Math Kernel Library Reference Manual 2426 n (global) INTEGER. Specifies the number of columns of the distribute submatrix sub(C), m = 0. alpha (global) REAL for pssymm DOUBLE PRECISION for pdsymm COMPLEX for pcsymm DOUBLE COMPLEX for pzsymm Specifies the scalar alpha. a (local) REAL for pssymm DOUBLE PRECISION for pdsymm COMPLEX for pcsymm DOUBLE COMPLEX for pzsymm Array, DIMENSION (lld_a, LOCq(ja+na-1)). Before entry this array must contain the local pieces of the symmetric distributed matrix sub(A), such that when uplo = 'U' or 'u', the na-byna upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the symmetric distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the na-by-na lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the symmetric distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) REAL for pssymm DOUBLE PRECISION for pdsymm COMPLEX for pcsymm DOUBLE COMPLEX for pzsymm Array, DIMENSION (lld_b, LOCq(jb+n-1) ). Before entry this array must contain the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. beta (global) REAL for pssymm DOUBLE PRECISION for pdsymm COMPLEX for pcsymm DOUBLE COMPLEX for pzsymm Specifies the scalar beta. When beta is set to zero, then sub(C) need not be set on input. c (local) REAL for pssymm DOUBLE PRECISION for pdsymm COMPLEX for pcsymm DOUBLE COMPLEX for pzsymm Array, DIMENSION (lld_c, LOCq(jc+n-1) ). Before entry this array must contain the local pieces of the distributed matrix sub(C). PBLAS Routines 12 2427 ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the m-by-n updated matrix. p?syrk Performs a rank-k update of a symmetric distributed matrix. Syntax call pssyrk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pdsyrk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pcsyrk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pzsyrk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?syrk routines perform a distributed matrix-matrix operation defined as sub(C):=alpha*sub(A)*sub(A)'+ beta*sub(C), or sub(C):=alpha*sub(A)'*sub(A)+ beta*sub(C), where: alpha and beta are scalars, sub(C) is an n-by-n symmetric distributed matrix, sub(C)=C(ic:ic+n-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+k-1), if trans = 'N' or 'n', and sub(A)=A(ia:ia+k-1, ja:ja+n-1) otherwise. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(C) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(C) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(C) is used. trans (global) CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then sub(C) := alpha*sub(A)*sub(A)' + beta*sub(C); if trans = 'T' or 't', then sub(C) := alpha*sub(A)'*sub(A) + beta*sub(C). 12 Intel® Math Kernel Library Reference Manual 2428 n (global) INTEGER. Specifies the order of the distributed matrix sub(C), n = 0. k (global) INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the distributed matrix sub(A) , and on entry with trans = 'T' or 't' , k specifies the number of rows of the distributed matrix sub(A), k = 0. alpha (global) REAL for pssyrk DOUBLE PRECISION for pdsyrk COMPLEX for pcsyrk DOUBLE COMPLEX for pzsyrk Specifies the scalar alpha. a (local) REAL for pssyrk DOUBLE PRECISION for pdsyrk COMPLEX for pcsyrk DOUBLE COMPLEX for pzsyrk Array, DIMENSION (lld_a, kla), where kla is LOCq(ja+k-1) when trans = 'N' or 'n', and is LOCq(ja+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) REAL for pssyrk DOUBLE PRECISION for pdsyrk COMPLEX for pcsyrk DOUBLE COMPLEX for pzsyrk Specifies the scalar beta. c (local) REAL for pssyrk DOUBLE PRECISION for pdsyrk COMPLEX for pcsyrk DOUBLE COMPLEX for pzsyrk Array, DIMENSION (lld_c, LOCq(jc+n-1)). Before entry with uplo = 'U' or 'u', this array contains n-by-n upper triangular part of the symmetric distributed matrix sub(C) and its strictly lower triangular part is not referenced. Before entry with uplo = 'L' or 'l', this array contains n-by-n lower triangular part of the symmetric distributed matrix sub(C) and its strictly upper triangular part is not referenced. ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c With uplo = 'U' or 'u', the upper triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. PBLAS Routines 12 2429 With uplo = 'L' or 'l', the lower triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. p?syr2k Performs a rank-2k update of a symmetric distributed matrix. Syntax call pssyr2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pdsyr2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pcsyr2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pzsyr2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?syr2k routines perform a distributed matrix-matrix operation defined as sub(C):=alpha*sub(A)*sub(B)'+alpha*sub(B)*sub(A)'+ beta*sub(C), or sub(C):=alpha*sub(A)'*sub(B) +alpha*sub(B)'*sub(A) + beta*sub(C), where: alpha and beta are scalars, sub(C) is an n-by-n symmetric distributed matrix, sub(C)=C(ic:ic+n-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+k-1), if trans = 'N' or 'n', and sub(A)=A(ia:ia+k-1, ja:ja+n-1) otherwise. sub(B) is a distributed matrix, sub(B)=B(ib:ib+n-1, jb:jb+k-1), if trans = 'N' or 'n', and sub(B)=B(ib:ib+k-1, jb:jb+n-1) otherwise. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(C) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(C) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(C) is used. trans (global) CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then sub(C) := alpha*sub(A)*sub(B)' + alpha*sub(B)*sub(A)' + beta*sub(C); if trans = 'T' or 't', then sub(C) := alpha*sub(B)'*sub(A) + alpha*sub(A)'*sub(B) + beta*sub(C). n (global) INTEGER. Specifies the order of the distributed matrix sub(C), n = 0. 12 Intel® Math Kernel Library Reference Manual 2430 k (global) INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the distributed matrices sub(A) and sub(B), and on entry with trans = 'T' or 't' , k specifies the number of rows of the distributed matrices sub(A) and sub(B), k = 0. alpha (global) REAL for pssyr2k DOUBLE PRECISION for pdsyr2k COMPLEX for pcsyr2k DOUBLE COMPLEX for pzsyr2k Specifies the scalar alpha. a (local) REAL for pssyr2k DOUBLE PRECISION for pdsyr2k COMPLEX for pcsyr2k DOUBLE COMPLEX for pzsyr2k Array, DIMENSION (lld_a, kla), where kla is LOCq(ja+k-1) when trans = 'N' or 'n', and is LOCq(ja+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) REAL for pssyr2k DOUBLE PRECISION for pdsyr2k COMPLEX for pcsyr2k DOUBLE COMPLEX for pzsyr2k Array, DIMENSION (lld_b, klb), where klb is LOCq(jb+k-1) when trans = 'N' or 'n', and is LOCq(jb+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. beta (global) REAL for pssyr2k DOUBLE PRECISION for pdsyr2k COMPLEX for pcsyr2k DOUBLE COMPLEX for pzsyr2k Specifies the scalar beta. c (local) REAL for pssyr2k DOUBLE PRECISION for pdsyr2k COMPLEX for pcsyr2k DOUBLE COMPLEX for pzsyr2k Array, DIMENSION (lld_c, LOCq(jc+n-1)). Before entry with uplo = 'U' or 'u', this array contains n-by-n upper triangular part of the symmetric distributed matrix sub(C) and its strictly lower triangular part is not referenced. PBLAS Routines 12 2431 Before entry with uplo = 'L' or 'l', this array contains n-by-n lower triangular part of the symmetric distributed matrix sub(C) and its strictly upper triangular part is not referenced. ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c With uplo = 'U' or 'u', the upper triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. With uplo = 'L' or 'l', the lower triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. p?tran Transposes a real distributed matrix. Syntax call pstran(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pdtran(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?tran routines transpose a real distributed matrix. The operation is defined as sub(C):=beta*sub(C) + alpha*sub(A)', where: alpha and beta are scalars, sub(C) is an m-by-n distributed matrix, sub(C)=C(ic:ic+m-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+m-1). Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(C), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(C) , n = 0. alpha (global) REAL for pstran DOUBLE PRECISION for pdtran Specifies the scalar alpha. a (local) REAL for pstran DOUBLE PRECISION for pdtran Array, DIMENSION (lld_a, LOCq(ja+m-1)). This array contains the local pieces of the distributed matrix sub(A). 12 Intel® Math Kernel Library Reference Manual 2432 ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) REAL for pstran DOUBLE PRECISION for pdtran Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local) REAL for pstran DOUBLE PRECISION for pdtran Array, DIMENSION (lld_c, LOCq(jc+n-1)). This array contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the updated submatrix. p?tranu Transposes a distributed complex matrix. Syntax call pctranu(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pztranu(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?tranu routines transpose a complex distributed matrix. The operation is defined as sub(C):=beta*sub(C) + alpha*sub(A)', where: alpha and beta are scalars, sub(C) is an m-by-n distributed matrix, sub(C)=C(ic:ic+m-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+m-1). Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(C), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(C) , n = 0. alpha (global) COMPLEX for pctranu PBLAS Routines 12 2433 DOUBLE COMPLEX for pztranu Specifies the scalar alpha. a (local) COMPLEX for pctranu DOUBLE COMPLEX for pztranu Array, DIMENSION (lld_a, LOCq(ja+m-1)). This array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) COMPLEX for pctranu DOUBLE COMPLEX for pztranu Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local) COMPLEX for pctranu DOUBLE COMPLEX for pztranu Array, DIMENSION (lld_c, LOCq(jc+n-1)). This array contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the updated submatrix. p?tranc Transposes a complex distributed matrix, conjugated. Syntax call pctranc(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pztranc(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?tranc routines transpose a complex distributed matrix. The operation is defined as sub(C):=beta*sub(C) + alpha*conjg(sub(A)'), where: alpha and beta are scalars, sub(C) is an m-by-n distributed matrix, sub(C)=C(ic:ic+m-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+m-1). 12 Intel® Math Kernel Library Reference Manual 2434 Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(C), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(C) , n = 0. alpha (global) COMPLEX for pctranc DOUBLE COMPLEX for pztranc Specifies the scalar alpha. a (local) COMPLEX for pctranc DOUBLE COMPLEX for pztranc Array, DIMENSION (lld_a, LOCq(ja+m-1)). This array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) COMPLEX for pctranc DOUBLE COMPLEX for pztranc Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local) COMPLEX for pctranc DOUBLE COMPLEX for pztranc Array, DIMENSION (lld_c, LOCq(jc+n-1)). This array contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the updated submatrix. p?trmm Computes a scalar-matrix-matrix product (one matrix operand is triangular) for distributed matrices. Syntax call pstrmm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pdtrmm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pctrmm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pztrmm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) Include Files • C: mkl_pblas.h PBLAS Routines 12 2435 Description The p?trmm routines perform a matrix-matrix operation using triangular matrices. The operation is defined as sub(B) := alpha*op(sub(A))*sub(B) or sub(B) := alpha*sub(B)*op(sub(A)) where: alpha is a scalar, sub(B) is an m-by-n distributed matrix, sub(B)=B(ib:ib+m-1, jb:jb+n-1). A is a unit, or non-unit, upper or lower triangular distributed matrix, sub(A)=A(ia:ia+m-1, ja:ja+m-1), if side = 'L' or 'l', and sub(A)=A(ia:ia+n-1, ja:ja+n-1), if side = 'R' or 'r'. op(sub(A)) is one of op(sub(A)) = sub(A), or op(sub(A)) = sub(A)', or op(sub(A)) = conjg(sub(A)'). Input Parameters side (global)CHARACTER*1. Specifies whether op(sub(A)) appears on the left or right of sub(B) in the operation: if side = 'L' or 'l', then sub(B) := alpha*op(sub(A))*sub(B); if side = 'R' or 'r', then sub(B) := alpha*sub(B)*op(sub(A)). uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(A) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. transa (global) CHARACTER*1. Specifies the form of op(sub(A)) used in the matrix multiplication: if transa = 'N' or 'n', then op(sub(A)) = sub(A); if transa = 'T' or 't', then op(sub(A)) = sub(A)' ; if transa = 'C' or 'c', then op(sub(A)) = conjg(sub(A)'). diag (global) CHARACTER*1. Specifies whether the matrix sub(A) is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(B), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(B), n = 0. alpha (global) REAL for pstrmm DOUBLE PRECISION for pdtrmm COMPLEX for pctrmm DOUBLE COMPLEX for pztrmm Specifies the scalar alpha. When alpha is zero, then the arrayb need not be set before entry. a (local) REAL for pstrmm DOUBLE PRECISION for pdtrmm COMPLEX for pctrmm DOUBLE COMPLEX for pztrmm 12 Intel® Math Kernel Library Reference Manual 2436 Array, DIMENSION (lld_a,ka), where ka is at least LOCq(1, ja+m-1) when side = 'L' or 'l' and is at least LOCq(1, ja+n-1) when side = 'R' or 'r'. Before entry with uplo = 'U' or 'u', this array contains the local entries corresponding to the entries of the upper triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly lower triangular part of the distributed matrix sub(A) is not referenced. Before entry with uplo = 'L' or 'l', this array contains the local entries corresponding to the entries of the lower triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly upper triangular part of the distributed matrix sub(A) is not referenced . When diag = 'U' or 'u', the local entries corresponding to the diagonal elements of the submatrix sub(A) are not referenced either, but are assumed to be unity. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) REAL for pstrmm DOUBLE PRECISION for pdtrmm COMPLEX for pctrmm DOUBLE COMPLEX for pztrmm Array, DIMENSION (lld_b, LOCq(1, jb+n-1)). Before entry, this array contains the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. Output Parameters b Overwritten by the transformed distributed matrix. p?trsm Solves a distributed matrix equation (one matrix operand is triangular). Syntax call pstrsm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pdtrsm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pctrsm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pztrsm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) Include Files • C: mkl_pblas.h PBLAS Routines 12 2437 Description The p?trsm routines solve one of the following distributed matrix equations: op(sub(A))*X = alpha*sub(B), or X*op(sub(A)) = alpha*sub(B), where: alpha is a scalar, X and sub(B) are m-by-n distributed matrices, sub(B)=B(ib:ib+m-1, jb:jb+n-1); A is a unit, or non-unit, upper or lower triangular distributed matrix, sub(A)=A(ia:ia+m-1, ja:ja+m-1), if side = 'L' or 'l', and sub(A)=A(ia:ia+n-1, ja:ja+n-1), if side = 'R' or 'r'; op(sub(A)) is one of op(sub(A)) = sub(A), or op(sub(A)) = sub(A)', or op(sub(A)) = conjg(sub(A)'). The distributed matrix sub(B) is overwritten by the solution matrix X. Input Parameters side (global)CHARACTER*1. Specifies whether op(sub(A)) appears on the left or right of X in the equation: if side = 'L' or 'l', then op(sub(A))*X = alpha*sub(B); if side = 'R' or 'r', then X*op(sub(A)) = alpha*sub(B). uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(A) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. transa (global) CHARACTER*1. Specifies the form of op(sub(A)) used in the matrix equation: if transa = 'N' or 'n', then op(sub(A)) = sub(A); if transa = 'T' or 't', then op(sub(A)) = sub(A)'; if transa = 'C' or 'c', then op(sub(A)) = conjg(sub(A)'). diag (global) CHARACTER*1. Specifies whether the matrix sub(A) is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(B), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(B), n = 0. alpha (global) REAL for pstrsm DOUBLE PRECISION for pdtrsm COMPLEX for pctrsm DOUBLE COMPLEX for pztrsm Specifies the scalar alpha. When alpha is zero, then a is not referenced and b need not be set before entry. a (local) REAL for pstrsm DOUBLE PRECISION for pdtrsm COMPLEX for pctrsm DOUBLE COMPLEX for pztrsm 12 Intel® Math Kernel Library Reference Manual 2438 Array, DIMENSION (lld_a, ka), where ka is at least LOCq(1, ja+m-1) when side = 'L' or 'l' and is at least LOCq(1, ja+n-1) when side = 'R' or 'r'. Before entry with uplo = 'U' or 'u', this array contains the local entries corresponding to the entries of the upper triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly lower triangular part of the distributed matrix sub(A) is not referenced. Before entry with uplo = 'L' or 'l', this array contains the local entries corresponding to the entries of the lower triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly upper triangular part of the distributed matrix sub(A) is not referenced . When diag = 'U' or 'u', the local entries corresponding to the diagonal elements of the submatrix sub(A) are not referenced either, but are assumed to be unity. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) REAL for pstrsm DOUBLE PRECISION for pdtrsm COMPLEX for pctrsm DOUBLE COMPLEX for pztrsm Array, DIMENSION (lld_b, LOCq(1, jb+n-1)). Before entry, this array contains the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. Output Parameters b Overwritten by the solution distributed matrix X. PBLAS Routines 12 2439 12 Intel® Math Kernel Library Reference Manual 2440 Partial Differential Equations Support 13 The Intel® Math Kernel Library (Intel® MKL) provides tools for solving Partial Differential Equations (PDE). These tools are Trigonometric Transform interface routines (see Trigonometric Transform Routines) and Poisson Library (see Poisson Library Routines). Poisson Library is designed for fast solving of simple Helmholtz, Poisson, and Laplace problems. The solver is based on the Trigonometric Transform interface, which is, in turn, based on the Intel MKL Fast Fourier Transform (FFT) interface (refer to Fourier Transform Functions), optimized for Intel® processors. Direct use of the Trigonometric Transform routines may be helpful to those who have already implemented their own solvers similar to the one that the Poisson Library provides. As it may be hard enough to modify the original code so as to make it work with Poisson Library, you are encouraged to use fast (staggered) sine/cosine transforms implemented in the Trigonometric Transform interface to improve performance of your solver. Both Trigonometric Transform and Poisson Library routines can be called from C and Fortran 90, although the interfaces description uses C convention. Fortran 90 users can find routine calls specifics in the "Calling PDE Support Routines from Fortran 90" section. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Trigonometric Transform Routines In addition to the Fast Fourier Transform (FFT) interface, described in chapter "Fast Fourier Transforms", Intel® MKL supports the Real Discrete Trigonometric Transforms (sometimes called real-to-real Discrete Fourier Transforms) interface. In this manual, the interface is referred to as TT interface. It implements a group of routines (TT routines) used to compute sine/cosine, staggered sine/cosine, and twice staggered sine/cosine transforms (referred to as staggered2 sine/cosine transforms, for brevity). The TT interface provides much flexibility of use: you can adjust routines to your particular needs at the cost of manual tuning routine parameters or just call routines with default parameter values. The current Intel MKL implementation of the TT interface can be used in solving partial differential equations and contains routines that are helpful for Fast Poisson and similar solvers. To describe the Intel MKL TT interface, the C convention is used. Fortran users should refer to Calling PDE Support Routines from Fortran 90. For the list of Trigonometric Transforms currently implemented in Intel MKL TT interface, see Transforms Implemented. If you have got used to the FFTW interface (www.fftw.org), you can call the TT interface functions through real-to-real FFTW to Intel MKL wrappers without changing FFTW function calls in your code (refer to the "FFTW to Intel® MKL Wrappers for FFTW 3.x" section in Appendix F for details). However, you are strongly encouraged to use the native TT interface for better performance. Another reason why you should use the wrappers cautiously is that TT and the real-to-real FFTW interfaces are not fully compatible and some features of the real-to-real FFTW, such as strides and multidimensional transforms, are not available through wrappers. 2441 Transforms Implemented TT routines allow computing the following transforms: Forward sine transform Backward sine transform Forward staggered sine transform Backward staggered sine transform Forward staggered2 sine transform Backward staggered2 sine transform Forward cosine transform Backward cosine transform 13 Intel® Math Kernel Library Reference Manual 2442 Forward staggered cosine transform Backward staggered cosine transform Forward staggered2 cosine transform Backward staggered2 cosine transform NOTE The size of the transform n can be any integer greater or equal to 2. Sequence of Invoking TT Routines Computation of a transform using TT interface is conceptually divided into four steps, each of which is performed via a dedicated routine. Table "TT Interface Routines" lists the routines and briefly describes their purpose and use. Most TT routines have versions operating with single-precision and double-precision data. Names of such routines begin respectively with "s" and "d". The wildcard "?" stands for either of these symbols in routine names. TT Interface Routines Routine Description ?_init_trig_transform Initializes basic data structures of Trigonometric Transforms. ?_commit_trig_transform Checks consistency and correctness of user-defined data as well as creates a data structure to be used by Intel MKL FFT interface1. Partial Differential Equations Support 13 2443 Routine Description ?_forward_trig_transform ?_backward_trig_transform Computes a forward/backward Trigonometric Transform of a specified type using the appropriate formula (see Transforms Implemented). free_trig_transform Cleans the memory used by a data structure needed for calling FFT interface1. 1TT routines call Intel MKL FFT interface for better performance. To find a transformed vector for a particular input vector only once, the Intel MKL TT interface routines are normally invoked in the order in which they are listed in Table "TT Interface Routines". NOTE Though the order of invoking TT routines may be changed, it is highly recommended to follow the above order of routine calls. The diagram in Figure "Typical Order of Invoking TT Interface Routines" indicates the typical order in which TT interface routines can be invoked in a general case (prefixes and suffixes in routine names are omitted). Typical Order of Invoking TT Interface Routines A general scheme of using TT routines for double-precision computations is shown below. A similar scheme holds for single-precision computations with the only difference in the initial letter of routine names. ... d_init_trig_transform(&n, &tt_type, ipar, dpar, &ir); /* Change parameters in ipar if necessary. */ /* Note that the result of the Transform will be in f ! If you want to preserve the data stored in f, save them before this place in your code */ d_commit_trig_transform(f, &handle, ipar, dpar, &ir); d_forward_trig_transform(f, &handle, ipar, dpar, &ir); d_backward_trig_transform(f, &handle, ipar, dpar, &ir); free_trig_transform(&handle, ipar, &ir); /* here the user may clean the memory used by f, dpar, ipar */ ... You can find examples of Fortran 90 and C code that use TT interface routines to solve one-dimensional Helmholtz problem in the examples\pdettf\source and examples\pdettc\source folders of your Intel MKL directory. 13 Intel® Math Kernel Library Reference Manual 2444 Interface Description All types in this documentation are standard C types: int, float, and double. Fortran 90 users can call the routines with INTEGER, REAL, and DOUBLE PRECISION Fortran types, respectively (see examples in the examples\pdettf\source and examples\pdettc\source folders of your Intel MKL directory). The interface description uses the built-in type int for integer values. If you employ the ILP64 interface, read this type as long long int (or INTEGER*8 for Fortran). For more information, refer to the Intel MKL User's Guide. Routine Options All TT routines use parameters to pass various options to one another. These parameters are arrays ipar, dpar and spar. Values for these parameters should be specified very carefully (see Common Parameters). You can change these values during computations to meet your needs. WARNING To avoid failure or wrong results, you must provide correct and consistent parameters to the routines. User Data Arrays TT routines take arrays of user data as input. For example, user arrays are passed to the routine d_forward_trig_transform to compute a forward Trigonometric Transform. To minimize storage requirements and improve the overall run-time efficiency, Intel MKL TT routines do not make copies of user input arrays. NOTE If you need a copy of your input data arrays, save them yourself. TT Routines The section gives detailed description of TT routines, their syntax, parameters and values they return. Double-precision and single-precision versions of the same routine are described together. TT routines call Intel MKL FFT interface (described in section "FFT Functions" in chapter "Fast Fourier Transforms"), which enhances performance of the routines. ?_init_trig_transform Initializes basic data structures of a Trigonometric Transform. Syntax void d_init_trig_transform(int *n, int *tt_type, int ipar[], double dpar[], int *stat); void s_init_trig_transform(int *n, int *tt_type, int ipar[], float spar[], int *stat); Include Files • FORTRAN 90: mkl_trig_transforms.f90 • C: mkl_trig_transforms.h Partial Differential Equations Support 13 2445 Input Parameters n int*. Contains the size of the problem, which should be a positive integer greater than 1. Note that data vector of the transform, which other TT routines will use, must have size n+1 for all but staggered2 transforms. Staggered2 transforms require the vector of size n. tt_type int*. Contains the type of transform to compute, defined via a set of named constants. The following constants are available in the current implementation of TT interface: MKL_SINE_TRANSFORM, MKL_STAGGERED_SINE_TRANSFORM, MKL_STAGGERED2_SINE_TRANSFORM; MKL_COSINE_TRANSFORM, MKL_STAGGERED_COSINE_TRANSFORM, MKL_STAGGERED2_COSINE_TRANSFORM. Output Parameters ipar int array of size 128. Contains integer data needed for Trigonometric Transform computations. dpar double array of size 5n/2+2. Contains double-precision data needed for Trigonometric Transform computations. spar float array of size 5n/2+2. Contains single-precision data needed for Trigonometric Transform computations. stat int*. Contains the routine completion status, which is also written to ipar[6]. The status should be 0 to proceed to other TT routines. Description The ?_init_trig_transform routine initializes basic data structures for Trigonometric Transforms of appropriate precision. After a call to ?_init_trig_transform, all subsequently invoked TT routines use values of ipar and dpar (spar) array parameters returned by ?_init_trig_transform. The routine initializes the entire array ipar. In the dpar or spar array, ?_init_trig_transform initializes elements that do not depend upon the type of transform. For a detailed description of arrays ipar, dpar and spar, refer to the Common Parameters section. You can skip calling the initialization routine in your code. For more information, see Caveat on Parameter Modifications. Return Values stat= 0 The routine successfully completed the task. In general, to proceed with computations, the routine should complete with this stat value. stat= -99999 The routine failed to complete the task. ?_commit_trig_transform Checks consistency and correctness of user's data as well as initializes certain data structures required to perform the Trigonometric Transform. Syntax void d_commit_trig_transform(double f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], double dpar[], int *stat); void s_commit_trig_transform(float f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], float spar[], int *stat); Include Files • FORTRAN 90: mkl_trig_transforms.f90 13 Intel® Math Kernel Library Reference Manual 2446 • C: mkl_trig_transforms.h Input Parameters f double for d_commit_trig_transform, float for s_commit_trig_transform, array of size n for staggered2 transforms and of size n+1 for all other transforms, where n is the size of the problem. Contains data vector to be transformed. Note that the following values should be 0.0 up to rounding errors: • f[0] and f[n] for sine transforms • f[n] for staggered cosine transforms • f[0] for staggered sine transforms. Otherwise, the routine will produce a warning, and the result of the computations for sine transforms may be wrong. These restrictions meet the requirements of the Poisson Library (described in the Poisson Library Routines section), which the TT interface is primarily designed for. ipar int array of size 128. Contains integer data needed for Trigonometric Transform computations. dpar double array of size 5n/2+2. Contains double-precision data needed for Trigonometric Transform computations. The routine initializes most elements of this array. spar float array of size 5n/2+2. Contains single-precision data needed for Trigonometric Transform computations. The routine initializes most elements of this array. Output Parameters handle DFTI_DESCRIPTOR_HANDLE*. The data structure used by Intel MKL FFT interface (for details, refer to section "FFT Functions" in chapter "Fast Fourier Transforms"). ipar Contains integer data needed for Trigonometric Transform computations. On output, ipar[6] is updated with the stat value. dpar Contains double-precision data needed for Trigonometric Transform computations. On output, the entire array is initialized. spar Contains single-precision data needed for Trigonometric Transform computations. On output, the entire array is initialized. stat int*. Contains the routine completion status, which is also written to ipar[6]. Description The routine ?_commit_trig_transform checks consistency and correctness of the parameters to be passed to the transform routines ?_forward_trig_transform and/or ?_backward_trig_transform. The routine also initializes the following data structures: handle, dpar in case of d_commit_trig_transform, and spar in case of s_commit_trig_transform. The ?_commit_trig_transform routine initializes only those elements of dpar or spar that depend upon the type of transform, defined in the ?_init_trig_transform routine and passed to ?_commit_trig_transform with the ipar array. The size of the problem n, which determines sizes of the array parameters, is also passed to the routine with the ipar array and defined in the previously called ?_init_trig_transform routine. For a detailed description of arrays ipar, dpar and spar, refer to the Common Parameters section. The routine performs only a basic check for correctness and Partial Differential Equations Support 13 2447 consistency of the parameters. If you are going to modify parameters of TT routines, see the Caveat on Parameter Modifications section. Unlike ?_init_trig_transform, the ?_commit_trig_transform routine is mandatory, and you cannot skip calling it in your code. Return Values stat= 11 The routine produced some warnings and made some changes in the parameters to achieve their correctness and/or consistency. You may proceed with computations by assigning ipar[6]=0 if you are sure that the parameters are correct. stat= 10 The routine made some changes in the parameters to achieve their correctness and/or consistency. You may proceed with computations by assigning ipar[6]=0 if you are sure that the parameters are correct. stat= 1 The routine produced some warnings. You may proceed with computations by assigning ipar[6]=0 if you are sure that the parameters are correct. stat= 0 The routine completed the task normally. stat= -100 The routine stopped for any of the following reasons: • An error in the user's data was encountered. • Data in ipar, dpar or spar parameters became incorrect and/or inconsistent as a result of modifications. stat= -1000 The routine stopped because of an FFT interface error. stat= -10000 The routine stopped because the initialization failed to complete or the parameter ipar[0] was altered by mistake. NOTE Although positive values of stat usually indicate minor problems with the input data and Trigonometric Transform computations can be continued, you are highly recommended to investigate the problem first and achieve stat=0. ?_forward_trig_transform Computes the forward Trigonometric Transform of type specified by the parameter. Syntax void d_forward_trig_transform(double f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], double dpar[], int *stat); void s_forward_trig_transform(float f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], float spar[], int *stat); Include Files • FORTRAN 90: mkl_trig_transforms.f90 • C: mkl_trig_transforms.h Input Parameters f double for d_forward_trig_transform, 13 Intel® Math Kernel Library Reference Manual 2448 float for s_forward_trig_transform, array of size n for staggered2 transforms and of size n+1 for all other transforms, where n is the size of the problem. On input, contains data vector to be transformed. Note that the following values should be 0.0 up to rounding errors: • f[0] and f[n] for sine transforms • f[n] for staggered cosine transforms • f[0] for staggered sine transforms. Otherwise, the routine will produce a warning, and the result of the computations for sine transforms may be wrong. The above restrictions meet the requirements of the Poisson Library (described in the Poisson Library Routines section), which the TT interface is primarily designed for. handle DFTI_DESCRIPTOR_HANDLE*. The data structure used by Intel MKL FFT interface (for details, refer to section "FFT Functions" in chapter "Fast Fourier Transforms"). ipar int array of size 128. Contains integer data needed for Trigonometric Transform computations. dpar double array of size 5n/2+2. Contains double-precision data needed for Trigonometric Transform computations. spar float array of size 5n/2+2. Contains single-precision data needed for Trigonometric Transform computations. Output Parameters f Contains the transformed vector on output. ipar Contains integer data needed for Trigonometric Transform computations. On output, ipar[6] is updated with the stat value. stat int*. Contains the routine completion status, which is also written to ipar[6]. Description The routine computes the forward Trigonometric Transform of type defined in the ?_init_trig_transform routine and passed to ?_forward_trig_transform with the ipar array. The size of the problem n, which determines sizes of the array parameters, is also passed to the routine with the ipar array and defined in the previously called ?_init_trig_transform routine. The other data that facilitates the computation is created by ?_commit_trig_transform and supplied in dpar or spar. For a detailed description of arrays ipar, dpar and spar, refer to the Common Parameters section. The routine has a commit step, which calls the ?_commit_trig_transform routine. The transform is computed according to formulas given in the Transforms Implemented section. The routine replaces the input vector f with the transformed vector. NOTE If you need a copy of the data vector f to be transformed, make the copy before calling the ? _forward_trig_transform routine. Return Values stat= 0 The routine completed the task normally. stat= -100 The routine stopped for any of the following reasons: • An error in the user's data was encountered. Partial Differential Equations Support 13 2449 • Data in ipar, dpar or spar parameters became incorrect and/or inconsistent as a result of modifications. stat= -1000 The routine stopped because of an FFT interface error. stat= -10000 The routine stopped because its commit step failed to complete or the parameter ipar[0] was altered by mistake. ?_backward_trig_transform Computes the backward Trigonometric Transform of type specified by the parameter. Syntax void d_backward_trig_transform(double f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], double dpar[], int *stat); void s_backward_trig_transform(float f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], float spar[], int *stat); Include Files • FORTRAN 90: mkl_trig_transforms.f90 • C: mkl_trig_transforms.h Input Parameters f double for d_backward_trig_transform, float for s_backward_trig_transform, array of size n for staggered2 transforms and of size n+1 for all other transforms, where n is the size of the problem. On input, contains data vector to be transformed. Note that the following values should be 0.0 up to rounding errors: • f[0] and f[n] for sine transforms • f[n] for staggered cosine transforms • f[0] for staggered sine transforms. Otherwise, the routine will produce a warning, and the result of the computations for sine transforms may be wrong. The above restrictions meet the requirements of the Poisson Library (described in the Poisson Library Routines section), which the TT interface is primarily designed for. handle DFTI_DESCRIPTOR_HANDLE*. The data structure used by Intel MKL FFT interface (for details, refer to section "FFT Functions" in chapter "Fast Fourier Transforms"). ipar int array of size 128. Contains integer data needed for Trigonometric Transform computations. dpar double array of size 5n/2+2. Contains double-precision data needed for Trigonometric Transform computations. spar float array of size 5n/2+2. Contains single-precision data needed for Trigonometric Transform computations. 13 Intel® Math Kernel Library Reference Manual 2450 Output Parameters f Contains the transformed vector on output. ipar Contains integer data needed for Trigonometric Transform computations. On output, ipar[6] is updated with the stat value. stat int*. Contains the routine completion status, which is also written to ipar[6]. Description The routine computes the backward Trigonometric Transform of type defined in the ? _init_trig_transform routine and passed to ?_backward_trig_transform with the ipar array. The size of the problem n, which determines sizes of the array parameters, is also passed to the routine with the ipar array and defined in the previously called ?_init_trig_transform routine. The other data that facilitates the computation is created by ?_commit_trig_transform and supplied in dpar or spar. For a detailed description of arrays ipar, dpar and spar, refer to the Common Parameters section. The routine has a commit step, which calls the ?_commit_trig_transform routine. The transform is computed according to formulas given in the Transforms Implemented section. The routine replaces the input vector f with the transformed vector. NOTE If you need a copy of the data vector f to be transformed, make the copy before calling the ? _backward_trig_transform routine. Return Values stat= 0 The routine completed the task normally. stat= -100 The routine stopped for any of the following reasons: • An error in the user's data was encountered. • Data in ipar, dpar or spar parameters became incorrect and/or inconsistent as a result of modifications. stat= -1000 The routine stopped because of an FFT interface error. stat= -10000 The routine stopped because its commit step failed to complete or the parameter ipar[0] was altered by mistake. free_trig_transform Cleans the memory allocated for the data structure used by the FFT interface. Syntax void free_trig_transform(DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], int *stat); Include Files • FORTRAN 90: mkl_trig_transforms.f90 • C: mkl_trig_transforms.h Input Parameters ipar int array of size 128. Contains integer data needed for Trigonometric Transform computations. Partial Differential Equations Support 13 2451 handle DFTI_DESCRIPTOR_HANDLE*. The data structure used by Intel MKL FFT interface (for details, refer to section "FFT Functions" in chapter "Fast Fourier Transforms"). Output Parameters handle The data structure used by Intel MKL FFT interface. Memory allocated for the structure is released on output. ipar Contains integer data needed for Trigonometric Transform computations. On output, ipar[6] is updated with the stat value. stat int*. Contains the routine completion status, which is also written to ipar[6]. Description The free_trig_transform routine cleans the memory used by the handle structure, needed for Intel MKL FFT functions. To release the memory allocated for other parameters, include cleaning of the memory in your code. Return Values stat= 0 The routine completed the task normally. stat= -1000 The routine stopped because of an FFT interface error. stat= -99999 The routine failed to complete the task. Common Parameters This section provides description of array parameters that hold TT routine options: ipar, dpar and spar. NOTE Initial values are assigned to the array parameters by the appropriate ? _init_trig_transform and ?_commit_trig_transform routines. ipar int array of size 128, holds integer data needed for Trigonometric Transform computations. Its elements are described in Table "Elements of the ipar Array": Elements of the ipar Array Index Description 0 Contains the size of the problem to solve. The ?_init_trig_transform routine sets ipar[0]=n, and all subsequently called TT routines use ipar[0] as the size of the transform. 1 Contains error messaging options: • ipar[1]=-1 indicates that all error messages will be printed to the file MKL_Trig_Transforms_log.txt in the folder from which the routine is called. If the file does not exist, the routine tries to create it. If the attempt fails, the routine prints information that the file cannot be created to the standard output device. • ipar[1]=0 indicates that no error messages will be printed. • ipar[1]=1 (default) indicates that all error messages will be printed to the preconnected default output device (usually, screen). In case of errors, each TT routine assigns a non-zero value to stat regardless of the ipar[1] setting. 13 Intel® Math Kernel Library Reference Manual 2452 Index Description 2 Contains warning messaging options: • ipar[2]=-1 indicates that all warning messages will be printed to the file MKL_Trig_Transforms_log.txt in the directory from which the routine is called. If the file does not exist, the routine tries to create it. If the attempt fails, the routine prints information that the file cannot be created to the standard output device. • ipar[2]=0 indicates that no warning messages will be printed. • ipar[2]=1 (default) indicates that all warning messages will be printed to the preconnected default output device (usually, screen). In case of warnings, the stat parameter will acquire a non-zero value regardless of the ipar[2] setting. 3 through 4 Reserved for future use. 5 Contains the type of the transform. The ?_init_trig_transform routine sets ipar[5]=tt_type, and all subsequently called TT routines use ipar[5] as the type of the transform. 6 Contains the stat value returned by the last completed TT routine. Used to check that the previous call to a TT routine completed with stat=0. 7 Informs the ?_commit_trig_transform routines whether to initialize data structures dpar (spar) and handle. ipar[7]=0 indicates that the routine should skip the initialization and only check correctness and consistency of the parameters. Otherwise, the routine initializes the data structures. The default value is 1. The possibility to check correctness and consistency of input data without initializing data structures dpar, spar and handle enables avoiding performance losses in a repeated use of the same transform for different data vectors. Note that you can benefit from the opportunity that ipar[7] gives only if you are sure to have supplied proper tolerance value in the dpar or spar array. Otherwise, avoid tuning this parameter. 8 Contains message style options for TT routines. If ipar[8]=0 then TT routines print all error and warning messages in Fortran-style notations. Otherwise, TT routines print the messages in C-style notations. The default value is 1. When selecting between these notations, mind that by default, numbering of elements in C arrays starts from 0 and in Fortran, it starts from 1. For example, for a C-style message "parameter ipar[0]=3 should be an even integer", the corresponding Fortran-style message will be "parameter ipar(1)=3 should be an even integer". The use of ipar[8] enables you to view messages in a more convenient style. 9 Specifies the number of OpenMP threads to run TT routines in the OpenMP environment of the Poisson Library. The default value is 1. You are highly recommended not to alter this value. See also Caveat on Parameter Modifications. 10 Specifies the mode of compatibility with FFTW. The default value is 0. Set the value to 1 to invoke compatibility with FFTW. In the latter case, results will not be normalized, because FFTW does not do this. It is highly recommended not to alter this value, but rather use real-to-real FFTW to MKL wrappers, described in the "FFTW to Intel® MKL Wrappers for FFTW 3.x" section in Appendix F. See also Caveat on Parameter Modifications. 11 through 127 Reserved for future use. Partial Differential Equations Support 13 2453 NOTE You may declare the ipar array in your code as int ipar[11]. However, for compatibility with later versions of Intel MKL TT interface, which may require more ipar values, it is highly recommended to declare ipar as int ipar[128]. Arrays dpar and spar are the same except in the data precision: dpar double array of size 5n/2+2, holds data needed for double-precision routines to perform TT computations. This array is initialized in the d_init_trig_transform and d_commit_trig_transform routines. spar float array of size 5n/2+2, holds data needed for single-precision routines to perform TT computations. This array is initialized in the s_init_trig_transform and s_commit_trig_transform routines. As dpar and spar have similar elements in respective positions, the elements are described together in Table "Elements of the dpar and spar Arrays": Elements of the dpar and spar Arrays Index Description 0 Contains the first absolute tolerance used by the appropriate ? _commit_trig_transform routine. For a staggered cosine or a sine transform, f[n] should be equal to 0.0 and for a staggered sine or a sine transform, f[0] should be equal to 0.0. The ?_commit_trig_transform routine checks whether absolute values of these parameters are below dpar[0]*n or spar[0]*n, depending on the routine precision. To suppress warnings resulting from tolerance checks, set dpar[0] or spar[0] to a sufficiently large number. 1 Reserved for future use. 2 through 5n/2+1 Contain tabulated values of trigonometric functions. Contents of the elements depend upon the type of transform tt_type, set up in the ?_commit_trig_transform routine: • If tt_type=MKL_SINE_TRANSFORM, the transform uses only the first n/2 array elements, which contain tabulated sine values. • If tt_type=MKL_STAGGERED_SINE_TRANSFORM, the transform uses only the first 3n/2 array elements, which contain tabulated sine and cosine values. • If tt_type=MKL_STAGGERED2_SINE_TRANSFORM, the transform uses all the 5n/2 array elements, which contain tabulated sine and cosine values. • If tt_type=MKL_COSINE_TRANSFORM, the transform uses only the first n array elements, which contain tabulated cosine values. • If tt_type=MKL_STAGGERED_COSINE_TRANSFORM, the transform uses only the first 3n/2 elements, which contain tabulated sine and cosine values. • If tt_type=MKL_STAGGERED2_COSINE_TRANSFORM, the transform uses all the 5n/ 2 elements, which contain tabulated sine and cosine values. NOTE To save memory, you can define the array size depending upon the type of transform. Caveat on Parameter Modifications Flexibility of the TT interface enables you to skip calling the ?_init_trig_transform routine and to initialize the basic data structures explicitly in your code. You may also need to modify the contents of ipar, dpar and spar arrays after initialization. When doing so, provide correct and consistent data in the arrays. Mistakenly altered arrays cause errors or wrong computation. You can perform a basic check for correctness and consistency of parameters by calling the ?_commit_trig_transform routine; however, this does not ensure the correct result of a transform but only reduces the chance of errors or wrong results. 13 Intel® Math Kernel Library Reference Manual 2454 NOTE To supply correct and consistent parameters to TT routines, you should have considerable experience in using the TT interface and good understanding of elements that the ipar, spar and dpar arrays contain and dependencies between values of these elements. However, in rare occurrences, even advanced users might fail to compute a transform using TT routines after the parameter modifications. In cases like these, refer for technical support at http://www.intel.com/ software/products/support/ . WARNING The only way that ensures proper computation of the Trigonometric Transforms is to follow a typical sequence of invoking the routines and not change the default set of parameters. So, avoid modifications of ipar, dpar and spar arrays unless a strong need arises. Implementation Details Several aspects of the Intel MKL TT interface are platform-specific and language-specific. To promote portability across platforms and ease of use across different languages, users are provided with the TT language-specific header files to include in their code. Currently, the following of them are available: • mkl_trig_transforms.h, to be used together with mkl_dfti.h, for C programs. • mkl_trig_transforms.f90, to be used together with mkl_dfti.f90, for Fortran 90 programs. NOTE Use of the Intel MKL TT software without including one of the above header files is not supported. C-specific Header File The C-specific header file defines the following function prototypes: void d_init_trig_transform(int *, int *, int *, double *, int *); void d_commit_trig_transform(double *, DFTI_DESCRIPTOR_HANDLE *, int *, double *, int *); void d_forward_trig_transform(double *, DFTI_DESCRIPTOR_HANDLE *, int *, double *, int *); void d_backward_trig_transform(double *, DFTI_DESCRIPTOR_HANDLE *, int *, double *, int *); void s_init_trig_transform(int *, int *, int *, float *, int *); void s_commit_trig_transform(float *, DFTI_DESCRIPTOR_HANDLE *, int *, float *, int *); void s_forward_trig_transform(float *, DFTI_DESCRIPTOR_HANDLE *, int *, float *, int *); void s_backward_trig_transform(float *, DFTI_DESCRIPTOR_HANDLE *, int *, float *, int *); void free_trig_transform(DFTI_DESCRIPTOR_HANDLE *, int *, int *); Partial Differential Equations Support 13 2455 Fortran-Specific Header File The Fortran90-specific header file defines the following function prototypes: SUBROUTINE D_INIT_TRIG_TRANSFORM(n, tt_type, ipar, dpar, stat) INTEGER, INTENT(IN) :: n, tt_type INTEGER, INTENT(INOUT) :: ipar(*) REAL(8), INTENT(INOUT) :: dpar(*) INTEGER, INTENT(OUT) :: stat END SUBROUTINE D_INIT_TRIG_TRANSFORM SUBROUTINE D_COMMIT_TRIG_TRANSFORM(f, handle, ipar, dpar, stat) REAL(8), INTENT(INOUT) :: f(*) TYPE(DFTI_DESCRIPTOR), POINTER :: handle INTEGER, INTENT(INOUT) :: ipar(*) REAL(8), INTENT(INOUT) :: dpar(*) INTEGER, INTENT(OUT) :: stat END SUBROUTINE D_COMMIT_TRIG_TRANSFORM SUBROUTINE D_FORWARD_TRIG_TRANSFORM(f, handle, ipar, dpar, stat) REAL(8), INTENT(INOUT) :: f(*) TYPE(DFTI_DESCRIPTOR), POINTER :: handle INTEGER, INTENT(INOUT) :: ipar(*) REAL(8), INTENT(INOUT) :: dpar(*) INTEGER, INTENT(OUT) :: stat END SUBROUTINE D_FORWARD_TRIG_TRANSFORM SUBROUTINE D_BACKWARD_TRIG_TRANSFORM(f, handle, ipar, dpar, stat) REAL(8), INTENT(INOUT) :: f(*) TYPE(DFTI_DESCRIPTOR), POINTER :: handle INTEGER, INTENT(INOUT) :: ipar(*) REAL(8), INTENT(INOUT) :: dpar(*) INTEGER, INTENT(OUT) :: stat END SUBROUTINE D_BACKWARD_TRIG_TRANSFORM SUBROUTINE S_INIT_TRIG_TRANSFORM(n, tt_type, ipar, spar, stat) INTEGER, INTENT(IN) :: n, tt_type INTEGER, INTENT(INOUT) :: ipar(*) REAL(4), INTENT(INOUT) :: spar(*) INTEGER, INTENT(OUT) :: stat END SUBROUTINE S_INIT_TRIG_TRANSFORM SUBROUTINE S_COMMIT_TRIG_TRANSFORM(f, handle, ipar, spar, stat) REAL(4), INTENT(INOUT) :: f(*) TYPE(DFTI_DESCRIPTOR), POINTER :: handle INTEGER, INTENT(INOUT) :: ipar(*) 13 Intel® Math Kernel Library Reference Manual 2456 Fortran 90 specifics of the TT routines usage are similar for all Intel MKL PDE support tools and described in the Calling PDE Support Routines from Fortran 90 section. Poisson Library Routines In addition to Real Discrete Trigonometric Transforms (TT) interface (refer to Trigonometric Transform Routines), Intel® MKL supports the Poisson Library interface, referred to as PL interface. The interface implements a group of routines (PL routines) used to compute a solution of Laplace, Poisson, and Helmholtz problems of special kind using discrete Fourier transforms. Laplace and Poisson problems are special cases of a more general Helmholtz problem. The problems being solved are defined more exactly in the Poisson Library Implemented subsection. The PL interface provides much flexibility of use: you can adjust routines to your particular needs at the cost of manual tuning routine parameters or just call routines with default parameter values. The interface can adjust style of error and warning messages to C or Fortran notations by setting up a dedicated parameter. This adds convenience to debugging, because users can read information in the way that is natural for their code. The Intel MKL PL interface currently contains only routines that implement the following solvers: • Fast Laplace, Poisson and Helmholtz solvers in a Cartesian coordinate system • Fast Poisson and Helmholtz solvers in a spherical coordinate system. To describe the Intel MKL PL interface, the C convention is used. Fortran usage specifics can be found in the Calling PDE Support Routines from Fortran 90 section. NOTE Fortran users should mind that respective array indices in Fortran increase by 1. Poisson Library Implemented PL routines enable approximate solving of certain two-dimensional and three-dimensional problems. Figure "Structure of the Poisson Library" shows the general structure of the Poisson Library. Structure of the Poisson Library Partial Differential Equations Support 13 2457 Sections below provide details of the problems that can be solved using Intel MKL PL. Two-Dimensional Problems Notational Conventions The PL interface description uses the following notation for boundaries of a rectangular domain ax < x < bx, ay < y < by on a Cartesian plane: bd_ax = {x = ax, ay = y = by}, bd_bx = {x = bx, ay = y = by} bd_ay = {ax = x = bx, y = ay}, bd_by = {ax = x = bx, y = by}. The wildcard "+" may stand for any of the symbols ax, bx, ay, by, so that bd_+ denotes any of the above boundaries. The PL interface description uses the following notation for boundaries of a rectangular domain af < f < bf, a? < ? < b? on a sphere 0 = f = 2 p, 0 = ? = p: bd_af = {f = af, a? = ? = b?}, bd_bf = {f = bf, a? = ? = b?} bd_a? = {af = f = bf, ? = a?}, bd_b? = {af = f = bf, ? = b?}. The wildcard "~" may stand for any of the symbols af, bf, a?, b?, so that bd_~ denotes any of the above boundaries. Two-dimensional (2D) Helmholtz problem on a Cartesian plane The 2D Helmholtz problem is to find an approximate solution of the Helmholtz equation in a rectangle, that is, a rectangular domain ax< x < bx, ay< y < by, with one of the following boundary conditions on each boundary bd_+: • The Dirichlet boundary condition • The Neumann boundary condition where n= -x on bd_ax, n= x on bd_bx, n= -y on bd_ay, n= y on bd_by. Two-dimensional (2D) Poisson problem on a Cartesian plane The Poisson problem is a special case of the Helmholtz problem, when q=0. The 2D Poisson problem is to find an approximate solution of the Poisson equation 13 Intel® Math Kernel Library Reference Manual 2458 in a rectangle ax< x < bx, ay< y < by with the Dirichlet or Neumann boundary condition on each boundary bd_+. In case of a problem with the Neumann boundary condition on the entire boundary, you can find the solution of the problem only up to a constant. In this case, the Poisson Library will compute the solution that provides the minimal Euclidean norm of a residual. Two-dimensional (2D) Laplace problem on a Cartesian plane The Laplace problem is a special case of the Helmholtz problem, when q=0 and f(x, y)=0. The 2D Laplace problem is to find an approximate solution of the Laplace equation in a rectangle ax< x < bx, ay< y < by with the Dirichlet or Neumann boundary condition on each boundary bd_+. Helmholtz problem on a sphere The Helmholtz problem on a sphere is to find an approximate solution of the Helmholtz equation in a spherical rectangle that is, a domain bounded by angles af= f = bf, a?= ? = b?, with boundary conditions for particular domains listed in Table "Details of Helmholtz Problem on a Sphere". Details of Helmholtz Problem on a Sphere Domain on a sphere Boundary condition Periodic/nonperiodic case Rectangular, that is, bf - af < 2 p and b? - a? < p Homogeneous Dirichlet boundary conditions on each boundary bd_~ non-periodic Where af = 0, bf = 2 p, and b? - a? < p Homogeneous Dirichlet boundary conditions on the boundaries bd_a? and bd_b? periodic Entire sphere, that is, af = 0, bf = 2 p, a? = 0, and b? = p Boundary condition at the poles. periodic Partial Differential Equations Support 13 2459 Poisson problem on a sphere The Poisson problem is a special case of the Helmholtz problem, when q=0. The Poisson problem on a sphere is to find an approximate solution of the Poisson equation in a spherical rectangle af= f = bf, a?= ? = b? in cases listed in Table "Details of Helmholtz Problem on a Sphere". The solution to the Poisson problem on the entire sphere can be found up to a constant only. In this case, Poisson Library will compute the solution that provides the minimal Euclidean norm of a residual. Approximation of 2D problems To find an approximate solution for any of the 2D problems, a uniform mesh is built in the rectangular domain: in the Cartesian case and in the spherical case. Poisson Library uses the standard five-point finite difference approximation on this mesh to compute the approximation to the solution: • In the Cartesian case, the values of the approximate solution will be computed in the mesh points (xi , yj) provided that the user knows the values of the right-hand side f(x, y) in these points and the values of the appropriate boundary functions G(x, y) and/or g(x,y) in the mesh points laying on the boundary of the rectangular domain. • In the spherical case, the values of the approximate solution will be computed in the mesh points (fi , ?j) provided that the user knows the values of the right-hand side f(f, ?) in these points. NOTE The number of mesh intervals nf in the f direction of a spherical mesh must be even in the periodic case. The current implementation of the Poisson Library does not support meshes with the number of intervals that does not meet this condition. Three-Dimensional Problems Notational Conventions The PL interface description uses the following notation for boundaries of a parallelepiped domain ax < x < bx, ay < y _( ) where • indicates the data type: s real, single precision d real, double precision • indicates the task type: trnlsp nonlinear least squares problem without constraints trnlspbc nonlinear least squares problem with boundary constraints jacobi computation of the Jacobian matrix using central differences • indicates an action on the task: init initializes the solver check checks correctness of the input parameters solve solves the problem get retrieves the number of iterations, the stop criterion, the initial residual, and the final residual delete releases the allocated data Nonlinear Least Squares Problem without Constraints The nonlinear least squares problem without constraints can be described as follows: where F(x) : Rn ? Rm is a twice differentiable function in Rn. 14 Intel® Math Kernel Library Reference Manual 2496 Solving a nonlinear least squares problem means searching for the best approximation to the vector y with the model function fi(x) and nonlinear variables x. The best approximation means that the sum of squares of residuals yi - fi(x) is the minimum. See usage examples in FORTRAN and C in the examples\solver\source folder of your Intel MKL directory (ex_nlsqp_f.f and ex_nlsqp_c.c, respectively). RCI TR Routines Routine Name Operation ?trnlsp_init Initializes the solver. ?trnlsp_check Checks correctness of the input parameters. ?trnlsp_solve Solves a nonlinear least squares problem using the Trust-Region algorithm. ?trnlsp_get Retrieves the number of iterations, stop criterion, initial residual, and final residual. ?trnlsp_delete Releases allocated data. ?trnlsp_init Initializes the solver of a nonlinear least squares problem. Syntax Fortran: res = strnlsp_init(handle, n, m, x, eps, iter1, iter2, rs) res = dtrnlsp_init(handle, n, m, x, eps, iter1, iter2, rs) C: res = strnlsp_init(&handle, &n, &m, x, eps, &iter1, &iter2, &rs); res = dtrnlsp_init(&handle, &n, &m, x, eps, &iter1, &iter2, &rs); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlsp_init routine initializes the solver. After initialization, all subsequent invocations of the ?trnlsp_solve routine should use the values of the handle returned by ?trnlsp_init. The eps array contains the stopping criteria: eps Value Description 1 ? < eps(1) 2 ||F(x)||2 < eps(2) 3 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n Nonlinear Optimization Problem Solvers 14 2497 eps Value Description 4 ||s||2 < eps(4) 5 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) 6 The trial step precision. If eps(6) = 0, then the trial step meets the required precision (= 1.0D-10). Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. • s is the trial step. Input Parameters n INTEGER. Length of x. m INTEGER. Length of F(x). x REAL for strnlsp_init DOUBLE PRECISION for dtrnlsp_init Array of size n. Initial guess. eps REAL for strnlsp_init DOUBLE PRECISION for dtrnlsp_init Array of size 6; contains stopping criteria. See the values in the Description section. iter1 INTEGER. Specifies the maximum number of iterations. iter2 INTEGER. Specifies the maximum number of iterations of trial step calculation. rs REAL for strnlsp_init DOUBLE PRECISION for dtrnlsp_init Definition of initial size of the trust region (boundary of the trial step). The minimum value is 0.1, and the maximum value is 100.0. Based on your knowledge of the objective function and initial guess you can increase or decrease the initial trust region. It can influence the iteration process, for example, the direction of the iteration process and the number of iterations. The default value is 100.0. Output Parameters handle Type _TRNSP_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. res INTEGER. Indicates task completion status. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. • res = TR_OUT_OF_MEMORY - there was a memory error. TR_SUCCESS, TR_INVALID_OPTION, and TR_OUT_OF_MEMORY are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. See Also ?trnlsp_solve 14 Intel® Math Kernel Library Reference Manual 2498 ?trnlsp_check Checks the correctness of handle and arrays containing Jacobian matrix, objective function, and stopping criteria. Syntax Fortran: res = strnlsp_check(handle, n, m, fjac, fvec, eps, info) res = dtrnlsp_check(handle, n, m, fjac, fvec, eps, info) C: res = strnlsp_check(&handle, &n, &m, fjac, fvec, eps, info); res = dtrnlsp_check(&handle, &n, &m, fjac, fvec, eps, info); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlsp_check routine checks the arrays passed into the solver as input parameters. If an array contains any INF or NaN values, the routine sets the flag in output array info (see the description of the values returned in the Output Parameters section for the info array). Input Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. n INTEGER. Length of x. m INTEGER. Length of F(x). fjac REAL for strnlsp_check DOUBLE PRECISION for dtrnlsp_check Array of size m by n. Contains the Jacobian matrix of the function. fvec REAL for strnlsp_check DOUBLE PRECISION for dtrnlsp_check Array of size m. Contains the function values at X, where fvec(i) = (yi – fi(x)). eps REAL for strnlsp_check DOUBLE PRECISION for dtrnlsp_check Array of size 6; contains stopping criteria. See the values in the Description section of the ?trnlsp_init. Output Parameters info INTEGER Array of size 6. Results of input parameter checking: Nonlinear Optimization Problem Solvers 14 2499 Parameter Used for Val ue Description C Language Fortran Language info(0) info(1) Flags for handle 0 The handle is valid. 1 The handle is not allocated. info(1) info(2) Flags for fjac 0 The fjac array is valid. 1 The fjac array is not allocated 2 The fjac array contains NaN. 3 The fjac array contains Inf. info(2) info(3) Flags for fvec 0 The fvec array is valid. 1 The fvec array is not allocated 2 The fvec array contains NaN. 3 The fvec array contains Inf. info(3) info(4) Flags for eps 0 The eps array is valid. 1 The eps array is not allocated 2 The eps array contains NaN. 3 The eps array contains Inf. 4 The eps array contains a value less than or equal to zero. res INTEGER. Information about completion of the task. res = TR_SUCCESS - the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlsp_solve Solves a nonlinear least squares problem using the TR algorithm. Syntax Fortran: res = strnlsp_solve(handle, fvec, fjac, RCI_Request) res = dtrnlsp_solve(handle, fvec, fjac, RCI_Request) C: res = strnlsp_solve(&handle, fvec, fjac, &RCI_Request); res = dtrnlsp_solve(&handle, fvec, fjac, &RCI_Request); Include Files • Fortran: mkl_rci.fi 14 Intel® Math Kernel Library Reference Manual 2500 • C: mkl_rci.h Description The ?trnlsp_solve routine uses the TR algorithm to solve nonlinear least squares problems. The problem is stated as follows: where • F(x):Rn ? Rm • m = n From a current point xcurrent, the algorithm uses the trust-region approach: to get xnew = xcurrent + s that satisfies where • J(x) is the Jacobian matrix • s is the trial step • ||s||2 = ?current The RCI_Request parameter provides additional information: RCI_Request Value Description 2 Request to calculate the Jacobian matrix and put the result into fjac 1 Request to recalculate the function at vector X and put the result into fvec 0 One successful iteration step on the current trust-region radius (that does not mean that the value of x has changed) -1 The algorithm has exceeded the maximum number of iterations -2 ? < eps(1) -3 ||F(x)||2 < eps(2) -4 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n -5 ||s||2 < eps(4) -6 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) Nonlinear Optimization Problem Solvers 14 2501 Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. • s is the trial step. Input Parameters handle Type _TRNSP_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. fvec REAL for strnlsp_solve DOUBLE PRECISION for dtrnlsp_solve Array of size m. Contains the function values at X, where fvec(i) = (yi – fi(x)). fjac REAL for strnlsp_solve DOUBLE PRECISION for dtrnlsp_solve Array of size (m,n). Contains the Jacobian matrix of the function. Output Parameters fvec REAL for strnlsp_solve DOUBLE PRECISION for dtrnlsp_solve Array of size m. Updated function evaluated at x. RCI_Request INTEGER. Informs about the task stage. See the Description section for the parameter values and their meaning. res INTEGER. Indicates the task completion. res = TR_SUCCESS - the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlsp_get Retrieves the number of iterations, stop criterion, initial residual, and final residual. Syntax Fortran: res = strnlsp_get(handle, iter, st_cr, r1, r2) res = dtrnlsp_get(handle, iter, st_cr, r1, r2) C: res = strnlsp_get(&handle, &iter, &st_cr, &r1, &r2); res = dtrnlsp_get(&handle, &iter, &st_cr, &r1, &r2); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine retrieves the current number of iterations, the stop criterion, the initial residual, and final residual. The initial residual is the value of the functional (||y - f(x)||) of the initial x values provided by the user. 14 Intel® Math Kernel Library Reference Manual 2502 The final residual is the value of the functional (||y - f(x)||) of the final x resulting from the algorithm operation. The st_cr parameter contains the stop criterion: st_cr Value Description 1 The algorithm has exceeded the maximum number of iterations 2 ? < eps(1) 3 ||F(x)||2 < eps(2) 4 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n 5 ||s||2 < eps(4) 6 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. • s is the trial step. Input Parameters handle Type _TRNSP_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters iter INTEGER. Contains the current number of iterations. st_cr INTEGER. Contains the stop criterion. See the Description section for the parameter values and their meanings. r1 REAL for strnlsp_get DOUBLE PRECISION for dtrnlsp_get Contains the residual, (||y - f(x)||) given the initial x. r2 REAL for strnlsp_get DOUBLE PRECISION for dtrnlsp_get Contains the final residual, that is, the value of the functional (||y - f(x)||) of the final x resulting from the algorithm operation. res INTEGER. Indicates the task completion. res = TR_SUCCESS - the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlsp_delete Releases allocated data. Syntax Fortran: res = strnlsp_delete(handle) res = dtrnlsp_delete(handle) Nonlinear Optimization Problem Solvers 14 2503 C: res = strnlsp_delete(&handle); res = dtrnlsp_delete(&handle); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlsp_delete routine releases all memory allocated for the handle. This routine flags memory as not used, but to actually release all memory you must call the support function mkl_free_buffers. Input Parameters handle Type _TRNSP_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters res INTEGER. Indicates the task completion. res = TR_SUCCESS means the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. Nonlinear Least Squares Problem with Linear (Bound) Constraints The nonlinear least squares problem with linear bound constraints is very similar to the nonlinear least squares problem without constraints but it has the following constraints: See usage examples in FORTRAN and C in the examples\solver\source folder of your Intel MKL directory (ex_nlsqp_bc_f.f and ex_nlsqp_bc_c.c, respectively). RCI TR Routines for Problem with Bound Constraints Routine Name Operation ?trnlspbc_init Initializes the solver. ?trnlspbc_check Checks correctness of the input parameters. ?trnlspbc_solve Solves a nonlinear least squares problem using RCI and the Trust- Region algorithm. ?trnlspbc_get Retrieves the number of iterations, stop criterion, initial residual, and final residual. ?trnlspbc_delete Releases allocated data. 14 Intel® Math Kernel Library Reference Manual 2504 ?trnlspbc_init Initializes the solver of nonlinear least squares problem with linear (boundary) constraints. Syntax Fortran: res = strnlspbc_init(handle, n, m, x, LW, UP, eps, iter1, iter2, rs) res = dtrnlspbc_init(handle, n, m, x, LW, UP, eps, iter1, iter2, rs) C: res = strnlspbc_init(&handle, &n, &m, x, LW, UP, eps, &iter1, &iter2, &rs); res = dtrnlspbc_init(&handle, &n, &m, x, LW, UP, eps, &iter1, &iter2, &rs); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlspbc_init routine initializes the solver. After initialization all subsequent invocations of the ?trnlspbc_solve routine should use the values of the handle returned by ?trnlspbc_init. The eps array contains the stopping criteria: eps Value Description 1 ? < eps(1) 2 ||F(x)||2 < eps(2) 3 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n 4 ||s||2 < eps(4) 5 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) 6 The trial step precision. If eps(6) = 0, then the trial step meets the required precision (= 1.0D-10). Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. • s is the trial step. Input Parameters n INTEGER. Length of x. m INTEGER. Length of F(x). x REAL for strnlspbc_init Nonlinear Optimization Problem Solvers 14 2505 DOUBLE PRECISION for dtrnlspbc_init Array of size n. Initial guess. LW REAL for strnlspbc_init DOUBLE PRECISION for dtrnlspbc_init Array of size n. Contains low bounds for x (lwi < xi ). UP REAL for strnlspbc_init DOUBLE PRECISION for dtrnlspbc_init Array of size n. Contains upper bounds for x (upi > xi ). eps REAL for strnlspbc_init DOUBLE PRECISION for dtrnlspbc_init Array of size 6; contains stopping criteria. See the values in the Description section. iter1 INTEGER. Specifies the maximum number of iterations. iter2 INTEGER. Specifies the maximum number of iterations of trial step calculation. rs REAL for strnlspbc_init DOUBLE PRECISION for dtrnlspbc_init Definition of initial size of the trust region (boundary of the trial step). The minimum value is 0.1, and the maximum value is 100.0. Based on your knowledge of the objective function and initial guess you can increase or decrease the initial trust region. It can influence the iteration process, for example, the direction of the iteration process and the number of iterations. The default value is 100.0. Output Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. res INTEGER. Informs about the task completion. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. • res = TR_OUT_OF_MEMORY - there was a memory error. TR_SUCCESS, TR_INVALID_OPTION, and TR_OUT_OF_MEMORY are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. ?trnlspbc_check Checks the correctness of handle and arrays containing Jacobian matrix, objective function, lower and upper bounds, and stopping criteria. Syntax Fortran: res = strnlspbc_check(handle, n, m, fjac, fvec, LW, UP, eps, info) res = dtrnlspbc_check(handle, n, m, fjac, fvec, LW, UP, eps, info) C: res = strnlspbc_check(&handle, &n, &m, fjac, fvec, LW, UP, eps, info); res = dtrnlspbc_check(&handle, &n, &m, fjac, fvec, LW, UP, eps, info); 14 Intel® Math Kernel Library Reference Manual 2506 Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlspbc_check routine checks the arrays passed into the solver as input parameters. If an array contains any INF or NaN values, the routine sets the flag in output array info (see the description of the values returned in the Output Parameters section for the info array). Input Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. n INTEGER. Length of x. m INTEGER. Length of F(x). fjac REAL for strnlspbc_check DOUBLE PRECISION for dtrnlspbc_check Array of size m by n. Contains the Jacobian matrix of the function. fvec REAL for strnlspbc_check DOUBLE PRECISION for dtrnlspbc_check Array of size m. Contains the function values at X, where fvec(i) = (yi – fi(x)). LW REAL for strnlspbc_check DOUBLE PRECISION for dtrnlspbc_check Array of size n. Contains low bounds for x (lwi < xi ). UP REAL for strnlspbc_check DOUBLE PRECISION for dtrnlspbc_check Array of size n. Contains upper bounds for x (upi > xi ). eps REAL for strnlspbc_check DOUBLE PRECISION for dtrnlspbc_check Array of size 6; contains stopping criteria. See the values in the Description section of the ?trnlspbc_init. Output Parameters info INTEGER Array of size 6. Results of input parameter checking: Parameter Used for Val ue Description C Language Fortran Language info(0) info(1) Flags for handle 0 The handle is valid. 1 The handle is not allocated. info(1) info(2) Flags for fjac 0 The fjac array is valid. 1 The fjac array is not allocated 2 The fjac array contains NaN. Nonlinear Optimization Problem Solvers 14 2507 Parameter Used for Val ue Description C Language Fortran Language 3 The fjac array contains Inf. info(2) info(3) Flags for fvec 0 The fvec array is valid. 1 The fvec array is not allocated 2 The fvec array contains NaN. 3 The fvec array contains Inf. info(3) info(4) Flags for LW 0 The LW array is valid. 1 The LW array is not allocated 2 The LW array contains NaN. 3 The LW array contains Inf. 4 The lower bound is greater than the upper bound. info(4) info(5) Flags for up 0 The up array is valid. 1 The up array is not allocated 2 The up array contains NaN. 3 The up array contains Inf. 4 The upper bound is less than the lower bound. info(5) info(6) Flags for eps 0 The eps array is valid. 1 The eps array is not allocated 2 The eps array contains NaN. 3 The eps array contains Inf. 4 The eps array contains a value less than or equal to zero. res INTEGER. Information about completion of the task. res = TR_SUCCESS - the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlspbc_solve Solves a nonlinear least squares problem with linear (bound) constraints using the Trust-Region algorithm. 14 Intel® Math Kernel Library Reference Manual 2508 Syntax Fortran: res = strnlspbc_solve(handle, fvec, fjac, RCI_Request) res = dtrnlspbc_solve(handle, fvec, fjac, RCI_Request) C: res = strnlspbc_solve(&handle, fvec, fjac, &RCI_Request); res = dtrnlspbc_solve(&handle, fvec, fjac, &RCI_Request); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlspbc_solve routine, based on RCI, uses the Trust-Region algorithm to solve nonlinear least squares problems with linear (bound) constraints. The problem is stated as follows: where li = xi = ui i = 1, ..., n. The RCI_Request parameter provides additional information: RCI_Request Value Description 2 Request to calculate the Jacobian matrix and put the result into fjac 1 Request to recalculate the function at vector X and put the result into fvec 0 One successful iteration step on the current trust-region radius (that does not mean that the value of x has changed) -1 The algorithm has exceeded the maximum number of iterations -2 ? < eps(1) -3 ||F(x)||2 < eps(2) -4 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n -5 ||s||2 < eps(4) -6 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. Nonlinear Optimization Problem Solvers 14 2509 • s is the trial step. Input Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. fvec REAL for strnlspbc_solve DOUBLE PRECISION for dtrnlspbc_solve Array of size m. Contains the function values at X, where fvec(i) = (yi – fi(x)). fjac REAL for strnlspbc_solve DOUBLE PRECISION for dtrnlspbc_solve Array of size m by n. Contains the Jacobian matrix of the function. Output Parameters fvec REAL for strnlspbc_solve DOUBLE PRECISION for dtrnlspbc_solve Array of size m. Updated function evaluated at x. RCI_Request INTEGER. Informs about the task stage. See the Description section for the parameter values and their meaning. res INTEGER. Informs about the task completion. res = TR_SUCCESS means the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlspbc_get Retrieves the number of iterations, stop criterion, initial residual, and final residual. Syntax Fortran: res = strnlspbc_get(handle, iter, st_cr, r1, r2) res = dtrnlspbc_get(handle, iter, st_cr, r1, r2) C: res = strnlspbc_get(&handle, &iter, &st_cr, &r1, &r2); res = dtrnlspbc_get(&handle, &iter, &st_cr, &r1, &r2); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine retrieves the current number of iterations, the stop criterion, the initial residual, and final residual. The st_cr parameter contains the stop criterion: st_cr Value Description 1 The algorithm has exceeded the maximum number of iterations 14 Intel® Math Kernel Library Reference Manual 2510 st_cr Value Description 2 ? < eps(1) 3 ||F(x)||2 < eps(2) 4 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n 5 ||s||2 < eps(4) 6 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. • s is the trial step. Input Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters iter INTEGER. Contains the current number of iterations. st_cr INTEGER. Contains the stop criterion. See the Description section for the parameter values and their meanings. r1 REAL for strnlspbc_get DOUBLE PRECISION for dtrnlspbc_get Contains the residual, (||y - f(x)||) given the initial x. r2 REAL for strnlspbc_get DOUBLE PRECISION for dtrnlspbc_get Contains the final residual, that is, the value of the function (||y - f(x)||) of the final x resulting from the algorithm operation. res INTEGER. Informs about the task completion. res = TR_SUCCESS - the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlspbc_delete Releases allocated data. Syntax Fortran: res = strnlspbc_delete(handle) res = dtrnlspbc_delete(handle) C: res = strnlspbc_delete(&handle); res = dtrnlspbc_delete(&handle); Nonlinear Optimization Problem Solvers 14 2511 Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlspbc_delete routine releases all memory allocated for the handle. NOTE This routine flags memory as not used, but to actually release all memory you must call the support function mkl_free_buffers. Input Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters res INTEGER. Informs about the task completion. res = TR_SUCCESS means the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. Jacobian Matrix Calculation Routines This section describes routines that compute the Jacobian matrix using the central difference algorithm. Jacobian matrix calculation is required to solve a nonlinear least squares problem and systems of nonlinear equations (with or without linear bound constraints). Routines for calculation of the Jacobian matrix have the "Black-Box" interfaces, where you pass the objective function via parameters. Your objective function must have a fixed interface. Jacobian Matrix Calculation Routines Routine Name Operation ?jacobi_init Initializes the solver. ?jacobi_solve Computes the Jacobian matrix of the function on the basis of RCI using the central difference algorithm. ?jacobi_delete Removes data. ?jacobi Computes the Jacobian matrix of the fcn function using the central difference algorithm. ?jacobix Presents an alternative interface for the ?jacobi function enabling you to pass additional data into the objective function. ?jacobi_init Initializes the solver for Jacobian calculations. Syntax Fortran: res = sjacobi_init(handle, n, m, x, fjac, esp) res = djacobi_init(handle, n, m, x, fjac, esp) 14 Intel® Math Kernel Library Reference Manual 2512 C: res = sjacobi_init(&handle, &n, &m, x, fjac, &eps); res = djacobi_init(&handle, &n, &m, x, fjac, &eps); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine initializes the solver. Input Parameters n INTEGER. Length of x. m INTEGER. Length of F. x REAL for sjacobi_init DOUBLE PRECISION for djacobi_init Array of size n. Vector, at which the function is evaluated. eps REAL for sjacobi_init DOUBLE PRECISION for djacobi_init Precision of the Jacobian matrix calculation. fjac REAL for sjacobi_init DOUBLE PRECISION for djacobi_init Array of size (m,n). Contains the Jacobian matrix of the function. Output Parameters handle Data object of the _JACOBIMATRIX_HANDLE_t type in C/C++ and INTEGER*8 in FORTRAN. res INTEGER. Indicates task completion status. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. • res = TR_OUT_OF_MEMORY - there was a memory error. TR_SUCCESS, TR_INVALID_OPTION, and TR_OUT_OF_MEMORY are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. ?jacobi_solve Computes the Jacobian matrix of the function using RCI and the central difference algorithm. Syntax Fortran: res = sjacobi_solve(handle, f1, f2, RCI_Request) res = djacobi_solve(handle, f1, f2, RCI_Request) C: res = sjacobi_solve(&handle, f1, f2, &RCI_Request); res = djacobi_solve(&handle, f1, f2, &RCI_Request); Nonlinear Optimization Problem Solvers 14 2513 Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?jacobi_solve routine computes the Jacobian matrix of the function using RCI and the central difference algorothm. See usage examples in FORTRAN and C in the examples\solver\source folder of your Intel MKL directory (sjacobi_rci_f.f, djacobi_rci_f.f and sjacobi_rci_c.c, djacobi_rci_c.c, respectively). Input Parameters handle Type _JACOBIMATRIX_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters f1 REAL for sjacobi_solve DOUBLE PRECISION for djacobi_solve Contains the updated function values at x + eps. f2 REAL for sjacobi_solve DOUBLE PRECISION for djacobi_solve Array of size m. Contains the updated function values at x - eps. RCI_Request INTEGER. Informs about the task completion. When equal to 0, the task has completed successfully. RCI_Request= 1 indicates that you should compute the function values at the current x point and put the results into f1. RCI_Request= 2 indicates that you should compute the function values at the current x point and put the results into f2. res INTEGER. Indicates the task completion status. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. TR_SUCCESS and TR_INVALID_OPTION are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. See Also ?jacobi_init ?jacobi_delete Releases allocated data. Syntax Fortran: res = sjacobi_delete(handle) res = djacobi_delete(handle) C: res = sjacobi_delete(&handle); 14 Intel® Math Kernel Library Reference Manual 2514 res = djacobi_delete(&handle); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?jacobi_delete routine releases all memory allocated for the handle. This routine flags memory as not used, but to actually release all memory you must call the support function mkl_free_buffers. Input Parameters handle Type _JACOBIMATRIX_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters res INTEGER. Informs about the task completion. res = TR_SUCCESS means the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?jacobi Computes the Jacobian matrix of the objective function using the central difference algorithm. Syntax Fortran: res = sjacobi(fcn, n, m, fjac, x, jac_eps) res = djacobi(fcn, n, m, fjac, x, jac_eps) C: res = sjacobi(fcn, &n, &m, fjac, x, &jac_eps); res = djacobi(fcn, &n, &m, fjac, x, &jac_eps); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?jacobi routine computes the Jacobian matrix for function fcn using the central difference algorithm. This routine has a "Black-Box" interface, where you input the objective function via parameters. Your objective function must have a fixed interface. See calling and usage examples in FORTRAN and C in the examples\solver\source folder of your Intel MKL directory (ex_nlsqp_f.f, ex_nlsqp_bc_f.f and ex_nlsqp_c.c, ex_nlsqp_bc_c.c, respectively). Input Parameters fcn User-supplied subroutine to evaluate the function that defines the least squares problem. Call fcn (m, n, x, f) with the following parameters: Nonlinear Optimization Problem Solvers 14 2515 Parameter Type Description Input Parameters m INTEGER Length of f n INTEGER Length of x x REAL for sjacobi DOUBLE PRECISION for djacobi Array of size n. Vector, at which the function is evaluated. The fcn function should not change this parameter. Output Parameters f REAL for sjacobix DOUBLE PRECISION for djacobix Array of size m; contains the function values at x. You need to declare fcn as EXTERNAL in the calling program. n INTEGER. Length of X. m INTEGER. Length of F. x REAL for sjacobi DOUBLE PRECISION for djacobi Array of size n. Vector at which the function is evaluated. eps REAL for sjacobi DOUBLE PRECISION for djacobi Precision of the Jacobian matrix calculation. Output Parameters fjac REAL for sjacobi DOUBLE PRECISION for djacobi Array of size (m,n). Contains the Jacobian matrix of the function. res INTEGER. Indicates task completion status. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. • res = TR_OUT_OF_MEMORY - there was a memory error. TR_SUCCESS, TR_INVALID_OPTION, and TR_OUT_OF_MEMORY are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. See Also ?jacobix ?jacobix Alternative interface for ?jacobi function for passing additional data into the objective function. Syntax Fortran: res = sjacobix(fcn, n, m, fjac, x, jac_eps, user_data) res = djacobix(fcn, n, m, fjac, x, jac_eps, user_data) 14 Intel® Math Kernel Library Reference Manual 2516 C: res = sjacobix(fcn, &n, &m, fjac, x, &jac_eps, user_data); res = djacobix(fcn, &n, &m, fjac, x, &jac_eps, user_data); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?jacobix routine presents an alternative interface for the ?jacobi function that enables you to pass additional data into the objective function fcn. See calling and usage examples in FORTRAN and C in the examples\solver\source folder of your Intel MKL directory (ex_nlsqp_f_x.f, ex_nlsqp_bc_f_x.f and ex_nlsqp_c_x.c, ex_nlsqp_bc_c_x.c, respectively). Input Parameters fcn User-supplied subroutine to evaluate the function that defines the least squares problem. Call fcn (m, n, x, f, user_data) with the following parameters: Parameter Type Description Input Parameters m INTEGER Length of f n INTEGER Length of x x REAL for sjacobix DOUBLE PRECISION for djacobix Array of size n. Vector, at which the function is evaluated. The fcn function should not change this parameter. user_data INTEGER*8, for Fortran void*, for C (Fortran) Your additional data, if any. Otherwise, a dummy argument. (C) Pointer to your additional data, if any. Otherwise, a dummy argument. Output Parameters f REAL for sjacobix DOUBLE PRECISION for djacobix Array of size m; contains the function values at x. You need to declare fcn as EXTERNAL in the calling program. n INTEGER. Length of X. m INTEGER. Length of F. x REAL for sjacobix DOUBLE PRECISION for djacobix Array of size n. Vector at which the function is evaluated. eps REAL for sjacobix DOUBLE PRECISION for djacobix Precision of the Jacobian matrix calculation. Nonlinear Optimization Problem Solvers 14 2517 user_data (Fortran) INTEGER*8. Contains your additional data. If there is no additional data, this is a dummy argument. (C) void*. Pointer to your additional data. If there is no additional data, this is a dummy argument. Output Parameters fjac REAL for sjacobix DOUBLE PRECISION for djacobix Array of size (m,n). Contains the Jacobian matrix of the function. res INTEGER. Indicates task completion status. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. • res = TR_OUT_OF_MEMORY - there was a memory error. TR_SUCCESS, TR_INVALID_OPTION, and TR_OUT_OF_MEMORY are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. See Also ?jacobi 14 Intel® Math Kernel Library Reference Manual 2518 Support Functions 15 Intel® Math Kernel Library (Intel® MKL) support functions are used to: – retrieve information about the current Intel MKL version – additionally control the number of threads – handle errors – test characters and character strings for equality – measure user time for a process and elapsed CPU time – measure CPU frequency – free memory allocated by Intel MKL memory management software – facilitate easy linking Functions described below are subdivided according to their purpose into the following groups: Version Information Functions Threading Control Functions Error Handling Functions Equality Test Functions Timing Functions Memory Functions Miscellaneous Utility Functions Functions Supporting the Single Dynamic Library Table "Intel MKL Support Functions" contains the list of support functions common for Intel MKL. Intel MKL Support Functions Function Name Operation Version Information Functions mkl_get_version Returns information about the active library version. mkl_get_version_string Returns information about the library version string. Threading Control Functions mkl_set_num_threads Suggests the number of threads to use. mkl_domain_set_num_threads Suggests the number of threads for a particular function domain. mkl_set_dynamic Enables Intel MKL to dynamically change the number of threads. mkl_get_max_threads Inquires about the number of threads targeted for parallelism. mkl_domain_get_max_threads Inquires about the number of threads targeted for parallelism in different domains. mkl_get_dynamic Returns the current value of the MKL_DYNAMIC variable. Error Handling Functions 2519 Function Name Operation xerbla Handles error conditions for the BLAS, LAPACK, VSL, VML routines. pxerbla Handles error conditions for the ScaLAPACK routines. Equality Test Functions lsame Tests two characters for equality regardless of the case. lsamen Tests two character strings for equality regardless of the case. Timing Functions second/dsecnd Returns user time for a process. mkl_get_cpu_clocks Returns full precision elapsed CPU clocks. mkl_get_cpu_frequency Returns CPU frequency value in GHz. mkl_get_max_cpu_frequency Returns the maximum CPU frequency value in GHz. mkl_get_clocks_frequency Returns the frequency value in GHz based on constantrate Time Stamp Counter. Memory Functions mkl_free_buffers Frees memory buffers. mkl_thread_free_buffers Frees memory buffers allocated only in the current thread. mkl_mem_stat Reports an amount of memory utilized by Intel MKL memory management software. mkl_disable_fast_mm Enables Intel MKL to dynamically turn off memory management. mkl_malloc Allocates the aligned memory buffer. mkl_free Frees the aligned memory buffer allocated by MKL_malloc. Miscellaneous Utility Functions mkl_progress Tracks computational progress of selective MKL routines. mkl_enable_instructions Allows Intel MKL to dispatch Intel® Advanced Vector Extensions (Intel® AVX) if run on the respective hardware (or simulation). Functions Supporting the Single Dynamic Library (SDL) mkl_set_interface_layer Sets the interface layer for Intel MKL at run time. mkl_set_threading_layer Sets the threading layer for Intel MKL at run time. mkl_set_xerbla Replaces the error handling routine. Use with SDL on Windows* OS. mkl_set_progress Replaces the progress information routine. Use with SDL on Windows* OS. 15 Intel® Math Kernel Library Reference Manual 2520 Version Information Functions Intel® MKL provides two methods for extracting information about the library version number: • extracting a version string using the mkl_get_version_string function • using the mkl_get_version function to obtain an MKLVersion structure that contains the version information A makefile is also provided to automatically build the examples and output summary files containing the version information for the current library. mkl_get_version Returns information about the active library C version. Syntax void mkl_get_version( MKLVersion* pVersion ); Include Files • C: mkl_service.h Output Parameters pVersion Pointer to the MKLVersion structure. Description The mkl_get_version function collects information about the active C version of the Intel MKL software and returns this information in a structure of MKLVersion type by the pVersion address. The MKLVersion structure type is defined in the mkl_types.h file. The following fields of the MKLVersion structure are available: MajorVersion is the major number of the current library version. MinorVersion is the minor number of the current library version. UpdateVersion is the update number of the current library version. ProductStatus is the status of the current library version. Possible variants could be “Beta”, “Product”. Build is the string that contains the build date and the internal build number. Processor is the processor optimization that is targeted for the specific processor. It is not the definition of the processor installed in the system, rather the MKL library detection that is optimal for the processor installed in the system. NOTE MKLGetVersion is an obsolete name for the mkl_get_version function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for Support Functions 15 2521 Optimization Notice use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 mkl_get_version Usage ---------------------------------------------------------------------------------------------- #include #include #include "mkl_service.h" int main(void) { MKLVersion Version; mkl_get_version(&Version); // MKL_Get_Version(&Version); printf("Major version: %d\n",Version.MajorVersion); printf("Minor version: %d\n",Version.MinorVersion); printf("Update version: %d\n",Version.UpdateVersion); printf("Product status: %s\n",Version.ProductStatus); printf("Build: %s\n",Version.Build); printf("Processor optimization: %s\n",Version.Processor); printf("================================================================\n"); printf("\n"); return 0; } Output: Major Version 9 Minor Version 0 Update Version 0 Product status Product Build 061909.09 Processor optimization Intel® Xeon® Processor with Intel® 64 architecture 15 Intel® Math Kernel Library Reference Manual 2522 mkl_get_version_string Gets the library version string. Syntax Fortran: call mkl_get_version_string( buf ) C: mkl_get_version_string( buf, len ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description buf FORTRAN: CHARACTER*198 C: char* Source string len FORTRAN: INTEGER C: int Length of the source string Description The function returns a string that contains the library version information. NOTE MKLGetVersionString is an obsolete name for the mkl_get_version_string function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. See example below: Examples Fortran Example program mkl_get_version_string character*198 buf call mkl_get_version_string(buf) write(*,'(a)') buf end C Example #include #include "mkl_service.h" int main(void) { int len=198; char buf[198]; mkl_get_version_string(buf, len); printf("%s\n",buf); Support Functions 15 2523 printf("\n"); return 0; } Threading Control Functions Intel® MKL provides optional threading control functions that take precedence over OpenMP* environment variable settings with the same purpose (see Intel® MKL User's Guide for details). These functions enable you to specify the number of threads for Intel MKL independently of the OpenMP* settings and takes precedence over them. Although Intel MKL may actually use a different number of threads from the number suggested, the controls also enable you to instruct the library to try using the suggested number when the number used in the calling application is unavailable. See the following examples of Fortran and C usage: Fortran Usage call mkl_set_num_threads( foo ) ierr = mkl_domain_set_num_threads( num, MKL_DOMAIN_BLAS ) call mkl_set_dynamic ( 1 ) num = mkl_get_max_threads() num = mkl_domain_get_max_threads( MKL_DOMAIN_BLAS ); ret = mkl_get_dynamic() C Usage #include "mkl.h" // Mandatory to make these definitions work! mkl_set_num_threads(num); return_code = mkl_domain_set_num_threads( num, MKL_DOMAIN_FFT ); mkl_set_dynamic( 1 ); num = mkl_get_max_threads(); num = mkl_domain_get_max_threads( MKL_DOMAIN_FFT ); return_code = mkl_get_dynamic(); NOTE Always remember to add #include "mkl.h" to use the C usage syntax. mkl_set_num_threads Suggests the number of threads to use. Syntax Fortran: call mkl_set_num_threads( number ) C: void mkl_set_num_threads( number ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h 15 Intel® Math Kernel Library Reference Manual 2524 Input Parameters Name Type Description number FORTRAN: INTEGER C: int Number of threads suggested by user Description This function allows you to specify how many threads Intel MKL should use. The number is a hint, and there is no guarantee that exactly this number of threads will be used. Enter a positive integer. This routine takes precedence over the MKL_NUM_THREADS environment variable. NOTE Always remember to add #include "mkl.h" to use the C usage syntax. See Intel MKL User's Guide for implementation details. mkl_domain_set_num_threads Suggests the number of threads for a particular function domain. Syntax Fortran: ierr = mkl_domain_set_num_threads( num, mask ) C: ierr = mkl_domain_set_num_threads( num, mask ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description num FORTRAN: INTEGER C: int Number of threads suggested by user mask FORTRAN: INTEGER C: int Name of the targeted domain Description This function allows you to request different domains of Intel MKL to use different numbers of threads. The currently supported domains are: • MKL_DOMAIN_BLAS - BLAS • MKL_DOMAIN_FFT - FFT (excluding cluster FFT) • MKL_DOMAIN_VML - Vector Math Library • MKL_DOMAIN_PARDISO - PARDISO • MKL_DOMAIN_ALL - another way to do what mkl_set_num_threads does Support Functions 15 2525 This is only a hint, and use of this number of threads is not guaranteed. Enter a valid domain and a positive integer for the number of threads. This routine has precedence over the MKL_DOMAIN_NUM_THREADS environment variable. See Intel MKL User's Guide for implementation details. Return Values 1(true) Indicates no error, execution is successful. 0(false) Indicates failure, possibly because the inputs were invalid. mkl_set_dynamic Enables Intel MKL to dynamically change the number of threads. Syntax Fortran: call mkl_set_dynamic( boolean_var ) C: void mkl_set_dynamic( boolean_var ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description boolean_v ar FORTRAN: INTEGER C: int The parameter that determines whether dynamic adjustment of the number of threads is enabled or disabled. Description This function indicates whether or not Intel MKL can dynamically change the number of threads. The default for this is true, regardless of how the OMP_DYNAMIC variable is set. This will also hold precedent over the OMP_DYNAMIC variable. A value of false does not guarantee that the user's requested number of threads will be used. But it means that Intel MKL will attempt to use that value. This routine takes precedence over the environment variable MKL_DYNAMIC. Note that if Intel MKL is called from within a parallel region, Intel MKL may not thread unless MKL_DYNAMIC is set to false, either with the environment variable or by this routine call. See Intel MKL User's Guide for implementation details. mkl_get_max_threads Inquires about the number of threads targeted for parallelism. 15 Intel® Math Kernel Library Reference Manual 2526 Syntax Fortran: num = mkl_get_max_threads() C: num = mkl_get_max_threads(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Description This function allows you to inquire independently of OpenMP* how many threads Intel MKL is targeting for parallelism. The number is a hint, and there is no guarantee that exactly this number of threads will be used. See Intel MKL User's Guide for implementation details. Return Values The output is INTEGER equal to the number of threads. mkl_domain_get_max_threads Inquires about the number of threads targeted for parallelism in different domains. Syntax Fortran: ierr = mkl_domain_get_max_threads( mask ) C: ierr = mkl_domain_get_max_threads( mask ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description mask FORTRAN: INTEGER C: int The name of the targeted domain Description This function allows the user of different domains of Intel MKL to inquire what number of threads is being used as a hint. The inquiry does not imply that this is the actual number of threads used. The number may vary depending on the value of the MKL_DYNAMIC variable and/or problem size, system resources, etc. But the function returns the value that MKL is targeting for a given domain. The currently supported domains are: • MKL_DOMAIN_BLAS - BLAS Support Functions 15 2527 • MKL_DOMAIN_FFT - FFT (excluding cluster FFT) • MKL_DOMAIN_VML - Vector Math Library • MKL_DOMAIN_PARDISO - PARDISO • MKL_DOMAIN_ALL - another way to do what mkl_get_max_threads does. You are supposed to enter a valid domain. See Intel MKL User's Guide for implementation details. Return Values Returns the hint about the number of threads for a given domain. mkl_get_dynamic Returns current value of MKL_DYNAMIC variable. Syntax Fortran: ret = mkl_get_dynamic() C: ret = mkl_get_dynamic(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Description This function returns the current value of the MKL_DYNAMIC variable. This variable can be changed by manipulating the MKL_DYNAMIC environment variable before the Intel MKL run is launched or by calling mkl_set_dynamic(). Doing the latter has precedence over the former. The function returns a value of 0 or 1: 1 indicates that MKL_DYNAMIC is true, 0 indicates that MKL_DYNAMIC is false. This variable indicates whether or not Intel MKL can dynamically change the number of threads. A value of false does not guarantee that the number of threads you requested will be used. But it means that Intel MKL will attempt to use that value. Note that if Intel MKL is called from within a parallel region, Intel MKL may not thread unless MKL_DYNAMIC is set to false, either with the environment variable or by this routine call. See Intel MKL User's Guide for implementation details. Return Values 1 Indicates MKL_DYNAMIC is true. 0 Indicates MKL_DYNAMIC is false. Error Handling Functions 15 Intel® Math Kernel Library Reference Manual 2528 xerbla Error handling routine called by BLAS, LAPACK, VML, VSL routines. Syntax Fortran: call xerbla( srname, info ) C: xerbla( srname, info, len ); Include Files • FORTRAN 77: mkl_blas.fi • C: mkl_blas.h Input Parameters Name Type Description srname FORTRAN: CHARACTER*(*) C: char* The name of the routine that called xerbla info FORTRAN: INTEGER C: int* The position of the invalid parameter in the parameter list of the calling routine len C: int Length of the source string Description The routine xerbla is an error handler for the BLAS, LAPACK, VSL, and VML routines. It is called by a BLAS, LAPACK, VSL or VML routine if an input parameter has an invalid value. If an issue is found with an input parameter, xerbla prints a message similar to the following: MKL ERROR: Parameter 6 was incorrect on entry to DGEMM and then returns to your application. Comments in the LAPACK reference code (http://www.netlib.org/ lapack/explore-html/xerbla.f.html) suggest this behavior though the LAPACK User's Guide recommends that the execution should stop when an error is found. Note that xerbla is an internal function. You can change or disable printing of an error message by providing your own xerbla function. See the FORTRAN and C examples below. Examples subroutine xerbla (srname, info) character*(*) srname !Name of subprogram that called xerbla integer*4 info !Position of the invalid parameter in the parameter list return !Return to the calling subprogram end Support Functions 15 2529 void xerbla(char* srname, int* info, int len){ // srname - name of the function that called xerbla // info - position of the invalid parameter in the parameter list // len - length of the name in bytes printf("\nXERBLA is called :%s: %d\n",srname,*info); } pxerbla Error handling routine called by ScaLAPACK routines. Syntax call pxerbla(ictxt, srname, info) Include Files • C: mkl_scalapack.h Input Parameters ictxt (global) INTEGER The BLACS context handle, indicating the global context of the operation. The context itself is global. srname (global) CHARACTER*6 The name of the routine which called pxerbla. info (global) INTEGER. The position of the invalid parameter in the parameter list of the calling routine. Description This routine is an error handler for the ScaLAPACK routines. It is called if an input parameter has an invalid value. A message is printed and program execution continues. For ScaLAPACK driver and computational routines, a RETURN statement is issued following the call to pxerbla. Control returns to the higher-level calling routine, and you can determine how the program should proceed. However, in the specialized low-level ScaLAPACK routines (auxiliary routines that are Level 2 equivalents of computational routines), the call to pxerbla() is immediately followed by a call to BLACS_ABORT() to terminate program execution since recovery from an error at this level in the computation is not possible. It is always good practice to check for a nonzero value of info on return from a ScaLAPACK routine. Installers may consider modifying this routine in order to call system-specific exception-handling facilities. Equality Test Functions lsame Tests two characters for equality regardless of the case. 15 Intel® Math Kernel Library Reference Manual 2530 Syntax Fortran: val = lsame( ca, cb ) C: val = lsame( ca, cb ); Include Files • FORTRAN 77: mkl_blas.fi • C: mkl_blas.h Input Parameters Name Type Description ca, cb FORTRAN: CHARACTER*1 C: const char* FORTRAN: The single characters to be compared C: Pointers to the single characters to be compared Output Parameters Name Type Description val FORTRAN: LOGICAL C: int Result of the comparison Description This logical function returns .TRUE. if ca is the same letter as cb regardless of the case, and .FALSE. otherwise. lsamen Tests two character strings for equality regardless of the case. Syntax Fortran: val = lsamen( n, ca, cb ) C: val = lsamen( n, ca, cb ); Include Files • FORTRAN 77: mkl_lapack.fi • C: mkl_lapack.h Input Parameters Name Type Description n FORTRAN: INTEGER FORTRAN: The number of characters in ca and cb to be compared. Support Functions 15 2531 Name Type Description C: const int* C: Pointer to the number of characters in ca and cb to be compared. ca, cb FORTRAN: CHARACTER*(*) C: const char* Specify two character strings of length at least n to be compared. Only the first n characters of each string will be accessed. Output Parameters Name Type Description val FORTRAN: LOGICAL C: int FORTRAN: Result of the comparison. .TRUE. if ca and cb are equivalent except for the case, and .FALSE. otherwise. The function also returns .FALSE. if len(ca) or len(cb) is less than n. C: Result of the comparison. Non-zero if ca and cb are equivalent except for the case, and zero otherwise. Description This logical function tests if the first n letters of one string are the same as the first n letters of another string, regardless of the case. Timing Functions second/dsecnd Returns elapsed CPU time in seconds. Syntax Fortran: val = second() val = dsecnd() C: val = second(); val = dsecnd(); Include Files • FORTRAN 77: mkl_lapack.fi • C: mkl_lapack.h Output Parameters Name Type Description val FORTRAN: REAL for second DOUBLE PRECISION for dsecnd Elapsed CPU time in seconds 15 Intel® Math Kernel Library Reference Manual 2532 Name Type Description C: float for second double for dsecnd Description The second/dsecnd functions return the elapsed CPU time in seconds. The difference between these functions is that dsecnd returns the result with double precision. Apply each function in pairs: the first time, directly before a call to the routine to be measured, and the second time - after the measurement. The difference between the returned values is the time spent in the routine. The second/dsecnd functions get the time from the elapsed CPU clocks divided by frequency. Obtaining the frequency may take some time when the second/dsecnd function runs for the first time. To eliminate the effect of this extra time on your measurements, make the first call to second/dsecnd in advance. Do not use second for measuring short time intervals because the single-precision format is not capable of holding sufficient timer precision. mkl_get_cpu_clocks Returns full precision elapsed CPU clocks. Syntax Fortran: call mkl_get_cpu_clocks( clocks ) C: mkl_get_cpu_clocks( &clocks ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description clocks FORTRAN: INTEGER*8 C: unsigned MKL_INT64 Elapsed CPU clocks Description The mkl_get_cpu_clocks function returns the elapsed CPU clocks. This may be useful when timing short intervals with high resolution. The mkl_get_cpu_clocks function is also applied in pairs like second/dsecnd. Note that out-of-order code execution on IA-32 or Intel® 64 architecture processors may disturb the exact elapsed CPU clocks value a little bit, which may be important while measuring extremely short time intervals. NOTE getcpuclocks is an obsolete name for the mkl_get_cpu_clocks function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. Support Functions 15 2533 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 mkl_get_cpu_frequency Returns the current CPU frequency value in GHz. Syntax Fortran: freq = mkl_get_cpu_frequency() C: freq = mkl_get_cpu_frequency(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description freq FORTRAN: DOUBLE PRECISION C: double Current CPU frequency value in GHz Description The function mkl_get_cpu_frequency returns the current CPU frequency in GHz. NOTE getcpufrequency is an obsolete name for the mkl_get_cpu_frequency function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. mkl_get_max_cpu_frequency Returns the maximum CPU frequency value in GHz. Syntax Fortran: freq = mkl_get_max_cpu_frequency() C: freq = mkl_get_max_cpu_frequency(); 15 Intel® Math Kernel Library Reference Manual 2534 Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description freq FORTRAN: DOUBLE PRECISION C: double Maximum CPU frequency value in GHz Description The function mkl_get_max_cpu_frequency returns the maximum CPU frequency in GHz. mkl_get_clocks_frequency Returns the frequency value in GHz based on constant-rate Time Stamp Counter. Syntax Fortran: freq = mkl_get_clocks_frequency() C: freq = mkl_get_clocks_frequency(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description freq FORTRAN: DOUBLE PRECISION C: double Frequency value in GHz Description The function mkl_get_clocks_frequency returns the CPU frequency value (in GHz) based on constant-rate Time Stamp Counter (TSC). Use of the constant-rate TSC ensures that each clock tick is constant even if the CPU frequency changes. Therefore, the returned frequency is constant. NOTE Obtaining the frequency may take some time when mkl_get_clocks_frequency is called for the first time. The same holds for functions second/dsecnd, which call mkl_get_clocks_frequency. See Also second/dsecnd Support Functions 15 2535 Memory Functions This section describes the Intel MKL memory support functions. See the Intel® MKL User's Guide for details of the Intel MKL memory management. mkl_free_buffers Frees memory buffers. Syntax Fortran: call mkl_free_buffers C: mkl_free_buffers(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Description The mkl_free_buffers function frees the memory allocated by the Intel MKL memory management software. The memory management software allocates new buffers if no free buffers are currently available. Call mkl_free_buffers() to free all memory buffers and to avoid memory leaking on completion of work with the Intel MKL functions, that is, after the last call of an Intel MKL function from your application. See Intel® MKL User's Guide for details. NOTE MKL_FreeBuffers is an obsolete name for the mkl_free_buffers function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. 15 Intel® Math Kernel Library Reference Manual 2536 mkl_free_buffers Usage with FFT Functions ---------------------------------------------------------------------------------------------- DFTI_DESCRIPTOR_HANDLE hand1; DFTI_DESCRIPTOR_HANDLE hand2; void mkl_free_buffers(void); . . . . . . /* Using MKL FFT */ Status = DftiCreateDescriptor(&hand1, DFTI_SINGLE, DFTI_COMPLEX, dim, m1); Status = DftiCommitDescriptor(hand1); Status = DftiComputeForward(hand1, s_array1); . . . . . . Status = DftiCreateDescriptor(&hand2, DFTI_SINGLE, DFTI_COMPLEX, dim, m2); Status = DftiCommitDescriptor(hand2); . . . . . . Status = DftiFreeDescriptor(&hand1); /* Do not call mkl_free_buffers() here as the hand2 descriptor will be corrupted! */ . . . . . . Status = DftiComputeBackward(hand2, s_array2)); Status = DftiFreeDescriptor(&hand2); /* Here user finishes the MKL FFT usage */ /* Memory leak will be triggered by any memory control tool */ /* Use mkl_free_buffers() to avoid memory leaking */ mkl_free_buffers(); ---------------------------------------------------------------------------------------------- If the memory space is sufficient, use mkl_free_buffers after the last call of the MKL functions. Otherwise, a drop in performance can occur due to reallocation of buffers for the subsequent MKL functions. WARNING For FFT calls, do not use mkl_free_buffers between DftiCreateDescriptor(hand) and DftiFreeDescriptor(&hand). mkl_thread_free_buffers Frees memory buffers allocated in the current thread. Syntax Fortran: call mkl_thread_free_buffers C: mkl_thread_free_buffers(); Support Functions 15 2537 Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Description The mkl_thread_free_buffers function frees the memory allocated by the Intel MKL memory management in the current thread only. Memory buffers allocated in other threads are not affected. Call mkl_thread_free_buffers() to avoid memory leaking if you are unable to call the mkl_free_buffers function in the multi-threaded application when you are not sure if all the other running Intel MKL functions completed operation. mkl_disable_fast_mm Enables Intel MKL to dynamically turn off memory management. Syntax Fortran: mm = mkl_disable_fast_mm C: mm = mkl_disable_fast_mm(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Description The Intel MKL memory management software is turned on by default. To turn it off dynamically before any Intel MKL function call, you can use the mkl_disable_fast_mm function similarly to the MKL_DISABLE_FAST_MM environment variable (See Intel® MKL User's Guide for details.) Run mkl_disable_fast_mm function to allocate and free memory from call to call. Note that disabling the Intel MKL memory management software negatively impacts performance of some Intel MKL routines, especially for small problem sizes. The function return value 1 indicates that the Intel MKL memory management was turned off successfully. The function return value 0 indicates a failure. mkl_mem_stat Reports amount of memory utilized by Intel MKL memory management software. Syntax Fortran: AllocatedBytes = mkl_mem_stat( AllocatedBuffers ) C: AllocatedBytes = mkl_mem_stat( &AllocatedBuffers ); 15 Intel® Math Kernel Library Reference Manual 2538 Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description AllocatedBytes FORTRAN: INTEGER*8 C: MKL_INT64 Amount of allocated bytes AllocatedBuffers FORTRAN: INTEGER*4, C: int Number of allocated buffers Description The function returns the amount of the allocated memory in the AllocatedBuffers buffers. If there are no allocated buffers at the moment, the function returns 0. Call the mkl_mem_stat() function to check the Intel MKL memory status. Note that after calling mkl_free_buffers there should not be any allocated buffers. See Example "mkl_malloc(), mkl_free(), mkl_mem_stat() Usage". NOTE MKL_MemStat is an obsolete name for the MKL_Mem_Stat function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. mkl_malloc Allocates the aligned memory buffer. Syntax Fortran: a_ptr = mkl_malloc( alloc_size, alignment ) C: a_ptr = mkl_malloc( alloc_size, alignment ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description alloc_size FORTRAN: INTEGER*4 C: size_t Size of the buffer to be allocated Note that Fortran type INTEGER*4 is given for the 32-bit systems. Otherwise, it is INTEGER*8. alignment FORTRAN: INTEGER*4 Alignment of the allocated buffer Support Functions 15 2539 Name Type Description C: int Output Parameters Name Type Description a_ptr FORTRAN: POINTER C: void* Pointer to the allocated buffer Description The function allocates a size-bytes buffer, aligned on the alignment boundary, and returns a pointer to this buffer. The function returns NULL if size < 1. If alignment is not power of 2, the alignment 32 is used. See Example "mkl_malloc(), mkl_free(), mkl_mem_stat() Usage". mkl_free Frees the aligned memory buffer allocated by mkl_malloc. Syntax Fortran: call mkl_free( a_ptr ) C: mkl_free( a_ptr ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description a_ptr FORTRAN: POINTER C: void* Pointer to the buffer to be freed Description The function frees the buffer pointed by ptr and allocated by mkl_malloc(). See Example "mkl_malloc(), mkl_free(), mkl_mem_stat() Usage". Examples of mkl_malloc(), mkl_free(), mkl_mem_stat() Usage Usage Example in Fortran PROGRAM FOO REAL*8 A,B,C 15 Intel® Math Kernel Library Reference Manual 2540 POINTER (A_PTR,A(1)), (B_PTR,B(1)), (C_PTR,C(1) INTEGER N, I REAL*8 ALPHA, BETA INTEGER*8 ALLOCATED_BYTES INTEGER*4 ALLOCATED_BUFFERS #ifdef _SYSTEM_BITS32 INTEGER*4 MKL_MALLOC INTEGER*4 ALLOC_SIZE #else INTEGER*8 MKL_MALLOC INTEGER*8 ALLOC_SIZE #endif INTEGER MKL_MEM_STAT EXTERNAL MKL_MALLOC, MKL_FREE, MKL_MEM_STAT ALPHA = 1.1; BETA = -1.2 N = 1000 ALLOC_SIZE = 8*N*N A_PTR = MKL_MALLOC(ALLOC_SIZE,64) B_PTR = MKL_MALLOC(ALLOC_SIZE,64) C_PTR = MKL_MALLOC(ALLOC_SIZE,64) DO I=1,N*N A(I) = I B(I) = -I C(I) = 0.0 END DO CALL DGEMM('N','N',N,N,N,ALPHA,A,N,B,N,BETA,C,N); ALLOCATED_BYTES = MKL_MEM_STAT(ALLOCATED_BUFFERS) PRINT *,'DGEMM uses ',ALLOCATED_BYTES,' bytes in ', $ ALLOCATED_BUFFERS,' buffers ' CALL MKL_FREE_BUFFERS ALLOCATED_BYTES = MKL_MEM_STAT(ALLOCATED_BUFFERS) IF (ALLOCATED_BYTES > 0) THEN PRINT *,'MKL MEMORY LEAK!' PRINT *,'AFTER MKL_FREE_BUFFERS there are ', $ ALLOCATED_BYTES,' bytes in ', $ ALLOCATED_BUFFERS,' buffers' END IF CALL MKL_FREE(A_PTR) CALL MKL_FREE(B_PTR) CALL MKL_FREE(C_PTR) STOP END Usage Example in C #include #include int main(void) { double *a, *b, *c; int n, i; double alpha, beta; MKL_INT64 AllocatedBytes; int N_AllocatedBuffers; alpha = 1.1; beta = -1.2; n = 1000; a = (double*)mkl_malloc(n*n*sizeof(double),64); b = (double*)mkl_malloc(n*n*sizeof(double),64); c = (double*)mkl_malloc(n*n*sizeof(double),64); for (i=0;i<(n*n);i++) { a[i] = (double)(i+1); b[i] = (double)(-i-1); Support Functions 15 2541 c[i] = 0.0; } dgemm("N","N",&n,&n,&n,&alpha,a,&n,b,&n,&beta,c,&n); AllocatedBytes = mkl_mem_stat(&N_AllocatedBuffers); printf("\nDGEMM uses %ld bytes in %d buffers",(long)AllocatedBytes,N_AllocatedBuffers); mkl_free_buffers(); AllocatedBytes = mkl_mem_stat(&N_AllocatedBuffers); if (AllocatedBytes > 0) { printf("\nMKL memory leak!"); printf("\nAfter mkl_free_buffers there are %ld bytes in %d buffers", (long)AllocatedBytes,N_AllocatedBuffers); } mkl_free(a); mkl_free(b); mkl_free(c); return 0; } Miscellaneous Utility Functions Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 mkl_progress Provides progress information. Syntax Fortran: stopflag = mkl_progress( thread, step, stage ) C: stopflag = mkl_progress( thread, step, stage, lstage ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_lapack.h and mkl_service.h Input Parameters Name Type Description thread FORTRAN: INTEGER*4 C: const int* FORTRAN: The number of the thread the progress routine is called from. 0 is passed for sequential code. 15 Intel® Math Kernel Library Reference Manual 2542 Name Type Description C: Pointer to the number of the thread the progress routine is called from. 0 is passed for sequential code. step FORTRAN: INTEGER*4 C: const int* FORTRAN: The linear progress indicator that shows the amount of work done. Increases from 0 to the linear size of the problem during the computation. C: Pointer to the linear progress indicator that shows the amount of work done. Increases from 0 to the linear size of the problem during the computation. stage FORTRAN: CHARACTER*(*) C: const char* Message indicating the name of the routine or the name of the computation stage the progress routine is called from. lstage C: int The length of a stage string excluding the trailing NULL character. Output Parameters Name Type Description stopflag FORTRAN: INTEGER C: int The stopping flag. A non-zero flag forces the routine to be interrupted. The zero flag is the default return value. Description The mkl_progress function is intended to track progress of a lengthy computation and/or interrupt the computation. By default this routine does nothing but the user application can redefine it to obtain the computation progress information. You can set it to perform certain operations during the routine computation, for instance, to print a progress indicator. A non-zero return value may be supplied by the redefined function to break the computation. The progress function mkl_progress is regularly called from some LAPACK and DSS/PARDISO functions during the computation. Refer to a specific LAPACK or DSS/PARDISO function description to see whether the function supports this feature or not. Application Notes Note that mkl_progress is a Fortran routine, that is, to redefine the progress routine from C, the name should be spelt differently, parameters should be passed by reference, and an extra parameter meaning the length of the stage string should be considered. The stage string is not terminated with the NULL character. The C interface of the progress routine is as follows: int mkl_progress_( int* thread, int* step, char* stage, int lstage ); // Linux, Mac int MKL_PROGRESS( int* thread, int* step, char* stage, int lstage ); // Windows See further the examples of printing a progress information on the standard output in Fortran and C languages: Examples Fortran example: integer function mkl_progress( thread, step, stage ) integer*4 thread, step character*(*) stage print*,'Thread:',thread,',stage:',stage,',step:',step mkl_progress = 0 return end Support Functions 15 2543 C example: #include #include #define BUFLEN 16 int mkl_progress_( int* ithr, int* step, char* stage, int lstage ) { char buf[BUFLEN]; if( lstage >= BUFLEN ) lstage = BUFLEN-1; strncpy( buf, stage, lstage ); buf[lstage] = '\0'; printf( "In thread %i, at stage %s, steps passed %i\n", *ithr, buf, *step ); return 0; } mkl_enable_instructions Allows dispatching Intel® Advanced Vector Extensions. Syntax Fortran: irc = mkl_enable_instructions(MKL_AVX_ENABLE) C: irc = mkl_enable_instructions(MKL_AVX_ENABLE); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters MKL_AVX_ENABLE Parameter indicating which new instructions the user needs to enable. Output Parameters Name Type Description irc FORTRAN: INTEGER*4 C: int Value reflecting AVX usage status: =1 MKL uses the AVX code, if the hardware supports Intel® AVX. =0 The request is rejected. Most likely, mkl_enable_instructions has been called after another Intel MKL function. Description This function is currently void and deprecated but can be used in future Intel MKL releases. NOTE Always remember to add #include "mkl.h" to use the C usage syntax. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on 15 Intel® Math Kernel Library Reference Manual 2544 Optimization Notice microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Functions Supporting the Single Dynamic Library Intel® MKL provides the Single Dynamic Library (SDL), which enables setting the interface and threading layer for Intel MKL at run time. See Intel® MKL User's Guide for details of SDL and layered model concept. This section describes the functions supporting SDL. mkl_set_interface_layer Sets the interface layer for Intel MKL at run time. Use with the Single Dynamic Library. Syntax Fortran: interface = mkl_set_interface_layer( required_interface ) C: interface = mkl_set_interface_layer( required_interface ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description required_interface FORTRAN: INTEGER C: int Determines the interface layer. Possible values: MKL_INTERFACE_LP64 for the LP64 interface. MKL_INTERFACE_ILP64 for the ILP64 interface. Description If you are using the Single Dynamic Library (SDL), the mkl_set_interface_layer function sets LP64 or ILP64 interface for Intel MKL at run time. Call this function prior to calling any other Intel MKL function in your application except mkl_set_threading_layer. You can call mkl_set_interface_layer and mkl_set_threading_layer in any order. The mkl_set_interface_layer function takes precedence over the MKL_INTERFACE_LAYER environment variable. See Intel MKL User's Guide for the layered model concept and usage details of SDL. Support Functions 15 2545 mkl_set_threading_layer Sets the threading layer for Intel MKL at run time. Use with the Single Dynamic Library (SDL). Syntax Fortran: threading = mkl_set_threading_layer( required_threading ) C: threading = mkl_set_threading_layer( required_threading ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description required_threading FORTRAN: INTEGER C: int Determines the threading layer. Possible values: MKL_THREADING_INTEL for Intel threading. MKL_THREADING_SEQUENTIAL for the sequential mode of Intel MKL. MKL_THREADING_PGI for PGI threading on Windows* or Linux* operating system only. MKL_THREADING_GNU for GNU threading on Linux* operating system only. Description If you are using the Single Dynamic Library (SDL), the mkl_set_threading_layer function sets the specified threading layer for Intel MKL at run time. Call this function prior to calling any other Intel MKL function in your application except mkl_set_interface_layer. You can call mkl_set_threading_layer and mkl_set_interface_layer in any order. The mkl_set_threading_layer function takes precedence over the MKL_THREADING_LAYER environment variable. See Intel MKL User's Guide for the layered model concept and usage details of SDL. mkl_set_xerbla Replaces the error handling routine. Use with the Single Dynamic Library on Windows* OS. Syntax Fortran: old_xerbla_ptr = mkl_set_xerbla( new_xerbla_ptr ) 15 Intel® Math Kernel Library Reference Manual 2546 C: old_xerbla_ptr = mkl_set_xerbla( new_xerbla_ptr ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description new_xerbla_ptr XerblaEntry Pointer to the error handling routine to be used. Description If you are linking with the Single Dynamic Library (SDL) mkl_rt.lib on Windows* OS, the mkl_set_xerbla function replaces the error handling routine that is called by Intel MKL functions with the routine specified by the parameter. See Intel MKL User's Guide for details of SDL. Return Values The function returns the pointer to the replaced error handling routine. See Also xerbla mkl_set_progress Replaces the progress information routine. Use with the Single Dynamic Library (SDL) on Windows* OS. Syntax Fortran: old_progress_ptr mkl_set_progress( new_progress_ptr ) C: old_progress_ptr mkl_set_progress( new_progress_ptr ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description new_progress_ptr ProgressEntry Pointer to the progress information routine to be used. Description If you are linking with the Single Dynamic Library (SDL) mkl_rt.lib on Windows* OS, the mkl_set_progress function replaces the currently used progress information routine with the routine specified by the parameter. See Intel MKL User's Guide for details of SDL. Support Functions 15 2547 Return Values The function returns the pointer to the replaced progress information routine. See Also mkl_progress 15 Intel® Math Kernel Library Reference Manual 2548 BLACS Routines 16 This chapter describes the Intel® Math Kernel Library implementation of FORTRAN 77 routines from the BLACS (Basic Linear Algebra Communication Subprograms) package. These routines are used to support a linear algebra oriented message passing interface that may be implemented efficiently and uniformly across a large range of distributed memory platforms. The BLACS routines make linear algebra applications both easier to program and more portable. For this purpose, they are used in Intel MKL intended for the Linux* and Windows* OSs as the communication layer of ScaLAPACK and Cluster FFT. On computers, a linear algebra matrix is represented by a two dimensional array (2D array), and therefore the BLACS operate on 2D arrays. See description of the basic matrix shapes in a special section. The BLACS routines implemented in Intel MKL are of four categories: • Combines • Point to Point Communication • Broadcast • Support. The Combines take data distributed over processes and combine the data to produce a result. The Point to Point routines are intended for point-to-point communication and Broadcast routines send data possessed by one process to all processes within a scope. The Support routines perform distinct tasks that can be used for initialization, destruction, information, and miscellaneous tasks. Matrix Shapes The BLACS routines recognize the two most common classes of matrices for dense linear algebra. The first of these classes consists of general rectangular matrices, which in machine storage are 2D arrays consisting of m rows and n columns, with a leading dimension, lda, that determines the distance between successive columns in memory. The general rectangular matrices take the following parameters as input when determining what array to operate on: m (input) INTEGER. The number of matrix rows to be operated on. n (input) INTEGER. The number of matrix columns to be operated on. a (input/output) TYPE (depends on routine), array of dimension (lda,n). A pointer to the beginning of the (sub)array to be sent. lda (input) INTEGER. The distance between two elements in matrix row. The second class of matrices recognized by the BLACS are trapezoidal matrices (triangular matrices are a sub-class of trapezoidal). Trapezoidal arrays are defined by m, n, and lda, as above, but they have two additional parameters as well. These parameters are: uplo (input) CHARACTER*1 . Indicates whether the matrix is upper or lower trapezoidal, as discussed below. diag (input) CHARACTER*1 . Indicates whether the diagonal of the matrix is unit diagonal (will not be operated on) or otherwise (will be operated on). 2549 The shape of the trapezoidal arrays is determined by these parameters as follows: Trapezoidal Arrays Shapes The packing of arrays, if required, so that they may be sent efficiently is hidden, allowing the user to concentrate on the logical matrix, rather than on how the data is organized in the system memory. BLACS Combine Operations This section describes BLACS routines that combine the data to produce a result. In a combine operation, each participating process contributes data that is combined with other processes’ data to produce a result. This result can be given to a particular process (called the destination process), or to all participating processes. If the result is given to only one process, the operation is referred to as a leave-on-one combine, and if the result is given to all participating processes the operation is referenced as a leave-on-all combine. At present, three kinds of combines are supported. They are: • element-wise summation • element-wise absolute value maximization • element-wise absolute value minimization of general rectangular arrays. Note that a combine operation combines data between processes. By definition, a combine performed across a scope of only one process does not change the input data. This is why the operations (max/min/sum) are specified as element-wise. Element-wise indicates that each element of the input array will be combined with the corresponding element from all other processes’ arrays to produce the result. Thus, a 4 x 2 array of inputs produces a 4 x 2 answer array. When the max/min comparison is being performed, absolute value is used. For example, -5 and 5 are equivalent. However, the returned value is unchanged; that is, it is not the absolute value, but is a signed value instead. Therefore, if you performed a BLACS absolute value maximum combine on the numbers -5, 3, 1, 8 the result would be -8. The initial symbol ? in the routine names below masks the data type: i integer s single precision real 16 Intel® Math Kernel Library Reference Manual 2550 d double precision real c single precision complex z double precision complex. BLACS Combines Routine name Results of operation gamx2d Entries of result matrix will have the value of the greatest absolute value found in that position. gamn2d Entries of result matrix will have the value of the smallest absolute value found in that position. gsum2d Entries of result matrix will have the summation of that position. ?gamx2d Performs element-wise absolute value maximization. Syntax call igamx2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call sgamx2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call dgamx2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call cgamx2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call zgamx2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the combine should proceed on. Limited to ROW, COLUMN, or ALL. top CHARACTER*1. Communication pattern to use during the combine operation. m INTEGER. The number of matrix rows to be combined. n INTEGER. The number of matrix columns to be combined. a TYPE array (lda, n). Matrix to be compared with to produce the maximum. lda INTEGER. The leading dimension of the matrix A, that is, the distance between two successive elements in a matrix row. rcflag INTEGER. If rcflag = -1, the arrays ra and ca are not referenced and need not exist. Otherwise, rcflag indicates the leading dimension of these arrays, and so must be = m. rdest INTEGER. The process row coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. cdest INTEGER. The process column coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. BLACS Routines 16 2551 Output Parameters a TYPE array (lda, n). Contains the result if this process is selected to receive the answer, or intermediate results if the process is not selected to receive the result. ra INTEGER array (rcflag, n). If rcflag = -1, this array will not be referenced, and need not exist. Otherwise, it is an integer array (of size at least rcflag x n) indicating the row index of the process that provided the maximum. If the calling process is not selected to receive the result, this array will contain intermediate (useless) results. ca INTEGER array (rcflag, n). If rcflag = -1, this array will not be referenced, and need not exist. Otherwise, it is an integer array (of size at least rcflag x n) indicating the row index of the process that provided the maximum. If the calling process is not selected to receive the result, this array will contain intermediate (useless) results. Description This routine performs element-wise absolute value maximization, that is, each element of matrix A is compared with the corresponding element of the other process's matrices. Note that the value of A is returned, but the absolute value is used to determine the maximum (the 1-norm is used for complex numbers). Combines may be globally-blocking, so they must be programmed as if no process returns until all have called the routine. See Also BLACS Routines Usage Example ?gamn2d Performs element-wise absolute value minimization. Syntax call igamn2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call sgamn2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call dgamn2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call cgamn2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call zgamn2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the combine should proceed on. Limited to ROW, COLUMN, or ALL. top CHARACTER*1. Communication pattern to use during the combine operation. m INTEGER. The number of matrix rows to be combined. n INTEGER. The number of matrix columns to be combined. a TYPE array (lda, n). Matrix to be compared with to produce the minimum. 16 Intel® Math Kernel Library Reference Manual 2552 lda INTEGER. The leading dimension of the matrix A, that is, the distance between two successive elements in a matrix row. rcflag INTEGER. If rcflag = -1, the arrays ra and ca are not referenced and need not exist. Otherwise, rcflag indicates the leading dimension of these arrays, and so must be = m. rdest INTEGER. The process row coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. cdest INTEGER. The process column coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. Output Parameters a TYPE array (lda, n). Contains the result if this process is selected to receive the answer, or intermediate results if the process is not selected to receive the result. ra INTEGER array (rcflag, n). If rcflag = -1, this array will not be referenced, and need not exist. Otherwise, it is an integer array (of size at least rcflag x n) indicating the row index of the process that provided the minimum. If the calling process is not selected to receive the result, this array will contain intermediate (useless) results. ca INTEGER array (rcflag, n). If rcflag = -1, this array will not be referenced, and need not exist. Otherwise, it is an integer array (of size at least rcflag x n) indicating the row index of the process that provided the minimum. If the calling process is not selected to receive the result, this array will contain intermediate (useless) results. Description This routine performs element-wise absolute value minimization, that is, each element of matrix A is compared with the corresponding element of the other process's matrices. Note that the value of A is returned, but the absolute value is used to determine the minimum (the 1-norm is used for complex numbers). Combines may be globally-blocking, so they must be programmed as if no process returns until all have called the routine. See Also BLACS Routines Usage Example ?gsum2d Performs element-wise summation. Syntax call igsum2d( icontxt, scope, top, m, n, a, lda, rdest, cdest ) call sgsum2d( icontxt, scope, top, m, n, a, lda, rdest, cdest ) call dgsum2d( icontxt, scope, top, m, n, a, lda, rdest, cdest ) call cgsum2d( icontxt, scope, top, m, n, a, lda, rdest, cdest ) BLACS Routines 16 2553 call zgsum2d( icontxt, scope, top, m, n, a, lda, rdest, cdest ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the combine should proceed on. Limited to ROW, COLUMN, or ALL. top CHARACTER*1. Communication pattern to use during the combine operation. m INTEGER. The number of matrix rows to be combined. n INTEGER. The number of matrix columns to be combined. a TYPE array (lda, n). Matrix to be added to produce the sum. lda INTEGER. The leading dimension of the matrix A, that is, the distance between two successive elements in a matrix row. rdest INTEGER. The process row coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. cdest INTEGER. The process column coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. Output Parameters a TYPE array (lda, n). Contains the result if this process is selected to receive the answer, or intermediate results if the process is not selected to receive the result. Description This routine performs element-wise summation, that is, each element of matrix A is summed with the corresponding element of the other process's matrices. Combines may be globally-blocking, so they must be programmed as if no process returns until all have called the routine. See Also BLACS Routines Usage Example BLACS Point To Point Communication This section describes BLACS routines for point to point communication. Point to point communication requires two complementary operations. The send operation produces a message that is then consumed by the receive operation. These operations have various resources associated with them. The main such resource is the buffer that holds the data to be sent or serves as the area where the incoming data is to be received. The level of blocking indicates what correlation the return from a send/receive operation has with the availability of these resources and with the status of message. Non-blocking The return from the send or receive operations does not imply that the resources may be reused, that the message has been sent/received or that the complementary operation has been called. Return means only that the send/receive has been started, and will be completed at some later date. Polling is required to determine when the operation has finished. 16 Intel® Math Kernel Library Reference Manual 2554 In non-blocking message passing, the concept of communication/computation overlap (abbreviated C/C overlap) is important. If a system possesses C/C overlap, independent computation can occur at the same time as communication. That means a nonblocking operation can be posted, and unrelated work can be done while the message is sent/received in parallel. If C/C overlap is not present, after returning from the routine call, computation will be interrupted at some later date when the message is actually sent or received. Locally-blocking Return from the send or receive operations indicates that the resources may be reused. However, since this only depends on local information, it is unknown whether the complementary operation has been called. There are no locally-blocking receives: the send must be completed before the receive buffer is available for re-use. If a receive has not been posted at the time a locally-blocking send is issued, buffering will be required to avoid losing the message. Buffering can be done on the sending process, the receiving process, or not done at all, losing the message. Globally-blocking Return from a globally-blocking procedure indicates that the operation resources may be reused, and that complement of the operation has at least been posted. Since the receive has been posted, there is no buffering required for globally-blocking sends: the message is always sent directly into the user's receive buffer. Almost all processors support non-blocking communication, as well as some other level of blocking sends. What level of blocking the send possesses varies between platforms. For instance, the Intel® processors support locally-blocking sends, with buffering done on the receiving process. This is a very important distinction, because codes written assuming locally-blocking sends will hang on platforms with globallyblocking sends. Below is a simple example of how this can occur: IAM = MY_PROCESS_ID() IF (IAM .EQ. 0) THEN SEND TO PROCESS 1 RECV FROM PROCESS 1 ELSE IF (IAM .EQ. 1) THEN SEND TO PROCESS 0 RECV FROM PROCESS 0 END IF If the send is globally-blocking, process 0 enters the send, and waits for process 1 to start its receive before continuing. In the meantime, process 1 starts to send to 0, and waits for 0 to receive before continuing. Both processes are now waiting on each other, and the program will never continue. The solution for this case is obvious. One of the processes simply reverses the order of its communication calls and the hang is avoided. However, when the communication is not just between two processes, but rather involves a hierarchy of processes, determining how to avoid this kind of difficulty can become problematic. For this reason, it was decided the BLACS would support locally-blocking sends. On systems natively supporting globally-blocking sends, non-blocking sends coupled with buffering is used to simulate locallyblocking sends. The BLACS support globally-blocking receives. In addition, the BLACS specify that point to point messages between two given processes will be strictly ordered. If process 0 sends three messages (label them A, B, and C) to process 1, process 1 must receive A before it can receive B, and message C can be received only after both A and B. The main reason for this restriction is that it allows for the computation of message identifiers. Note, however, that messages from different processes are not ordered. If processes 0, . . ., 3 send messages A, . . ., D to process 4, process 4 may receive these messages in any order that is convenient. BLACS Routines 16 2555 Convention The convention used in the communication routine names follows the template ?xxyy2d, where the letter in the ? position indicates the data type being sent, xx is replaced to indicate the shape of the matrix, and the yy positions are used to indicate the type of communication to perform: i integer s single precision real d double precision real c single precision complex z double precision complex ge The data to be communicated is stored in a general rectangular matrix. tr The data to be communicated is stored in a trapezoidal matrix. sd Send. One process sends to another. rv Receive. One process receives from another. BLACS Point To Point Communication Routine name Operation performed gesd2d trsd2d Take the indicated matrix and send it to the destination process. gerv2d trrv2d Receive a message from the process into the matrix. As a simple example, the pseudo code given above is rewritten below in terms of the BLACS. It is further specifed that the data being exchanged is the double precision vector X, which is 5 elements long. CALL GRIDINFO(NPROW, NPCOL, MYPROW, MYPCOL) IF (MYPROW.EQ.0 .AND. MYPCOL.EQ.0) THEN CALL DGESD2D(5, 1, X, 5, 1, 0) CALL DGERV2D(5, 1, X, 5, 1, 0) ELSE IF (MYPROW.EQ.1 .AND. MYPCOL.EQ.0) THEN CALL DGESD2D(5, 1, X, 5, 0, 0) CALL DGERV2D(5, 1, X, 5, 0, 0) END IF ?gesd2d Takes a general rectangular matrix and sends it to the destination process. Syntax call igesd2d( icontxt, m, n, a, lda, rdest, cdest ) call sgesd2d( icontxt, m, n, a, lda, rdest, cdest ) call dgesd2d( icontxt, m, n, a, lda, rdest, cdest ) call cgesd2d( icontxt, m, n, a, lda, rdest, cdest ) call zgesd2d( icontxt, m, n, a, lda, rdest, cdest ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. m, n, a, lda Describe the matrix to be sent. See Matrix Shapes for details. 16 Intel® Math Kernel Library Reference Manual 2556 rdest INTEGER. The process row coordinate of the process to send the message to. cdest INTEGER. The process column coordinate of the process to send the message to. Description This routine takes the indicated general rectangular matrix and sends it to the destination process located at {RDEST, CDEST} in the process grid. Return from the routine indicates that the buffer (the matrix A) may be reused. The routine is locally-blocking, that is, it will return even if the corresponding receive is not posted. See Also BLACS Routines Usage Example ?trsd2d Takes a trapezoidal matrix and sends it to the destination process. Syntax call itrsd2d( icontxt, uplo, diag, m, n, a, lda, rdest, cdest ) call strsd2d( icontxt, uplo, diag, m, n, a, lda, rdest, cdest ) call dtrsd2d( icontxt, uplo, diag, m, n, a, lda, rdest, cdest ) call ctrsd2d( icontxt, uplo, diag, m, n, a, lda, rdest, cdest ) call ztrsd2d( icontxt, uplo, diag, m, n, a, lda, rdest, cdest ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. uplo, diag, m, n, a, lda Describe the matrix to be sent. See Matrix Shapes for details. rdest INTEGER. The process row coordinate of the process to send the message to. cdest INTEGER. The process column coordinate of the process to send the message to. Description This routine takes the indicated trapezoidal matrix and sends it to the destination process located at {RDEST, CDEST} in the process grid. Return from the routine indicates that the buffer (the matrix A) may be reused. The routine is locally-blocking, that is, it will return even if the corresponding receive is not posted. ?gerv2d Receives a message from the process into the general rectangular matrix. Syntax call igerv2d( icontxt, m, n, a, lda, rsrc, csrc ) call sgerv2d( icontxt, m, n, a, lda, rsrc, csrc ) call dgerv2d( icontxt, m, n, a, lda, rsrc, csrc ) call cgerv2d( icontxt, m, n, a, lda, rsrc, csrc ) BLACS Routines 16 2557 call zgerv2d( icontxt, m, n, a, lda, rsrc, csrc ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. m, n, lda Describe the matrix to be sent. See Matrix Shapes for details. rsrc INTEGER. The process row coordinate of the source of the message. csrc INTEGER. The process column coordinate of the source of the message. Output Parameters a An array of dimension (lda,n) to receive the incoming message into. Description This routine receives a message from process {RSRC, CSRC} into the general rectangular matrix A. This routine is globally-blocking, that is, return from the routine indicates that the message has been received into A. See Also BLACS Routines Usage Example ?trrv2d Receives a message from the process into the trapezoidal matrix. Syntax call itrrv2d( icontxt, uplo, diag, m, n, a, lda, rsrc, csrc ) call strrv2d( icontxt, uplo, diag, m, n, a, lda, rsrc, csrc ) call dtrrv2d( icontxt, uplo, diag, m, n, a, lda, rsrc, csrc ) call ctrrv2d( icontxt, uplo, diag, m, n, a, lda, rsrc, csrc ) call ztrrv2d( icontxt, uplo, diag, m, n, a, lda, rsrc, csrc ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. uplo, diag, m, n, lda Describe the matrix to be sent. See Matrix Shapes for details. rsrc INTEGER. The process row coordinate of the source of the message. csrc INTEGER. The process column coordinate of the source of the message. Output Parameters a An array of dimension (lda,n) to receive the incoming message into. Description This routine receives a message from process {RSRC, CSRC} into the trapezoidal matrix A. This routine is globally-blocking, that is, return from the routine indicates that the message has been received into A. 16 Intel® Math Kernel Library Reference Manual 2558 BLACS Broadcast Routines This section describes BLACS broadcast routines. A broadcast sends data possessed by one process to all processes within a scope. Broadcast, much like point to point communication, has two complementary operations. The process that owns the data to be broadcast issues a broadcast/send. All processes within the same scope must then issue the complementary broadcast/receive. The BLACS define that both broadcast/send and broadcast/receive are globally-blocking. Broadcasts/ receives cannot be locally-blocking since they must post a receive. Note that receives cannot be locallyblocking. When a given process can leave, a broadcast/receive operation is topology dependent, so, to avoid a hang as topology is varied, the broadcast/receive must be treated as if no process can leave until all processes have called the operation. Broadcast/sends could be defined to be locally-blocking. Since no information is being received, as long as locally-blocking point to point sends are used, the broadcast/send will be locally blocking. However, defining one process within a scope to be locally-blocking while all other processes are globally-blocking adds little to the programmability of the code. On the other hand, leaving the option open to have globally-blocking broadcast/sends may allow for optimization on some platforms. The fact that broadcasts are defined as globally-blocking has several important implications. The first is that scoped operations (broadcasts or combines) must be strictly ordered, that is, all processes within a scope must agree on the order of calls to separate scoped operations. This constraint falls in line with that already in place for the computation of message IDs, and is present in point to point communication as well. A less obvious result is that scoped operations with SCOPE = 'ALL' must be ordered with respect to any other scoped operation. This means that if there are two broadcasts to be done, one along a column, and one involving the entire process grid, all processes within the process column issuing the column broadcast must agree on which broadcast will be performed first. The convention used in the communication routine names follows the template ?xxyy2d, where the letter in the ? position indicates the data type being sent, xx is replaced to indicate the shape of the matrix, and the yy positions are used to indicate the type of communication to perform: i integer s single precision real d double precision real c single precision complex z double precision complex ge The data to be communicated is stored in a general rectangular matrix. tr The data to be communicated is stored in a trapezoidal matrix. bs Broadcast/send. A process begins the broadcast of data within a scope. br Broadcast/receive A process receives and participates in the broadcast of data within a scope. BLACS Broadcast Routines Routine name Operation performed gebs2d trbs2d Start a broadcast along a scope. gebr2d trbr2d Receive and participate in a broadcast along a scope. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for BLACS Routines 16 2559 Optimization Notice use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 ?gebs2d Starts a broadcast along a scope for a general rectangular matrix. Syntax call igebs2d( icontxt, scope, top, m, n, a, lda ) call sgebs2d( icontxt, scope, top, m, n, a, lda ) call dgebs2d( icontxt, scope, top, m, n, a, lda ) call cgebs2d( icontxt, scope, top, m, n, a, lda ) call zgebs2d( icontxt, scope, top, m, n, a, lda ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the broadcast should proceed on. Limited to 'Row', 'Column', or 'All'. top CHARACTER*1. Indicates the communication pattern to use for the broadcast. m, n, a, lda Describe the matrix to be sent. See Matrix Shapes for details. Description This routine starts a broadcast along a scope. All other processes within the scope must call broadcast/ receive for the broadcast to proceed. At the end of a broadcast, all processes within the scope will possess the data in the general rectangular matrix A. Broadcasts may be globally-blocking. This means no process is guaranteed to return from a broadcast until all processes in the scope have called the appropriate routine (broadcast/send or broadcast/receive). See Also BLACS Routines Usage Example ?trbs2d Starts a broadcast along a scope for a trapezoidal matrix. Syntax call itrbs2d( icontxt, scope, top, uplo, diag, m, n, a, lda ) call strbs2d( icontxt, scope, top, uplo, diag, m, n, a, lda ) call dtrbs2d( icontxt, scope, top, uplo, diag, m, n, a, lda ) call ctrbs2d( icontxt, scope, top, uplo, diag, m, n, a, lda ) call ztrbs2d( icontxt, scope, top, uplo, diag, m, n, a, lda ) 16 Intel® Math Kernel Library Reference Manual 2560 Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the broadcast should proceed on. Limited to 'Row', 'Column', or 'All'. top CHARACTER*1. Indicates the communication pattern to use for the broadcast. uplo, diag, m, n, a, lda Describe the matrix to be sent. See Matrix Shapes for details. Description This routine starts a broadcast along a scope. All other processes within the scope must call broadcast/ receive for the broadcast to proceed. At the end of a broadcast, all processes within the scope will possess the data in the trapezoidal matrix A. Broadcasts may be globally-blocking. This means no process is guaranteed to return from a broadcast until all processes in the scope have called the appropriate routine (broadcast/send or broadcast/receive). ?gebr2d Receives and participates in a broadcast along a scope for a general rectangular matrix. Syntax call igebr2d( icontxt, scope, top, m, n, a, lda, rsrc, csrc ) call sgebr2d( icontxt, scope, top, m, n, a, lda, rsrc, csrc ) call dgebr2d( icontxt, scope, top, m, n, a, lda, rsrc, csrc ) call cgebr2d( icontxt, scope, top, m, n, a, lda, rsrc, csrc ) call zgebr2d( icontxt, scope, top, m, n, a, lda, rsrc, csrc ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the broadcast should proceed on. Limited to 'Row', 'Column', or 'All'. top CHARACTER*1. Indicates the communication pattern to use for the broadcast. m, n, lda Describe the matrix to be sent. See Matrix Shapes for details. rsrc INTEGER. The process row coordinate of the process that called broadcast/send. csrc INTEGER. The process column coordinate of the process that called broadcast/send. Output Parameters a An array of dimension (lda,n) to receive the incoming message into. Description This routine receives and participates in a broadcast along a scope. At the end of a broadcast, all processes within the scope will possess the data in the general rectangular matrix A. Broadcasts may be globallyblocking. This means no process is guaranteed to return from a broadcast until all processes in the scope have called the appropriate routine (broadcast/send or broadcast/receive). BLACS Routines 16 2561 See Also BLACS Routines Usage Example ?trbr2d Receives and participates in a broadcast along a scope for a trapezoidal matrix. Syntax call itrbr2d( icontxt, scope, top, uplo, diag, m, n, a, lda, rsrc, csrc ) call strbr2d( icontxt, scope, top, uplo, diag, m, n, a, lda, rsrc, csrc ) call dtrbr2d( icontxt, scope, top, uplo, diag, m, n, a, lda, rsrc, csrc ) call ctrbr2d( icontxt, scope, top, uplo, diag, m, n, a, lda, rsrc, csrc ) call ztrbr2d( icontxt, scope, top, uplo, diag, m, n, a, lda, rsrc, csrc ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the broadcast should proceed on. Limited to 'Row', 'Column', or 'All'. top CHARACTER*1. Indicates the communication pattern to use for the broadcast. uplo, diag, m, n, lda Describe the matrix to be sent. See Matrix Shapes for details. rsrc INTEGER. The process row coordinate of the process that called broadcast/send. csrc INTEGER. The process column coordinate of the process that called broadcast/send. Output Parameters a An array of dimension (lda,n) to receive the incoming message into. Description This routine receives and participates in a broadcast along a scope. At the end of a broadcast, all processes within the scope will possess the data in the trapezoidal matrix A. Broadcasts may be globally-blocking. This means no process is guaranteed to return from a broadcast until all processes in the scope have called the appropriate routine (broadcast/send or broadcast/receive). BLACS Support Routines The support routines perform distinct tasks that can be used for: Initialization Destruction Information Purposes Miscellaneous Tasks. Initialization Routines This section describes BLACS routines that deal with grid/context creation, and processing before the grid/ context has been defined. 16 Intel® Math Kernel Library Reference Manual 2562 BLACS Initialization Routines Routine name Operation performed blacs_pinfo Returns the number of processes available for use. blacs_setup Allocates virtual machine and spawns processes. blacs_get Gets values that BLACS use for internal defaults. blacs_set Sets values that BLACS use for internal defaults. blacs_gridinit Assigns available processes into BLACS process grid. blacs_gridmap Maps available processes into BLACS process grid. blacs_pinfo Returns the number of processes available for use. Syntax call blacs_pinfo( mypnum, nprocs ) Output Parameters mypnum INTEGER. An integer between 0 and (nprocs - 1) that uniquely identifies each process. nprocs INTEGER.The number of processes available for BLACS use. Description This routine is used when some initial system information is required before the BLACS are set up. On all platforms except PVM, nprocs is the actual number of processes available for use, that is, nprows * npcols <= nprocs. In PVM, the virtual machine may not have been set up before this call, and therefore no parallel machine exists. In this case, nprocs is returned as less than one. If a process has been spawned via the keyboard, it receives mypnum of 0, and all other processes get mypnum of -1. As a result, the user can distinguish between processes. Only after the virtual machine has been set up via a call to BLACS_SETUP, this routine returns the correct values for mypnum and nprocs. See Also BLACS Routines Usage Example blacs_setup Allocates virtual machine and spawns processes. Syntax call blacs_setup( mypnum, nprocs ) Input Parameters nprocs INTEGER. On the process spawned from the keyboard rather than from pvmspawn, this parameter indicates the number of processes to create when building the virtual machine. Output Parameters mypnum INTEGER. An integer between 0 and (nprocs - 1) that uniquely identifies each process. BLACS Routines 16 2563 nprocs INTEGER. For all processes other than spawned from the keyboard, this parameter means the number of processes available for BLACS use. Description This routine only accomplishes meaningful work in the PVM BLACS. On all other platforms, it is functionally equivalent to blacs_pinfo. The BLACS assume a static system, that is, the given number of processes does not change. PVM supplies a dynamic system, allowing processes to be added to the system on the fly. blacs_setup is used to allocate the virtual machine and spawn off processes. It reads in a file called blacs_setup.dat, in which the first line must be the name of your executable. The second line is optional, but if it exists, it should be a PVM spawn flag. Legal values at this time are 0 (PvmTaskDefault), 4 (PvmTaskDebug), 8 (PvmTaskTrace), and 12 (PvmTaskDebug + PvmTaskTrace). The primary reason for this line is to allow the user to easily turn on and off PVM debugging. Additional lines, if any, specify what machines should be added to the current configuration before spawning nprocs-1 processes to the machines in a round robin fashion. nprocs is input on the process which has no PVM parent (that is, mypnum=0), and both parameters are output for all processes. So, on PVM systems, the call to blacs_pinfo informs you that the virtual machine has not been set up, and a call to blacs_setup then sets up the machine and returns the real values for mypnum and nprocs. Note that if the file blacs_setup.dat does not exist, the BLACS prompt the user for the executable name, and processes are spawned to the current PVM configuration. See Also BLACS Routines Usage Example blacs_get Gets values that BLACS use for internal defaults. Syntax call blacs_get( icontxt, what, val ) Input Parameters icontxt INTEGER. On values of what that are tied to a particular context, this parameter is the integer handle indicating the context. Otherwise, ignored. what INTEGER. Indicates what BLACS internal(s) should be returned in val. Present options are: • what = 0 : Handle indicating default system context • what = 1 : The BLACS message ID range • what = 2 : The BLACS debug level the library was compiled with • what = 10 : Handle indicating the system context used to define the BLACS context whose handle is icontxt • what = 11 : Number of rings multiring topology is presently using • what = 12 : Number of branches general tree topology is presently using. Output Parameters val INTEGER. The value of the BLACS internal. 16 Intel® Math Kernel Library Reference Manual 2564 Description This routine gets the values that the BLACS are using for internal defaults. Some values are tied to a BLACS context, and some are more general. The most common use is in retrieving a default system context for input into blacs_gridinit or blacs_gridmap. Some systems, such as MPI*, supply their own version of context. For those users who mix system code with BLACS code, a BLACS context should be formed in reference to a system context. Thus, the grid creation routines take a system context as input. If you wish to have strictly portable code, you may use blacs_get to retrieve a default system context that will include all available processes. This value is not tied to a BLACS context, so the parameter icontxt is unused. blacs_get returns information on three quantities that are tied to an individual BLACS context, which is passed in as icontxt. The information that may be retrieved is: • The handle of the system context upon which this BLACS context was defined • The number of rings for TOP = 'M' (multiring broadcast) • The number of branches for TOP = 'T' (general tree broadcast/general tree gather). See Also BLACS Routines Usage Example blacs_set Sets values that BLACS use for internal defaults. Syntax call blacs_set( icontxt, what, val ) Input Parameters icontxt INTEGER. For values of what that are tied to a particular context, this parameter is the integer handle indicating the context. Otherwise, ignored. what INTEGER. Indicates what BLACS internal(s) should be set. Present values are: • 1 = The BLACS message ID range • 11 = Number of rings for multiring topology to use • 12 = Number of branches for general tree topology to use. val INTEGER. Array of dimension (*). Indicates the value(s) the internals should be set to. The specific meanings depend on what values, as discussed in Description below. Description This routine sets the BLACS internal defaults depending on what values: what = 1 Setting the BLACS message ID range. If you wish to mix the BLACS with other message-passing packages, restrict the BLACS to a certain message ID range not to be used by the non-BLACS routines. The message ID range must be set before the first call to blacs_gridinit or blacs_gridmap. Subsequent calls will have no effect. Because the message ID range is not tied to a particular context, the parameter icontxt is ignored, and val is defined as: VAL (input) INTEGER array of dimension (2) VAL(1) : The smallest message ID (also called message type or message tag) the BLACS should use. BLACS Routines 16 2565 VAL(2) : The largest message ID (also called message type or message tag) the BLACS should use. what = 11 Set number of rings for TOP = 'M' (multiring broadcast).This quantity is tied to a context, so icontxt is used, and val is defined as: VAL (input) INTEGER array of dimension (1) VAL(1) : The number of rings for multiring topology to use. what = 12 Set number of rings for TOP = 'T' (general tree broadcast/general tree gather). This quantity is tied to a context, so icontxt is used, and val is defined as: VAL (input) INTEGER array of dimension (1) VAL(1) : The number of branches for general tree topology to use. blacs_gridinit Assigns available processes into BLACS process grid. Syntax call blacs_gridinit( icontxt, order, nprow, npcol ) Input Parameters icontxt INTEGER. Integer handle indicating the system context to be used in creating the BLACS context. Call blacs_get to obtain a default system context. order CHARACTER*1. Indicates how to map processes to BLACS grid. Options are: • 'R' : Use row-major natural ordering • 'C' : Use column-major natural ordering • ELSE : Use row-major natural ordering nprow INTEGER. Indicates how many process rows the process grid should contain. npcol INTEGER. Indicates how many process columns the process grid should contain. Output Parameters icontxt INTEGER. Integer handle to the created BLACS context. Description All BLACS codes must call this routine, or its sister routine blacs_gridmap. These routines take the available processes, and assign, or map, them into a BLACS process grid. In other words, they establish how the BLACS coordinate system maps into the native machine process numbering system. Each BLACS grid is contained in a context, so that it does not interfere with distributed operations that occur within other grids/ contexts. These grid creation routines may be called repeatedly to define additional contexts/grids. The creation of a grid requires input from all processes that are defined to be in this grid. Processes belonging to more than one grid have to agree on which grid formation will be serviced first, much like the globally blocking sum or broadcast. These grid creation routines set up various internals for the BLACS, and one of them must be called before any calls are made to the non-initialization BLACS. Note that these routines map already existing processes to a grid: the processes are not created dynamically. On most parallel machines, the processes are actual processors (hardware), and they are "created" when you run your executable. When using the PVM BLACS, if the virtual machine has not been set up yet, the routine blacs_setup should be used to create the virtual machine. 16 Intel® Math Kernel Library Reference Manual 2566 This routine creates a simple nprow x npcol process grid. This process grid uses the first nprow * npcol processes, and assigns them to the grid in a row- or column-major natural ordering. If these process-to-grid mappings are unacceptable, call blacs_gridmap. See Also BLACS Routines Usage Example blacs_get blacs_gridmap blacs_setup blacs_gridmap Maps available processes into BLACS process grid. Syntax call blacs_gridmap( icontxt, usermap, ldumap, nprow, npcol ) Input Parameters icontxt INTEGER. Integer handle indicating the system context to be used in creating the BLACS context. Call blacs_get to obtain a default system context. usermap INTEGER. Array, dimension (ldumap, npcol), indicating the process-to-grid mapping. ldumap INTEGER. Leading dimension of the 2D array usermap. ldumap = nprow. nprow INTEGER. Indicates how many process rows the process grid should contain. npcol INTEGER. Indicates how many process columns the process grid should contain. Output Parameters icontxt INTEGER. Integer handle to the created BLACS context. Description All BLACS codes must call this routine, or its sister routine blacs_gridinit. These routines take the available processes, and assign, or map, them into a BLACS process grid. In other words, they establish how the BLACS coordinate system maps into the native machine process numbering system. Each BLACS grid is contained in a context, so that it does not interfere with distributed operations that occur within other grids/ contexts. These grid creation routines may be called repeatedly to define additional contexts/grids. The creation of a grid requires input from all processes that are defined to be in this grid. Processes belonging to more than one grid have to agree on which grid formation will be serviced first, much like the globally blocking sum or broadcast. These grid creation routines set up various internals for the BLACS, and one of them must be called before any calls are made to the non-initialization BLACS. Note that these routines map already existing processes to a grid: the processes are not created dynamically. On most parallel machines, the processes are actual processors (hardware), and they are "created" when you run your executable. When using the PVM BLACS, if the virtual machine has not been set up yet, the routine blacs_setup should be used to create the virtual machine. This routine allows the user to map processes to the process grid in an arbitrary manner. usermap(i,j) holds the process number of the process to be placed in {i, j} of the process grid. On most distributed systems, this process number is a machine defined number between 0 ... nprow-1. For PVM, these node numbers are the PVM TIDS (Task IDs). The blacs_gridmap routine is intended for an experienced user. The blacs_gridinit routine is much simpler. blacs_gridinit simply performs a gridmap where the first BLACS Routines 16 2567 nprow * npcol processes are mapped into the current grid in a row-major natural ordering. If you are an experienced user, blacs_gridmap allows you to take advantage of your system's actual layout. That is, you can map nodes that are physically connected to be neighbors in the BLACS grid, etc. The blacs_gridmap routine also opens the way for multigridding: you can separate your nodes into arbitrary grids, join them together at some later date, and then re-split them into new grids. blacs_gridmap also provides the ability to make arbitrary grids or subgrids (for example, a "nearest neighbor" grid), which can greatly facilitate operations among processes that do not fall on a row or column of the main process grid. See Also BLACS Routines Usage Example blacs_get blacs_gridinit blacs_setup Destruction Routines This section describes BLACS routines that destroy grids, abort processes, and free resources. BLACS Destruction Routines Routine name Operation performed blacs_freebuff Frees BLACS buffer. blacs_gridexit Frees a BLACS context. blacs_abort Aborts all processes. blacs_exit Frees all BLACS contexts and releases all allocated memory. blacs_freebuff Frees BLACS buffer. Syntax call blacs_freebuff( icontxt, wait ) Input Parameters icontxt INTEGER. Integer handle that indicates the BLACS context. wait INTEGER. Parameter indicating whether to wait for non-blocking operations or not. If equals 0, the operations should not be waited for; free only unused buffers. Otherwise, wait in order to free all buffers. Description This routine releases the BLACS buffer. The BLACS have at least one internal buffer that is used for packing messages. The number of internal buffers depends on what platform you are running the BLACS on. On systems where memory is tight, keeping this buffer or buffers may become expensive. Call freebuff to release the buffer. However, the next call of a communication routine that requires packing reallocates the buffer. The wait parameter determines whether the BLACS should wait for any non-blocking operations to be completed or not. If wait = 0, the BLACS free any buffers that can be freed without waiting. If wait is not 0, the BLACS free all internal buffers, even if non-blocking operations must be completed first. 16 Intel® Math Kernel Library Reference Manual 2568 blacs_gridexit Frees a BLACS context. Syntax call blacs_gridexit( icontxt ) Input Parameters icontxt INTEGER. Integer handle that indicates the BLACS context to be freed. Description This routine frees a BLACS context. Release the resources when contexts are no longer needed. After freeing a context, the context no longer exists, and its handle may be re-used if new contexts are defined. blacs_abort Aborts all processes. Syntax call blacs_abort( icontxt, errornum ) Input Parameters icontxt INTEGER. Integer handle that indicates the BLACS context to be aborted. errornum INTEGER. User-defined integer error number. Description This routine aborts all the BLACS processes, not only those confined to a particular context. Use blacs_abort to abort all the processes in case of a serious error. Note that both parameters are input, but the routine uses them only in printing out the error message. The context handle passed in is not required to be a valid context handle. blacs_exit Frees all BLACS contexts and releases all allocated memory. Syntax call blacs_exit( continue ) Input Parameters continue INTEGER. Flag indicating whether message passing continues after the BLACS are done. If continue is non-zero, the user is assumed to continue using the machine after completing the BLACS. Otherwise, no message passing is assumed after calling this routine. Description This routine frees all BLACS contexts and releases all allocated memory. This routine should be called when a process has finished all use of the BLACS. The continue parameter indicates whether the user will be using the underlying communication platform after the BLACS are finished. This information is most important for the PVM BLACS. If continue is set to 0, then pvm_exit is called; BLACS Routines 16 2569 otherwise, it is not called. Setting continue not equal to 0 indicates that explicit PVM send/recvs will be called after the BLACS routines are used. Make sure your code calls pvm_exit. PVM users should either call blacs_exit or explicitly call pvm_exit to avoid PVM problems. See Also BLACS Routines Usage Example Informational Routines This section describes BLACS routines that return information involving the process grid. BLACS Informational Routines Routine name Operation performed blacs_gridinfo Returns information on the current grid. blacs_pnum Returns the system process number of the process in the process grid. blacs_pcoord Returns the row and column coordinates in the process grid. blacs_gridinfo Returns information on the current grid. Syntax call blacs_gridinfo( icontxt, nprow, npcol, myprow, mypcol ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. Output Parameters nprow INTEGER. Number of process rows in the current process grid. npcol INTEGER. Number of process columns in the current process grid. myprow INTEGER. Row coordinate of the calling process in the process grid. mypcol INTEGER. Column coordinate of the calling process in the process grid. Description This routine returns information on the current grid. If the context handle does not point at a valid context, all quantities are returned as -1. See Also BLACS Routines Usage Example blacs_pnum Returns the system process number of the process in the process grid. Syntax call blacs_pnum( icontxt, prow, pcol ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. 16 Intel® Math Kernel Library Reference Manual 2570 prow INTEGER. Row coordinate of the process the system process number of which is to be determined. pcol INTEGER. Column coordinate of the process the system process number of which is to be determined. Description This function returns the system process number of the process at {PROW, PCOL} in the process grid. See Also BLACS Routines Usage Example blacs_pcoord Returns the row and column coordinates in the process grid. Syntax call blacs_pcoord( icontxt, pnum, prow, pcol ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. pnum INTEGER. Process number the coordinates of which are to be determined. This parameter stand for the process number of the underlying machine, that is, it is a tid for PVM. Output Parameters prow INTEGER. Row coordinates of the pnum process in the BLACS grid. pcol INTEGER. Column coordinates of the pnum process in the BLACS grid. Description Given the system process number, this function returns the row and column coordinates in the BLACS process grid. See Also BLACS Routines Usage Example Miscellaneous Routines This section describes blacs_barrier routine. BLACS Informational Routines Routine name Operation performed blacs_barrier Holds up execution of all processes within the indicated scope until they have all called the routine. blacs_barrier Holds up execution of all processes within the indicated scope. Syntax call blacs_barrier( icontxt, scope ) BLACS Routines 16 2571 Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Parameter that indicates whether a process row (scope='R'), column ('C'), or entire grid ('A') will participate in the barrier. Description This routine holds up execution of all processes within the indicated scope until they have all called the routine. Examples of BLACS Routines Usage Example. BLACS Usage. Hello World The following routine takes the available processes, forms them into a process grid, and then has each process check in with the process at {0,0} in the process grid. PROGRAM HELLO * -- BLACS example code -- * Written by Clint Whaley 7/26/94 * Performs a simple check-in type hello world * .. * .. External Functions .. INTEGER BLACS_PNUM EXTERNAL BLACS_PNUM * .. * .. Variable Declaration .. INTEGER CONTXT, IAM, NPROCS, NPROW, NPCOL, MYPROW, MYPCOL INTEGER ICALLER, I, J, HISROW, HISCOL * * Determine my process number and the number of processes in * machine * CALL BLACS_PINFO(IAM, NPROCS) * * If in PVM, create virtual machine if it doesn't exist * IF (NPROCS .LT. 1) THEN IF (IAM .EQ. 0) THEN WRITE(*, 1000) READ(*, 2000) NPROCS END IF CALL BLACS_SETUP(IAM, NPROCS) END IF * * Set up process grid that is as close to square as possible * NPROW = INT( SQRT( REAL(NPROCS) ) ) NPCOL = NPROCS / NPROW * * Get default system context, and define grid * CALL BLACS_GET(0, 0, CONTXT) CALL BLACS_GRIDINIT(CONTXT, 'Row', NPROW, NPCOL) CALL BLACS_GRIDINFO(CONTXT, NPROW, NPCOL, MYPROW, MYPCOL) * * If I'm not in grid, go to end of program * IF ( (MYPROW.GE.NPROW) .OR. (MYPCOL.GE.NPCOL) ) GOTO 30 * * Get my process ID from my grid coordinates * 16 Intel® Math Kernel Library Reference Manual 2572 ICALLER = BLACS_PNUM(CONTXT, MYPROW, MYPCOL) * * If I am process {0,0}, receive check-in messages from * all nodes * IF ( (MYPROW.EQ.0) .AND. (MYPCOL.EQ.0) ) THEN WRITE(*,*) ' ' DO 20 I = 0, NPROW-1 DO 10 J = 0, NPCOL-1 IF ( (I.NE.0) .OR. (J.NE.0) ) THEN CALL IGERV2D(CONTXT, 1, 1, ICALLER, 1, I, J) END IF * * Make sure ICALLER is where we think in process grid * CALL BLACS_PCOORD(CONTXT, ICALLER, HISROW, HISCOL) IF ( (HISROW.NE.I) .OR. (HISCOL.NE.J) ) THEN WRITE(*,*) 'Grid error! Halting . . .' STOP END IF WRITE(*, 3000) I, J, ICALLER 10 CONTINUE 20 CONTINUE WRITE(*,*) ' ' WRITE(*,*) 'All processes checked in. Run finished.' * * All processes but {0,0} send process ID as a check-in * ELSE CALL IGESD2D(CONTXT, 1, 1, ICALLER, 1, 0, 0) END IF 30 CONTINUE CALL BLACS_EXIT(0) 1000 FORMAT('How many processes in machine?') 2000 FORMAT(I) 3000 FORMAT('Process {',i2,',',i2,'} (node number =',I, $ ') has checked in.') STOP END Example. BLACS Usage. PROCMAP This routine maps processes to a grid using blacs_gridmap. SUBROUTINE PROCMAP(CONTEXT, MAPPING, BEGPROC, NPROW, NPCOL, IMAP) * * -- BLACS example code -- * Written by Clint Whaley 7/26/94 * .. * .. Scalar Arguments .. INTEGER CONTEXT, MAPPING, BEGPROC, NPROW, NPCOL BLACS Routines 16 2573 * .. * .. Array Arguments .. INTEGER IMAP(NPROW, *) * .. * * Purpose * ======= * PROCMAP maps NPROW*NPCOL processes starting from process BEGPROC to * the grid in a variety of ways depending on the parameter MAPPING. * * Arguments * ========= * * CONTEXT (output) INTEGER * This integer is used by the BLACS to indicate a context. * A context is a universe where messages exist and do not * interact with other context's messages. The context * includes the definition of a grid, and each process's * coordinates in it. * * MAPPING (input) INTEGER * Way to map processes to grid. Choices are: * 1 : row-major natural ordering * 2 : column-major natural ordering * * BEGPROC (input) INTEGER * The process number (between 0 and NPROCS-1) to use as * {0,0}. From this process, processes will be assigned * to the grid as indicated by MAPPING. * * NPROW (input) INTEGER * The number of process rows the created grid * should have. * * NPCOL (input) INTEGER * The number of process columns the created grid * should have. * * IMAP (workspace) INTEGER array of dimension (NPROW, NPCOL) * Workspace, where the array which maps the * processes to the grid will be stored for the * call to GRIDMAP. * * =============================================================== * * .. * .. External Functions .. INTEGER BLACS_PNUM EXTERNAL BLACS_PNUM * .. * .. External Subroutines .. EXTERNAL BLACS_PINFO, BLACS_GRIDINIT, BLACS_GRIDMAP * .. * .. Local Scalars .. INTEGER TMPCONTXT, NPROCS, I, J, K * .. * .. Executable Statements .. * * See how many processes there are in the system * CALL BLACS_PINFO( I, NPROCS ) 16 Intel® Math Kernel Library Reference Manual 2574 IF (NPROCS-BEGPROC .LT. NPROW*NPCOL) THEN WRITE(*,*) 'Not enough processes for grid' STOP END IF * * Temporarily map all processes into 1 x NPROCS grid * CALL BLACS_GET( 0, 0, TMPCONTXT ) CALL BLACS_GRIDINIT( TMPCONTXT, 'Row', 1, NPROCS ) K = BEGPROC * * If we want a row-major natural ordering * IF (MAPPING .EQ. 1) THEN DO I = 1, NPROW DO J = 1, NPCOL IMAP(I, J) = BLACS_PNUM(TMPCONTXT, 0, K) K = K + 1W END DO END DO * * If we want a column-major natural ordering * ELSE IF (MAPPING .EQ. 2) THEN DO J = 1, NPCOL DO I = 1, NPROW IMAP(I, J) = BLACS_PNUM(TMPCONTXT, 0, K) K = K + 1 END DO END DO ELSE WRITE(*,*) 'Unknown mapping.' STOP END IF * * Free temporary context * CALL BLACS_GRIDEXIT(TMPCONTXT) * * Apply the new mapping to form desired context * CALL BLACS_GET( 0, 0, CONTEXT ) CALL BLACS_GRIDMAP( CONTEXT, IMAP, NPROW, NPROW, NPCOL ) RETURN END BLACS Routines 16 2575 Example. BLACS Usage. PARALLEL DOT PRODUCT This routine does a bone-headed parallel double precision dot product of two vectors. Arguments are input on process {0,0}, and output everywhere else. DOUBLE PRECISION FUNCTION PDDOT( CONTEXT, N, X, Y ) * * -- BLACS example code -- * Written by Clint Whaley 7/26/94 * .. * .. Scalar Arguments .. INTEGER CONTEXT, N * .. * .. Array Arguments .. DOUBLE PRECISION X(*), Y(*) * .. * * Purpose * ======= * PDDOT is a restricted parallel version of the BLAS routine * DDOT. It assumes that the increment on both vectors is one, * and that process {0,0} starts out owning the vectors and * has N. It returns the dot product of the two N-length vectors * X and Y, that is, PDDOT = X' Y. * * Arguments * ========= * * CONTEXT (input) INTEGER * This integer is used by the BLACS to indicate a context. * A context is a universe where messages exist and do not * interact with other context's messages. The context * includes the definition of a grid, and each process's * coordinates in it. * * N (input/output) INTEGER * The length of the vectors X and Y. Input * for {0,0}, output for everyone else. * * X (input/output) DOUBLE PRECISION array of dimension (N) * The vector X of PDDOT = X' Y. Input for {0,0}, * output for everyone else. * * Y (input/output) DOUBLE PRECISION array of dimension (N) * The vector Y of PDDOT = X' Y. Input for {0,0}, * output for everyone else. * * =============================================================== * * .. * .. External Functions .. DOUBLE PRECISION DDOT EXTERNAL DDOT * .. * .. External Subroutines .. EXTERNAL BLACS_GRIDINFO, DGEBS2D, DGEBR2D, DGSUM2D * .. * .. Local Scalars .. INTEGER IAM, NPROCS, NPROW, NPCOL, MYPROW, MYPCOL, I, LN DOUBLE PRECISION LDDOT * .. 16 Intel® Math Kernel Library Reference Manual 2576 * .. Executable Statements .. * * Find out what grid has been set up, and pretend it is 1-D * CALL BLACS_GRIDINFO( CONTXT, NPROW, NPCOL, MYPROW, MYPCOL ) IAM = MYPROW*NPCOL + MYPCOL NPROCS = NPROW * NPCOL * * Temporarily map all processes into 1 x NPROCS grid * CALL BLACS_GET( 0, 0, TMPCONTXT ) CALL BLACS_GRIDINIT( TMPCONTXT, 'Row', 1, NPROCS ) K = BEGPROC * * Do bone-headed thing, and just send entire X and Y to * everyone * IF ( (MYPROW.EQ.0) .AND. (MYPCOL.EQ.0) ) THEN CALL IGEBS2D(CONTXT, 'All', 'i-ring', 1, 1, N, 1 ) CALL DGEBS2D(CONTXT, 'All', 'i-ring', N, 1, X, N ) CALL DGEBS2D(CONTXT, 'All', 'i-ring', N, 1, Y, N ) ELSE CALL IGEBR2D(CONTXT, 'All', 'i-ring', 1, 1, N, 1, 0, 0 ) CALL DGEBR2D(CONTXT, 'All', 'i-ring', N, 1, X, N, 0, 0 ) CALL DGEBR2D(CONTXT, 'All', 'i-ring', N, 1, Y, N, 0, 0 ) ENDIF * * Find out the number of local rows to multiply (LN), and * where in vectors to start (I) * LN = N / NPROCS I = 1 + IAM * LN * * Last process does any extra rows * IF (IAM .EQ. NPROCS-1) LN = LN + MOD(N, NPROCS) * * Figure dot product of my piece of X and Y * LDDOT = DDOT( LN, X(I), 1, Y(I), 1 ) * * Add local dot products to get global dot product; * give all procs the answer * CALL DGSUM2D( CONTXT, 'All', '1-tree', 1, 1, LDDOT, 1, -1, 0 ) PDDOT = LDDOT RETURN BLACS Routines 16 2577 END Example. BLACS Usage. PARALLEL MATRIX INFINITY NORM This routine does a parallel infinity norm on a distributed double precision matrix. Unlike the PDDOT example, this routine assumes the matrix has already been distributed. DOUBLE PRECISION FUNCTION PDINFNRM(CONTXT, LM, LN, A, LDA, WORK) * * -- BLACS example code -- * Written by Clint Whaley. * .. * .. Scalar Arguments .. INTEGER CONTEXT, LM, LN, LDA * .. * .. Array Arguments .. DOUBLE PRECISION A(LDA, *), WORK(*) * .. * * Purpose * ======= * Compute the infinity norm of a distributed matrix, where * the matrix is spread across a 2D process grid. The result is * left on all processes. * * Arguments * ========= * * CONTEXT (input) INTEGER * This integer is used by the BLACS to indicate a context. * A context is a universe where messages exist and do not * interact with other context's messages. The context * includes the definition of a grid, and each process's * coordinates in it. * * LM (input) INTEGER * Number of rows of the global matrix owned by this * process. * * LN (input) INTEGER * Number of columns of the global matrix owned by this * process. * * A (input) DOUBLE PRECISION, dimension (LDA,N) * The matrix whose norm you wish to compute. * * LDA (input) INTEGER * Leading Dimension of A. * * WORK (temporary) DOUBLE PRECISION array, dimension (LM) * Temporary work space used for summing rows. * * .. External Subroutines .. EXTERNAL BLACS_GRIDINFO, DGEBS2D, DGEBR2D, DGSUM2D, DGAMX2D * .. * .. External Functions .. INTEGER IDAMAX DOUBLE PRECISION DASUM * 16 Intel® Math Kernel Library Reference Manual 2578 * .. Local Scalars .. INTEGER NPROW, NPCOL, MYROW, MYCOL, I, J DOUBLE PRECISION MAX * * .. Executable Statements .. * * Get process grid information * CALL BLACS_GRIDINFO( CONTXT, NPROW, NPCOL, MYPROW, MYPCOL ) * * Add all local rows together * DO 20 I = 1, LM WORK(I) = DASUM(LN, A(I,1), LDA) 20 CONTINUE * * Find sum of global matrix rows and store on column 0 of * process grid * CALL DGSUM2D(CONTXT, 'Row', '1-tree', LM, 1, WORK, LM, MYROW, 0) * * Find maximum sum of rows for supnorm * IF (MYCOL .EQ. 0) THEN MAX = WORK(IDAMAX(LM,WORK,1)) IF (LM .LT. 1) MAX = 0.0D0 CALL DGAMX2D(CONTXT, 'Col', 'h', 1, 1, MAX, 1, I, I, -1, -1, 0) END IF * * Process column 0 has answer; send answer to all nodes * IF (MYCOL .EQ. 0) THEN CALL DGEBS2D(CONTXT, 'Row', ' ', 1, 1, MAX, 1) ELSE CALL DGEBR2D(CONTXT, 'Row', ' ', 1, 1, MAX, 1, 0, 0) END IF * PDINFNRM = MAX BLACS Routines 16 2579 * RETURN * * End of PDINFNRM * END 16 Intel® Math Kernel Library Reference Manual 2580 Data Fitting Functions 17 Data Fitting functions in Intel® MKL provide spline-based interpolation capabilities that you can use to approximate functions, function derivatives or integrals, and perform cell search operations. The Data Fitting component is task based. The task is a data structure or descriptor that holds the parameters related to a specific Data Fitting operation. You can modify the task parameters using the task editing functionality of the library. For definition of the implemented operations, see Mathematical Conventions. Data Fitting routines use the following workflow to process a task: 1. Create a task or multiple tasks. 2. Modify the task parameters. 3. Perform a Data Fitting computation. 4. Destroy the task or tasks. All Data Fitting functions fall into the following categories: Task Creation and Initialization Routines - routines that create a new Data Fitting task descriptor and initialize the most common parameters, such as partition of the interpolation interval, values of the vectorvalued function, and the parameters describing their structure. Task Editors - routines that set or modify parameters in an existing Data Fitting task. Computational Routines - routines that perform Data Fitting computations, such as construction of a spline, interpolation, computation of derivatives and integrals, and search. Task Destructors - routines that delete Data Fitting task descriptors and deallocate resources. You can access the Data Fitting routines through the Fortran and C89/C99 language interfaces. You can also use the C89 interface with more recent versions of C/C++, or the Fortran 90 interface with programs written in Fortran 95 The ${MKL}/include directory of the Intel® MKL contains the following Data Fitting header files: • C/C++: mkl_df.h • Fortran: mkl_df.f90 and mkl_df.f77 You can find examples that demonstrate C/C++ and Fortran usage of Data Fitting routines in the ${MKL}/ examples/datafittingc and ${MKL}/examples/datafittingf directories, respectively. Naming Conventions The Fortran interfaces of the Data Fitting functions are in lowercase, while the names of the types and constants are in uppercase. The C/C++ interface of the Data Fitting functions, types, and constants are case-sensitive and can be in lowercase, uppercase, and mixed case. The names of all routines have the following structure: df[datatype] where • df is a prefix indicating that the routine belongs to the Data Fitting component of Intel MKL. • [datatype] field specifies the type of the input and/or output data and can be s (for the single precision real type), d (for the double precision real type), or i (for the integer type). This field is omitted in the names of the routines that are not data type dependent. • field specifies the functionality the routine performs. For example, this field can be NewTask1D, Interpolate1D, or DeleteTask. 2581 Data Types The Data Fitting component provides routines for processing single and double precision real data types. The results of cell search operations are returned as a generic integer data type. All Data Fitting routines use the following data type: Type Data Object Fortran: TYPE(DF_TASK) C: DFTaskPtr Pointer to a task NOTE The actual size of the generic integer type is platform-dependent. Before compiling your application, you need to set an appropriate byte size for integers. For details, see section Using the ILP64 Interface vs. LP64 Interface of the Intel® MKL User's Guide. Mathematical Conventions This section explains the notation used for Data Fitting function descriptions. Spline notations are based on the terminology and definitions of [deBoor2001]. The definition of Subbotin quadratic splines follows the conventions of [StechSub76]. Mathematical Notation in the Data Fitting Component Concept Mathematical Notation Partition of interpolation interval [a, b] , where • xi denotes breakpoints. • [xi, xi+1) denotes a sub-interval (cell) of size ?xi+1-xi . {xi}i=1,...,n, where a = x1 < x2<... b), the df?integrateex1d routine passes max(llim, b) as the left integration limit and rlim as the right integration limit to the user-defined callback function. • If the left and the right integration limits belong to the interpolation interval, the df?integrateex1d routine passes them to the user-defined callback function unchanged. The value of the integral is the sum of integral values obtained on the sub-intervals. See Also df?integrate1d/df?integrateex1d df?integrcallback df?searchcellscallback df?searchcellscallback A callback function for user-defined search to be passed into df?interpolateex1d or df? searchcellsex1d. Syntax Fortran: status = dfssearchcellscallback(n, site, cell, flag, params) status = dfdsearchcellscallback(n, site, cell, flag, params) C: status = dfsSearchCellsCallBack(n, site, cell, flag, params) status = dfdSearchCellsCallBack(n, site, cell, flag, params) Include Files • Fortran: mkl_df.f90 and mkl_df.f77 • C: mkl_df.h Input Parameters Name Type Description n Fortran: INTEGER(KIND=8) C: long long* Number of interpolation sites. site Fortran: REAL(KIND=4) DIMENSION(*) for dfssearchcellscallback Array of interpolation sites of size n. Data Fitting Functions 17 2625 Name Type Description REAL(KIND=8) DIMENSION(*) for dfdsearchcellscallback C: float* for dfsSearchCellsCallBack double* for dfdSearchCellsCallBack cell Fortran: INTEGER(KIND=8) DIMENSION(*) C: long long* Array of size n that returns indices of the cells computed by the callback function. flag Fortran: INTEGER(KIND=4) DIMENSION(*) C: int* Array of size n, with values set as follows: • If the cell with index cell[i] contains site[i], set flag[i] to 1. • Otherwise, set flag[i] to zero. In this case, the library interprets the index as an approximation and computes the index of the cell containing site[i] by using the provided index as a starting point for the search. params Fortran: INTEGER DIMENSION(*) C: void* Pointer to user-defined parameters of the callback function. Output Parameters Name Type Description status Fortran: INTEGER C: int The status returned by the callback function: • Zero indicates successful completion of the callback operation. • A negative value indicates an error. • The DF_STATUS_EXACT_RESULT status indicates that cell indices returned by the callback function are exact. In this case, you do not need to initialize entries of the flag array. • A positive value indicates a warning. See "Task Status and Error Reporting" for error code definitions. Description When passed into the df?interpolateex1d or df?searchcellsex1d routine, this function performs a user-defined search. See Also df?interpolate1d/df?interpolateex1d df?interpcallback 17 Intel® Math Kernel Library Reference Manual 2626 Task Destructors Task destructors are routines used to delete task descriptors and deallocate the corresponding memory resources. The Data Fitting task destructor dfdeletetask destroys a Data Fitting task and frees the memory. dfdeletetask Destroys a Data Fitting task object and frees the memory. Syntax Fortran: status = dfdeletetask(task) C: status = dfDeleteTask(&task) Include Files • Fortran: mkl_df.f90 and mkl_df.f77 • C: mkl_df.h Input Parameters Name Type Description task Fortran: TYPE(DF_TASK) C: DFTaskPtr Descriptor of the task to destroy. Output Parameters Name Type Description status Fortran: INTEGER C: int Status of the routine: • DF_STATUS_OK if the task is deleted successfully. • Non-zero error code if the operation failed. See "Task Status and Error Reporting" for error code definitions. Description Given a pointer to a task descriptor, this routine deletes the Data Fitting task descriptor and frees the memory allocated for the structure. If the task is deleted successfully, the routine sets the task pointer to NULL. Otherwise, the routine returns an error code. Data Fitting Functions 17 2627 17 Intel® Math Kernel Library Reference Manual 2628 Linear Solvers Basics A Many applications in science and engineering require the solution of a system of linear equations. This problem is usually expressed mathematically by the matrix-vector equation, Ax = b, where A is an m-by-n matrix, x is the n element column vector and b is the m element column vector. The matrix A is usually referred to as the coefficient matrix, and the vectors x and b are referred to as the solution vector and the right-hand side, respectively. Basic concepts related to solving linear systems with sparse matrices are described in section Sparse Linear Systems that follows. Sparse Linear Systems In many real-life applications, most of the elements in A are zero. Such a matrix is referred to as sparse. Conversely, matrices with very few zero elements are called dense. For sparse matrices, computing the solution to the equation Ax = b can be made much more efficient with respect to both storage and computation time, if the sparsity of the matrix can be exploited. The more an algorithm can exploit the sparsity without sacrificing the correctness, the better the algorithm. Generally speaking, computer software that finds solutions to systems of linear equations is called a solver. A solver designed to work specifically on sparse systems of equations is called a sparse solver. Solvers are usually classified into two groups - direct and iterative. Iterative Solvers start with an initial approximation to a solution and attempt to estimate the difference between the approximation and the true result. Based on the difference, an iterative solver calculates a new approximation that is closer to the true result than the initial approximation. This process is repeated until the difference between the approximation and the true result is sufficiently small. The main drawback to iterative solvers is that the rate of convergence depends greatly on the values in the matrix A. Consequently, it is not possible to predict how long it will take for an iterative solver to produce a solution. In fact, for illconditioned matrices, the iterative process will not converge to a solution at all. However, for wellconditioned matrices it is possible for iterative solvers to converge to a solution very quickly. Consequently for the right applications, iterative solvers can be very efficient. Direct Solvers, on the other hand, often factor the matrix A into the product of two triangular matrices and then perform a forward and backward triangular solve. This approach makes the time required to solve a systems of linear equations relatively predictable, based on the size of the matrix. In fact, for sparse matrices, the solution time can be predicted based on the number of non-zero elements in the array A. Matrix Fundamentals A matrix is a rectangular array of either real or complex numbers. A matrix is denoted by a capital letter; its elements are denoted by the same lower case letter with row/column subscripts. Thus, the value of the element in row i and column j in matrix A is denoted by a(i,j). For example, a 3 by 4 matrix A, is written as follows: 2629 Note that with the above notation, we assume the standard Fortran programming language convention of starting array indices at 1 rather than the C programming language convention of starting them at 0. A matrix in which all of the elements are real numbers is called a real matrix. A matrix that contains at least one complex number is called a complex matrix. A real or complex matrix A with the property that a(i,j) = a(j,i), is called a symmetric matrix. A complex matrix A with the property that a(i,j) = conj(a(j,i)), is called a Hermitian matrix. Note that programs that manipulate symmetric and Hermitian matrices need only store half of the matrix values, since the values of the non-stored elements can be quickly reconstructed from the stored values. A matrix that has the same number of rows as it has columns is referred to as a square matrix. The elements in a square matrix that have same row index and column index are called the diagonal elements of the matrix, or simply the diagonal of the matrix. The transpose of a matrix A is the matrix obtained by “flipping” the elements of the array about its diagonal. That is, we exchange the elements a(i,j) and a(j,i). For a complex matrix, if we both flip the elements about the diagonal and then take the complex conjugate of the element, the resulting matrix is called the Hermitian transpose or conjugate transpose of the original matrix. The transpose and Hermitian transpose of a matrix A are denoted by AT and AH respectively. A column vector, or simply a vector, is a n × 1 matrix, and a row vector is a 1 × n matrix. A real or complex matrix A is said to be positive definite if the vector-matrix product xTAx is greater than zero for all non-zero vectors x. A matrix that is not positive definite is referred to as indefinite. An upper (or lower) triangular matrix, is a square matrix in which all elements below (or above) the diagonal are zero. A unit triangular matrix is an upper or lower triangular matrix with all 1's along the diagonal. A matrix P is called a permutation matrix if, for any matrix A, the result of the matrix product PA is identical to A except for interchanging the rows of A. For a square matrix, it can be shown that if PA is a permutation of the rows of A, then APT is the same permutation of the columns of A. Additionally, it can be shown that the inverse of P is PT. In order to save space, a permutation matrix is usually stored as a linear array, called a permutation vector, rather than as an array. Specifically, if the permutation matrix maps the i-th row of a matrix to the j-th row, then the i-th element of the permutation vector is j. A matrix with non-zero elements only on the diagonal is called a diagonal matrix. As is the case with a permutation matrix, it is usually stored as a vector of values, rather than as a matrix. Direct Method For solvers that use the direct method, the basic technique employed in finding the solution of the system Ax = b is to first factor A into triangular matrices. That is, find a lower triangular matrix L and an upper triangular matrix U, such that A = LU. Having obtained such a factorization (usually referred to as an LU decomposition or LU factorization), the solution to the original problem can be rewritten as follows. Ax = b LUx = b L(Ux) = b This leads to the following two-step process for finding the solution to the original system of equations: 1. Solve the systems of equations Ly = b. 2. Solve the system Ux = y. Solving the systems Ly = b and Ux = y is referred to as a forward solve and a backward solve, respectively. If a symmetric matrix A is also positive definite, it can be shown that A can be factored as LLT where L is a lower triangular matrix. Similarly, a Hermitian matrix, A, that is positive definite can be factored as A = LLH. For both symmetric and Hermitian matrices, a factorization of this form is called a Cholesky factorization. A Intel® Math Kernel Library Reference Manual 2630 In a Cholesky factorization, the matrix U in an LU decomposition is either LT or LH. Consequently, a solver can increase its efficiency by only storing L, and one-half of A, and not computing U. Therefore, users who can express their application as the solution of a system of positive definite equations will gain a significant performance improvement over using a general representation. For matrices that are symmetric (or Hermitian) but not positive definite, there are still some significant efficiencies to be had. It can be shown that if A is symmetric but not positive definite, then A can be factored as A = LDLT, where D is a diagonal matrix and L is a lower unit triangular matrix. Similarly, if A is Hermitian, it can be factored as A = LDLH. In either case, we again only need to store L, D, and half of A and we need not compute U. However, the backward solve phases must be amended to solving LTx = D-1y rather than LTx = y. Fill-In and Reordering of Sparse Matrices Two important concepts associated with the solution of sparse systems of equations are fill-in and reordering. The following example illustrates these concepts. Consider the system of linear equation Ax = b, where A is a symmetric positive definite sparse matrix, and A and b are defined by the following: A star (*) is used to represent zeros and to emphasize the sparsity of A. The Cholesky factorization of A is: A = LLT, where L is the following: Notice that even though the matrix A is relatively sparse, the lower triangular matrix L has no zeros below the diagonal. If we computed L and then used it for the forward and backward solve phase, we would do as much computation as if A had been dense. The situation of L having non-zeros in places where A has zeros is referred to as fill-in. Computationally, it would be more efficient if a solver could exploit the non-zero structure of A in such a way as to reduce the fill-in when computing L. By doing this, the solver would only need to compute the non-zero entries in L. Toward this end, consider permuting the rows and columns of A. As described in Matrix Fundamentals section , the permutations of the rows of A can be represented as a permutation matrix, P. The result of permuting the rows is the product of P and A. Suppose, in the above example, we swap the first and fifth row Linear Solvers Basics A 2631 of A, then swap the first and fifth columns of A, and call the resulting matrix B. Mathematically, we can express the process of permuting the rows and columns of A to get B as B = PAPT. After permuting the rows and columns of A, we see that B is given by the following: Since B is obtained from A by simply switching rows and columns, the numbers of non-zero entries in A and B are the same. However, when we find the Cholesky factorization, B = LLT, we see the following: The fill-in associated with B is much smaller than the fill-in associated with A. Consequently, the storage and computation time needed to factor B is much smaller than to factor A. Based on this, we see that an efficient sparse solver needs to find permutation P of the matrix A, which minimizes the fill-in for factoring B = PAPT, and then use the factorization of B to solve the original system of equations. Although the above example is based on a symmetric positive definite matrix and a Cholesky decomposition, the same approach works for a general LU decomposition. Specifically, let P be a permutation matrix, B = PAPT and suppose that B can be factored as B = LU. Then Ax = b PA(P-1P)x = Pb PA(PTP)x = Pb (PAPT)(Px) = Pb A Intel® Math Kernel Library Reference Manual 2632 B(Px) = Pb LU(Px) = Pb It follows that if we obtain an LU factorization for B, we can solve the original system of equations by a three step process: 1. Solve Ly = Pb. 2. Solve Uz = y. 3. Set x = PTz. If we apply this three-step process to the current example, we first need to perform the forward solve of the systems of equation Ly = Pb: This gives: The second step is to perform the backward solve, Uz = y. Or, in this case, since a Cholesky factorization is used, LTz = y. Linear Solvers Basics A 2633 This gives The third and final step is to set x = PTz. This gives Sparse Matrix Storage Formats As discussed above, it is more efficient to store only the non-zero elements of a sparse matrix. There are a number of common storage formats used for sparse matrices, but most of them employ the same basic technique. That is, store all non-zero elements of the matrix into a linear array and provide auxiliary arrays to describe the locations of the non-zero elements in the original matrix. Storage Formats for the Direct Sparse Solvers The storing the non-zero elements of a sparse matrix into a linear array is done by walking down each column (column-major format) or across each row (row-major format) in order, and writing the non-zero elements to a linear array in the order they appear in the walk. For symmetric matrices, it is necessary to store only the upper triangular half of the matrix (upper triangular format) or the lower triangular half of the matrix (lower triangular format). The Intel MKL direct sparse solvers use a row-major upper triangular storage format: the matrix is compressed row-by-row and for symmetric matrices only non-zero elements in the upper triangular half of the matrix are stored. The Intel MKL sparse matrix storage format for direct sparse solvers is specified by three arrays: values, columns, and rowIndex. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrix. values A real or complex array that contains the non-zero elements of a sparse matrix. The non-zero elements are mapped into the values array using the row-major upper triangular storage mapping described above. columns Element i of the integer array columns is the number of the column that contains the i-th element in the values array. rowIndex Element j of the integer array rowIndex gives the index of the element in the values array that is first non-zero element in a row j. The length of the values and columns arrays is equal to the number of non-zero elements in the matrix. As the rowIndex array gives the location of the first non-zero element within a row, and the non-zero elements are stored consecutively, the number of non-zero elements in the i-th row is equal to the difference of rowIndex(i) and rowIndex(i+1). To have this relationship hold for the last row of the matrix, an additional entry (dummy entry) is added to the end of rowIndex. Its value is equal to the number of non-zero elements plus one. This makes the total length of the rowIndex array one larger than the number of rows in the matrix. NOTE The Intel MKL sparse storage scheme for the direct sparse solvers supports both with onebased indexing and zero-based indexing. Consider the symmetric matrix A: A Intel® Math Kernel Library Reference Manual 2634 Only elements from the upper triangle are stored. The actual arrays for the matrix A are as follows: Storage Arrays for a Symmetric Matrix one-based indexing values = (1 -1 -3 5 4 6 4 7 -5) columns = (1 2 4 2 3 4 5 4 5) rowIndex = (1 4 5 8 9 10) zero-based indexing values = (1 -1 -3 5 4 6 4 7 -5) columns = (0 1 3 1 2 3 4 3 4) rowIndex = (0 3 4 7 8 9) For a non-symmetric or non-Hermitian matrix, all non-zero elements need to be stored. Consider the nonsymmetric matrix B: The matrix B has 13 non-zero elements, and all of them are stored as follows: Storage Arrays for a Non-Symmetric Matrix one-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) columns = (1 2 4 1 2 3 4 5 1 3 4 2 5) rowIndex = (1 4 6 9 12 14) zero-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) columns = (0 1 3 0 1 2 3 4 0 2 3 1 4) rowIndex = (0 3 5 8 11 13) Direct sparse solvers can also solve symmetrically structured systems of equations. A symmetrically structured system of equations is one where the pattern of non-zero elements is symmetric. That is, a matrix has a symmetric structure if a(j,i) is not zero if and only if a(j, i) is not zero. From the point of view of the solver software, a "non-zero" element of a matrix is any element stored in the values array, even if its value Linear Solvers Basics A 2635 is equal to 0. In that sense, any non-symmetric matrix can be turned into a symmetrically structured matrix by carefully adding zeros to the values array. For example, the above matrix B can be turned into a symmetrically structured matrix by adding two non-zero entries: The matrix B can be considered to be symmetrically structured with 15 non-zero elements and represented as: Storage Arrays for a Symmetrically Structured Matrix one-based indexing values = (1 -1 -3 -2 5 0 4 6 4 -4 2 7 8 0 -5) columns = (1 2 4 1 2 5 3 4 5 1 3 4 2 3 5) rowIndex = (1 4 7 10 13 16) zero-based indexing values = (1 -1 -3 -2 5 0 4 6 4 -4 2 7 8 0 -5) columns = (0 1 3 0 1 4 2 3 4 0 2 3 1 2 4) rowIndex = (0 3 6 9 12 15) Storage Format Restrictions The storage format for the sparse solver must conform to two important restrictions: - the non-zero values in a given row must be placed into the values array in the order in which they occur in the row (from left to right); - no diagonal element can be omitted from the values array for any symmetric or structurally symmetric matrix. The second restriction implies that if symmetric or structurally symmetric matrices have zero diagonal elements, then they must be explicitly represented in the values array. Sparse Matrix Storage Formats for Sparse BLAS Levels 2 and Level 3 This section describes in detail the sparse matrix storage formats supported in the current version of the Intel MKL Sparse BLAS Level 2 and Level 3. CSR Format The Intel MKL compressed sparse row (CSR) format is specified by four arrays: the values, columns, pointerB, and pointerE. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrix A. values A real or complex array that contains the non-zero elements of A. Values of the non-zero elements of A are mapped into the values array using the row-major storage mapping described above. columns Element i of the integer array columns is the number of the column in A that contains the i-th value in the values array. A Intel® Math Kernel Library Reference Manual 2636 pointerB Element j of this integer array gives the index of the element in the values array that is first non-zero element in a row j of A. Note that this index is equal to pointerB(j) - pointerB(1)+1 . pointerE An integer array that contains row indices, such that pointerE(j)- pointerB(1) is the index of the element in the values array that is last nonzero element in a row j of A. The length of the values and columns arrays is equal to the number of non-zero elements in A.The length of the pointerB and pointerE arrays is equal to the number of rows in A. NOTE Note that the Intel MKL Sparse BLAS routines support the CSR format both with one-based indexing and zero-based indexing. The matrix B can be represented in the CSR format as: Storage Arrays for a Matrix in CSR Format one-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) columns = (1 2 4 1 2 3 4 5 1 3 4 2 5) pointerB = (1 4 6 9 12) pointerE = (4 6 9 12 14) zero-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) columns = (0 1 3 0 1 2 3 4 0 2 3 1 4) pointerB = (0 3 5 8 11) pointerE = (3 5 8 11 13) This storage format is used in the NIST Sparse BLAS library [Rem05]. Note that the storage format accepted for the direct sparse solvers and described above (see Storage Formats for the Direct Sparse Solvers) is a variation of the CSR format. It also is used in the Intel MKL Sparse BLAS Level 2 both with one-based indexing and zero-based indexing. The above matrix B can be represented in this format (referred to as the 3-array variation of the CSR format) as: Storage Arrays for a Matrix in CSR Format (3-Array Variation) one-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) columns = (1 2 4 1 2 3 4 5 1 3 4 2 5) rowIndex = (1 4 6 9 12 14) zero-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) Linear Solvers Basics A 2637 columns = (0 1 3 0 1 2 3 4 0 2 3 1 4) rowIndex = (0 3 5 8 11 13) The 3-array variation of the CSR format has a restriction: all non-zero elements are stored continuously, that is the set of non-zero elements in the row J goes just after the set of non-zero elements in the row J-1 . There are no such restrictions in the general (NIST) CSR format. This may be useful, for example, if there is a need to operate with different submatrices of the matrix at the same time. In this case, it is enough to define the arrays pointerB and pointerE for each needed submatrix so that all these arrays are pointers to the same array values. Comparing the array rowIndex from the Table "Storage Arrays for a Non-Symmetric Example Matrix" with the arrays pointerB and pointerE from the Table "Storage Arrays for an Example Matrix in CSR Format" it is easy to see that pointerB(i) = rowIndex(i) for i=1, ..5; pointerE(i) = rowIndex(i+1) for i=1, ..5. This enables calling a routine that has values, columns, pointerB and pointerE as input parameters for a sparse matrix stored in the format accepted for the direct sparse solvers. For example, a routine with the interface: Subroutine name_routine(.... , values, columns, pointerB, pointerE, ...) can be called with parameters values, columns, rowIndex as follows: call name_routine(.... , values, columns, rowIndex, rowindex(2), ...). CSC Format The compressed sparse column format (CSC) is similar to the CSR format, but the columns are used instead the rows. In other words, the CSC format is identical to the CSR format for the transposed matrix. The CSR format is specified by four arrays: values, columns, pointerB, and pointerE. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrix A. values A real or complex array that contains the non-zero elements of A. Values of the non-zero elements of A are mapped into the values array using the columnmajor storage mapping. rows Element i of the integer array rows is the number of the row in A that contains the i-th value in the values array. pointerB Element j of this integer array gives the index of the element in the values array that is first non-zero element in a column j of A. Note that this index is equal to pointerB(j) - pointerB(1)+1 . pointerE An integer array that contains column indices, such that pointerE(j)- pointerB(1) is the index of the element in the values array that is last nonzero element in a column j of A. The length of the values and columns arrays is equal to the number of non-zero elements in A.The length of the pointerB and pointerE arrays is equal to the number of columns in A. NOTE Note that the Intel MKL Sparse BLAS routines support the CSC format both with one-based indexing and zero-based indexing. The above matrix B can be represented in the CSC format as: Storage Arrays for a Matrix in CSC Format one-based indexing A Intel® Math Kernel Library Reference Manual 2638 values = (1 -2 -4 -1 5 8 4 2 -3 6 7 4 -5) rows = (1 2 4 1 2 5 3 4 1 3 4 2 5) pointerB = (1 4 7 9 12) pointerE = (4 7 9 12 14) zero-based indexing values = (1 -2 -4 -1 5 8 4 2 -3 6 7 4 -5) rows = (0 1 3 0 1 4 2 3 0 2 3 1 4) pointerB = (0 3 6 8 11) pointerE = (3 6 8 11 13) Coordinate Format The coordinate format is the most flexible and simplest format for the sparse matrix representation. Only non-zero elements are stored, and the coordinates of each non-zero element are given explicitly. Many commercial libraries support the matrix-vector multiplication for the sparse matrices in the coordinate format. The Intel MKL coordinate format is specified by three arrays: values, rows, and column, and a parameter nnz which is number of non-zero elements in A. All three arrays have dimension nnz. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrix A. values A real or complex array that contains the non-zero elements of A in any order. rows Element i of the integer array rows is the number of the row in A that contains the i-th value in the values array. columns Element i of the integer array columns is the number of the column in A that contains the i-th value in the values array. NOTE Note that the Intel MKL Sparse BLAS routines support the coordinate format both with onebased indexing and zero-based indexing. For example, the sparse matrix C can be represented in the coordinate format as follows: Storage Arrays for an Example Matrix in case of the coordinate format one-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) rows = (1 1 1 2 2 3 3 3 4 4 4 5 5) columns = (1 2 3 1 2 3 4 5 1 3 4 2 5) zero-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) rows = (0 0 0 1 1 2 2 2 3 3 3 4 4) Linear Solvers Basics A 2639 columns = (0 1 2 0 1 2 3 4 0 2 3 1 4) Diagonal Storage Format If the sparse matrix has diagonals containing only zero elements, then the diagonal storage format can be used to reduce the amount of information needed to locate the non-zero elements. This storage format is particularly useful in many applications where the matrix arises from a finite element or finite difference discretization. The Intel MKL diagonal storage format is specified by two arrays: values and distance, and two parameters: ndiag, which is the number of non-empty diagonals, and lval, which is the declared leading dimension in the calling (sub)programs. The following table describes the arrays values and distance: values A real or complex two-dimensional array is dimensioned as lval by ndiag. Each column of it contains the non-zero elements of certain diagonal of A. The key point of the storage is that each element in values retains the row number of the original matrix. To achieve this diagonals in the lower triangular part of the matrix are padded from the top, and those in the upper triangular part are padded from the bottom. Note that the value of distance(i) is the number of elements to be padded for diagonal i. distance An integer array with dimension ndiag. Element i of the array distance is the distance between i-diagonal and the main diagonal. The distance is positive if the diagonal is above the main diagonal, and negative if the diagonal is below the main diagonal. The main diagonal has a distance equal to zero. The above matrix C can be represented in the diagonal storage format as follows: where the asterisks denote padded elements. When storing symmetric, Hermitian, or skew-symmetric matrices, it is necessary to store only the upper or the lower triangular part of the matrix. For the Intel MKL triangular solver routines elements of the array distance must be sorted in increasing order. In all other cases the diagonals and distances can be stored in arbitrary order. Skyline Storage Format The skyline storage format is important for the direct sparse solvers, and it is well suited for Cholesky or LU decomposition when no pivoting is required. The skyline storage format accepted in Intel MKL can store only triangular matrix or triangular part of a matrix. This format is specified by two arrays: values and pointers. The following table describes these arrays: A Intel® Math Kernel Library Reference Manual 2640 values A scalar array. For a lower triangular matrix it contains the set of elements from each row of the matrix starting from the first non-zero element to and including the diagonal element. For an upper triangular matrix it contains the set of elements from each column of the matrix starting with the first non-zero element down to and including the diagonal element. Encountered zero elements are included in the sets. pointers An integer array with dimension (m+1), where m is the number of rows for lower triangle (columns for the upper triangle). pointers(i) - pointers(1)+1 gives the index of element in values that is first non-zero element in row (column) i. The value of pointers(m+1) is set to nnz+pointers(1), where nnz is the number of elements in the array values. For example, the low triangle of the matrix C given above can be stored as follows: values = ( 1 -2 5 4 -4 0 2 7 8 0 0 -5 ) pointers = ( 1 2 4 5 9 13 ) and the upper triangle of this matrix C can be stored as follows: values = ( 1 -1 5 -3 0 4 6 7 4 0 -5 ) pointers = ( 1 2 4 7 9 12 ) This storage format is supported by the NIST Sparse BLAS library [Rem05]. Note that the Intel MKL Sparse BLAS routines operating with the skyline storage format does not support general matrices. BSR Format The Intel MKL block compressed sparse row (BSR) format for sparse matrices is specified by four arrays: values, columns, pointerB, and pointerE. The following table describes these arrays. values A real array that contains the elements of the non-zero blocks of a sparse matrix. The elements are stored block-by-block in row-major order. A non-zero block is the block that contains at least one non-zero element. All elements of non-zero blocks are stored, even if some of them is equal to zero. Within each non-zero block elements are stored in column-major order in the case of one-based indexing, and in row-major order in the case of the zero-based indexing. columns Element i of the integer array columns is the number of the column in the block matrix that contains the i-th non-zero block. pointerB Element j of this integer array gives the index of the element in the columns array that is first non-zero block in a row j of the block matrix. pointerE Element j of this integer array gives the index of the element in the columns array that contains the last non-zero block in a row j of the block matrix plus 1. The length of the values array is equal to the number of all elements in the non-zero blocks, the length of the columns array is equal to the number of non-zero blocks. The length of the pointerB and pointerE arrays is equal to the number of block rows in the block matrix. NOTE Note that the Intel MKL Sparse BLAS routines support BSR format both with one-based indexing and zero-based indexing. For example, consider the sparse matrix D Linear Solvers Basics A 2641 If the size of the block equals 2, then the sparse matrix D can be represented as a 3x3 block matrix E with the following structure: where The matrix D can be represented in the BSR format as follows: one-based indexing values = (1 2 0 1 6 8 7 2 1 5 4 1 4 0 3 0 7 0 2 0) columns = (1 2 2 2 3) pointerB = (1 3 4) pointerE = (3 4 6) zero-based indexing values = (1 0 2 1 6 7 8 2 1 4 5 1 4 3 0 0 7 2 0 0) columns = (0 1 1 1 2) pointerB = (0 2 3) pointerE = (2 3 5) This storage format is supported by the NIST Sparse BLAS library [Rem05]. Intel MKL supports the variation of the BSR format that is specified by three arrays: values, columns, and rowIndex. The following table describes these arrays. A Intel® Math Kernel Library Reference Manual 2642 values A real array that contains the elements of the non-zero blocks of a sparse matrix. The elements are stored block by block in row-major order. A non-zero block is the block that contains at least one non-zero element. All elements of non-zero blocks are stored, even if some of them is equal to zero. Within each non-zero block the elements are stored in column major order in the case of the onebased indexing, and in row major order in the case of the zero-based indexing. columns Element i of the integer array columns is the number of the column in the block matrix that contains the i-th non-zero block. rowIndex Element j of this integer array gives the index of the element in the columns array that is first non-zero block in a row j of the block matrix. The length of the values array is equal to the number of all elements in the non-zero blocks, the length of the columns array is equal to the number of non-zero blocks. As the rowIndex array gives the location of the first non-zero block within a row, and the non-zero blocks are stored consecutively, the number of non-zero blocks in the i-th row is equal to the difference of rowIndex(i) and rowIndex(i+1). To retain this relationship for the last row of the block matrix, an additional entry (dummy entry) is added to the end of rowIndex with value equal to the number of non-zeros blocks plus one. This makes the total length of the rowIndex array one larger than the number of rows of the block matrix. The above matrix D can be represented in this 3-array variation of the BSR format as follows: one-based indexing values = (1 2 0 1 6 8 7 2 1 5 4 2 4 0 3 0 7 0 2 0) columns = (1 2 2 2 3) rowIndex = (1 3 4 6) zero-based indexing values = (1 0 2 1 6 7 8 2 1 4 5 1 4 3 0 0 7 2 0 0) columns = (0 1 1 1 2) rowIndex = (0 2 3 5) When storing symmetric matrices, it is necessary to store only the upper or the lower triangular part of the matrix. For example, consider the symmetric sparse matrix F: Linear Solvers Basics A 2643 If the size of the block equals 2, then the sparse matrix F can be represented as a 3x3 block matrix G with the following structure: where The symmetric matrix F can be represented in this 3-array variation of the BSR format (storing only upper triangular) as follows: one-based indexing values = (1 2 0 1 6 8 7 2 1 5 4 2 7 0 2 0) columns = (1 2 2 3) rowIndex = (1 3 4 5) zero-based indexing values = (1 0 2 1 6 7 8 2 1 4 5 2 7 2 0 0) columns = (0 1 1 2) rowIndex = (0 2 3 4) A Intel® Math Kernel Library Reference Manual 2644 Routine and Function Arguments B The major arguments in the BLAS routines are vector and matrix, whereas VML functions work on vector arguments only. The sections that follow discuss each of these arguments and provide examples. Vector Arguments in BLAS Vector arguments are passed in one-dimensional arrays. The array dimension (length) and vector increment are passed as integer variables. The length determines the number of elements in the vector. The increment (also called stride) determines the spacing between vector elements and the order of the elements in the array in which the vector is passed. A vector of length n and increment incx is passed in a one-dimensional array x whose values are defined as x(1), x(1+|incx|), ..., x(1+(n-1)* |incx|) If incx is positive, then the elements in array x are stored in increasing order. If incx is negative, the elements in array x are stored in decreasing order with the first element defined as x(1+(n-1)* |incx|). If incx is zero, then all elements of the vector have the same value, x(1). The dimension of the onedimensional array that stores the vector must always be at least idimx = 1 + (n-1)* |incx | Example. One-dimensional Real Array Let x(1:7) be the one-dimensional real array x = (1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0). If incx =2 and n = 3, then the vector argument with elements in order from first to last is (1.0, 5.0, 9.0). If incx = -2 and n = 4, then the vector elements in order from first to last is (13.0, 9.0, 5.0, 1.0). If incx = 0 and n = 4, then the vector elements in order from first to last is (1.0, 1.0, 1.0, 1.0). One-dimensional substructures of a matrix, such as the rows, columns, and diagonals, can be passed as vector arguments with the starting address and increment specified. In Fortran, storing the m-by-n matrix is based on column-major ordering where the increment between elements in the same column is 1, the increment between elements in the same row is m, and the increment between elements on the same diagonal is m + 1. Example. Two-dimensional Real Matrix Let a be the real 5 x 4 matrix declared as REAL A (5,4). To scale the third column of a by 2.0, use the BLAS routine sscal with the following calling sequence: callsscal (5, 2.0, a(1,3), 1) To scale the second row, use the statement: callsscal (4, 2.0, a(2,1), 5) To scale the main diagonal of A by 2.0, use the statement: callsscal (5, 2.0, a(1,1), 6) 2645 NOTE The default vector argument is assumed to be 1. Vector Arguments in VML Vector arguments of VML mathematical functions are passed in one-dimensional arrays with unit vector increment. It means that a vector of length n is passed contiguously in an array a whose values are defined as a[0], a[1], ..., a[n-1] (for the C interface). To accommodate for arrays with other increments, or more complicated indexing, VML contains auxiliary pack/unpack functions that gather the array elements into a contiguous vector and then scatter them after the computation is complete. Generally, if the vector elements are stored in a one-dimensional array a as a[m0], a[m1], ..., a[mn-1] and need to be regrouped into an array y as y[k0], y[k1], ..., y[kn-1], VML pack/unpack functions can use one of the following indexing methods: Positive Increment Indexing kj = incy * j, mj = inca * j, j = 0 ,..., n-1 Constraint: incy > 0 and inca > 0. For example, setting incy = 1 specifies gathering array elements into a contiguous vector. This method is similar to that used in BLAS, with the exception that negative and zero increments are not permitted. Index Vector Indexing kj = iy[j], mj = ia[j], j = 0 ,..., n-1, where ia and iy are arrays of length n that contain index vectors for the input and output arrays a and y, respectively. Mask Vector Indexing Indices kj , mj are such that: my[kj] ? 0, ma[mj] ? 0 , j = 0,..., n-1, where ma and my are arrays that contain mask vectors for the input and output arrays a and y, respectively. Matrix Arguments Matrix arguments of the Intel® Math Kernel Library routines can be stored in either one- or two-dimensional arrays, using the following storage schemes: • conventional full storage (in a two-dimensional array) • packed storage for Hermitian, symmetric, or triangular matrices (in a one-dimensional array) • band storage for band matrices (in a two-dimensional array) • rectangular full packed storage for symmetric, Hermitian, or triangular matrices as compact as the Packed storage while maintaining efficiency by using Level 3 BLAS/LAPACK kernels. Full storage is the following obvious scheme: a matrix A is stored in a two-dimensional array a, with the matrix element aij stored in the array element a(i,j). B Intel® Math Kernel Library Reference Manual 2646 If a matrix is triangular (upper or lower, as specified by the argument uplo), only the elements of the relevant triangle are stored; the remaining elements of the array need not be set. Routines that handle symmetric or Hermitian matrices allow for either the upper or lower triangle of the matrix to be stored in the corresponding elements of the array: if uplo ='U', aij is stored in a(i,j) for i = j, other elements of a need not be set. if uplo ='L', aij is stored in a(i,j) for j = i, other elements of a need not be set. Packed storage allows you to store symmetric, Hermitian, or triangular matrices more compactly: the relevant triangle (again, as specified by the argument uplo) is packed by columns in a one-dimensional array ap: if uplo ='U', aij is stored in ap(i+j(j-1)/2) for i = j if uplo ='L', aij is stored in ap(i+(2*n-j)*(j-1)/2) for j = i. In descriptions of LAPACK routines, arrays with packed matrices have names ending in p. Band storage is as follows: an m-by-n band matrix with kl non-zero sub-diagonals and ku non-zero superdiagonals is stored compactly in a two-dimensional array ab with kl+ku+1 rows and n columns. Columns of the matrix are stored in the corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. Thus, aij is stored in ab(ku+1+i-j,j) for max(1,j-ku) = i = min(n,j+kl). Use the band storage scheme only when kl and ku are much less than the matrix size n. Although the routines work correctly for all values of kl and ku, using the band storage is inefficient if your matrices are not really banded. The band storage scheme is illustrated by the following example, when m = n = 6, kl = 2, ku = 1 Array elements marked * are not used by the routines: When a general band matrix is supplied for LU factorization, space must be allowed to store kl additional super-diagonals generated by fill-in as a result of row interchanges. This means that the matrix is stored according to the above scheme, but with kl + ku super-diagonals. Thus, aij is stored in ab(kl+ku+1+i-j,j) for max(1,j-ku) = i = min(n,j+kl). The band storage scheme for LU factorization is illustrated by the following example, whenm = n = 6, kl = 2, ku = 1: Routine and Function Arguments B 2647 Array elements marked * are not used by the routines; elements marked + need not be set on entry, but are required by the LU factorization routines to store the results. The input array will be overwritten on exit by the details of the LU factorization as follows: where uij are the elements of the upper triangular matrix U, and mij are the multipliers used during factorization. Triangular band matrices are stored in the same format, with either kl= 0 if upper triangular, or ku = 0 if lower triangular. For symmetric or Hermitian band matrices with k sub-diagonals or super-diagonals, you need to store only the upper or lower triangle, as specified by the argument uplo: if uplo ='U', aij is stored in ab(k+1+i-j,j) for max(1,j-k) = i = j if uplo ='L', aij is stored in ab(1+i-j,j) for j = i = min(n,j+k). In descriptions of LAPACK routines, arrays that hold matrices in band storage have names ending in b. In Fortran, column-major ordering of storage is assumed. This means that elements of the same column occupy successive storage locations. Three quantities are usually associated with a two-dimensional array argument: its leading dimension, which specifies the number of storage locations between elements in the same row, its number of rows, and its number of columns. For a matrix in full storage, the leading dimension of the array must be at least as large as the number of rows in the matrix. A character transposition parameter is often passed to indicate whether the matrix argument is to be used in normal or transposed form or, for a complex matrix, if the conjugate transpose of the matrix is to be used. The values of the transposition parameter for these three cases are the following: 'N' or 'n' normal (no conjugation, no transposition) 'T' or 't' transpose 'C' or 'c' conjugate transpose. B Intel® Math Kernel Library Reference Manual 2648 Example. Two-Dimensional Complex Array Suppose A (1:5, 1:4) is the complex two-dimensional array presented by matrix Let transa be the transposition parameter, m be the number of rows, n be the number of columns, and lda be the leading dimension. Then if transa = 'N', m = 4, n = 2, and lda = 5, the matrix argument would be If transa = 'T', m = 4, n = 2, and lda =5, the matrix argument would be If transa = 'C', m = 4, n = 2, and lda =5, the matrix argument would be Note that care should be taken when using a leading dimension value which is different from the number of rows specified in the declaration of the two-dimensional array. For example, suppose the array A above is declared as COMPLEX A (5,4). Then if transa = 'N', m = 3, n = 4, and lda = 4, the matrix argument will be Routine and Function Arguments B 2649 Rectangular Full Packed storage allows you to store symmetric, Hermitian, or triangular matrices as compact as the Packed storage while maintaining efficiency by using Level 3 BLAS/LAPACK kernels. To store an n-by-n triangle (and suppose for simplicity that n is even), you partition the triangle into three parts: two n/2-by-n/2 triangles and an n/2-by-n/2 square, then pack this as an n-by-n/2 rectangle (or n/2-by-n rectangle), by transposing (or transpose-conjugating) one of the triangles and packing it next to the other triangle. Since the two triangles are stored in full storage, you can use existing efficient routines on them. There are eight cases of RFP storage representation: when n is even or odd, the packed matrix is transposed or not, the triangular matrix is lower or upper. See below for all the eight storage schemes illustrated: n is odd, A is lower triangular n is even, A is lower triangular n is odd, A is upper triangular n is even, A is upper triangular B Intel® Math Kernel Library Reference Manual 2650 Intel MKL provides a number of routines such as ?hfrk, ?sfrk performing BLAS operations working directly on RFP matrices, as well as some conversion routines, for instance, ?tpttf goes from the standard packed format to RFP and ?trttf goes from the full format to RFP. Please refer to the Netlib site for more information. Note that in the descriptions of LAPACK routines, arrays with RFP matrices have names ending in fp. Routine and Function Arguments B 2651 B Intel® Math Kernel Library Reference Manual 2652 Code Examples C This appendix presents code examples of using some Intel MKL routines and functions. You can find here example code written in both Fortran and C. Please refer to respective chapters in the manual for detailed descriptions of function parameters and operation. BLAS Code Examples Example. Using BLAS Level 1 Function The following example illustrates a call to the BLAS Level 1 function sdot. This function performs a vectorvector operation of computing a scalar product of two single-precision real vectors x and y. Parameters n Specifies the number of elements in vectors x and y. incx Specifies the increment for the elements of x. incy Specifies the increment for the elements of y. program dot_main real x(10), y(10), sdot, res integer n, incx, incy, i external sdot n = 5 incx = 2 incy = 1 do i = 1, 10 x(i) = 2.0e0 y(i) = 1.0e0 end do res = sdot (n, x, incx, y, incy) print*, `SDOT = `, res end As a result of this program execution, the following line is printed: SDOT = 10.000 Example. Using BLAS Level 1 Routine The following example illustrates a call to the BLAS Level 1 routine scopy. This routine performs a vectorvector operation of copying a single-precision real vector x to a vector y. Parameters n Specifies the number of elements in vectors x and y. incx Specifies the increment for the elements of x. incy Specifies the increment for the elements of y. program copy_main real x(10), y(10) integer n, incx, incy, i n = 3 2653 incx = 3 incy = 1 do i = 1, 10 x(i) = i end do call scopy (n, x, incx, y, incy) print*, `Y = `, (y(i), i = 1, n) end As a result of this program execution, the following line is printed: Y = 1.00000 4.00000 7.00000 Example. Using BLAS Level 2 Routine The following example illustrates a call to the BLAS Level 2 routine sger. This routine performs a matrixvector operation a := alpha*x*y' + a. Parameters alpha Specifies a scalar alpha. x m-element vector. y n-element vector. a m-by-n matrix. program ger_main real a(5,3), x(10), y(10), alpha integer m, n, incx, incy, i, j, lda m = 2 n = 3 lda = 5 incx = 2 incy = 1 alpha = 0.5 do i = 1, 10 x(i) = 1.0 y(i) = 1.0 end do do i = 1, m do j = 1, n a(i,j) = j end do end do call sger (m, n, alpha, x, incx, y, incy, a, lda) print*, `Matrix A: ` do i = 1, m print*, (a(i,j), j = 1, n) end do end As a result of this program execution, matrix a is printed as follows: Matrix A: 1.50000 2.50000 3.50000 1.50000 2.50000 3.50000 Example. Using BLAS Level 3 Routine The following example illustrates a call to the BLAS Level 3 routine ssymm. This routine performs a matrixmatrix operation c := alpha*a*b' + beta*c. C Intel® Math Kernel Library Reference Manual 2654 Parameters alpha Specifies a scalar alpha. beta Specifies a scalar beta. a Symmetric matrix b m-by-n matrix c m-by-n matrix program symm_main real a(3,3), b(3,2), c(3,3), alpha, beta integer m, n, lda, ldb, ldc, i, j character uplo, side uplo = 'u' side = 'l' m = 3 n = 2 lda = 3 ldb = 3 ldc = 3 alpha = 0.5 beta = 2.0 do i = 1, m do j = 1, m a(i,j) = 1.0 end do end do do i = 1, m do j = 1, n c(i,j) = 1.0 b(i,j) = 2.0 end do end do call ssymm (side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) print*, `Matrix C: ` do i = 1, m print*, (c(i,j), j = 1, n) end do end As a result of this program execution, matrix c is printed as follows: Matrix C: 5.00000 5.00000 5.00000 5.00000 5.00000 5.00000 The following example illustrates a call from a C program to the complex BLAS Level 1 function zdotc(). This function computes the dot product of two double-precision complex vectors. Example. Calling a Complex BLAS Level 1 Function from C In this example, the complex dot product is returned in the structure c. #include #include "mkl_blas.h" #define N 5 void main() { int n, inca = 1, incb = 1, i; MKL_Complex16 a[N], b[N], c; void zdotc(); n = N; for( i = 0; i < n; i++ ){ a[i].real = (double)i; a[i].imag = (double)i * 2.0; b[i].real = (double)(n - i); b[i].imag = (double)i * 2.0; } Code Examples C 2655 zdotc( &c, &n, a, &inca, b, &incb ); printf( "The complex dot product is: ( %6.2f, %6.2f )\n", c.real, c.imag ); } NOTE Instead of calling BLAS directly from C programs, you might wish to use the CBLAS interface; this is the supported way of calling BLAS from C. For more information about CBLAS, see Appendix D , which presents CBLAS, the C interface to the Basic Linear Algebra Subprograms (BLAS) implemented in Intel® MKL. Fourier Transform Functions Code Examples This section presents code examples of functions described in the “FFT Functions” and “Cluster FFT Functions” sections in the “Fourier Transform Functions” chapter. The examples are grouped in subsections • Examples for FFT Functions, including Examples of Using Multi-Threading for FFT Computation • Examples for Cluster FFT Functions • Auxiliary data transformations. FFT Code Examples This section presents code examples of using the FFT interface functions described in “Fourier Transform Functions” chapter. Here are the examples of two one-dimensional computations. These examples use the default settings for all of the configuration parameters, which are specified in “Configuration Settings”. One-dimensional In-place FFT (Fortran Interface) ! Fortran example. ! 1D complex to complex, and real to conjugate-even Use MKL_DFTI Complex :: X(32) Real :: Y(34) type(DFTI_DESCRIPTOR), POINTER :: My_Desc1_Handle, My_Desc2_Handle Integer :: Status !...put input data into X(1),...,X(32); Y(1),...,Y(32) ! Perform a complex to complex transform Status = DftiCreateDescriptor( My_Desc1_Handle, DFTI_SINGLE,& DFTI_COMPLEX, 1, 32 ) Status = DftiCommitDescriptor( My_Desc1_Handle ) Status = DftiComputeForward( My_Desc1_Handle, X ) Status = DftiFreeDescriptor(My_Desc1_Handle) ! result is given by {X(1),X(2),...,X(32)} ! Perform a real to complex conjugate-even transform Status = DftiCreateDescriptor(My_Desc2_Handle, DFTI_SINGLE,& DFTI_REAL, 1, 32) Status = DftiCommitDescriptor(My_Desc2_Handle) Status = DftiComputeForward(My_Desc2_Handle, Y) Status = DftiFreeDescriptor(My_Desc2_Handle) ! result is given in CCS format. One-dimensional Out-of-place FFT (Fortran Interface) ! Fortran example. ! 1D complex to complex, and real to conjugate-even Use MKL_DFTI Complex :: X_in(32) Complex :: X_out(32) Real :: Y_in(32) Real :: Y_out(34) type(DFTI_DESCRIPTOR), POINTER :: My_Desc1_Handle, My_Desc2_Handle C Intel® Math Kernel Library Reference Manual 2656 Integer :: Status ...put input data into X_in(1),...,X_in(32); Y_in(1),...,Y_in(32) ! Perform a complex to complex transform Status = DftiCreateDescriptor( My_Desc1_Handle, DFTI_SINGLE, DFTI_COMPLEX, 1, 32 ) Status = DftiSetValue( My_Desc1_Handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE) Status = DftiCommitDescriptor( My_Desc1_Handle ) Status = DftiComputeForward( My_Desc1_Handle, X_in, X_out ) Status = DftiFreeDescriptor(My_Desc1_Handle) ! result is given by {X_out(1),X_out(2),...,X_out(32)} ! Perform a real to complex conjugate-even transform Status = DftiCreateDescriptor(My_Desc2_Handle, DFTI_SINGLE, DFTI_REAL, 1, 32) Status = DftiSetValue( My_Desc2_Handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE) Status = DftiCommitDescriptor(My_Desc2_Handle) Status = DftiComputeForward(My_Desc2_Handle, Y_in, Y_out) Status = DftiFreeDescriptor(My_Desc2_Handle) ! result is given by Y_out in CCS format. One-dimensional In-place FFT (C Interface) /* C example, float _Complex is defined in C9X */ #include "mkl_dfti.h" float _Complex x[32]; float y[34]; DFTI_DESCRIPTOR_HANDLE my_desc1_handle; DFTI_DESCRIPTOR_HANDLE my_desc2_handle; MKL_LONG status; //...put input data into x[0],...,x[31]; y[0],...,y[31] status = DftiCreateDescriptor( &my_desc1_handle, DFTI_SINGLE, DFTI_COMPLEX, 1, 32); status = DftiCommitDescriptor( my_desc1_handle ); status = DftiComputeForward( my_desc1_handle, x); status = DftiFreeDescriptor(&my_desc1_handle); /* result is x[0], ..., x[31]*/ status = DftiCreateDescriptor( &my_desc2_handle, DFTI_SINGLE, DFTI_REAL, 1, 32); status = DftiCommitDescriptor( my_desc2_handle); status = DftiComputeForward( my_desc2_handle, y); status = DftiFreeDescriptor(&my_desc2_handle); /* result is given in CCS format*/ One-dimensional Out-of-place FFT (C Interface) /* C example, float _Complex is defined in C9X */ #include "mkl_dfti.h" float _Complex x_in[32]; float _Complex x_out[32]; float y_in[32]; float y_out[34]; DFTI_DESCRIPTOR_HANDLE my_desc1_handle; DFTI_DESCRIPTOR_HANDLE my_desc2_handle; MKL_LONG status; //...put input data into x_in[0],...,x_in[31]; y_in[0],...,y_in[31] status = DftiCreateDescriptor( &my_desc1_handle, DFTI_SINGLE, DFTI_COMPLEX, 1, 32); status = DftiSetValue( my_desc1_handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE); status = DftiCommitDescriptor( my_desc1_handle ); status = DftiComputeForward( my_desc1_handle, x_in, x_out); status = DftiFreeDescriptor(&my_desc1_handle); /* result is x_out[0], ..., x_out[31]*/ status = DftiCreateDescriptor( &my_desc2_handle, DFTI_SINGLE, DFTI_REAL, 1, 32); Status = DftiSetValue( My_Desc2_Handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE); status = DftiCommitDescriptor( my_desc2_handle); Code Examples C 2657 status = DftiComputeForward( my_desc2_handle, y_in, y_out); status = DftiFreeDescriptor(&my_desc2_handle); /* result is given by y_out in CCS format*/ Two-dimensional FFT (Fortran Interface) The following is an example of two simple two-dimensional transforms. Notice that the data and result parameters in computation functions are all declared as assumed-size rank-1 array DIMENSION(0:*). Therefore two-dimensional array must be transformed to one-dimensional array by EQUIVALENCE statement or other facilities of Fortran. ! Fortran example. ! 2D complex to complex, and real to conjugate-even Use MKL_DFTI Complex :: X_2D(32,100) Real :: Y_2D(34, 102) Complex :: X(3200) Real :: Y(3468) Equivalence (X_2D, X) Equivalence (Y_2D, Y) type(DFTI_DESCRIPTOR), POINTER :: My_Desc1_Handle, My_Desc2_Handle Integer :: Status, L(2) !...put input data into X_2D(j,k), Y_2D(j,k), 1<=j=32,1<=k<=100 !...set L(1) = 32, L(2) = 100 !...the transform is a 32-by-100 ! Perform a complex to complex transform Status = DftiCreateDescriptor( My_Desc1_Handle, DFTI_SINGLE,& DFTI_COMPLEX, 2, L) Status = DftiCommitDescriptor( My_Desc1_Handle) Status = DftiComputeForward( My_Desc1_Handle, X) Status = DftiFreeDescriptor(My_Desc1_Handle) ! result is given by X_2D(j,k), 1<=j<=32, 1<=k<=100 ! Perform a real to complex conjugate-even transform Status = DftiCreateDescriptor( My_Desc2_Handle, DFTI_SINGLE,& DFTI_REAL, 2, L) Status = DftiCommitDescriptor( My_Desc2_Handle) Status = DftiComputeForward( My_Desc2_Handle, Y) Status = DftiFreeDescriptor(My_Desc2_Handle) ! result is given by the complex value z(j,k) 1<=j<=32; 1<=k<=100 ! and is stored in CCS format Two-dimensional FFT (C Interface) /* C99 example */ #include "mkl_dfti.h" float _Complex x[32][100]; float y[34][102]; DFTI_DESCRIPTOR_HANDLE my_desc1_handle; DFTI_DESCRIPTOR_HANDLE my_desc2_handle; MKL_LONG status, l[2]; //...put input data into x[j][k] 0<=j<=31, 0<=k<=99 //...put input data into y[j][k] 0<=j<=31, 0<=k<=99 l[0] = 32; l[1] = 100; status = DftiCreateDescriptor( &my_desc1_handle, DFTI_SINGLE, DFTI_COMPLEX, 2, l); status = DftiCommitDescriptor( my_desc1_handle); status = DftiComputeForward( my_desc1_handle, x); status = DftiFreeDescriptor(&my_desc1_handle); /* result is the complex value x[j][k], 0<=j<=31, 0<=k<=99 */ status = DftiCreateDescriptor( &my_desc2_handle, DFTI_SINGLE, DFTI_REAL, 2, l); status = DftiCommitDescriptor( my_desc2_handle); status = DftiComputeForward( my_desc2_handle, y); C Intel® Math Kernel Library Reference Manual 2658 status = DftiFreeDescriptor(&my_desc2_handle); /* result is the complex value z(j,k) 0<=j<=31; 0<=k<=99 /* and is stored in CCS format*/ The following examples demonstrate how you can change the default configuration settings by using the DftiSetValue function. For instance, to preserve the input data after the FFT computation, the configuration of the DFTI_PLACEMENT should be changed to "not in place" from the default choice of "in place." Changing Default Settings (Fortran) The code below illustrates how this can be done: ! Fortran example ! 1D complex to complex, not in place Use MKL_DFTI Complex :: X_in(32), X_out(32) type(DFTI_DESCRIPTOR), POINTER :: My_Desc_Handle Integer :: Status !...put input data into X_in(j), 1<=j<=32 Status = DftiCreateDescriptor( My_Desc_Handle,& DFTI_SINGLE, DFTI_COMPLEX, 1, 32) Status = DftiSetValue( My_Desc_Handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE) Status = DftiCommitDescriptor( My_Desc_Handle) Status = DftiComputeForward( My_Desc_Handle, X_in, X_out) Status = DftiFreeDescriptor (My_Desc_Handle) ! result is X_out(1),X_out(2),...,X_out(32) Changing Default Settings (C) /* C99 example */ #include "mkl_dfti.h" float _Complex x_in[32], x_out[32]; DFTI_DESCRIPTOR_HANDLE my_desc_handle; MKL_LONG status; //...put input data into x_in[j], 0 <= j < 32 status = DftiCreateDescriptor( &my_desc_handle, DFTI_SINGLE, DFTI_COMPLEX, 1, 32); status = DftiSetValue( my_desc_handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE); status = DftiCommitDescriptor( my_desc_handle); status = DftiComputeForward( my_desc_handle, x_in, x_out); status = DftiFreeDescriptor(&my_desc_handle); /* result is x_out[0], x_out[1], ..., x_out[31] */ Using Status Checking Functions The example illustrates the use of status checking functions described in Chapter 11. /* C */ DFTI_DESCRIPTOR_HANDLE desc; MKL_LONG status; // . . . descriptor creation and other code status = DftiCommitDescriptor(desc); if (status && !DftiErrorClass(status,DFTI_NO_ERROR)) { printf ('Error: %s\n', DftiErrorMessage(status)); } ! Fortran type(DFTI_DESCRIPTOR), POINTER :: desc integer status ! ...descriptor creation and other code status = DftiCommitDescriptor(desc) Code Examples C 2659 if (status .ne. 0) then if (.not. DftiErrorClass(status,DFTI_NO_ERROR) then print *, 'Error: ‘, DftiErrorMessage(status) endif endif Computing 2D FFT by One-Dimensional Transforms Below is an example where a 20-by-40 two-dimensional FFT is computed explicitly using one-dimensional transforms. Notice that the data and result parameters in computation functions are all declared as assumedsize rank-1 array DIMENSION(0:*). Therefore two-dimensional array must be transformed to onedimensional array by EQUIVALENCE statement or other facilities of Fortran. ! Fortran use mkl_dfti Complex :: X_2D(20,40) Complex :: X(800) Equivalence (X_2D, X) INTEGER :: STRIDE(2) type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle_Dim1 type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle_Dim2 ! ... Status = DftiCreateDescriptor(Desc_Handle_Dim1, DFTI_SINGLE,& DFTI_COMPLEX, 1, 20 ) Status = DftiCreateDescriptor(Desc_Handle_Dim2, DFTI_SINGLE,& DFTI_COMPLEX, 1, 40 ) ! perform 40 one-dimensional transforms along 1st dimension Status = DftiSetValue( Desc_Handle_Dim1, DFTI_NUMBER_OF_TRANSFORMS, 40 ) Status = DftiSetValue( Desc_Handle_Dim1, DFTI_INPUT_DISTANCE, 20 ) Status = DftiSetValue( Desc_Handle_Dim1, DFTI_OUTPUT_DISTANCE, 20 ) Status = DftiCommitDescriptor( Desc_Handle_Dim1 ) Status = DftiComputeForward( Desc_Handle_Dim1, X ) ! perform 20 one-dimensional transforms along 2nd dimension Stride(1) = 0; Stride(2) = 20 Status = DftiSetValue( Desc_Handle_Dim2, DFTI_NUMBER_OF_TRANSFORMS, 20 ) Status = DftiSetValue( Desc_Handle_Dim2, DFTI_INPUT_DISTANCE, 1 ) Status = DftiSetValue( Desc_Handle_Dim2, DFTI_OUTPUT_DISTANCE, 1 ) Status = DftiSetValue( Desc_Handle_Dim2, DFTI_INPUT_STRIDES, Stride ) Status = DftiSetValue( Desc_Handle_Dim2, DFTI_OUTPUT_STRIDES, Stride ) Status = DftiCommitDescriptor( Desc_Handle_Dim2 ) Status = DftiComputeForward( Desc_Handle_Dim2, X ) Status = DftiFreeDescriptor( Desc_Handle_Dim1 ) Status = DftiFreeDescriptor( Desc_Handle_Dim2 ) /* C */ #include "mkl_dfti.h" float _Complex x[20][40]; MKL_LONG stride[2]; MKL_LONG status; DFTI_DESCRIPTOR_HANDLE desc_handle_dim1; DFTI_DESCRIPTOR_HANDLE desc_handle_dim2; //... status = DftiCreateDescriptor( &desc_handle_dim1, DFTI_SINGLE, DFTI_COMPLEX, 1, 20 ); status = DftiCreateDescriptor( &desc_handle_dim2, DFTI_SINGLE, DFTI_COMPLEX, 1, 40 ); /* perform 40 one-dimensional transforms along 1st dimension */ /* note that the 1st dimension data are not unit-stride */ stride[0] = 0; stride[1] = 40; status = DftiSetValue( desc_handle_dim1, DFTI_NUMBER_OF_TRANSFORMS, 40 ); status = DftiSetValue( desc_handle_dim1, DFTI_INPUT_DISTANCE, 1 ); status = DftiSetValue( desc_handle_dim1, DFTI_OUTPUT_DISTANCE, 1 ); status = DftiSetValue( desc_handle_dim1, DFTI_INPUT_STRIDES, stride ); status = DftiSetValue( desc_handle_dim1, DFTI_OUTPUT_STRIDES, stride ); status = DftiCommitDescriptor( desc_handle_dim1 ); status = DftiComputeForward( desc_handle_dim1, x ); C Intel® Math Kernel Library Reference Manual 2660 /* perform 20 one-dimensional transforms along 2nd dimension */ /* note that the 2nd dimension is unit stride */ status = DftiSetValue( desc_handle_dim2, DFTI_NUMBER_OF_TRANSFORMS, 20 ); status = DftiSetValue( desc_handle_dim2, DFTI_INPUT_DISTANCE, 40 ); status = DftiSetValue( desc_handle_dim2, DFTI_OUTPUT_DISTANCE, 40 ); status = DftiCommitDescriptor( desc_handle_dim2 ); status = DftiComputeForward( desc_handle_dim2, x ); status = DftiFreeDescriptor( &desc_handle_dim1 ); status = DftiFreeDescriptor( &desc_handle_dim2 ); The following are examples of real multi-dimensional transforms with CCE format storage of conjugate-even complex matrix. Example “Two-Dimensional REAL In-place FFT (Fortran Interface)” is two-dimensional inplace transform and Example “Two-Dimensional REAL Out-of-place FFT (Fortran Interface)” is twodimensional out-of-place transform in Fortran interface. Example “Three-Dimensional REAL FFT (C Interface)” is three-dimensional out-of-place transform in C interface. Note that the data and result parameters in computation functions are all declared as assumed-size rank-1 array DIMENSION(0:*). Therefore two-dimensional array must be transformed to one-dimensional array by EQUIVALENCE statement or other facilities of Fortran. Two-Dimensional REAL In-place FFT (Fortran Interface) ! Fortran example. ! 2D and real to conjugate-even Use MKL_DFTI Real :: X_2D(34,100) ! 34 = (32/2 + 1)*2 Real :: X(3400) Equivalence (X_2D, X) type(DFTI_DESCRIPTOR), POINTER :: My_Desc_Handle Integer :: Status, L(2) Integer :: strides_in(3) Integer :: strides_out(3) ! ...put input data into X_2D(j,k), 1<=j=32,1<=k<=100 ! ...set L(1) = 32, L(2) = 100 ! ...set strides_in(1) = 0, strides_in(2) = 1, strides_in(3) = 34 ! ...set strides_out(1) = 0, strides_out(2) = 1, strides_out(3) = 17 ! ...the transform is a 32-by-100 ! Perform a real to complex conjugate-even transform Status = DftiCreateDescriptor( My_Desc_Handle, DFTI_SINGLE,& DFTI_REAL, 2, L ) Status = DftiSetValue(My_Desc_Handle, DFTI_CONJUGATE_EVEN_STORAGE,& DFTI_COMPLEX_COMPLEX) Status = DftiSetValue(My_Desc_Handle, DFTI_INPUT_STRIDES, strides_in) Status = DftiSetValue(My_Desc_Handle, DFTI_OUTPUT_STRIDES, strides_out) Status = DftiCommitDescriptor( My_Desc_Handle) Status = DftiComputeForward( My_Desc_Handle, X ) Status = DftiFreeDescriptor(My_Desc_Handle) ! result is given by the complex value z(j,k) 1<=j<=17; 1<=k<=100 and ! is stored in real matrix X_2D in CCE format. Two-Dimensional REAL Out-of-place FFT (Fortran Interface) ! Fortran example. ! 2D and real to conjugate-even Use MKL_DFTI Real :: X_2D(32,100) Complex :: Y_2D(17, 100) ! 17 = 32/2 + 1 Real :: X(3200) Complex :: Y(1700) Equivalence (X_2D, X) Equivalence (Y_2D, Y) type(DFTI_DESCRIPTOR), POINTER :: My_Desc_Handle Integer :: Status, L(2) Integer :: strides_out(3) Code Examples C 2661 ! ...put input data into X_2D(j,k), 1<=j=32,1<=k<=100 ! ...set L(1) = 32, L(2) = 100 ! ...set strides_out(1) = 0, strides_out(2) = 1, strides_out(3) = 17 ! ...the transform is a 32-by-100 ! Perform a real to complex conjugate-even transform Status = DftiCreateDescriptor( My_Desc_Handle, DFTI_SINGLE,& DFTI_REAL, 2, L ) Status = DftiSetValue(My_Desc_Handle,& DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX) Status = DftiSetValue( My_Desc_Handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE ) Status = DftiSetValue(My_Desc_Handle,& DFTI_OUTPUT_STRIDES, strides_out) Status = DftiCommitDescriptor(My_Desc_Handle) Status = DftiComputeForward(My_Desc_Handle, X, Y) Status = DftiFreeDescriptor(My_Desc_Handle) ! result is given by the complex value z(j,k) 1<=j<=17; 1<=k<=100 and ! is stored in complex matrix Y_2D in CCE format. Three-Dimensional REAL FFT (C Interface) /* C99 example */ #include "mkl_dfti.h" float x[32][100][19]; float _Complex y[32][100][10]; /* 10 = 19/2 + 1 */ DFTI_DESCRIPTOR_HANDLE my_desc_handle; MKL_LONG status, l[3]; MKL_LONG strides_out[4]; //...put input data into x[j][k][s] 0<=j<=31, 0<=k<=99, 0<=s<=18 l[0] = 32; l[1] = 100; l[2] = 19; strides_out[0] = 0; strides_out[1] = 1000; strides_out[2] = 10; strides_out[3] = 1; status = DftiCreateDescriptor( &my_desc_handle, DFTI_SINGLE, DFTI_REAL, 3, l ); status = DftiSetValue(my_desc_handle, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX); status = DftiSetValue( my_desc_handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE ); status = DftiSetValue(my_desc_handle, DFTI_OUTPUT_STRIDES, strides_out); status = DftiCommitDescriptor(my_desc_handle); status = DftiComputeForward(my_desc_handle, x, y); status = DftiFreeDescriptor(&my_desc_handle); /* result is the complex value z(j,k,s) 0<=j<=31; 0<=k<=99, 0<=s<=9 and is stored in complex matrix y in CCE format. */ Examples of Using Multi-Threading for FFT Computation The following sample program shows how to employ internal threading in Intel MKL for FFT computation (see case "a" in “Number of user threads”). To specify the number of threads inside Intel MKL, use the following settings: set MKL_NUM_THREADS = 1 for one-threaded mode; set MKL_NUM_THREADS = 4 for multi-threaded mode. Note that the configuration parameter DFTI_NUMBER_OF_USER_THREADS must be equal to its default value 1. C Intel® Math Kernel Library Reference Manual 2662 Using Intel MKL Internal Threading Mode #include "mkl_dfti.h" int main () { float x[200][100]; DFTI_DESCRIPTOR_HANDLE fft; MKL_LONG len[2] = {200, 100}; // initialize x DftiCreateDescriptor ( &fft, DFTI_SINGLE, DFTI_REAL, 2, len ); DftiCommitDescriptor ( fft ); DftiComputeForward ( fft, x ); DftiFreeDescriptor ( &fft ); return 0; } The following Example “Using Parallel Mode with Multiple Descriptors Initialized in a Parallel Region” and Example “Using Parallel Mode with Multiple Descriptors Initialized in One Thread” illustrate a parallel customer program with each descriptor instance used only in a single thread (see cases "b" and "c" in Number of user threads). Specify the number of threads for Example “Using Parallel Mode with Multiple Descriptors Initialized in a Parallel Region” like this: set MKL_NUM_THREADS = 1 for Intel MKL to work in the single-threaded mode (recommended); set OMP_NUM_THREADS = 4 for the customer program to work in the multi-threaded mode. The configuration parameter DFTI_NUMBER_OF_USER_THREADS must have its default value of 1. Using Parallel Mode with Multiple Descriptors Initialized in a Parallel Region Note that in this example, the program can be transformed to become single-threaded at the customer level but using parallel mode within Intel MKL (case "a"). To achieve this, you need to set the parameter DFTI_NUMBER_OF_TRANSFORMS = 4 and to set the corresponding parameter DFTI_INPUT_DISTANCE = 5000. C code for the example is as follows: #include "mkl_dfti.h" #include #define ARRAY_LEN(a) sizeof(a)/sizeof(a[0]) int main () { // 4 OMP threads, each does 2D FFT 50x100 points MKL_Complex8 x[4][50][100]; int nth = ARRAY_LEN(x); MKL_LONG len[2] = {ARRAY_LEN(x[0]), ARRAY_LEN(x[0][0])}; int th; // assume x is initialized and do 2D FFTs #pragma omp parallel for shared(len, x) for (th = 0; th < nth; th++) { DFTI_DESCRIPTOR_HANDLE myFFT; DftiCreateDescriptor (&myFFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len); DftiCommitDescriptor (myFFT); DftiComputeForward (myFFT, x[th]); DftiFreeDescriptor (&myFFT); } return 0; } Fortran code for the example is as follows: program fft2d_private_descr_main use mkl_dfti Code Examples C 2663 integer nth, len(2) ! 4 OMP threads, each does 2D FFT 50x100 points parameter (nth = 4, len = (/50, 100/)) complex x(len(2)*len(1), nth) type(dfti_descriptor), pointer :: myFFT integer th, myStatus ! assume x is initialized and do 2D FFTs !$OMP PARALLEL DO SHARED(len, x) PRIVATE(myFFT, myStatus) do th = 1, nth myStatus = DftiCreateDescriptor (myFFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len) myStatus = DftiCommitDescriptor (myFFT) myStatus = DftiComputeForward (myFFT, x(:, th)) myStatus = DftiFreeDescriptor (myFFT) end do !$OMP END PARALLEL DO end Specify the number of threads for Example “Using Parallel Mode with Multiple Descriptors Initialized in One Thread” like this: set MKL_NUM_THREADS = 1 for Intel MKL to work in the single-threaded mode (obligatory); set OMP_NUM_THREADS = 4 for the customer program to work in the multi-threaded mode. The configuration parameter DFTI_NUMBER_OF_USER_THREADS must have the default value of 1. Using Parallel Mode with Multiple Descriptors Initialized in One Thread C code for the example is as follows: #include "mkl_dfti.h" #include #define ARRAY_LEN(a) sizeof(a)/sizeof(a[0]) int main () { // 4 OMP threads, each does 2D FFT 50x100 points MKL_Complex8 x[4][50][100]; int nth = ARRAY_LEN(x); MKL_LONG len[2] = {ARRAY_LEN(x[0]), ARRAY_LEN(x[0][0])}; DFTI_DESCRIPTOR_HANDLE FFT[ARRAY_LEN(x)]; int th; for (th = 0; th < nth; th++) DftiCreateDescriptor (&FFT[th], DFTI_SINGLE, DFTI_COMPLEX, 2, len); for (th = 0; th < nth; th++) DftiCommitDescriptor (FFT[th]); // assume x is initialized and do 2D FFTs #pragma omp parallel for shared(FFT, x) for (th = 0; th < nth; th++) DftiComputeForward (FFT[th], x[th]); for (th = 0; th < nth; th++) DftiFreeDescriptor (&FFT[th]); return 0; } Fortran code for the example is as follows: program fft2d_array_descr_main use mkl_dfti integer nth, len(2) ! 4 OMP threads, each does 2D FFT 50x100 points parameter (nth = 4, len = (/50, 100/)) complex x(len(2)*len(1), nth) type thread_data type(dfti_descriptor), pointer :: FFT end type thread_data type(thread_data) :: workload(nth) C Intel® Math Kernel Library Reference Manual 2664 integer th, status, myStatus do th = 1, nth status = DftiCreateDescriptor (workload(th)%FFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len) status = DftiCommitDescriptor (workload(th)%FFT) end do ! assume x is initialized and do 2D FFTs !$OMP PARALLEL DO SHARED(len, x, workload) PRIVATE(myStatus) do th = 1, nth myStatus = DftiComputeForward (workload(th)%FFT, x(:, th)) end do !$OMP END PARALLEL DO do th = 1, nth status = DftiFreeDescriptor (workload(th)%FFT) end do end The following Example “Using Parallel Mode with a Common Descriptor” illustrates a parallel customer program with a common descriptor used in several threads (see case "d" in “Number of user threads”). In this case, the number of threads, as well as any other configuration parameter, must not be changed after FFT initialization by the DftiCommitDescriptor() function is done. Using Parallel Mode with a Common Descriptor C code for the example is as follows: #include "mkl_dfti.h" #include #define ARRAY_LEN(a) sizeof(a)/sizeof(a[0]) int main () { // 4 OMP threads, each does 2D FFT 50x100 points MKL_Complex8 x[4][50][100]; int nth = ARRAY_LEN(x); MKL_LONG len[2] = {ARRAY_LEN(x[0]), ARRAY_LEN(x[0][0])}; DFTI_DESCRIPTOR_HANDLE FFT; int th; DftiCreateDescriptor (&FFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len); DftiSetValue (FFT, DFTI_NUMBER_OF_USER_THREADS, nth); DftiCommitDescriptor (FFT); // assume x is initialized and do 2D FFTs #pragma omp parallel for shared(FFT, x) for (th = 0; th < nth; th++) DftiComputeForward (FFT, x[th]); DftiFreeDescriptor (&FFT); return 0; } Fortran code for the example is as follows: program fft2d_shared_descr_main use mkl_dfti integer nth, len(2) ! 4 OMP threads, each does 2D FFT 50x100 points parameter (nth = 4, len = (/50, 100/)) complex x(len(2)*len(1), nth) type(dfti_descriptor), pointer :: FFT integer th, status, myStatus status = DftiCreateDescriptor (FFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len) status = DftiSetValue (FFT, DFTI_NUMBER_OF_USER_THREADS, nth) status = DftiCommitDescriptor (FFT) ! assume x is initialized and do 2D FFTs !$OMP PARALLEL DO SHARED(len, x, FFT) PRIVATE(myStatus) do th = 1, nth myStatus = DftiComputeForward (FFT, x(:, th)) end do Code Examples C 2665 !$OMP END PARALLEL DO status = DftiFreeDescriptor (FFT) end Examples for Cluster FFT Functions The following C example computes a 2-dimensional out-of-place FFT using the cluster FFT interface: 2D Out-of-place Cluster FFT Computation DFTI_DESCRIPTOR_DM_HANDLE desc; MKL_LONG len[2],v,i,j,n,s; Complex *in,*out; MPI_Init(...); // Create descriptor for 2D FFT len[0]=nx; len[1]=ny; DftiCreateDescriptorDM(MPI_COMM_WORLD,&desc,DFTI_DOUBLE,DFTI_COMPLEX,2,len); // Ask necessary length of in and out arrays and allocate memory DftiGetValueDM(desc,CDFT_LOCAL_SIZE,&v); in=(Complex*)malloc(v*sizeof(Complex)); out=(Complex*)malloc(v*sizeof(Complex)); // Fill local array with initial data. Current process performs n rows, // 0 row of in corresponds to s row of virtual global array DftiGetValueDM(desc,CDFT_LOCAL_NX,&n); DftiGetValueDM(desc,CDFT_LOCAL_X_START,&s); // Virtual global array globalIN is defined by function f as // globalIN[i*ny+j]=f(i,j) for(i=0;ipolar conversion of complex data // Cartesian representation: z = re + I*im // Polar representation: z = r * exp( I*phi ) #include void variant1_Cartesian2Polar(int n,const double *re,const double *im, double *r,double *phi) { vdHypot(n,re,im,r); // compute radii r[] vdAtan2(n,im,re,phi); // compute phases phi[] } void variant2_Cartesian2Polar(int n,const MKL_Complex16 *z,double *r,double *phi, double *temp_re,double *temp_im) { vzAbs(n,z,r); // compute radii r[] vdPackI(n, (double*)z + 0, 2, temp_re); vdPackI(n, (double*)z + 1, 2, temp_im); vdAtan2(n,temp_im,temp_re,phi); // compute phases phi[] } Conversion from polar to Cartesian representation of complex data // Polar->Cartesian conversion of complex data. // Polar representation: z = r * exp( I*phi ) // Cartesian representation: z = re + I*im #include void variant1_Polar2Cartesian(int n,const double *r,const double *phi, double *re,double *im) { vdSinCos(n,phi,im,re); // compute direction, i.e. z[]/abs(z[]) vdMul(n,r,re,re); // scale real part vdMul(n,r,im,im); // scale imaginary part } void variant2_Polar2Cartesian(int n,const double *r,const double *phi, MKL_Complex16 *z, double *temp_re,double *temp_im) { Code Examples C 2667 vdSinCos(n,phi,temp_im,temp_re); // compute direction, i.e. z[]/abs(z[]) vdMul(n,r,temp_im,temp_im); // scale imaginary part vdMul(n,r,temp_re,temp_re); // scale real part vdUnpackI(n,temp_re,(double*)z + 0, 2); // fill in result.re vdUnpackI(n,temp_im,(double*)z + 1, 2); // fill in result.im } C Intel® Math Kernel Library Reference Manual 2668 CBLAS Interface to the BLAS D This appendix presents CBLAS, the C interface to the Basic Linear Algebra Subprograms (BLAS) implemented in Intel® MKL. Similar to BLAS, the CBLAS interface includes the following levels of functions: • “Level 1 CBLAS” (vector-vector operations) • “Level 2 CBLAS” (matrix-vector operations) • “Level 3 CBLAS” (matrix-matrix operations). • “Sparse CBLAS” (operations on sparse vectors). To obtain the C interface, the Fortran routine names are prefixed with cblas_ (for example, dasum becomes cblas_dasum). Names of all CBLAS functions are in lowercase letters. Complex functions ?dotc and ?dotu become CBLAS subroutines (void functions); they return the complex result via a void pointer, added as the last parameter. CBLAS names of these functions are suffixed with _sub. For example, the BLAS function cdotc corresponds to cblas_cdotc_sub. WARNING Users of the CBLAS interface should be aware that the CBLAS are just a C interface to the BLAS, which is based on the FORTRAN standard and subject to the FORTRAN standard restrictions. In particular, the output parameters should not be referenced through more than one argument. In the descriptions of CBLAS interfaces, links provided for each function group lead to the descriptions of the respective Fortran-interface BLAS functions. CBLAS Arguments The arguments of CBLAS functions comply with the following rules: • Input arguments are declared with the const modifier. • Non-complex scalar input arguments are passed by value. • Complex scalar input arguments are passed as void pointers. • Array arguments are passed by address. • BLAS character arguments are replaced by the appropriate enumerated type. • Level 2 and Level 3 routines acquire an additional parameter of type CBLAS_ORDER as their first argument. This parameter specifies whether two-dimensional arrays are row-major (CblasRowMajor) or column-major (CblasColMajor). Enumerated Types The CBLAS interface uses the following enumerated types: enum CBLAS_ORDER { CblasRowMajor=101, /* row-major arrays */ CblasColMajor=102}; /* column-major arrays */ enum CBLAS_TRANSPOSE { CblasNoTrans=111, /* trans='N' */ CblasTrans=112, /* trans='T' */ CblasConjTrans=113}; /* trans='C' */ enum CBLAS_UPLO { CblasUpper=121, /* uplo ='U' */ CblasLower=122}; /* uplo ='L' */ enum CBLAS_DIAG { CblasNonUnit=131, /* diag ='N' */ CblasUnit=132}; /* diag ='U' */ 2669 enum CBLAS_SIDE { CblasLeft=141, /* side ='L' */ CblasRight=142}; /* side ='R' */ Level 1 CBLAS This is an interface to “BLAS Level 1 Routines and Functions”, which perform basic vector-vector operations. ?asum float cblas_sasum(const int N, const float *X, const int incX); double cblas_dasum(const int N, const double *X, const int incX); float cblas_scasum(const int N, const void *X, const int incX); double cblas_dzasum(const int N, const void *X, const int incX); ?axpy void cblas_saxpy(const int N, const float alpha, const float *X, const int incX, float *Y, const int incY); void cblas_daxpy(const int N, const double alpha, const double *X, const int incX, double *Y, const int incY); void cblas_caxpy(const int N, const void *alpha, const void *X, const int incX, void *Y, const int incY); void cblas_zaxpy(const int N, const void *alpha, const void *X, const int incX, void *Y, const int incY); ?copy void cblas_scopy(const int N, const float *X, const int incX, float *Y, const int incY); void cblas_dcopy(const int N, const double *X, const int incX, double *Y, const int incY); void cblas_ccopy(const int N, const void *X, const int incX, void *Y, const int incY); void cblas_zcopy(const int N, const void *X, const int incX, void *Y, const int incY); ?dot float cblas_sdot(const int N, const float *X, const int incX, const float *Y, const int incY); double cblas_ddot(const int N, const double *X, const int incX, const double *Y, const int incY); ?sdot float cblas_sdsdot(const int N, const float SB, const float *SX, const int incX, const float *SY, const int incY); double cblas_dsdot(const int N, const float *SX, const int incX, const float *SY, const int incY); ?dotc void cblas_cdotc_sub(const int N, const void *X, const int incX, const void *Y, const int incY, void *dotc); void cblas_zdotc_sub(const int N, const void *X, const int incX, const void *Y, const int incY, void *dotc); ?dotu void cblas_cdotu_sub(const int N, const void *X, const int incX, const void *Y, const int incY, void *dotu); void cblas_zdotu_sub(const int N, const void *X, const int incX, const void *Y, const int incY, void *dotu); D Intel® Math Kernel Library Reference Manual 2670 ?nrm2 float cblas_snrm2(const int N, const float *X, const int incX); double cblas_dnrm2(const int N, const double *X, const int incX); float cblas_scnrm2(const int N, const void *X, const int incX); double cblas_dznrm2(const int N, const void *X, const int incX); ?rot void cblas_srot(const int N, float *X, const int incX, float *Y, const int incY, const float c, const float s); void cblas_drot(const int N, double *X, const int incX, double *Y,const int incY, const double c, const double s); ?rotg void cblas_srotg(float *a, float *b, float *c, float *s); void cblas_drotg(double *a, double *b, double *c, double *s); ?rotm void cblas_srotm(const int N, float *X, const int incX, float *Y, const int incY, const float *P); void cblas_drotm(const int N, double *X, const int incX, double *Y, const int incY, const double *P); ?rotmg void cblas_srotmg(float *d1, float *d2, float *b1, const float b2, float *P); void cblas_drotmg(double *d1, double *d2, double *b1, const double b2, double *P); ?scal void cblas_sscal(const int N, const float alpha, float *X, const int incX); void cblas_dscal(const int N, const double alpha, double *X, const int incX); void cblas_cscal(const int N, const void *alpha, void *X, const int incX); void cblas_zscal(const int N, const void *alpha, void *X, const int incX); void cblas_csscal(const int N, const float alpha, void *X, const int incX); void cblas_zdscal(const int N, const double alpha, void *X, const int incX); ?swap void cblas_sswap(const int N, float *X, const int incX, float *Y, const int incY); void cblas_dswap(const int N, double *X, const int incX, double *Y, const int incY); void cblas_cswap(const int N, void *X, const int incX, void *Y, const int incY); void cblas_zswap(const int N, void *X, const int incX, void *Y, const int incY); i?amax CBLAS_INDEX cblas_isamax(const int N, const float *X, const int incX); CBLAS_INDEX cblas_idamax(const int N, const double *X, const int incX); CBLAS_INDEX cblas_icamax(const int N, const void *X, const int incX); CBLAS_INDEX cblas_izamax(const int N, const void *X, const int incX); i?amin CBLAS_INDEX cblas_isamin(const int N, const float *X, const int incX); CBLAS_INDEX cblas_idamin(const int N, const double *X, const int incX); CBLAS_INDEX cblas_icamin(const int N, const void *X, const int incX); CBLAS_INDEX cblas_izamin(const int N, const void *X, const int incX); CBLAS Interface to the BLAS D 2671 ?cabs1 double cblas_dcabs1(const void *z); float cblas_scabs1(const void *c); Level 2 CBLAS This is an interface to “BLAS Level 2 Routines”, which perform basic matrix-vector operations. Each C routine in this group has an additional parameter of type CBLAS_ORDER (the first argument) that determines whether the two-dimensional arrays use column-major or row-major storage. ?gbmv void cblas_sgbmv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const int KL, const int KU, const float alpha, const float *A, const int lda, const float *X, const int incX, const float beta, float *Y, const int incY); void cblas_dgbmv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const int KL, const int KU, const double alpha, const double *A, const int lda, const double *X, const int incX, const double beta, double *Y, const int incY); void cblas_cgbmv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const int KL, const int KU, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); void cblas_zgbmv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const int KL, const int KU, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); ?gemv void cblas_sgemv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const float alpha, const float *A, const int lda, const float *X, const int incX, const float beta, float *Y, const int incY); void cblas_dgemv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const double alpha, const double *A, const int lda, const double *X, const int incX, const double beta, double *Y, const int incY); void cblas_cgemv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); void cblas_zgemv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); ?ger void cblas_sger(const enum CBLAS_ORDER order, const int M, const int N, const float alpha, const float *X, const int incX, const float *Y, const int incY, float *A, const int lda); void cblas_dger(const enum CBLAS_ORDER order, const int M, const int N, const double alpha, const double *X, const int incX, const double *Y, const int incY, double *A, const int lda); ?gerc void cblas_cgerc(const enum CBLAS_ORDER order, const int M, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); void cblas_zgerc(const enum CBLAS_ORDER order, const int M, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); ?geru void cblas_cgeru(const enum CBLAS_ORDER order, const int M, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); void cblas_zgeru(const enum CBLAS_ORDER order, const int M, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); D Intel® Math Kernel Library Reference Manual 2672 ?hbmv void cblas_chbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const int K, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); void cblas_zhbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const int K, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); ?hemv void cblas_chemv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); void cblas_zhemv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); ?her void cblas_cher(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const void *X, const int incX, void *A, const int lda); void cblas_zher(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const void *X, const int incX, void *A, const int lda); ?her2 void cblas_cher2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); void cblas_zher2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); ?hpmv void cblas_chpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *Ap, const void *X, const int incX, const void *beta, void *Y, const int incY); void cblas_zhpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *Ap, const void *X, const int incX, const void *beta, void *Y, const int incY); ?hpr void cblas_chpr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const void *X, const int incX, void *A); void cblas_zhpr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const void *X, const int incX, void *A); ?hpr2 void cblas_chpr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *Ap); void cblas_zhpr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *Ap); ?sbmv void cblas_ssbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const int K, const float alpha, const float *A, const int lda, const float *X, const int incX, const float beta, float *Y, const int incY); CBLAS Interface to the BLAS D 2673 void cblas_dsbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const int K, const double alpha, const double *A, const int lda, const double *X, const int incX, const double beta, double *Y, const int incY); ?spmv void cblas_sspmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *Ap, const float *X, const int incX, const float beta, float *Y, const int incY); void cblas_dspmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *Ap, const double *X, const int incX, const double beta, double *Y, const int incY); ?spr void cblas_sspr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *X, const int incX, float *Ap); void cblas_dspr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *X, const int incX, double *Ap); ?spr2 void cblas_sspr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *X, const int incX, const float *Y, const int incY, float *A); void cblas_dspr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *X, const int incX, const double *Y, const int incY, double *A); ?symv void cblas_ssymv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *A, const int lda, const float *X, const int incX, const float beta, float *Y, const int incY); void cblas_dsymv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *A, const int lda, const double *X, const int incX, const double beta, double *Y, const int incY); ?syr void cblas_ssyr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *X, const int incX, float *A, const int lda); void cblas_dsyr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *X, const int incX, double *A, const int lda); ?syr2 void cblas_ssyr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *X, const int incX, const float *Y, const int incY, float *A, const int lda); void cblas_dsyr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *X, const int incX, const double *Y, const int incY, double *A, const int lda); ?tbmv void cblas_stbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const float *A, const int lda, float *X, const int incX); void cblas_dtbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const double *A, const int lda, double *X, const int incX); void cblas_ctbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const void *A, const int lda, void *X, const int incX); D Intel® Math Kernel Library Reference Manual 2674 void cblas_ztbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const void *A, const int lda, void *X, const int incX); ?tbsv void cblas_stbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const float *A, const int lda, float *X, const int incX); void cblas_dtbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const double *A, const int lda, double *X, const int incX); void cblas_ctbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const void *A, const int lda, void *X, const int incX); void cblas_ztbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const void *A, const int lda, void *X, const int incX); ?tpmv void cblas_stpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const float *Ap, float *X, const int incX); void cblas_dtpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const double *Ap, double *X, const int incX); void cblas_ctpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *Ap, void *X, const int incX); void cblas_ztpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *Ap, void *X, const int incX); ?tpsv void cblas_stpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const float *Ap, float *X, const int incX); void cblas_dtpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const double *Ap, double *X, const int incX); void cblas_ctpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *Ap, void *X, const int incX); void cblas_ztpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *Ap, void *X, const int incX); ?trmv void cblas_strmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const float *A, const int lda, float *X, const int incX); void cblas_dtrmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const double *A, const int lda, double *X, const int incX); void cblas_ctrmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *A, const int lda, void *X, const int incX); void cblas_ztrmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *A, const int lda, void *X, const int incX); ?trsv void cblas_strsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const float *A, const int lda, float *X, const int incX); void cblas_dtrsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const double *A, const int lda, double *X, const int incX); void cblas_ctrsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE CBLAS Interface to the BLAS D 2675 TransA, const enum CBLAS_DIAG Diag, const int N,const void *A, const int lda, void *X, const int incX); void cblas_ztrsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *A, const int lda, void *X, const int incX); Level 3 CBLAS This is an interface to “BLAS Level 3 Routines”, which perform basic matrix-matrix operations. Each C routine in this group has an additional parameter of type CBLAS_ORDER (the first argument) that determines whether the two-dimensional arrays use column-major or row-major storage. ?gemm void cblas_sgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const float alpha, const float *A, const int lda, const float *B, const int ldb, const float beta, float *C, const int ldc); void cblas_dgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const double alpha, const double *A, const int lda, const double *B, const int ldb, const double beta, double *C, const int ldc); void cblas_cgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); void cblas_zgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); ?hemm void cblas_chemm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); void cblas_zhemm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); ?herk void cblas_cherk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const float alpha, const void *A, const int lda, const float beta, void *C, const int ldc); void cblas_zherk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const double alpha, const void *A, const int lda, const double beta, void *C, const int ldc); ?her2k void cblas_cher2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const float beta, void *C, const int ldc); void cblas_zher2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const double beta, void *C, const int ldc); ?symm void cblas_ssymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const float alpha, const float *A, const int lda, const float *B, const int ldb, const float beta, float *C, const int ldc); void cblas_dsymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const double alpha, const double *A, const int lda, const double *B, const int ldb, const double beta, double *C, const int ldc); void cblas_csymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const void *alpha, const void *A, const int lda, const void *B, const int D Intel® Math Kernel Library Reference Manual 2676 ldb, const void *beta, void *C, const int ldc); void cblas_zsymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); ?syrk void cblas_ssyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const float alpha, const float *A, const int lda, const float beta, float *C, const int ldc); void cblas_dsyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const double alpha, const double *A, const int lda, const double beta, double *C, const int ldc); void cblas_csyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const void *alpha, const void *A, const int lda, const void *beta, void *C, const int ldc); void cblas_zsyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const void *alpha, const void *A, const int lda, const void *beta, void *C, const int ldc); ?syr2k void cblas_ssyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const float alpha, const float *A, const int lda, const float *B, const int ldb, const float beta, float *C, const int ldc); void cblas_dsyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const double alpha, const double *A, const int lda, const double *B, const int ldb, const double beta, double *C, const int ldc); void cblas_csyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSP SE Trans, const int N, const int K, const void *alpha,const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); void cblas_zsyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); ?trmm void cblas_strmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const float alpha, const float *A, const int lda, float *B, const int ldb); void cblas_dtrmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const double alpha, const double *A, const int lda, double *B, const int ldb); void cblas_ctrmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const void *alpha, const void *A, const int lda, void *B, const int ldb); void cblas_ztrmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const void *alpha, const void *A, const int lda, void *B, const int ldb); ?trsm void cblas_strsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const float alpha, const float *A, const int lda, float *B, const int ldb); void cblas_dtrsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const double alpha, const double *A, const int lda, double *B, const int ldb); void cblas_ctrsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const void *alpha, const void *A, const int lda, void *B, const int ldb); void cblas_ztrsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const void *alpha, const void *A, const int lda, void *B, const int ldb); CBLAS Interface to the BLAS D 2677 Sparse CBLAS This is an interface to Sparse BLAS Level 1 Routines, which perform a number of common vector operations on sparse vectors stored in compressed form. Note that all index parameters, indx, are in C-type notation and vary in the range [0..N-1]. ?axpyi void cblas_saxpyi(const int N, const float alpha, const float *X, const int *indx, float *Y); void cblas_daxpyi(const int N, const double alpha, const double *X, const int *indx, double *Y); void cblas_caxpyi(const int N, const void *alpha, const void *X, const int *indx, void *Y); void cblas_zaxpyi(const int N, const void *alpha, const void *X, const int *indx, void *Y); ?doti float cblas_sdoti(const int N, const float *X, const int *indx, const float *Y); double cblas_ddoti(const int N, const double *X, const int *indx, const double *Y); ?dotci void cblas_cdotci_sub(const int N, const void *X, const int *indx, const void *Y, void *dotui); void cblas_zdotci_sub(const int N, const void *X, const int *indx, const void *Y, void *dotui); ?dotui void cblas_cdotui_sub(const int N, const void *X, const int *indx, const void *Y, void *dotui); void cblas_zdotui_sub(const int N, const void *X, const int *indx, const void *Y, void *dotui); ?gthr void cblas_sgthr(const int N, const float *Y, float *X, const int *indx); void cblas_dgthr(const int N, const double *Y, double *X, const int *indx); void cblas_cgthr(const int N, const void *Y, void *X, const int *indx); void cblas_zgthr(const int N, const void *Y, void *X, const int *indx); ?gthrz void cblas_sgthrz(const int N, float *Y, float *X, const int *indx); void cblas_dgthrz(const int N, double *Y, double *X, const int *indx); void cblas_cgthrz(const int N, void *Y, void *X, const int *indx); void cblas_zgthrz(const int N, void *Y, void *X, const int *indx); ?roti void cblas_sroti(const int N, float *X, const int *indx, float *Y, const float c, const float s); void cblas_droti(const int N, double *X, const int *indx, double *Y, const double c, const double s); ?sctr void cblas_ssctr(const int N, const float *X, const int *indx, float *Y); void cblas_dsctr(const int N, const double *X, const int *indx, double *Y); void cblas_csctr(const int N, const void *X, const int *indx, void *Y); void cblas_zsctr(const int N, const void *X, const int *indx, void *Y); D Intel® Math Kernel Library Reference Manual 2678 CBLAS Interface to the BLAS D 2679 D Intel® Math Kernel Library Reference Manual 2680 Specific Features of Fortran 95 Interfaces for LAPACK Routines E Intel® MKL implements Fortran 95 interface for LAPACK package, further referred to as MKL LAPACK95, to provide full capacity of MKL FORTRAN 77 LAPACK routines. This is the principal difference of Intel MKL from the Netlib Fortran 95 implementation for LAPACK. A new feature of MKL LAPACK95 by comparison with Intel MKL LAPACK77 implementation is presenting a package of source interfaces along with wrappers that make the implementation compiler-independent. As a result, the MKL LAPACK package can be used in all programming environments intended for Fortran 95. Depending on the degree and type of difference from Netlib implementation, the MKL LAPACK95 interfaces fall into several groups that require different transformations (see “MKL Fortran 95 Interfaces for LAPACK Routines vs. Netlib Implementation”). The groups are given in full with the calling sequences of the routines and appropriate differences from Netlib analogs. The following conventions are used: ::= ‘(’ ‘)’ ::= {}* ::= < identifier > ::= | ::= ‘,’ ::= ‘[,’ ‘]’ ::= where defined notions are separated from definitions by ::=, notion names are marked by angle brackets, terminals are given in quotes, and {…}* denotes repetition zero, one, or more times. and each should be present in all calls of denoted interface, may be omitted. Comments to interface definitions are provided where necessary. Comment lines begin with character !. Two interfaces with one name are presented when two variants of subroutine calls (separated by types of arguments) exist. Interfaces Identical to Netlib GERFS(A,AF,IPIV,B,X[,TRANS][,FERR][,BERR][,INFO]) GETRI(A,IPIV[,INFO]) GEEQU(A,R,C[,ROWCND][,COLCND][,AMAX][,INFO]) GESV(A,B[,IPIV][,INFO]) GESVX(A,B,X[,AF][,IPIV][,FACT][,TRANS][,EQUED][,R][,C][,FERR][,BERR] [,RCOND][,RPVGRW][,INFO]) GTSV(DL,D,DU,B[,INFO]) GTSVX(DL,D,DU,B,X[,DLF][,DF][,DUF][,DU2][,IPIV][,FACT][,TRANS][,FERR] [,BERR][,RCOND][,INFO]) POSV(A,B[,UPLO][,INFO]) POSVX(A,B,X[,UPLO][,AF][,FACT][,EQUED][,S][,FERR][,BERR][,RCOND][,INFO]) PTSV(D,E,B[,INFO]) PTSVX(D,E,B,X[,DF][,EF][,FACT][,FERR][,BERR][,RCOND][,INFO]) SYSV(A,B[,UPLO][,IPIV][,INFO]) SYSVX(A,B,X[,UPLO][,AF][,IPIV][,FACT][,FERR][,BERR][,RCOND][,INFO]) HESVX(A,B,X[,UPLO][,AF][,IPIV][,FACT][,FERR][,BERR][,RCOND][,INFO]) HESV(A,B[,UPLO][,IPIV][,INFO]) SPSV(AP,B[,UPLO][,IPIV][,INFO]) HPSV(AP,B[,UPLO][,IPIV][,INFO]) SYTRD(A,TAU[,UPLO][,INFO]) ORGTR(A,TAU[,UPLO][,INFO]) HETRD(A,TAU[,UPLO][,INFO]) UNGTR(A,TAU[,UPLO][,INFO]) SYGST(A,B[,ITYPE][,UPLO][,INFO]) HEGST(A,B[,ITYPE][,UPLO][,INFO]) 2681 GELS(A,B[,TRANS][,INFO]) GELSY(A,B[,RANK][,JPVT][,RCOND][,INFO]) GELSS(A,B[,RANK][,S][,RCOND][,INFO]) GELSD(A,B[,RANK][,S][,RCOND][,INFO]) GGLSE(A,B,C,D,X[,INFO]) GGGLM(A,B,D,X,Y[,INFO]) SYEV(A,W[,JOBZ][,UPLO][,INFO]) HEEV(A,W[,JOBZ][,UPLO][,INFO]) SYEVD(A,W[,JOBZ][,UPLO][,INFO]) HEEVD(A,W[,JOBZ][,UPLO][,INFO]) STEV(D,E[,Z][,INFO]) STEVD(D,E[,Z][,INFO]) STEVX(D,E,W[,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) STEVR(D,E,W[,Z][,VL][,VU][,IL][,IU][,M][,ISUPPZ][,ABSTOL][,INFO]) GEES(A,WR,WI[,VS][,SELECT][,SDIM][,INFO]) GEES(A,W[,VS][,SELECT][,SDIM][,INFO]) GEESX(A,WR,WI[,VS][,SELECT][,SDIM][,RCONDE][,RCONDV][,INFO]) GEESX(A,W[,VS][,SELECT][,SDIM][,RCONDE][,RCONDV][,INFO]) GEEV(A,WR,WI[,VL][,VR][,INFO]) GEEV(A,W[,VL][,VR][,INFO]) GEEVX(A,WR,WI[,VL][,VR][,BALANC][,ILO][,IHI][,SCALE][,ABNRM][,RCONDE][,RCONDV][,INFO]) GEEVX(A,W[,VL][,VR][,BALANC][,ILO][,IHI][,SCALE][,ABNRM][,RCONDE] [,RCONDV][,INFO]) GESVD(A,S[,U][,VT][,WW][,JOB][,INFO]) GGSVD(A,B,ALPHA,BETA[,K][,L][,U][,V][,Q][,IWORK][,INFO]) SYGV(A,B,W[,ITYPE][,JOBZ][,UPLO][,INFO]) HEGV(A,B,W[,ITYPE][,JOBZ][,UPLO][,INFO]) SYGVD(A,B,W[,ITYPE][,JOBZ][,UPLO][,INFO]) HEGVD(A,B,W[,ITYPE][,JOBZ][,UPLO][,INFO]) SPGVD(AP,BP,W[,ITYPE][,UPLO][,Z][,INFO]) HPGVD(AP,BP,W[,ITYPE][,UPLO][,Z][,INFO]) SPGVX(AP,BP,W[,ITYPE][,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) HPGVX(AP,BP,W[,ITYPE][,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) SBGVD(AB,BB,W[,UPLO][,Z][,INFO]) HBGVD(AB,BB,W[,UPLO][,Z][,INFO]) SBGVX(AB,BB,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,Q][,ABSTOL][,INFO]) HBGVX(AB,BB,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,Q][,ABSTOL][,INFO]) GGES(A,B,ALPHAR,ALPHAI,BETA[,VSL][,VSR][,SELECT][,SDIM][,INFO]) GGES(A,B,ALPHA,BETA[,VSL][,VSR][,SELECT][,SDIM][,INFO]) GGESX(A,B,ALPHAR,ALPHAI,BETA[,VSL][,VSR][,SELECT][,SDIM][,RCONDE][,RCONDV][,INFO]) GGEV(A,B,ALPHAR,ALPHAI,BETA[,VL][,VR][,INFO]) GGEV(A,B,ALPHA,BETA[,VL][,VR][,INFO]) GGEVX(A,B,ALPHAR,ALPHAI,BETA[,VL][,VR][,BALANC][,ILO][,IHI][,LSCALE][,RSCALE][,ABNRM] [,BBNRM][,RCONDE][,RCONDV][,INFO]) GGEVX(A,B,ALPHA,BETA[,VL][,VR][,BALANC][,ILO][,IHI][,LSCALE][,RSCALE][,ABNRM] [,BBNRM][,RCONDE][,RCONDV][,INFO]) Interfaces with Replaced Argument Names Argument names in the routines of this group are replaced as follows: Netlib Argument Name MKL Argument Name A AB A AP AF AFB AF AFP B BB B BP K KL GBSV(AB,B[,KL][,IPIV][,INFO]) ! netlib: (A,B,K,IPIV,INFO) GBSVX(AB,B,X[,KL][,AFB][,IPIV][,FACT][,TRANS][,EQUED][,R][,C][,FERR] [,BERR][,RCOND][,RPVGRW][,INFO]) ! netlib: (A,B,X,KL,AF,IPIV,FACT,TRANS,EQUED,R,C,FERR, ! BERR,RCOND,RPVGRW,INFO) E Intel® Math Kernel Library Reference Manual 2682 PPSV(AP,B[,UPLO][,INFO]) ! netlib: (A,B,UPLO,INFO) PPSVX(AP,B,X[,UPLO][,AFP][,FACT][,EQUED][,S][,FERR][,BERR][,RCOND][,INFO]) ! netlib: (A,B,X,UPLO,AF,FACT,EQUED,S,FERR,BERR,RCOND,INFO)! PBSV(AB,B[,UPLO][,INFO]) ! netlib: (A,B,UPLO,INFO) PBSVX(AB,B,X[,UPLO][,AFB][,FACT][,EQUED][,S][,FERR][,BERR][,RCOND][,INFO]) ! netlib: (A,B,X,UPLO,AF,FACT,EQUED,S,FERR,BERR,RCOND,INFO)! SPSVX(AP,B,X[,UPLO][,AFP][,IPIV][,FACT][,FERR][,BERR][,RCOND][,INFO]) ! netlib: (A,B,X,UPLO,AF,IPIV,FACT,FERR,BERR,RCOND,INFO) HPSVX(AP,B,X[,UPLO][,AFP][,IPIV][,FACT][,FERR][,BERR][,RCOND][,INFO]) ! netlib: (A,B,X,UPLO,AF,IPIV,FACT,FERR,BERR,RCOND,INFO) SPEV(AP,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) HPEV(AP,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) SPEVD(AP,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) HPEVD(AP,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) SPEVX(AP,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! netlib: (A,B,W,UPLO,Z,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) HPEVX(AP,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! netlib: (A,B,W,UPLO,Z,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) SBEV(AB,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) HBEV(AB,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) SBEVD(AB,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) HBEVD(AB,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) SBEVX(AB,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,Q][,ABSTOL][,INFO]) ! netlib: (A,B,W,UPLO,Z,VL,VU,IL,IU,M,IFAIL,Q,ABSTOL,INFO) HBEVX(AB,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,Q][,ABSTOL][,INFO]) ! netlib: (A,B,W,UPLO,Z,VL,VU,IL,IU,M,IFAIL,Q,ABSTOL,INFO) SPGV(AP,BP,W[,ITYPE][,UPLO][,Z][,INFO]) ! netlib: (A,B,W,ITYPE,UPLO,Z,INFO) HPGV(AB,BP,W[,ITYPE][,UPLO][,Z][,INFO]) ! netlib: (A,B,W,ITYPE,UPLO,Z,INFO) SBGV(AB,BB,W[,UPLO][,Z][,INFO]) ! netlib: (A,B,W,UPLO,Z,INFO) HBGV(AB,BB,W[,UPLO][,Z][,INFO]) ! netlib: (A,B,W,UPLO,Z,INFO) Specific Features of Fortran 95 Interfaces for LAPACK Routines E 2683 Modified Netlib Interfaces SYEVX(A,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,W,JOBZ,UPLO,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 4, mkl: 3 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z HEEVX(A,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,W,JOBZ,UPLO,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 4, mkl: 3 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z SYEVR(A,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,ISUPPZ][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,W,JOBZ,UPLO,VL,VU,IL,IU,M,ISUPPZ,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 4, mkl: 3 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z HEEVR(A,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,ISUPPZ][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,W,JOBZ,UPLO,VL,VU,IL,IU,M,ISUPPZ,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 4, mkl: 3 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z GESDD(A,S[,U][,VT][,JOBZ][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,S,U,VT,WW,JOB,INFO) ! Different number for parameter, netlib: 7, mkl: 6 ! Absent mkl parameter: WW ! Absent mkl parameter: JOB ! Different order for parameter INFO, netlib: 7, mkl: 6 ! Extra mkl parameter: JOBZ SYGVX(A,B,W[,ITYPE][,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,B,W,ITYPE,JOBZ,UPLO,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 6, mkl: 5 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z HEGVX(A,B,W[,ITYPE][,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,B,W,ITYPE,JOBZ,UPLO,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 6, mkl: 5 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z GETRS(A,IPIV,B[,TRANS][,INFO]) ! Interface netlib95 exists: ! Different intents for parameter A, netlib: INOUT, mkl: IN Interfaces Absent From Netlib GTTRF(DL,D,DU,DU2[,IPIV][,INFO]) PPTRF(A[,UPLO][,INFO]) PBTRF(A[,UPLO][,INFO]) PTTRF(D,E[,INFO]) SYTRF(A[,UPLO][,IPIV][,INFO]) HETRF(A[,UPLO][,IPIV][,INFO]) E Intel® Math Kernel Library Reference Manual 2684 SPTRF(A[,UPLO][,IPIV][,INFO]) HPTRF(A[,UPLO][,IPIV][,INFO]) GBTRS(A,B,IPIV[,KL][,TRANS][,INFO]) GTTRS(DL,D,DU,DU2,B,IPIV[,TRANS][,INFO]) POTRS(A,B[,UPLO][,INFO]) PPTRS(A,B[,UPLO][,INFO]) PBTRS(A,B[,UPLO][,INFO]) PTTRS(D,E,B[,INFO]) PTTRS(D,E,B[,UPLO][,INFO]) SYTRS(A,B,IPIV[,UPLO][,INFO]) HETRS(A,B,IPIV[,UPLO][,INFO]) SPTRS(A,B,IPIV[,UPLO][,INFO]) HPTRS(A,B,IPIV[,UPLO][,INFO]) TRTRS(A,B[,UPLO][,TRANS][,DIAG][,INFO]) TPTRS(A,B[,UPLO][,TRANS][,DIAG][,INFO]) TBTRS(A,B[,UPLO][,TRANS][,DIAG][,INFO]) GECON(A,ANORM,RCOND[,NORM][,INFO]) GBCON(A,IPIV,ANORM,RCOND[,KL][,NORM][,INFO]) GTCON(DL,D,DU,DU2,IPIV,ANORM,RCOND[,NORM][,INFO]) POCON(A,ANORM,RCOND[,UPLO][,INFO]) PPCON(A,ANORM,RCOND[,UPLO][,INFO]) PBCON(A,ANORM,RCOND[,UPLO][,INFO]) PTCON(D,E,ANORM,RCOND[,INFO]) SYCON(A,IPIV,ANORM,RCOND[,UPLO][,INFO]) HECON(A,IPIV,ANORM,RCOND[,UPLO][,INFO]) SPCON(A,IPIV,ANORM,RCOND[,UPLO][,INFO]) HPCON(A,IPIV,ANORM,RCOND[,UPLO][,INFO]) TRCON(A,RCOND[,UPLO][,DIAG][,NORM][,INFO]) TPCON(A,RCOND[,UPLO][,DIAG][,NORM][,INFO]) TBCON(A,RCOND[,UPLO][,DIAG][,NORM][,INFO]) GBRFS(A,AF,IPIV,B,X[,KL][,TRANS][,FERR][,BERR][,INFO]) GTRFS(DL,D,DU,DLF,DF,DUF,DU2,IPIV,B,X[,TRANS][,FERR][,BERR][,INFO]) PORFS(A,AF,B,X[,UPLO][,FERR][,BERR][,INFO]) PPRFS(A,AF,B,X[,UPLO][,FERR][,BERR][,INFO]) PBRFS(A,AF,B,X[,UPLO][,FERR][,BERR][,INFO]) PTRFS(D,DF,E,EF,B,X[,FERR][,BERR][,INFO]) PTRFS(D,DF,E,EF,B,X[,UPLO][,FERR][,BERR][,INFO]) SYRFS(A,AF,IPIV,B,X[,UPLO][,FERR][,BERR][,INFO]) HERFS(A,AF,IPIV,B,X[,UPLO][,FERR][,BERR][,INFO]) SPRFS(A,AF,IPIV,B,X[,UPLO][,FERR][,BERR][,INFO]) HPRFS(A,AF,IPIV,B,X[,UPLO][,FERR][,BERR][,INFO]) TRRFS(A,B,X[,UPLO][,TRANS][,DIAG][,FERR][,BERR][,INFO]) TPRFS(A,B,X[,UPLO][,TRANS][,DIAG][,FERR][,BERR][,INFO]) TBRFS(A,B,X[,UPLO][,TRANS][,DIAG][,FERR][,BERR][,INFO]) POTRI(A[,UPLO][,INFO]) PPTRI(A[,UPLO][,INFO]) SYTRI(A,IPIV[,UPLO][,INFO]) HETRI(A,IPIV[,UPLO][,INFO]) SPTRI(A,IPIV[,UPLO][,INFO]) HPTRI(A,IPIV[,UPLO][,INFO]) TRTRI(A[,UPLO][,DIAG][,INFO]) TPTRI(A[,UPLO][,DIAG][,INFO]) GBEQU(A,R,C[,KL][,ROWCND][,COLCND][,AMAX][,INFO]) POEQU(A,S[,SCOND][,AMAX][,INFO]) PPEQU(A,S[,SCOND][,AMAX][,UPLO][,INFO]) PBEQU(A,S[,SCOND][,AMAX][,UPLO][,INFO]) HESV(A,B[,UPLO][,IPIV][,INFO]) HPSV(A,B[,UPLO][,IPIV][,INFO]) GEQRF(A[,TAU][,INFO]) GEQPF(A,JPVT[,TAU][,INFO]) GEQP3(A,JPVT[,TAU][,INFO]) ORGQR(A,TAU[,INFO]) ORMQR(A,TAU,C[,SIDE][,TRANS][,INFO]) UNGQR(A,TAU[,INFO]) UNMQR(A,TAU,C[,SIDE][,TRANS][,INFO]) GELQF(A[,TAU][,INFO]) ORGLQ(A,TAU[,INFO]) ORMLQ(A,TAU,C[,SIDE][,TRANS][,INFO]) UNGLQ(A,TAU[,INFO]) UNMLQ(A,TAU,C[,SIDE][,TRANS][,INFO]) GEQLF(A[,TAU][,INFO]) Specific Features of Fortran 95 Interfaces for LAPACK Routines E 2685 ORGQL(A,TAU[,INFO]) UNGQL(A,TAU[,INFO]) ORMQL(A,TAU,C[,SIDE][,TRANS][,INFO]) UNMQL(A,TAU,C[,SIDE][,TRANS][,INFO]) GERQF(A[,TAU][,INFO]) ORGRQ(A,TAU[,INFO]) UNGRQ(A,TAU[,INFO]) ORMRQ(A,TAU,C[,SIDE][,TRANS][,INFO]) UNMRQ(A,TAU,C[,SIDE][,TRANS][,INFO]) TZRZF(A[,TAU][,INFO]) ORMRZ(A,TAU,C,L[,SIDE][,TRANS][,INFO]) UNMRZ(A,TAU,C,L[,SIDE][,TRANS][,INFO]) GGQRF(A,B[,TAUA][,TAUB][,INFO]) GGRQF(A,B[,TAUA][,TAUB][,INFO]) GEBRD(A[,D][,E][,TAUQ][,TAUP][,INFO]) GBBRD(A[,C][,D][,E][,Q][,PT][,KL][,M][,INFO]) ORGBR(A,TAU[,VECT][,INFO]) ORMBR(A,TAU,C[,VECT][,SIDE][,TRANS][,INFO]) ORMTR(A,TAU,C[,SIDE][,UPLO][,TRANS][,INFO]) UNGBR(A,TAU[,VECT][,INFO]) UNMBR(A,TAU,C[,VECT][,SIDE][,TRANS][,INFO]) BDSQR(D,E[,VT][,U][,C][,UPLO][,INFO]) BDSDC(D,E[,U][,VT][,Q][,IQ][,UPLO][,INFO]) UNMTR(A,TAU,C[,SIDE][,UPLO][,TRANS][,INFO]) SPTRD(A,TAU[,UPLO][,INFO]) OPGTR(A,TAU,Q[,UPLO][,INFO]) OPMTR(A,TAU,C[,SIDE][,UPLO][,TRANS][,INFO]) HPTRD(A,TAU[,UPLO][,INFO]) UPGTR(A,TAU,Q[,UPLO][,INFO]) UPMTR(A,TAU,C[,SIDE][,UPLO][,TRANS][,INFO]) SBTRD(A[,Q][,VECT][,UPLO][,INFO]) HBTRD(A[,Q][,VECT][,UPLO][,INFO]) STERF(D,E[,INFO]) STEQR(D,E[,Z][,COMPZ][,INFO]) STEDC(D,E[,Z][,COMPZ][,INFO]) STEGR(D,E,W[,Z][,VL][,VU][,IL][,IU][,M][,ISUPPZ][,ABSTOL][,INFO]) PTEQR(D,E[,Z][,COMPZ][,INFO]) STEBZ(D,E,M,NSPLIT,W,IBLOCK,ISPLIT[,ORDER][,VL][,VU][,IL][,IU][,ABSTOL][,INFO]) STEIN(D,E,W,IBLOCK,ISPLIT,Z[,IFAILV][,INFO]) DISNA(D,SEP[,JOB][,MINMN][,INFO]) SPGST(A,B[,ITYPE][,UPLO][,INFO]) HPGST(A,B[,ITYPE][,UPLO][,INFO]) SBGST(A,B[,X][,UPLO][,INFO]) HBGST(A,B[,X][,UPLO][,INFO]) PBSTF(B[,UPLO][,INFO]) GEHRD(A[,TAU][,ILO][,IHI][,INFO]) ORGHR(A,TAU[,ILO][,IHI][,INFO]) ORMHR(A,TAU,C[,ILO][,IHI][,SIDE][,TRANS][,INFO]) UNGHR(A,TAU[,ILO][,IHI][,INFO]) UNMHR(A,TAU,C[,ILO][,IHI][,SIDE][,TRANS][,INFO]) GEBAL(A[,SCALE][,ILO][,IHI][,JOB][,INFO]) GEBAK(V,SCALE[,ILO][,IHI][,JOB][,SIDE][,INFO]) HSEQR(H,WR,WI[,ILO][,IHI][,Z][,JOB][,COMPZ][,INFO]) HSEQR(H,W[,ILO][,IHI][,Z][,JOB][,COMPZ][,INFO]) HSEIN(H,WR,WI,SELECT[,VL][,VR][,IFAILL][,IFAILR][,INITV][,EIGSRC][,M][,INFO]) HSEIN(H,W,SELECT[,VL][,VR][,IFAILL][,IFAILR][,INITV][,EIGSRC][,M][,INFO]) TREVC(T[,HOWMNY][,SELECT][,VL][,VR][,M][,INFO]) TRSNA(T[,S][,SEP][,VL][,VR][,SELECT][,M][,INFO]) TREXC(T,IFST,ILST[,Q][,INFO]) TRSEN(T,SELECT[,WR][,WI][,M][,S][,SEP][,Q][,INFO]) TRSEN(T,SELECT[,W][,M][,S][,SEP][,Q][,INFO]) TRSYL(A,B,C,SCALE[,TRANA][,TRANB][,ISGN][,INFO]) GGHRD(A,B[,ILO][,IHI][,Q][,Z][,COMPQ][,COMPZ][,INFO]) GGBAL(A,B[,ILO][,IHI][,LSCALE][,RSCALE][,JOB][,INFO]) GGBAK(V[,ILO][,IHI][,LSCALE][,RSCALE][,JOB][,INFO]) HGEQZ(H,T[,ILO][,IHI][,ALPHAR][,ALPHAI][,BETA][,Q][,Z][,JOB][,COMPQ][,COMPZ][,INFO]) HGEQZ(H,T[,ILO][,IHI][,ALPHA][,BETA][,Q][,Z][,JOB][,COMPQ][,COMPZ][,INFO]) TGEVC(S,P[,HOWMNY][,SELECT][,VL][,VR][,M][,INFO]) TGEXC(A,B[,IFST][,ILST][,Z][,Q][,INFO]) TGSEN(A,B,SELECT[,ALPHAR][,ALPHAI][,BETA][,IJOB][,Q][,Z][,PL][,PR][,DIF][,M][,INFO]) TGSEN(A,B,SELECT[,ALPHA][,BETA][,IJOB][,Q][,Z][,PL][,PR][,DIF][,M][,INFO]) E Intel® Math Kernel Library Reference Manual 2686 TGSYL(A,B,C,D,E,F[,IJOB][,TRANS][,SCALE][,DIF][,INFO]) TGSNA(A,B[,S][,DIF][,VL][,VR][,SELECT][,M][,INFO]) GGSVP(A,B,TOLA,TOLB[,K][,L][,U][,V][,Q][,INFO]) TGSJA(A,B,TOLA,TOLB,K,L[,U][,V][,Q][,JOBU][,JOBV][,JOBQ][,ALPHA][,BETA][,NCYCLE][,INFO]) Interfaces of New Functionality GETRF(A[,IPIV][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,IPIV,RCOND,NORM,INFO) ! Different number for parameter, netlib: 5, mkl: 3 ! Different order for parameter INFO, netlib: 5, mkl: 3 ! Absent mkl parameter: NORM ! Absent mkl parameter: RCOND GBTRF(A[,KL][,M][,IPIV][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,K,M,IPIV,RCOND,NORM,INFO) ! Different number for parameter, netlib: 7, mkl: 5 ! Different order for parameter INFO, netlib: 7, mkl: 5 ! Absent mkl parameter: NORM ! Replace parameter name: netlib: K: mkl: KL ! Absent mkl parameter: RCOND POTRF(A[,UPLO][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,UPLO,RCOND,NORM,INFO) ! Different number for parameter, netlib: 5, mkl: 3 ! Different order for parameter INFO, netlib: 5, mkl: 3 ! Absent mkl parameter: NORM ! Absent mkl parameter: RCOND Specific Features of Fortran 95 Interfaces for LAPACK Routines E 2687 E Intel® Math Kernel Library Reference Manual 2688 FFTW Interface to Intel® Math Kernel Library F Intel® Math Kernel Library (Intel® MKL) offers FFTW2 and FFTW3 interfaces to Intel MKL Fast Fourier Transform and Trigonometric Transform functionality. The purpose of these interfaces is to enable applications using FFTW (www.fftw.org) to gain performance with Intel MKL without changing the program source code. Both FFTW2 and FFTW3 interfaces are provided in open source as FFTW wrappers to Intel MKL. For ease of use, FFTW3 interface is also integrated in Intel MKL. Notational Conventions This appendix typically employs path notations for Windows* OS. FFTW2 Interface to Intel® Math Kernel Library This section describes a collection of wrappers providing FFTW 2.x interface to Intel MKL. The wrappers translate calls to FFTW 2.x functions into the calls of the Intel MKL Fast Fourier Transform interface (FFT interface). The wrappers correspond to the FFTW version 2.x and the Intel MKL versions 7.0 or higher. Because of differences between FFTW and Intel MKL FFT functionalities, there are restrictions on using wrappers instead of the FFTW functions. Some FFTW functions have empty wrappers. However, many typical FFTs can be computed using these wrappers. Refer to chapter 11 "Fourier Transform Functions", for better understanding the effects from the use of the wrappers. More wrappers may be added in the future to extend FFTW functionality available with Intel MKL. Wrappers Reference The section provides a brief reference for the FFTW 2.x C interface. For details please refer to the original FFTW 2.x documentation available at www.fftw.org. Each FFTW function has its own wrapper. Some of them, which are not expressly listed in this section, are empty and do nothing, but they are provided to avoid link errors and satisfy the function calls. Intel MKL FFT interface operates on both float and double-precision data types. One-dimensional Complex-to-complex FFTs The following functions compute a one-dimensional complex-to-complex Fast Fourier transform. fftw_plan fftw_create_plan(int n, fftw_direction dir, int flags); fftw_plan fftw_create_plan_specific(int n, fftw_direction dir, int flags, fftw_complex *in, int istride, fftw_complex *out, int ostride); void fftw(fftw_plan plan, int howmany, fftw_complex *in, int istride, int idist, fftw_complex *out, int ostride, int odist); void fftw_one(fftw_plan plan, fftw_complex *in , fftw_complex *out); void fftw_destroy_plan(fftw_plan plan); 2689 Multi-dimensional Complex-to-complex FFTs The following functions compute a multi-dimensional complex-to-complex Fast Fourier transform. fftwnd_plan fftwnd_create_plan(int rank, const int *n, fftw_direction dir, int flags); fftwnd_plan fftw2d_create_plan(int nx, int ny, fftw_direction dir, int flags); fftwnd_plan fftw3d_create_plan(int nx, int ny, int nz, fftw_direction dir, int flags); fftwnd_plan fftwnd_create_plan_specific(int rank, const int *n, fftw_direction dir, int flags, fftw_complex *in, int istride, fftw_complex *out, int ostride); fftwnd_plan fftw2d_create_plan_specific(int nx, int ny, fftw_direction dir, int flags, fftw_complex *in, int istride, fftw_complex *out, int ostride); fftwnd_plan fftw3d_create_plan_specific(int nx, int ny, int nz, fftw_direction dir, int flags, fftw_complex *in, int istride, fftw_complex *out, int ostride); void fftwnd(fftwnd_plan plan, int howmany, fftw_complex *in, int istride, int idist, fftw_complex *out, int ostride, int odist); void fftwnd_one(fftwnd_plan plan, fftw_complex *in, fftw_complex *out); void fftwnd_destroy_plan(fftwnd_plan plan); One-dimensional Real-to-half-complex/Half-complex-to-real FFTs Half-complex representation of a conjugate-even symmetric vector of size N in a real array of the same size N consists of N/2+1 real parts of the elements of the vector followed by non-zero imaginary parts in the reverse order. Because the Intel MKL FFT interface does not currently support this representation, all wrappers of this kind are empty and do nothing. Nevertheless, you can perform one-dimensional real-to-complex and complex-to-real transforms using rfftwnd functions with rank=1. See Also Multi-dimensional Real-to-complex/Complex-to-real FFTs Multi-dimensional Real-to-complex/Complex-to-real FFTs The following functions compute multi-dimensional real-to-complex and complex-to-real Fast Fourier transforms. rfftwnd_plan rfftwnd_create_plan(int rank, const int *n, fftw_direction dir, int flags); rfftwnd_plan rfftw2d_create_plan(int nx, int ny, fftw_direction dir, int flags); rfftwnd_plan rfftw3d_create_plan(int nx, int ny, int nz, fftw_direction dir, int flags); rfftwnd_plan rfftwnd_create_plan_specific(int rank, const int *n, fftw_direction dir, int flags, fftw_real *in, int istride, fftw_real *out, int ostride); rfftwnd_plan rfftw2d_create_plan_specific(int nx, int ny, fftw_direction dir, int flags, fftw_real *in, int istride, fftw_real *out, int ostride); rfftwnd_plan rfftw3d_create_plan_specific(int nx, int ny, int nz, fftw_direction dir, int flags, fftw_real *in, int istride, fftw_real *out, int ostride); void rfftwnd_real_to_complex(rfftwnd_plan plan, int howmany, fftw_real *in, int istride, int idist, fftw_complex *out, int ostride, int odist); void rfftwnd_complex_to_real(rfftwnd_plan plan, int howmany, fftw_complex *in, int istride, int idist, fftw_real *out, int ostride, int odist); void rfftwnd_one_real_to_complex(rfftwnd_plan plan, fftw_real *in, fftw_complex *out); F Intel® Math Kernel Library Reference Manual 2690 void rfftwnd_one_complex_to_real(rfftwnd_plan plan, fftw_complex *in, fftw_real *out); void rfftwnd_destroy_plan(rfftwnd_plan plan); Multi-threaded FFTW This section discusses multi-threaded FFTW wrappers only. MPI FFTW wrappers, available only with Intel MKL for the Linux* and Windows* operating systems, are described in section "MPI FFTW Wrappers". Unlike the original FFTW interface, every computational function in the FFTW2 interface to Intel MKL provides multithreaded computation by default, with the number of threads defined by the number of processors available on the system (see section "Managing Performance and Memory" in the Intel MKL User's Guide). To limit the number of threads that use the FFTW interface, call the threaded FFTW computational functions: void fftw_threads(int nthreads, fftw_plan plan, int howmany, fftw_complex *in, int istride, int idist, fftw_complex *out, int ostride, int odist); void fftw_threads_one(int nthreads, rfftwnd_plan plan, fftw_complex *in, fftw_complex *out); ... void rfftwnd_threads_real_to_complex( int nthreads, rfftwnd_plan plan, int howmany, fftw_real *in, int istride, int idist, fftw_complex *out, int ostride, int odist); Compared to its non-threaded counterpart, every threaded computational function has threads_ as the second part of its name and additional first parameter nthreads. Set the nthreads parameter to the thread limit to ensure that the computation requires at most that number of threads. FFTW Support Functions The FFTW wrappers provide memory allocation functions to be used with FFTW: void* fftw_malloc(size_t n); void fftw_free(void* x); The fftw_malloc wrapper aligns the memory on a 16-byte boundary. If fftw_malloc fails to allocate memory, it aborts the application. To override this behavior, set a global variable fftw_malloc_hook and optionally the complementary variable fftw_free_hook: void *(*fftw_malloc_hook) (size_t n); void (*fftw_free_hook) (void *p); The wrappers use the function fftw_die to abort the application in cases when a caller cannot be informed of an error otherwise (for example, in computational functions that return void). To override this behavior, set a global variable fftw_die_hook: void (*fftw_die_hook)(const char *error_string); void fftw_die(const char *s); Limitations of the FFTW2 Interface to Intel MKL The FFTW2 wrappers implement the functionality of only those FFTW functions that Intel MKL can reasonably support. Other functions are provided as no-operation functions, whose only purpose is to satisfy link-time symbol resolution. Specifically, no-operation functions include: • Real-to-half-complex and respective backward transforms • Print plan functions • Functions for importing/exporting/forgetting wisdom • Most of the FFTW functions not covered by the original FFTW2 documentation Because the Intel MKL implementation of FFTW2 wrappers does not use plan and plan node structures declared in fftw.h, the behavior of an application that relies on the internals of the plan structures defined in that header file is undefined. FFTW Interface to Intel® Math Kernel Library F 2691 FFTW2 wrappers define plan as a set of attributes, such as strides, used to commit the Intel MKL FFT descriptor structure. If an FFTW2 computational function is called with attributes different from those recorded in the plan, the function attempts to adjust the attributes of the plan and recommit the descriptor. Thus, repeated calls of a computational function with the same plan but different strides, distances, and other parameters may be performance inefficient. Plan creation functions disregard most planner flags passed through the flags parameter. These functions take into account only the following values of flags: • FFTW_IN_PLACE If this value of flags is supplied, the plan is marked so that computational functions using that plan ignore the parameters related to output (out, ostride, and odist). Unlike the original FFTW interface, the wrappers never use the out parameter as a scratch space for in-place transforms. • FFTW_THREADSAFE If this value of flags is supplied, the plan is marked read-only. An attempt to change attributes of a read-only plan aborts the application. FFTW wrappers are generally not thread safe. Therefore, do not use the same plan in parallel user threads simultaneously. Calling Wrappers from Fortran The FFTW2 wrappers to Intel MKL provide the following subroutines for calling from Fortran: call fftw_f77_create_plan(plan, n, dir, flags) call fftw_f77(plan, howmany, in, istride, idist, out, ostride, odist) call fftw_f77_one(plan, in, out) call fftw_f77_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist) call fftw_f77_threads_one(nthreads, plan, in, out) call fftw_f77_destroy_plan(plan) call fftwnd_f77_create_plan(plan, rank, n, dir, flags) call fftw2d_f77_create_plan(plan, nx, ny, dir, flags) call fftw3d_f77_create_plan(plan, nx, ny, nz, dir, flags) call fftwnd_f77(plan, howmany, in, istride, idist, out, ostride, odist) call fftwnd_f77_one(plan, in, out) call fftwnd_f77_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist) call fftwnd_f77_threads_one(nthreads, plan, in, out) call fftwnd_f77_destroy_plan(plan) call rfftw_f77_create_plan(plan, n, dir, flags) call rfftw_f77(plan, howmany, in, istride, idist, out, ostride, odist) call rfftw_f77_one(plan, in, out) call rfftw_f77_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist) call rfftw_f77_threads_one(nthreads, plan, in, out) call rfftw_f77_destroy_plan(plan) call rfftwnd_f77_create_plan(plan, rank, n, dir, flags) F Intel® Math Kernel Library Reference Manual 2692 call rfftw2d_f77_create_plan(plan, nx, ny, dir, flags) call rfftw3d_f77_create_plan(plan, nx, ny, nz, dir, flags) call rfftwnd_f77_complex_to_real(plan, howmany, in, istride, idist, out, ostride, odist) call rfftwnd_f77_one_complex_to_real (plan, in, out) call rfftwnd_f77_real_to_complex(plan, howmany, in, istride, idist, out, ostride, odist) call rfftwnd_f77_one_real_to_complex (plan, in, out) call rfftwnd_f77_threads_complex_to_real(nthreads, plan, howmany, in, istride, idist, out, ostride, odist) call rfftwnd_f77_threads_one_complex_to_real(nthreads, plan, in, out) call rfftwnd_f77_threads_real_to_complex(nthreads, plan, howmany, in, istride, idist, out, ostride, odist) call rfftwnd_f77_threads_one_real_to_complex(nthreads, plan, in, out) call rfftwnd_f77_destroy_plan(plan) call fftw_f77_threads_init(info) The FFTW Fortran functions are actually the wrappers to FFTW C functions. So, their functionality and limitations are the same as of the corresponding C wrappers. See Also Wrappers Reference Limitations of the FFTW2 Interface to Intel MKL Installation Wrappers are delivered as source code, which you must compile to build the wrapper library. Then you can substitute the wrapper and Intel MKL libraries for the FFTW library. The source code for the wrappers and makefiles with the wrapper list files are located in the .\interfaces\fftw2xc and .\interfaces\fftw2xf subdirectory in the Intel MKL directory for C and Fortran wrappers, respectively. Creating the Wrapper Library Two header files are used to compile the C wrapper library: fftw2_mkl.h and fftw.h. The fftw2_mkl.h file is located in the .\interfaces\fftw2xc\wrappers subdirectory in the Intel MKL directory. Three header files are used to compile the Fortran wrapper library: fftw2_mkl.h, fftw2_f77_mkl.h, and fftw.h. The fftw2_mkl.h and fftw2_f77_mkl.h files are located in the .\interfaces\fftw2xf \wrappers subdirectory in the Intel MKL directory. The file fftw.h, used to compile libraries for both interfaces and located in the .\include\fftw subdirectory in the Intel MKL directory, slightly differs from the original FFTW (www.fftw.org) header file fftw.h. The source code for the wrappers, makefiles, and function list files are located in subdirectories . \interfaces\fftw2xc and .\interfaces\fftw2xf in the Intel MKL directory for C and Fortran wrappers, respectively. A wrapper library contains C or Fortran wrappers for complex and real transforms in a serial and multithreaded mode for one of the two data types (double or float). A makefile parameter manages the data type. The makefile parameters specify the platform (required), compiler, and data precision. Specifying the platform is required. The makefile comment heading provides the exact description of these parameters. FFTW Interface to Intel® Math Kernel Library F 2693 Because a C compiler builds the Fortran wrapper library, function names in the wrapper library and Fortran object module may be different. The file fftw2_f77_mkl.h in the .\interfaces\fftw2xf\source subdirectory in the Intel MKL directory defines function names according to the names in the Fortran module. If a required name is missing in the file, you can modify the file to add the name before building the library. To build the library, run the make command on Linux* OS and Mac OS* X or the nmake command on Windows* OS with appropriate parameters. For example, the command make libintel64 builds on Linux OS a double-precision wrapper library for Intel® 64 architecture based applications using the Intel® C++ Compiler or the Intel® Fortran Compiler version 9.1 or higher (compilers and data precision are chosen by default.). Each makefile creates the library in the directory with the Intel MKL libraries corresponding to the used platform. For example, ./lib/ia32 (on Linux OS and Mac OS X) or .\lib\ia32 (on Windows* OS). In the wrapper library names, the suffix corresponds to the used compiler, the letter "f" precedes the underscore for Fortran, and the letter "c" precedes the underscore for C. For example, fftw2xf_intel.lib (on Windows OS); libfftw2xf_intel.a (on Linux OS and Mac OS X); fftw2xc_intel.lib (on Windows OS); libfftw2xc_intel.a (on Linux OS and Mac OS X); fftw2xc_ms.lib (on Windows OS); libfftw2xc_gnu.a (on Linux OS and Mac OS X). Application Assembling Use the necessary original FFTW (www.fftw.org) header files without any modifications. Use the created wrapper library and the Intel MKL library instead of the FFTW library. Running Examples Intel MKL provides examples to demonstrate how to use the MPI FFTW wrapper library. The source code for the examples, makefiles used to run them, and the example list files are located in the .\examples \fftw2xc and .\examples\fftw2xf subdirectories in the Intel MKL directory for C and Fortran, respectively. To build examples, several additional files are needed: fftw.h, fftw_threads.h, rfftw.h, rfftw_threads.h, and fftw_f77.I. These files are distributed with permission from FFTW and are available in .\include\fftw. The original files can also be found in FFTW 2.1.5 at http://www.fftw.org/ download.html. An example makefile uses the function parameter in addition to the parameters that the respective wrapper library makefile uses (see Creating a Wrapper Library). The makefile comment heading provides the exact description of these parameters. An example makefile normally invokes examples. However, if the appropriate wrapper library is not yet created, the makefile first builds the library the same way as the wrapper library makefile does and then proceeds to examples. If the parameter function= is defined, only the specified example runs. Otherwise, all examples from the appropriate subdirectory run. The subdirectory .\_results is created, and the results are stored there in the .res files. MPI FFTW Wrappers MPI FFTW wrappers for FFTW 2 are available only with Intel® MKL for the Linux* and Windows* operating systems. MPI FFTW Wrappers Reference The section provides a reference for MPI FFTW C interface. F Intel® Math Kernel Library Reference Manual 2694 Complex MPI FFTW Complex One-dimensional MPI FFTW Transforms fftw_mpi_plan fftw_mpi_create_plan(MPI_Comm comm, int n, fftw_direction dir, int flags); void fftw_mpi(fftw_mpi_plan p, int n_fields, fftw_complex *local_data, fftw_complex *work); void fftw_mpi_local_sizes(fftw_mpi_plan p, int *local_n, int *local_start, int *local_n_after_transform, int *local_start_after_transform, int *total_local_size); void fftw_mpi_destroy_plan(fftw_mpi_plan plan); Argument restrictions: • Supported values of flags are FFTW_ESTIMATE, FFTW_MEASURE, FFTW_SCRAMBLED_INPUT and FFTW_SCRAMBLED_OUTPUT. The same algorithm corresponds to all these values of the flags parameter. If any other flags value is supplied, the wrapper library reports an error 'CDFT error in wrapper: unknown flags'. • The only supported value of n_fields is 1. Complex Multi-dimensional MPI FFTW Transforms fftwnd_mpi_plan fftw2d_mpi_create_plan(MPI_Comm comm, int nx, int ny, fftw_direction dir, int flags); fftwnd_mpi_plan fftw3d_mpi_create_plan(MPI_Comm comm, int nx, int ny, int nz, fftw_direction dir, int flags); fftwnd_mpi_plan fftwnd_mpi_create_plan(MPI_Comm comm, int dim, int *n, fftw_direction dir, int flags); void fftwnd_mpi(fftwnd_mpi_plan p, int n_fields, fftw_complex *local_data, fftw_complex *work, fftwnd_mpi_output_order output_order); void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p, int *local_nx, int *local_x_start, int *local_ny_after_transpose, int *local_y_start_after_transpose, int *total_local_size); void fftwnd_mpi_destroy_plan(fftwnd_mpi_plan plan); Argument restrictions: • Supported values of flags are FFTW_ESTIMATE and FFTW_MEASURE. If any other value of flags is supplied, the wrapper library reports an error 'CDFT error in wrapper: unknown flags'. • The only supported value of n_fields is 1. Real MPI FFTW Real-to-Complex MPI FFTW Transforms rfftwnd_mpi_plan rfftw2d_mpi_create_plan(MPI_Comm comm, int nx, int ny, fftw_direction dir, int flags); rfftwnd_mpi_plan rfftw3d_mpi_create_plan(MPI_Comm comm, int nx, int ny, int nz, fftw_direction dir, int flags); rfftwnd_mpi_plan rfftwnd_mpi_create_plan(MPI_Comm comm, int dim, int *n, fftw_direction dir, int flags); void rfftwnd_mpi(rfftwnd_mpi_plan p, int n_fields, fftw_real *local_data, fftw_real *work, fftwnd_mpi_output_order output_order); FFTW Interface to Intel® Math Kernel Library F 2695 void rfftwnd_mpi_local_sizes(rfftwnd_mpi_plan p, int *local_nx, int *local_x_start, int *local_ny_after_transpose, int *local_y_start_after_transpose, int *total_local_size); void rfftwnd_mpi_destroy_plan(rfftwnd_mpi_plan plan); Argument restrictions: • Supported values of flags are FFTW_ESTIMATE and FFTW_MEASURE. If any other value of flags is supplied, the wrapper library reports an error 'CDFT error in wrapper: unknown flags'. • The only supported value of n_fields is 1. • Function rfftwnd_mpi_create_plan can be used for both one-dimensional and multi-dimensional transforms. • Both values of the output_order parameter are supported: FFTW_NORMAL_ORDER and FFTW_TRANSPOSED_ORDER. Creating MPI FFTW Wrapper Library The source code for the wrappers, makefile, and wrapper list file are located in the .\interfaces \fftw2x_cdft subdirectory in the Intel MKL directory. A wrapper library contains C wrappers for Complex One-dimensional MPI FFTW Transforms and Complex Multi-dimensional MPI FFTW Transforms. The library also contains empty C wrappers for Real Multidimensional MPI FFTW Transforms. For details, see MPI FFTW Wrappers Reference. The makefile parameters specify the platform (required), compiler, and data precision. Specifying the platform is required. The makefile comment heading provides the exact description of these parameters. To build the library, run the make command on Linux* OS and Mac OS* X or the nmake command on Windows* OS with appropriate parameters. For example, the command make libintel64 builds on Linux OS a double-precision wrapper library for Intel® 64 architecture based applications using Intel MPI 2.0 and the Intel® C++ Compiler version 9.1 or higher (compilers and data precision are chosen by default.). The makefile creates the wrapper library in the directory with the Intel MKL libraries corresponding to the used platform. For example, ./lib/ia32 (on Linux OS) or .\lib\ia32 (on Windows* OS). In the wrapper library names, the suffix corresponds to the used data precision. For example, fftw2x_cdft_SINGLE.lib on Windows OS; libfftw2x_cdft_DOUBLE.a on Linux OS. Application Assembling with MPI FFTW Wrapper Library Use the necessary original FFTW (www.fftw.org) header files without any modifications. Use the created MPI FFTW wrapper library and the Intel MKL library instead of the FFTW library. Running Examples There are some examples that demonstrate how to use the MPI FFTW wrapper library for FFTW2. The source C code for the examples, makefiles used to run them, and the example list files are located in the . \examples\fftw2x_cdft subdirectory in the Intel MKL directory. To build examples, one additional file fftw_mpi.h is needed. This file is distributed with permission from FFTW and is available in .\include \fftw. The original file can also be found in FFTW 2.1.5 at http://www.fftw.org/download.html. Parameters for the example makefiles are described in the makefile comment headings and are similar to the wrapper library makefile parameters (see Creating MPI FFTW Wrapper Library). The table below lists examples available in the .\examples\fftw2x_cdft\source subdirectory. F Intel® Math Kernel Library Reference Manual 2696 Examples of MPI FFTW Wrappers Source file for the example Description wrappers_c1d.c One-dimensional Complex MPI FFTW transform, using plan = fftw_mpi_create_plan(...) wrappers_c2d.c Two-dimensional Complex MPI FFTW transform, using plan = fftw2d_mpi_create_plan(...) wrappers_c3d.c Three-dimensional Complex MPI FFTW transform, using plan = fftw3d_mpi_create_plan(...) wrappers_c4d.c Four-dimensional Complex MPI FFTW transform, using plan = fftwnd_mpi_create_plan(...) wrappers_r1d.c One-dimensional Real MPI FFTW transform, using plan = rfftw_mpi_create_plan(...) wrappers_r2d.c Two-dimensional Real MPI FFTW transform, using plan = rfftw2d_mpi_create_plan(...) wrappers_r3d.c Three-dimensional Real MPI FFTW transform, using plan = rfftw3d_mpi_create_plan(...) wrappers_r4d.c Four-dimensional Real MPI FFTW transform, using plan = rfftwnd_mpi_create_plan(...) FFTW3 Interface to Intel® Math Kernel Library This section describes a collection of FFTW3 wrappers to Intel MKL. The wrappers translate calls of FFTW3 functions to the calls of the Intel MKL Fourier transform (FFT) or Trigonometric Transform (TT) functions. The purpose of FFTW3 wrappers is to enable developers whose programs currently use the FFTW3 library to gain performance with the Intel MKL Fourier transforms without changing the program source code. The wrappers correspond to the FFTW release 3.2 and the Intel MKL releases starting with 10.2. For a detailed description of FFTW interface, refer to www.fftw.org. For a detailed description of Intel MKL FFT and TT functionality the wrappers use, see chapter 11 and section "Trigonometric Transform Routines" in chapter 13, respectively. The FFTW3 wrappers provide a limited functionality compared to the original FFTW 3.2 library, because of differences between FFTW and Intel MKL FFT and TT functionality. This section describes limitations of the FFTW3 wrappers and hints for their usage. Nevertheless, many typical FFT tasks can be performed using the FFTW3 wrappers to Intel MKL. More functionality may be added to the wrappers and Intel MKL in the future to reduce the constraints of the FFTW3 interface to Intel MKL. The FFTW3 wrappers are integrated in Intel MKL. The only change required to use Intel MKL through the FFTW3 wrappers is to link your application using FFTW3 against Intel MKL. A reference implementation of the FFTW3 wrappers is also provided in open source. You can find it in the interfaces directory of the Intel MKL distribution. You can use the reference implementation to create your own wrapper library (see Building Your Own Wrapper Library) Using FFTW3 Wrappers The FFTW3 wrappers are a set of functions and data structures depending on one another. The wrappers are not designed to provide the interface on a function-per-function basis. Some FFTW3 wrapper functions are empty and do nothing, but they are present to avoid link errors and satisfy function calls. This manual does not list the declarations of the functions that the FFTW3 wrappers provide (you can find the declarations in the fftw3.h header file). Instead, this section comments particular limitations of the wrappers and provides usage hints: FFTW Interface to Intel® Math Kernel Library F 2697 • The FFTW3 wrappers do not support long double precision because Intel MKL FFT functions operate only on single- and double-precision floating-point data types (float and double, respectively). Therefore the functions with prefix fftwl_, supporting the long double data type, are not provided. • The wrappers provide equivalent implementation for double- and single-precision functions (those with prefixes fftw_ and fftwf_, respectively). So, all these comments equally apply to the double- and single-precision functions and will refer to functions with prefix fftw_, that is, double-precision functions, for brevity. • The FFTW3 interface that the wrappers provide is defined in header files fftw3.h and fftw3.f. These files are borrowed from the FFTW3.2 package and distributed within Intel MKL with permission. Additionally, files fftw3_mkl.h, fftw3_mkl.f, and fftw3_mkl_f77.h define supporting structures, supplementary constants and macros, and expose Fortran interface in C. • Actual functionality of the plan creation wrappers is implemented in guru64 set of functions. Basic interface, advanced interface, and guru interface plan creation functions call the guru64 interface functions. Thus, all types of the FFTW3 plan creation interface in the wrappers are functional. • Plan creation functions may return a NULL plan, indicating that the functionality is not supported. So, please carefully check the result returned by plan creation functions in your application. In particular, the following problems return a NULL plan: – c2r and r2c problems with a split storage of complex data. – r2r problems with kind values FFTW_R2HC, FFTW_HC2R, and FFTW_DHT. The only supported r2r kinds are even/odd DFTs (sine/cosine transforms). – Multidimensional r2r transforms. – Transforms of multidimensional vectors. That is, the only supported values for parameter howmany_rank in guru and guru64 plan creation functions are 0 and 1. – Multidimensional transforms with rank > MKL_MAXRANK. • The MKL_RODFT00 value of the kind parameter is introduced by the FFTW3 wrappers. For better performance, you are strongly encouraged to use this value rather than FFTW_RODFT00. To use this kind value, provide an extra first element equal to 0.0 for the input/output vectors. Consider the following example: plan1 = fftw_plan_r2r_1d(n, in1, out1, FFTW_RODFT00, FFTW_ESTIMATE); plan2 = fftw_plan_r2r_1d(n, in2, out2, MKL_RODFT00, FFTW_ESTIMATE); Both plans perform the same transform, except that the in2/out2 arrays have one extra zero element at location 0. For example, if n=3, in1={x,y,z} and out1={u,v,w}, then in2={0,x,y,z} and out2={0,u,v,w}. • The flags parameter in plan creation functions is always ignored. The same algorithm is used regardless of the value of this parameter. In particular, flags values FFTW_ESTIMATE, FFTW_MEASURE, etc. have no effect. • For multithreaded plans, use normal sequence of calls to the fftw_init_threads() and fftw_plan_with_nthreads() functions (refer to FFTW documentation). • FFTW3 wrappers are not fully thread safe. If the new-array execute functions, such as fftw_execute_dft(), share the same plan from parallel user threads, set the number of the sharing threads before creation of the plan. For this purpose, the FFTW3 wrappers provide a header file fftw3_mkl.h, which defines a global structure fftw3_mkl with a field to be set to the number of sharing threads. Below is an example of setting the number of sharing threads: #include "fftw3.h" #include "fftw3_mkl.h" fftw3_mkl.number_of_user_threads = 4; plan = fftw_plan_dft(...); • Memory allocation function fftw_malloc returns memory aligned at a 16-byte boundary. You must free the memory with fftw_free. • The FFTW3 wrappers to Intel MKL use the 32-bit int type in both LP64 and ILP64 interfaces of Intel MKL. Use guru64 FFTW3 interfaces for 64-bit sizes. • Fortran wrappers (see Calling Wrappers from Fortran) use the INTEGER type, which is 32-bit in LP64 interfaces and 64-bit in ILP64 interfaces. F Intel® Math Kernel Library Reference Manual 2698 • The wrappers typically indicate a problem by returning a NULL plan. In a few cases, the wrappers may report a descriptive message of the problem detected. By default the reporting is turned off. To turn it on, set variable fftw3_mkl.verbose to a non-zero value, for example: #include "fftw3.h" #include "fftw3_mkl.h" fftw3_mkl.verbose = 0; plan = fftw_plan_r2r(...); • The following functions are empty: – For saving, loading, and printing plans – For saving and loading wisdom – For estimating arithmetic cost of the transforms. • Do not use macro FFTW_DLL with the FFTW3 wrappers to Intel MKL. • Do not use negative stride values. Though FFTW3 wrappers support negative strides in the part of advanced and guru FFTW interface, the underlying implementation does not. Calling Wrappers from Fortran Intel MKL also provides Fortran 77 interfaces of the FFTW3 wrappers. The Fortran wrappers are available for all FFTW3 interface functions and are based on C interface of the FFTW3 wrappers. Therefore they have the same functionality and restrictions as the corresponding C interface wrappers. The Fortran wrappers use the default INTEGER type for integer arguments. The default INTEGER is 32-bit in Intel MKL LP64 interfaces and 64-bit in ILP64 interfaces. Argument plan in a Fortran application must have type INTEGER*8. The wrappers that are double-precision subroutines have prefix dfftw_, single-precision subroutines have prefix sfftw_ and provide an equivalent functionality. Long double subroutines (with prefix lfftw_) are not provided. The Fortran FFTW3 wrappers use the default Intel® Fortran compiler convention for name decoration. If your compiler uses a different convention, or if you are using compiler options affecting the name decoration (such as /Qlowercase), you may need to compile the wrappers from sources, as described in section Building Your Own Wrapper Library. For interoperability with C, the declaration of the Fortran FFTW3 interface is provided in header file include/ fftw/fftw3_mkl_f77.h. You can call Fortran wrappers from a FORTRAN 77 or Fortran 90 application, although Intel MKL does not provide a Fortran 90 module for the wrappers. For a detailed description of the FFTW Fortran interface, refer to FFTW3 documentation (www.fftw.org). The following example illustrates calling the FFTW3 wrappers from Fortran: INTEGER*8 plan INTEGER N INCLUDE 'fftw3.f' COMPLEX*16 IN(*), OUT(*) !...initialize array IN CALL DFFTW_PLAN_DFT_1D(PLAN, N, IN, OUT, -1, FFTW_ESTIMATE) IF (PLAN .EQ. 0) STOP CALL DFFTW_EXECUTE !...result is in array OUT Building Your Own Wrapper Library The FFTW3 wrappers to Intel MKL are delivered both integrated in Intel MKL and as source code, which can be compiled to build a standalone wrapper library with exactly the same functionality. Normally you do not need to build the wrappers yourself. However, if your Fortran application is compiled with a compiler that uses a different name decoration than the Intel® Fortran compiler or if you are using compiler options altering the Fortran name decoration, you may need to build the wrappers that use the appropriate name changing convention. FFTW Interface to Intel® Math Kernel Library F 2699 The source code for the wrappers, makefiles, and function list files are located in subdirectories . \interfaces\fftw3xc and .\interfaces\fftw3xf in the Intel MKL directory for C and Fortran wrappers, respectively. To build the wrappers, 1. Change the current directory to the wrapper directory 2. Run the make command on Linux* OS and Mac OS* X or the nmake command on Windows* OS with a required target and optionally several parameters. The target, that is, one of {libia32, libintel64}, defines the platform architecture, and the other parameters facilitate selection of the compiler, size of the default INTEGER type, and placement of the resulting wrapper library. You can find a detailed and up-to-date description of the parameters in the makefile. In the following example, the make command is used to build the FFTW3 Fortran wrappers to MKL for use from the GNU g77 Fortran compiler on Linux OS based on Intel® 64 architecture: cd interfaces/fftw3xf make libintel64 compiler=gnu fname=a_name__ install_to=/my/path This command builds the wrapper library using the GNU gcc compiler, decorates the name with the second underscore, and places the result, named libfftw3xf_gcc.a, into directory /my/path. The name of the resulting library is composed of the name of the compiler used and may be changed by an optional parameter. Building an Application Normally, the only change needed to build your application with FFTW3 wrappers replacing original FFTW library is to add Intel MKL at the link stage (see section "Linking Your Application with Intel® Math Kernel Library" in the Intel MKL User's Guide). If you recompile your application, add subdirectory include\fftw to the search path for header files to avoid FFTW3 version conflicts. Sometimes, you may have to modify your application according to the following recommendations: • The application requires #include "fftw3.h" , which it probably already includes. • The application does not require #include "mkl_dfti.h" . • The application does not require #include "fftw3_mkl.h" . It is required only in case you want to use the MKL_RODFT00 constant. • If the application does not check whether a NULL plan is returned by plan creation functions, this check must be added, because the FFTW3 to Intel MKL wrappers do not provide 100% of FFTW3 functionality. • If the application is threaded, take care about shared plans, because the execute functions in the wrappers are not thread safe, unlike the original FFTW3 functions. See a note about setting fftw3_mkl.number_of_user_threads in section "Using FFTW3 wrappers". Running Examples There are some examples that demonstrate how to use the wrapper library. The source code for the examples, makefiles used to run them, and the example list files are located in the .\examples\fftw3xc and .\examples\fftw3xf subdirectories in the Intel MKL directory. To build Fortran examples, one additional file fftw3.f is needed. This file is distributed with permission from FFTW and is available in the . \include\fftw subdirectory of the Intel MKL directory. The original file can also be found in FFTW 3.2 at http://www.fftw.org/download.html. F Intel® Math Kernel Library Reference Manual 2700 Example makefile parameters are similar to the wrapper library makefile parameters. Example makefiles normally build and invoke the examples. If the parameter function= is defined, then only the specified example will run. Otherwise, all examples will be executed. Results of running the examples are saved in subdirectory .\_results in files with extension .res. For detailed information about options for the example makefile, refer to the makefile. MPI FFTW Wrappers This section describes a collection of MPI FFTW wrappers to Intel® MKL. The wrappers correspond to the FFTW 3.3 Alpha release and the Intel MKL releases starting with 10.3. For a detailed description of the MPI FFTW interface, refer to www.fftw.org. MPI FFTW wrappers are available only with Intel MKL for the Linux* and Windows* operating systems. These wrappers translate calls of MPI FFTW functions to the calls of the Intel MKL cluster Fourier transform (CFFT) functions. The purpose of the wrappers is to enable users of MPI FFTW functions improve performance of the applications without changing the program source code. Although the MPI FFTW wrappers provide less functionality than the original FFTW 3.3 because of differences between MPI FFTW and Intel MKL CFFT, the wrappers cover many typical CFFT use cases. The MPI FFTW wrappers are provided as source code. To use the wrappers, you need to build your own wrapper library (see Building Your Own Wrapper Library). See Also Cluster FFT Functions Building Your Own Wrapper Library The MPI FFTW wrappers for FFTW3 are delivered as source code, which can be compiled to build a wrapper library. The source code for the wrappers, makefiles, and function list files are located in subdirectory .\interfaces fftw3x_cdft in the Intel MKL directory. To build the wrappers, 1. Change the current directory to the wrapper directory 2. Run the make command on Linux* OS or the nmake command on Windows* OS with a required target and optionally several parameters. The target, that is, one of {libia32, libintel64}, defines the platform architecture, and the other parameters specify the compiler, size of the default INTEGER type, as well as the name and placement of the resulting wrapper library. You can find a detailed and up-to-date description of the parameters in the makefile. In the following example, the make command is used to build the MPI FFTW wrappers to Intel MKL for use from the GNU C compiler on Linux OS based on Intel® 64 architecture: cd interfaces/fftw3x_cdft make libintel64 compiler=gnu mpi=openmpi INSTALL_DIR=/my/path This command builds the wrapper library using the GNU gcc compiler so that the final user executable can use Open MPI and places the result, named libfftw3x_cdft_DOUBLE.a, into directory /my/path. Building an Application Normally, the only change needed to build your application with MPI FFTW wrappers replacing original FFTW3 library is to add Intel MKL and the wrapper library at the link stage (see section "Linking Your Application with Intel® Math Kernel Library" in the Intel MKL User's Guide). When you are recompiling your application, add subdirectory include\fftw to the search path for header files to avoid FFTW3 version conflicts. FFTW Interface to Intel® Math Kernel Library F 2701 Running Examples There are some examples that demonstrate how to use the MPI FFTW wrapper library for FFTW3. The source code for the examples, makefiles used to run them, and the example list files are located in the .\examples \fftw3x_cdft subdirectory in the Intel MKL directory. Example makefile parameters are similar to the wrapper library makefile parameters. Example makefiles normally build and invoke the examples. Results of running the examples are saved in subdirectory . \_results in files with extension .res. For detailed information about options for the example makefile, refer to the makefile. See Also Building Your Own Wrapper Library F Intel® Math Kernel Library Reference Manual 2702 Bibliography For more information about the BLAS, Sparse BLAS, LAPACK, ScaLAPACK, Sparse Solver, VML, VSL, FFT, and Non-Linear Optimization Solvers functionality, refer to the following publications: • BLAS Level 1 C. Lawson, R. Hanson, D. Kincaid, and F. Krough. Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, Vol.5, No.3 (September 1979) 308-325. • BLAS Level 2 J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, Vol.14, No.1 (March 1988) 1-32. • BLAS Level 3 J. Dongarra, J. DuCroz, I. Duff, and S. Hammarling. A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software (December 1989). • Sparse BLAS D. Dodson, R. Grimes, and J. Lewis. Sparse Extensions to the FORTRAN Basic Linear Algebra Subprograms, ACM Transactions on Math Software, Vol.17, No.2 (June 1991). D. Dodson, R. Grimes, and J. Lewis. Algorithm 692: Model Implementation and Test Package for the Sparse Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, Vol.17, No.2 (June 1991). [Duff86] I.S.Duff, A.M.Erisman, and J.K.Reid. Direct Methods for Sparse Matrices. Clarendon Press, Oxford, UK, 1986. [CXML01] Compaq Extended Math Library. Reference Guide, Oct.2001. [Rem05] K.Remington. A NIST FORTRAN Sparse Blas User's Guide. (available on http:// math.nist.gov/~KRemington/fspblas/) [Saad94] Y.Saad. SPARSKIT: A Basic Tool-kit for Sparse Matrix Computation. Version 2, 1994.(http://www.cs.umn.edu/~saad) [Saad96] Y.Saad. Iterative Methods for Linear Systems. PWS Publishing, Boston, 1996. • LAPACK [AndaPark94] A. A. Anda and H. Park. Fast plane rotations with dynamic scaling, SIAM J. matrix Anal. Appl., Vol. 15 (1994), pp. 162-174. [Bischof92] http://citeseer.ist.psu.edu/bischof92framework.html [Demmel92] J. Demmel and K. Veselic. Jacobi's method is more accurate than QR, SIAM J. Matrix Anal. Appl. 13(1992):1204-1246. [deRijk98] P. P. M. De Rijk. A one-sided Jacobi algorithm for computing the singular value decomposition on a vector computer, SIAM J. Sci. Stat. Comp., Vol. 10 (1998), pp. 359-371. [Dhillon04] I. Dhillon, B. Parlett. Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices, Linear Algebra and its Applications, 387(1), pp. 1-28, August 2004. [Dhillon04-02] I. Dhillon, B. Parlett. Orthogonal Eigenvectors and * Relative Gaps, SIAM Journal on Matrix Analysis and Applications, Vol. 25, 2004. (Also LAPACK Working Note 154.) [Dhillon97] I. Dhillon. A new O(n^2) algorithm for the symmetric tridiagonal eigenvalue/ eigenvector problem, Computer Science Division Technical Report No. UCB/ CSD-97-971, UC Berkeley, May 1997. [Drmac08-1] Z. Drmac and K. Veselic. New fast and accurate Jacobi SVD algorithm I, SIAM J. Matrix Anal. Appl. Vol. 35, No. 2 (2008), pp. 1322-1342. LAPACK Working note 169. 2703 [Drmac08-2] Z. Drmac and K. Veselic. New fast and accurate Jacobi SVD algorithm II, SIAM J. Matrix Anal. Appl. Vol. 35, No. 2 (2008), pp. 1343-1362. LAPACK Working note 170. [Drmac08-3] Z. Drmac and K. Bujanovic. On the failure of rank-revealing QR factorization software - a case study, ACM Trans. Math. Softw. Vol. 35, No 2 (2008), pp. 1-28. LAPACK Working note 176. [Drmac08-4] Z. Drmac. Implementation of Jacobi rotations for accurate singular value computation in floating point arithmetic, SIAM J. Sci. Comp., Vol. 18 (1997), pp. 1200-1222. [Golub96] G. Golub and C. Van Loan. Matrix Computations, Johns Hopkins University Press, Baltimore, third edition,1996. [LUG] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide, Third Edition, Society for Industrial and Applied Mathematics (SIAM), 1999. [Kahan66] W. Kahan. Accurate Eigenvalues of a Symmetric Tridiagonal Matrix, Report CS41, Computer Science Dept., Stanford University, July 21, 1966. [Marques06] O.Marques, E.J.Riedy, and Ch.Voemel. Benefits of IEEE-754 Features in Modern Symmetric Tridiagonal Eigensolvers, SIAM Journal on Scientific Computing, Vol.28, No.5, 2006. (Tech report version in LAPACK Working Note 172 with the same title.) [Sutton09] Brian D. Sutton. Computing the complete CS decomposition, Numer. Algorithms, 50(1):33-65, 2009. • ScaLAPACK [SLUG] L. Blackford, J. Choi, A.Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K.Stanley, D. Walker, and R. Whaley. ScaLAPACK Users' Guide, Society for Industrial and Applied Mathematics (SIAM), 1997. • Sparse Solver [Duff99] I. S. Duff and J. Koster. The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices. SIAM J. Matrix Analysis and Applications, 20(4):889-901, 1999. [Dong95] J. Dongarra, V.Eijkhout, A.Kalhan. Reverse Communication Interface for Linear Algebra Templates for Iterative Methods. UT-CS-95-291, May 1995. http:// www.netlib.org/lapack/lawnspdf/lawn99.pdf [Karypis98] G. Karypis and V. Kumar. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, 20(1): 359-392, 1998. [Li99] X.S. Li and J.W. Demmel. A Scalable Sparse Direct Solver Using Static Pivoting. In Proceeding of the 9th SIAM conference on Parallel Processing for Scientific Computing, San Antonio, Texas, March 22-34,1999. [Liu85] J.W.H. Liu. Modification of the Minimum-Degree algorithm by multiple elimination. ACM Transactions on Mathematical Software, 11(2):141-153, 1985. [Menon98] R. Menon L. Dagnum. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Computational Science & Engineering, 1:46-55, 1998. http://www.openmp.org. [Saad03] Y. Saad. Iterative Methods for Sparse Linear Systems. 2nd edition, SIAM, Philadelphia, PA, 2003. [Schenk00] O. Schenk. Scalable Parallel Sparse LU Factorization Methods on Shared Memory Multiprocessors. PhD thesis, ETH Zurich, 2000. [Schenk00-2] O. Schenk, K. Gartner, and W. Fichtner. Efficient Sparse LU Factorization with Left-right Looking Strategy on Shared Memory Multiprocessors. BIT, 40(1): 158-176, 2000. G Intel® Math Kernel Library Reference Manual 2704 [Schenk01] O. Schenk and K. Gartner. Sparse Factorization with Two-Level Scheduling in PARDISO. In Proceeding of the 10th SIAM conference on Parallel Processing for Scientific Computing, Portsmouth, Virginia, March 12-14, 2001. [Schenk02] O. Schenk and K. Gartner. Two-level scheduling in PARDISO: Improved Scalability on Shared Memory Multiprocessing Systems. Parallel Computing, 28:187-197, 2002. [Schenk03] O. Schenk and K. Gartner. Solving Unsymmetric Sparse Systems of Linear Equations with PARDISO. Journal of Future Generation Computer Systems, 20(3):475-487, 2004. [Schenk04] O. Schenk and K. Gartner. On Fast Factorization Pivoting Methods for Sparse Symmetric Indefinite Systems. Technical Report, Department of Computer Science, University of Basel, 2004, submitted. [Sonn89] P. Sonneveld. CGS, a Fast Lanczos-Type Solver for Nonsymmetric Linear Systems. SIAM Journal on Scientific and Statistical Computing, 10:36-52, 1989. [Young71] D.M.Young. Iterative Solution of Large Linear Systems. New York, Academic Press, Inc., 1971. • VSL [Billor00] Nedret Billor, Ali S. Hadib, and Paul F. Velleman. BACON: blocked adaptive computationally efficient outlier nominators. Computational Statistics & Data Analysis, 34, 279-298, 2000. [Bratley87] Bratley P., Fox B.L., and Schrage L.E. A Guide to Simulation. 2nd edition. Springer-Verlag, New York, 1987. [Bratley88] Bratley P. and Fox B.L. Implementing Sobol's Quasirandom Sequence Generator, ACM Transactions on Mathematical Software, Vol. 14, No. 1, Pages 88-100, March 1988. [Bratley92] Bratley P., Fox B.L., and Niederreiter H. Implementation and Tests of Low- Discrepancy Sequences, ACM Transactions on Modeling and Computer Simulation, Vol. 2, No. 3, Pages 195-213, July 1992. [Coddington94] Coddington, P. D. Analysis of Random Number Generators Using Monte Carlo Simulation. Int. J. Mod. Phys. C-5, 547, 1994. [Gentle98] Gentle, James E. Random Number Generation and Monte Carlo Methods, Springer-Verlag New York, Inc., 1998. [L'Ecuyer94] L'Ecuyer, Pierre. Uniform Random Number Generation. Annals of Operations Research, 53, 77-120, 1994. [L'Ecuyer99] L'Ecuyer, Pierre. Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure. Mathematics of Computation, 68, 225, 249-260, 1999. [L'Ecuyer99a] L'Ecuyer, Pierre. Good Parameter Sets for Combined Multiple Recursive Random Number Generators. Operations Research, 47, 1, 159-164, 1999. [L'Ecuyer01] L'Ecuyer, Pierre. Software for Uniform Random Number Generation: Distinguishing the Good and the Bad. Proceedings of the 2001 Winter Simulation Conference, IEEE Press, 95-105, Dec. 2001. [Kirkpatrick81] Kirkpatrick, S., and Stoll, E. A Very Fast Shift-Register Sequence Random Number Generator. Journal of Computational Physics, V. 40. 517-526, 1981. [Knuth81] Knuth, Donald E. The Art of Computer Programming, Volume 2, Seminumerical Algorithms. 2nd edition, Addison-Wesley Publishing Company, Reading, Massachusetts, 1981. [Maronna02] Maronna, R.A., and Zamar, R.H., Robust Multivariate Estimates for High- Dimensional Datasets, Technometrics, 44, 307-317, 2002. [Matsumoto98] Matsumoto, M., and Nishimura, T. Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator, ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, Pages 3-30, January 1998. Bibliography G 2705 [Matsumoto00] Matsumoto, M., and Nishimura, T. Dynamic Creation of Pseudorandom Number Generators, 56-69, in: Monte Carlo and Quasi-Monte Carlo Methods 1998, Ed. Niederreiter, H. and Spanier, J., Springer 2000, http:// www.math.sci.hiroshima-u.ac.jp/%7Em-mat/MT/DC/dc.html. [NAG] NAG Numerical Libraries. http://www.nag.co.uk/numeric/ numerical_libraries.asp [Rocke96] David M. Rocke, Robustness properties of S-estimators of multivariate location and shape in high dimension. The Annals of Statistics, 24(3), 1327-1345, 1996. [Saito08] Saito, M., and Matsumoto, M. SIMD-oriented Fast Mersenne Twister: a 128-bit Pseudorandom Number Generator. Monte Carlo and Quasi-Monte Carlo Methods 2006, Springer, Pages 607 – 622, 2008. http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/earticles.html [Schafer97] Schafer, J.L., Analysis of Incomplete Multivariate Data. Chapman & Hall, 1997. [Sobol76] Sobol, I.M., and Levitan, Yu.L. The production of points uniformly distributed in a multidimensional cube. Preprint 40, Institute of Applied Mathematics, USSR Academy of Sciences, 1976 (In Russian). [VSL Notes] Intel® MKL Vector Statistical Library Notes, a document present on the Intel® MKL product at http://software.intel.com/en-us/articles/intel-math-kernellibrary- documentation/ [VSL Data] Intel® MKL Vector Statistical Library Performance, a document present on the Intel® MKL product at http://software.intel.com/en-us/articles/intel-mathkernel- library-documentation/ • VML [C99] ISO/IEC 9899:1999/Cor 3:2007. Programming languages -- C. [Muller97] J.M.Muller. Elementary functions: algorithms and implementation, Birkhauser Boston, 1997. [IEEE754] IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-2008. [VML Data] Intel® MKL Vector Math Library Performance and Accuracy, a document present on the Intel® MKL product at http://software.intel.com/en-us/articles/intelmath- kernel-library-documentation/ • FFT [1] E. Oran Brigham, The Fast Fourier Transform and Its Applications, Prentice Hall, New Jersey, 1988. [2] Athanasios Papoulis, The Fourier Integral and its Applications, 2nd edition, McGraw-Hill, New York, 1984. [3] Ping Tak Peter Tang, DFTI - a new interface for Fast Fourier Transform libraries, ACM Transactions on Mathematical Software, Vol. 31, Issue 4, Pages 475 - 507, 2005. [4] Charles Van Loan, Computational Frameworks for the Fast Fourier Transform, SIAM, Philadelphia, 1992. • Optimization Solvers [Conn00] A. R. Conn, N. I.M. Gould, P. L. Toint.Trust-region Methods.SIAM Society for Industrial & Applied Mathematics, Englewood Cliffs, New Jersey, MPS-SIAM Series on Optimization edition, 2000. [Dong95] J. Dongarra, V. Eijkhout, A. Kalhan. Reverse communication interface for linear algebra templates for iterative methods.1995. • Data Fitting Functions [deBoor2001] Carl deBoor. A Practical Guide to Splines. Revised Edition. Springer-Verlag New York Berlin Heidelberg, 2001 [StechSub76] S.B. Stechhkin, and Yu Subbotin. Splines in Numerical Mathematics. Izd. Nauka, Moscow, 1976 For a reference implementation of BLAS, sparse BLAS, LAPACK, and ScaLAPACK packages (without platformspecific optimizations) visit www.netlib.org G Intel® Math Kernel Library Reference Manual 2706 Bibliography G 2707 G Intel® Math Kernel Library Reference Manual 2708 Glossary H AH Denotes the conjugate transpose of a general matrix A. See also conjugate matrix. AT Denotes the transpose of a general matrix A. See also transpose. band matrix A general m-by-n matrix A such that aij = 0 for |i - j| > l, where 1 < l < min(m, n). For example, any tridiagonal matrix is a band matrix. band storage A special storage scheme for band matrices. A matrix is stored in a two-dimensional array: columns of the matrix are stored in the corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. BLAS Abbreviation for Basic Linear Algebra Subprograms. These subprograms implement vector, matrix-vector, and matrix-matrix operations. BRNG Abbreviation for Basic Random Number Generator. Basic random number generators are pseudorandom number generators imitating i.i.d. random number sequences of uniform distribution. Distributions other than uniform are generated by applying different transformation techniques to the sequences of random numbers of uniform distribution. BRNG registration Standardized mechanism that allows a user to include a user-designed BRNG into the VSL and use it along with the predefined VSL basic generators. Bunch-Kaufman factorization Representation of a real symmetric or complex Hermitian matrix A in the form A = PUDUHPT (or A = PLDLHPT) where P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a Hermitian block-diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks. U and L have 2-by-2 unit diagonal blocks corresponding to the 2-by-2 blocks of D. c When found as the first letter of routine names, c indicates the usage of single-precision complex data type. CBLAS C interface to the BLAS. See BLAS. CDF Cumulative Distribution Function. The function that determines probability distribution for univariate or multivariate random variable X. For univariate distribution the cumulative distribution function is the function of real argument x, which for every x takes a value equal to probability of the event A: X = x. For multivariate distribution the cumulative distribution function is the function of a real vector x = (x1,x2, ..., xn), which, for every x, takes a value equal to probability of the event A = (X1 = x1 & X2 = x2, & ..., & Xn = xn). Cholesky factorization Representation of a symmetric positive-definite or, for complex data, Hermitian positive-definite matrix A in the form A = UHU or A = LLH, where L is a lower triangular matrix and U is an upper triangular matrix. condition number The number ?(A) defined for a given square matrix A as follows: ?(A) = ||A|| ||A-1||. conjugate matrix The matrix AH defined for a given general matrix A as follows: (AH)ij = (aji)*. 2709 conjugate number The conjugate of a complex number z = a + bi is z* = a - bi. d When found as the first letter of routine names, d indicates the usage of double-precision real data type. dot product The number denoted x · y and defined for given vectors x and y as follows: x · y = Si xiyi. Here xi and yi stand for the i-th elements of x and y, respectively. double precision A floating-point data type. On Intel® processors, this data type allows you to store real numbers x such that 2.23*10-308< | x | < 1.79*10308. For this data type, the machine precision e is approximately 10-15, which means that double-precision numbers usually contain no more than 15 significant decimal digits. For more information, refer to Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture. eigenvalue See eigenvalue problem. eigenvalue problem A problem of finding non-zero vectors x and numbers ? (for a given square matrix A) such that Ax = ?x. Here the numbers ? are called the eigenvalues of the matrix A and the vectors x are called the eigenvectors of the matrix A. eigenvector See eigenvalue problem. elementary reflector(Householder matrix) Matrix of a general form H = I - tvvT, where v is a column vector and t is a scalar. In LAPACK elementary reflectors are used, for example, to represent the matrix Q in the QR factorization (the matrix Q is represented as a product of elementary reflectors). factorization Representation of a matrix as a product of matrices. See also Bunch- Kaufman factorization, Cholesky factorization, LU factorization, LQ factorization, QR factorization, Schur factorization. FFTs Abbreviation for Fast Fourier Transforms. See Chapter 11 of this book. full storage A storage scheme allowing you to store matrices of any kind. A matrix A is stored in a two-dimensional array a, with the matrix element aij stored in the array element a(i,j). Hermitian matrix A square matrix A that is equal to its conjugate matrix AH. The conjugate AH is defined as follows: (AH)ij = (aji)*. I See identity matrix. identity matrix A square matrix I whose diagonal elements are 1, and off-diagonal elements are 0. For any matrix A, AI = A and IA = A. i.i.d. Independent Identically Distributed. in-place Qualifier of an operation. A function that performs its operation inplace takes its input from an array and returns its output to the same array. Intel MKL Abbreviation for Intel® Math Kernel Library. inverse matrix The matrix denoted as A-1 and defined for a given square matrix A as follows: AA-1 = A-1A = I. A-1 does not exist for singular matrices A. LQ factorization Representation of an m-by-n matrix A as A = LQ or A = (L 0)Q. Here Q is an n-by-n orthogonal (unitary) matrix. For m = n, L is an m-by-m lower triangular matrix with real diagonal elements; for m > n, where L1 is an n-by-n lower triangular matrix, and L2 is a rectangular matrix. H Intel® Math Kernel Library Reference Manual 2710 LU factorization Representation of a general m-by-n matrix A as A = PLU, where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n) and U is upper triangular (upper trapezoidal if m < n). machine precision The number e determining the precision of the machine representation of real numbers. For Intel® architecture, the machine precision is approximately 10-7 for single-precision data, and approximately 10-15 for double-precision data. The precision also determines the number of significant decimal digits in the machine representation of real numbers. See also double precision and single precision. MPI Message Passing Interface. This standard defines the user interface and functionality for a wide range of message-passing capabilities in parallel computing. MPICH A freely available, portable implementation of MPI standard for message-passing libraries. orthogonal matrix A real square matrix A whose transpose and inverse are equal, that is, AT = A-1, and therefore AAT = ATA = I. All eigenvalues of an orthogonal matrix have the absolute value 1. packed storage A storage scheme allowing you to store symmetric, Hermitian, or triangular matrices more compactly. The upper or lower triangle of a matrix is packed by columns in a one-dimensional array. PDF Probability Density Function. The function that determines probability distribution for univariate or multivariate continuous random variable X. The probability density function f(x) is closely related with the cumulative distribution function F(x). For univariate distribution the relation is For multivariate distribution the relation is positive-definite matrix A square matrix A such that Ax · x > 0 for any non-zero vector x. Here · denotes the dot product. pseudorandom number generator A completely deterministic algorithm that imitates truly random sequences. QR factorization Representation of an m-by-n matrix A as A = QR, where Q is an m-by-m orthogonal (unitary) matrix, and R is n-by-n upper triangular with real diagonal elements (if m = n) or trapezoidal (if m < n) matrix. random stream An abstract source of independent identically distributed random numbers of uniform distribution. In this manual a random stream points to a structure that uniquely defines a random number sequence generated by a basic generator associated with a given random stream. RNG Abbreviation for Random Number Generator. In this manual the term "random number generators" stands for pseudorandom number generators, that is, generators based on completely deterministic algorithms imitating truly random sequences. Glossary H 2711 Rectangular Full Packed (RFP) storage A storage scheme combining the full and packed storage schemes for the upper or lower triangle of the matrix. This combination enables using half of the full storage as packed storage while maintaining efficiency by using Level 3 BLAS/LAPACK kernels as the full storage. s When found as the first letter of routine names, s indicates the usage of single-precision real data type. ScaLAPACK Stands for Scalable Linear Algebra PACKage. Schur factorization Representation of a square matrix A in the form A = ZTZH. Here T is an upper quasi-triangular matrix (for complex A, triangular matrix) called the Schur form of A; the matrix Z is orthogonal (for complex A, unitary). Columns of Z are called Schur vectors. single precision A floating-point data type. On Intel® processors, this data type allows you to store real numbers x such that 1.18*10-38 < | x | < 3.40*1038. For this data type, the machine precision (e) is approximately 10-7, which means that single-precision numbers usually contain no more than 7 significant decimal digits. For more information, refer to Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture. singular matrix A matrix whose determinant is zero. If A is a singular matrix, the inverse A-1 does not exist, and the system of equations Ax = b does not have a unique solution (that is, there exist no solutions or an infinite number of solutions). singular value The numbers defined for a given general matrix A as the eigenvalues of the matrix AAH. See also SVD. SMP Abbreviation for Symmetric MultiProcessing. The MKL offers performance gains through parallelism provided by the SMP feature. sparse BLAS Routines performing basic vector operations on sparse vectors. Sparse BLAS routines take advantage of vectors' sparsity: they allow you to store only non-zero elements of vectors. See BLAS. sparse vectors Vectors in which most of the components are zeros. storage scheme The way of storing matrices. See full storage, packed storage, and band storage. SVD Abbreviation for Singular Value Decomposition. See also Singular value decomposition section in Chapter 5. symmetric matrix A square matrix A such that aij = aji. transpose The transpose of a given matrix A is a matrix AT such that (AT)ij = aji (rows of A become columns of AT, and columns of A become rows of AT). trapezoidal matrix A matrix A such that A = (A1A2), where A1 is an upper triangular matrix, A2 is a rectangular matrix. triangular matrix A matrix A is called an upper (lower) triangular matrix if all its subdiagonal elements (superdiagonal elements) are zeros. Thus, for an upper triangular matrix aij = 0 when i > j; for a lower triangular matrix aij = 0 when i < j. tridiagonal matrix A matrix whose non-zero elements are in three diagonals only: the leading diagonal, the first subdiagonal, and the first super-diagonal. unitary matrix A complex square matrix A whose conjugate and inverse are equal, that is, that is, AH = A-1, and therefore AAH = AHA = I. All eigenvalues of a unitary matrix have the absolute value 1. VML Abbreviation for Vector Mathematical Library. See Chapter 9 of this book. VSL Abbreviation for Vector Statistical Library. See Chapter 10 of this book. z When found as the first letter of routine names, z indicates the usage of double-precision complex data type. H Intel® Math Kernel Library Reference Manual 2712 Index ?_backward_trig_transform 2450 ?_commit_Helmholtz_2D 2467 ?_commit_Helmholtz_3D 2467 ?_commit_sph_np 2476 ?_commit_sph_p 2476 ?_commit_trig_transform 2446 ?_forward_trig_transform 2448 ?_Helmholtz_2D 2470 ?_Helmholtz_3D 2470 ?_init_Helmholtz_2D 2465 ?_init_Helmholtz_3D 2465 ?_init_sph_np 2475 ?_init_sph_p 2475 ?_init_trig_transform 2445 ?_sph_np 2478 ?_sph_p 2478 ?asum 54 ?axpby 327 ?axpy 55 ?axpyi 141 ?bdsdc 756 ?bdsqr 752 ?cabs1 73 ?ConvExec 2239 ?ConvExec1D 2242 ?ConvExecX 2246 ?ConvExecX1D 2249 ?ConvNewTask 2220 ?ConvNewTask1D 2223 ?ConvNewTaskX 2225 ?copy 56 ?CorrExec 2239 ?CorrExec1D 2242 ?CorrExecX 2246 ?CorrExecX1D 2249 ?CorrNewTask 2220 ?CorrNewTask1D 2223 ?CorrNewTaskX 2225 ?CorrNewTaskX1D 2228 ?dbtf2 1872 ?dbtrf 1873 ?disna 818 ?dot 58 ?dotc 60 ?dotci 144 ?doti 143 ?dotu 61 ?dotui 145 ?dtsvb 595 ?dttrf 1874 ?dttrfb 363 ?dttrsb 392 ?dttrsv 1875 ?gamn2d 2552 ?gamx2d 2551 ?gbbrd 739 ?gbcon 422 ?gbequ 542 ?gbequb 545 ?gbmv 75 ?gbrfs 458 ?gbrfsx 461 ?gbsv 574 ?gbsvx 576 ?gbsvxx 582 ?gbtf2 1166 ?gbtrf 359 ?gbtrs 387 ?gebak 849 ?gebal 847 ?gebd2 1167 ?gebr2d 2561 ?gebrd 736 ?gebs2d 2560 ?gecon 420 ?geequ 538 ?geequb 540 ?gees 1020 ?geesx 1024 ?geev 1028 ?geevx 1032 ?gehd2 1168 ?gehrd 835 ?gejsv 1045 ?gelq2 1170 ?gelqf 689 ?gels 930 ?gelsd 939 ?gelss 937 ?gelsy 933 ?gem2vc 331 ?gem2vu 329 ?gemm 119 ?gemm3m 333 ?gemv 77 ?geql2 1171 ?geqlf 700 ?geqp3 678 ?geqpf 676 ?geqr2 1172 ?geqr2p 1174 ?geqrf 671 ?geqrfp 674 ?ger 79 ?gerc 81 ?gerfs 449 ?gerfsx 452 ?gerq2 1175 ?gerqf 710 ?geru 82 ?gerv2d 2557 ?gesc2 1176 ?gesd2d 2556 ?gesdd 1041 ?gesv 558 ?gesvd 1037 ?gesvj 1051 ?gesvx 561 ?gesvxx 567 ?getc2 1177 ?getf2 1178 ?getrf 357 ?getri 514 ?getrs 385 ?ggbak 883 ?ggbal 880 ?gges 1121 ?ggesx 1126 ?ggev 1132 ?ggevx 1136 Index 2713 ?ggglm 946 ?gghrd 878 ?gglse 943 ?ggqrf 728 ?ggrqf 731 ?ggsvd 1055 ?ggsvp 910 ?gsum2d 2553 ?gsvj0 1432 ?gsvj1 1434 ?gtcon 424 ?gthr 146 ?gthrz 147 ?gtrfs 467 ?gtsv 589 ?gtsvx 591 ?gttrf 361 ?gttrs 389 ?gtts2 1179 ?hbev 993 ?hbevd 998 ?hbevx 1004 ?hbgst 829 ?hbgv 1105 ?hbgvd 1110 ?hbgvx 1117 ?hbtrd 791 ?hecon 438 ?heequb 556 ?heev 951 ?heevd 956 ?heevr 970 ?heevx 963 ?heft2 1419 ?hegst 822 ?hegv 1068 ?hegvd 1074 ?hegvx 1081 ?hemm 122 ?hemv 86 ?her 87 ?her2 89 ?her2k 126 ?herdb 766 ?herfs 494 ?herfsx 496 ?herk 124 ?hesv 642 ?hesvx 645 ?hesvxx 649 ?heswapr 1413 ?hetrd 772 ?hetrf 378 ?hetri 522 ?hetri2 525 ?hetri2x 529 ?hetrs 404 ?hetrs2 408 ?hfrk 1438 ?hgeqz 885 ?hpcon 441 ?hpev 977 ?hpevd 981 ?hpevx 988 ?hpgst 825 ?hpgv 1087 ?hpgvd 1092 ?hpgvx 1099 ?hpmv 91 ?hpr 92 ?hpr2 94 ?hprfs 504 ?hpsv 661 ?hpsvx 663 ?hptrd 784 ?hptrf 383 ?hptri 532 ?hptrs 411 ?hsein 855 ?hseqr 851 ?isnan 1180 ?jacobi 2515 ?jacobi_delete 2514 ?jacobi_init 2512 ?jacobi_solve 2513 ?jacobix 2516 ?la_gbamv 1455 ?la_gbrcond 1457 ?la_gbrcond_c 1459 ?la_gbrcond_x 1460 ?la_gbrfsx_extended 1462 ?la_gbrpvgrw 1467 ?la_geamv 1468 ?la_gercond 1470 ?la_gercond_c 1471 ?la_gercond_x 1472 ?la_gerfsx_extended 1473 ?la_heamv 1478 ?la_hercond_c 1480 ?la_hercond_x 1481 ?la_herfsx_extended 1482 ?la_herpvgrw 1487 ?la_porcond 1489 ?la_porcond_c 1490 ?la_porcond_x 1492 ?la_porfsx_extended 1493 ?la_porpvgrw 1498 ?la_rpvgrw 1503 ?la_syamv 1431, 1505 ?la_syrcond 1507 ?la_syrcond_c 1508 ?la_syrcond_x 1509 ?la_syrfsx_extended 1511 ?la_syrpvgrw 1516 ?la_wwaddw 1517 ?labrd 1181 ?lacgv 1155 ?lacn2 1184 ?lacon 1185 ?lacp2 1455 ?lacpy 1186 ?lacrm 1156 ?lacrt 1156 ?ladiv 1187 ?lae2 1188 ?laebz 1189 ?laed0 1192 ?laed1 1194 ?laed2 1195 ?laed3 1197 ?laed4 1199 ?laed5 1200 ?laed6 1200 ?laed7 1202 ?laed8 1204 ?laed9 1207 ?laeda 1208 ?laein 1209 ?laesy 1157 ?laev2 1212 Intel® Math Kernel Library Reference Manual 2714 ?laexc 1213 ?lag2 1214 ?lags2 1216 ?lagtf 1218 ?lagtm 1220 ?lagts 1221 ?lagv2 1223 ?lahef 1378 ?lahqr 1224 ?lahr2 1228 ?lahrd 1226 ?laic1 1230 ?laisnan 1181 ?laln2 1232 ?lals0 1234 ?lalsa 1236 ?lalsd 1239 ?lamc1 1526 ?lamc2 1526 ?lamc3 1527 ?lamc4 1528 ?lamc5 1528 ?lamch 1525 ?lamrg 1241 ?lamsh 1866 ?laneg 1242 ?langb 1243 ?lange 1244 ?langt 1245 ?lanhb 1248 ?lanhe 1253 ?lanhf 1443 ?lanhp 1250 ?lanhs 1246 ?lansb 1247 ?lansf 1442 ?lansp 1249 ?lanst/?lanht 1251 ?lansy 1252 ?lantb 1255 ?lantp 1256 ?lantr 1257 ?lanv2 1259 ?lapll 1259 ?lapmr 1260 ?lapmt 1262 ?lapy2 1262 ?lapy3 1263 ?laqgb 1264 ?laqge 1265 ?laqhb 1266 ?laqhe 1499 ?laqhp 1501 ?laqp2 1268 ?laqps 1269 ?laqr0 1270 ?laqr1 1273 ?laqr2 1274 ?laqr3 1277 ?laqr4 1280 ?laqr5 1282 ?laqsb 1285 ?laqsp 1286 ?laqsy 1287 ?laqtr 1289 ?lar1v 1290 ?lar2v 1293 ?larcm 1502 ?laref 1867 ?larf 1294 ?larfb 1295 ?larfg 1298 ?larfgp 1299 ?larfp 1429 ?larft 1300 ?larfx 1302 ?largv 1304 ?larnv 1305 ?larra 1306 ?larrb 1307 ?larrc 1309 ?larrd 1310 ?larre 1312 ?larrf 1315 ?larrj 1317 ?larrk 1318 ?larrr 1319 ?larrv 1320 ?larscl2 1504 ?lartg 1323 ?lartgp 1324 ?lartgs 1326 ?lartv 1327 ?laruv 1328 ?larz 1329 ?larzb 1330 ?larzt 1332 ?las2 1334 ?lascl 1335 ?lascl2 1504 ?lasd0 1336 ?lasd1 1338 ?lasd2 1340 ?lasd3 1342 ?lasd4 1344 ?lasd5 1346 ?lasd6 1347 ?lasd7 1350 ?lasd8 1353 ?lasd9 1354 ?lasda 1356 ?lasdq 1358 ?lasdt 1360 ?laset 1361 ?lasorte 1868 ?lasq1 1362 ?lasq2 1363 ?lasq3 1364 ?lasq4 1365 ?lasq5 1366 ?lasq6 1367 ?lasr 1368 ?lasrt 1371 ?lasrt2 1869 ?lassq 1372 ?lasv2 1373 ?laswp 1374 ?lasy2 1375 ?lasyf 1377 ?latbs 1380 ?latdf 1382 ?latps 1383 ?latrd 1385 ?latrs 1387 ?latrz 1390 ?lauu2 1392 ?lauum 1393 ?nrm2 62 ?opgtr 781 ?opmtr 782 Index 2715 ?orbdb/?unbdb 925 ?orcsd/?uncsd 1060 ?org2l/?ung2l 1394 ?org2r/?ung2r 1395 ?orgbr 742 ?orghr 837 ?orgl2/?ungl2 1396 ?orglq 692 ?orgql 702 ?orgqr 681 ?orgr2/?ungr2 1397 ?orgrq 712 ?orgtr 768 ?orm2l/?unm2l 1399 ?orm2r/?unm2r 1400 ?ormbr 744 ?ormhr 839 ?orml2/?unml2 1402 ?ormlq 694 ?ormql 706 ?ormqr 683 ?ormr2/?unmr2 1404 ?ormr3/?unmr3 1405 ?ormrq 716 ?ormrz 723 ?ormtr 770 ?pbcon 430 ?pbequ 552 ?pbrfs 480 ?pbstf 831 ?pbsv 617 ?pbsvx 619 ?pbtf2 1407 ?pbtrf 371 ?pbtrs 398 ?pftrf 368 ?pftri 517 ?pftrs 395 ?pocon 426 ?poequ 547 ?poequb 549 ?porfs 469 ?porfsx 472 ?posv 596 ?posvx 599 ?posvxx 604 ?potf2 1408 ?potrf 364 ?potri 516 ?potrs 393 ?ppcon 428 ?ppequ 550 ?pprfs 478 ?ppsv 611 ?ppsvx 612 ?pptrf 369 ?pptri 519 ?pptrs 396 ?pstf2 1451 ?pstrf 366 ?ptcon 432 ?pteqr 810 ?ptrfs 483 ?ptsv 623 ?ptsvx 625 ?pttrf 373 ?pttrs 400 ?pttrsv 1876 ?ptts2 1409 ?rot 63, 1158 ?rotg 64 ?roti 148 ?rotm 65 ?rotmg 67 ?rscl 1411 ?sbev 991 ?sbevd 995 ?sbevx 1001 ?sbgst 827 ?sbgv 1103 ?sbgvd 1107 ?sbgvx 1113 ?sbmv 95 ?sbtrd 789 ?scal 69 ?sctr 149 ?sdot 59 ?sfrk 1437 ?spcon 439 ?spev 975 ?spevd 979 ?spevx 985 ?spgst 823 ?spgv 1085 ?spgvd 1089 ?spgvx 1096 ?spmv 98, 1159 ?spr 99, 1161 ?spr2 101 ?sprfs 501 ?spsv 655 ?spsvx 657 ?sptrd 779 ?sptrf 381 ?sptri 530 ?sptrs 409 ?stebz 813 ?stedc 801 ?stegr 805 ?stein 815 ?stemr 798 ?steqr 795 ?steqr2 1878 ?sterf 793 ?stev 1008 ?stevd 1009 ?stevr 1015 ?stevx 1012 ?sum1 1165 ?swap 70 ?sycon 434 ?syconv 436 ?syequb 554 ?syev 949 ?syevd 954 ?syevr 966 ?syevx 959 ?sygs2/?hegs2 1415 ?sygst 820 ?sygv 1066 ?sygvd 1071 ?sygvx 1077 ?symm 128 ?symv 102, 1162 ?syr 104, 1163 ?syr2 106 ?syr2k 133 ?syrdb 764 ?syrfs 485 ?syrfsx 488 Intel® Math Kernel Library Reference Manual 2716 ?syrk 131 ?sysv 629 ?sysvx 631 ?sysvxx 635 ?syswapr 1411 ?syswapr1 1414 ?sytd2/?hetd2 1417 ?sytf2 1418 ?sytrd 762 ?sytrf 374 ?sytri 520 ?sytri2 523 ?sytri2x 527 ?sytrs 402 ?sytrs2 406 ?tbcon 447 ?tbmv 107 ?tbsv 109 ?tbtrs 418 ?tfsm 1440 ?tftri 535 ?tfttp 1444 ?tfttr 1445 ?tgevc 890 ?tgex2 1421 ?tgexc 894 ?tgsen 896 ?tgsja 914 ?tgsna 906 ?tgsy2 1423 ?tgsyl 902 ?tpcon 445 ?tpmv 112 ?tprfs 508 ?tpsv 113 ?tptri 536 ?tptrs 416 ?tpttf 1446 ?tpttr 1448 ?trbr2d 2562 ?trbs2d 2560 ?trcon 443 ?trevc 860 ?trexc 868 ?trmm 135 ?trmv 115 ?trnlsp_check 2499 ?trnlsp_delete 2503 ?trnlsp_get 2502 ?trnlsp_init 2497 ?trnlsp_solve 2500 ?trnlspbc_check 2506 ?trnlspbc_delete 2511 ?trnlspbc_get 2510 ?trnlspbc_init 2505 ?trnlspbc_solve 2508 ?trrfs 506 ?trrv2d 2558 ?trsd2d 2557 ?trsen 870 ?trsm 138 ?trsna 864 ?trsv 117 ?trsyl 874 ?trti2 1426 ?trtri (LAPACK) 534 ?trtrs (LAPACK) 413 ?trttf 1449 ?trttp 1450 ?tzrzf 720 ?ungbr 747 ?unghr 842 ?unglq 696 ?ungql 704 ?ungqr 685 ?ungrq 714 ?ungtr 775 ?unmbr 749 ?unmhr 844 ?unmlq 698 ?unmql 708 ?unmqr 687 ?unmrq 718 ?unmrz 725 ?unmtr 776 ?upgtr 786 ?upmtr 787 1-norm value complex Hermitian matrix packed storage 1250 complex Hermitian matrix in RFP format 1443 complex Hermitian tridiagonal matrix 1251 complex symmetric matrix 1252 general rectangular matrix 1244, 1779 general tridiagonal matrix 1245 Hermitian band matrix 1248 real symmetric matrix 1252, 1782 real symmetric matrix in RFP format 1442 real symmetric tridiagonal matrix 1251 symmetric band matrix 1247 symmetric matrix packed storage 1249 trapezoidal matrix 1257 triangular band matrix 1255 triangular matrix packed storage 1256 upper Hessenberg matrix 1246, 1780 A absolute value of a vector element largest 71 smallest 72 accuracy modes, in VML 1969 adding magnitudes of elements of a distributed vector 2377 adding magnitudes of the vector elements 54 arguments matrix 2646 sparse vector 140 vector 2645 array descriptor 1535, 2373 auxiliary functions ?la_lin_berr 1488 auxiliary routines LAPACK ScaLAPACK 1739 B backward error 1488 balancing a matrix 847 band storage scheme 2646 basic quasi-number generator Niederreiter 2121 Sobol 2121 Index 2717 basic random number generators GFSR 2121 MCG, 32-bit 2121 MCG, 59-bit 2121 Mersenne Twister MT19937 2121 MT2203 2121 MRG 2121 Wichmann-Hill 2121 bdsdc 756 Bernoulli 2195 Beta 2186 bidiagonal matrix LAPACK 734 ScaLAPACK 1666 Binomial 2198 bisection 1307 BLACS broadcast 2559 combines 2550 destruction routines 2568 informational routines 2570 initialization routines 2562 miscellaneous routines 2571 point to point communication 2554 ?gamn2d 2552 ?gamx2d 2551 ?gebr2d 2561 ?gebs2d 2560 ?gerv2d 2557 ?gesd2d 2556 ?gsum2d 2553 ?trbr2d 2562 ?trbs2d 2560 ?trrv2d 2558 ?trsd2d 2557 blacs_abort 2569 blacs_barrier 2571 blacs_exit 2569 blacs_freebuff 2568 blacs_get 2564 blacs_gridexit 2569 blacs_gridinfo 2570 blacs_gridinit 2566 blacs_gridmap 2567 blacs_pcoord 2571 blacs_pinfo 2563 blacs_pnum 2570 blacs_set 2565 blacs_setup 2563 usage examples 2572 BLACS routines matrix shapes 2549 blacs_abort 2569 blacs_barrier 2571 blacs_exit 2569 blacs_freebuff 2568 blacs_get 2564 blacs_gridexit 2569 blacs_gridinfo 2570 blacs_gridinit 2566 blacs_gridmap 2567 blacs_pcoord 2571 blacs_pinfo 2563 blacs_pnum 2570 blacs_set 2565 blacs_setup 2563 BLAS Code Examples 2653 BLAS Level 1 routines ?asum 53, 54 ?axpby 327 ?axpy 53, 55 ?cabs1 53, 73 ?copy 53, 56 ?dot 53, 58 ?dotc 53, 60 ?dotu 53, 61 ?nrm2 53, 62 ?rot 53, 63 ?rotg 53, 64 ?rotm 53, 65 ?rotmg 67 ?rotmq 53 ?scal 53, 69 ?sdot 53, 59 ?swap 53, 70 code example 2653 i?amax 53, 71 i?amin 53, 72 BLAS Level 2 routines ?gbmv 74, 75 ?gem2vc 331 ?gem2vu 329 ?gemv 74, 77 ?ger 74, 79 ?gerc 74, 81 ?geru 74, 82 ?hbmv 74, 84 ?hemv 74, 86 ?her 74, 87 ?her2 74, 89 ?hpmv 74, 91 ?hpr 74, 92 ?hpr2 74, 94 ?sbmv 74, 95 ?spmv 74, 98 ?spr 74, 99 ?spr2 74, 101 ?symv 74, 102 ?syr 74, 104 ?syr2 74, 106 ?tbmv 74, 107 ?tbsv 74, 109 ?tpmv 74, 112 ?tpsv 74, 113 ?trmv 74, 115 ?trsv 74, 117 code example 2654 BLAS Level 3 routines ?gemm 118, 119 ?gemm3m 333 ?hemm 118, 122 ?her2k 118, 126 ?herk 118, 124 ?symm 118, 128 ?syr2k 118, 133 ?syrk 118, 131 ?tfsm 1440 ?trmm 118, 135 ?trsm 118, 138 code example 2654 BLAS routines routine groups BLAS-like extensions 327 BLAS-like transposition routines mkl_?imatcopy 335 mkl_?omatadd 344 mkl_?omatcopy 338 mkl_?omatcopy2 341 block reflector Intel® Math Kernel Library Reference Manual 2718 general matrix LAPACK 1330 ScaLAPACK 1807 general rectangular matrix LAPACK 1295 ScaLAPACK 1795 triangular factor LAPACK 1300, 1332 ScaLAPACK 1802, 1813 block-cyclic distribution 1535, 2373 block-splitting method 2121 BRNG 2115, 2116, 2121 Bunch-Kaufman factorization Hermitian matrix packed storage 383 symmetric matrix packed storage 381 C C Datatypes 49 C interface conventions LAPACK 348 Cauchy 2173 cbbcsd 920 CBLAS arguments 2669 level 1 (vector operations) 2670 level 2 (matrix-vector operations) 2672 level 3 (matrix-matrix operations) 2676 sparse BLAS 2678 CBLAS to the BLAS 2669 cgbcon 422 cgbrfsx 461 cgbsvx 576 cgbtrs 387 cgecon 420 cgeqpf 676 cgtrfs 467 chegs2 1415 cheswapr 1413 chetd2 1417 chetri2 525 chetri2x 529 chetrs2 408 chgeqz 885 chla_transtype 1529 Cholesky factorization Hermitian positive semi-definite matrix 1451 Hermitian positive semidefinite matrix 366 Hermitian positive-definite matrix band storage 371, 398, 619, 1546, 1558 packed storage 369, 612 split 831 symmetric positive semi-definite matrix 1451 symmetric positive semidefinite matrix 366 symmetric positive-definite matrix band storage 371, 398, 619, 1546, 1558 packed storage 369, 612 chseqr 851 cla_gbamv 1455 cla_gbrcond_c 1459 cla_gbrcond_x 1460 cla_gbrfsx_extended 1462 cla_gbrpvgrw 1467 cla_geamv 1468 cla_gercond_c 1471 cla_gercond_x 1472 cla_gerfsx_extended 1473 cla_heamv 1478 cla_hercond_c 1480 cla_hercond_x 1481 cla_herfsx_extended 1482 cla_herpvgrw 1487 cla_lin_berr 1488 cla_porcond_c 1490 cla_porcond_x 1492 cla_porfsx_extended 1493 cla_porpvgrw 1498 cla_rpvgrw 1503 cla_syamv 1505 cla_syrcond_c 1508 cla_syrcond_x 1509 cla_syrfsx_extended 1511 cla_syrpvgrw 1516 cla_wwaddw 1517 clag2z 1427 clapmr 1260 clapmt 1262 clarfb 1295 clarft 1300 clarscl2 1504 clascl2 1504 clatps 1383 clatrd 1385 clatrs 1387 clatrz 1390 clauu2 1392 clauum 1393 code examples BLAS Level 1 function 2653 BLAS Level 1 routine 2653 BLAS Level 2 routine 2654 BLAS Level 3 routine 2654 communication subprograms complex division in real arithmetic 1187 complex Hermitian matrix 1-norm value LAPACK 1253 ScaLAPACK 1782 factorization with diagonal pivoting method 1419 Frobenius norm LAPACK 1253 ScaLAPACK 1782 infinity- norm LAPACK 1253 ScaLAPACK 1782 largest absolute value of element LAPACK 1253 ScaLAPACK 1782 complex Hermitian matrix in packed form 1-norm value 1250 Frobenius norm 1250 infinity- norm 1250 largest absolute value of element 1250 complex Hermitian tridiagonal matrix 1-norm value 1251 Frobenius norm 1251 infinity- norm 1251 largest absolute value of element 1251 complex matrix complex elementary reflector ScaLAPACK 1809 complex symmetric matrix 1-norm value 1252 Frobenius norm 1252 infinity- norm 1252 largest absolute value of element 1252 complex vector 1-norm using true absolute value Index 2719 LAPACK 1165 ScaLAPACK 1745 conjugation LAPACK 1155 ScaLAPACK 1743 complex vector conjugation LAPACK 1155 ScaLAPACK 1743 component-wise relative error 1488 compressed sparse vectors 140 computational node 2117 Computational Routines 669 condition number band matrix 422 general matrix LAPACK 420 ScaLAPACK 1564, 1566, 1568 Hermitian matrix packed storage 441 Hermitian positive-definite matrix band storage 430 packed storage 428 tridiagonal 432 symmetric matrix packed storage 439 symmetric positive-definite matrix band storage 430 packed storage 428 tridiagonal 432 triangular matrix band storage 447 packed storage 445 tridiagonal matrix 424 configuration parameters, in FFT interface 2313 Configuration Settings, for Fourier transform functions 2332 Continuous Distribution Generators 2153 Continuous Distributions 2156 ConvCopyTask 2254 ConvDeleteTask 2253 converting a DOUBLE COMPLEX triangular matrix to COMPLEX 1454 converting a double-precision triangular matrix to singleprecision 1453 converting a sparse vector into compressed storage form and writing zeros to the original vector 147 converting compressed sparse vectors into full storage form 149 ConvInternalPrecision 2234 Convolution and Correlation 2214 Convolution Functions ?ConvExec 2239 ?ConvExec1D 2242 ?ConvExecX 2246 ?ConvExecX1D 2249 ?ConvNewTask 2220 ?ConvNewTask1D 2223 ?ConvNewTaskX 2225 ?ConvNewTaskX1D 2228 ConvCopyTask 2254 ConvDeleteTask 2253 ConvSetDecimation 2237 ConvSetInternalPrecision 2234 ConvSetMode 2232 ConvSetStart 2235 CorrCopyTask 2254 CorrDeleteTask 2253 ConvSetMode 2232 ConvSetStart 2235 copying distributed vectors 2379 matrices distributed 1770 global parallel 1772 local replicated 1772 two-dimensional LAPACK 1186, 1455 ScaLAPACK 1773 vectors 56 copying a matrix 1444–1446, 1448–1450 CopyStream 2138 CopyStreamState 2139 CorrCopyTask 2254 CorrDeleteTask 2253 Correlation Functions ?CorrExec 2239 ?CorrExec1D 2242 ?CorrExecX 2246 ?CorrExecX1D 2249 ?CorrNewTask 2220 ?CorrNewTask1D 2223 ?CorrNewTaskX 2225 ?CorrNewTaskX1D 2228 CorrSetDecimation 2237 CorrSetInternalPrecision 2234 CorrSetMode 2232 CorrSetStart 2235 CorrSetInternalDecimation 2237 CorrSetInternalPrecision 2234 CorrSetMode 2232 CorrSetStart 2235 cosine-sine decomposition LAPACK 919, 1060 cpbtf2 1407 cporfsx 472 cpotf2 1408 cpprfs 478 cpptrs 396 cptts2 1409 Cray 1879 crscl 1411 cs decomposition See also LAPACK routines, cs decomposition 919 CSD (cosine-sine decomposition) LAPACK 919, 1060 csyconv 436 csyswapr 1411 csyswapr1 1414 csytf2 1418 csytri2 523 csytri2x 527 csytrs2 406 ctgex2 1421 ctgsy2 1423 ctrexc 868 ctrti2 1426 cunbdb 925 cuncsd 1060 cung2l 1394 cung2r 1395 cungbr 747 cungl2 1396 cungr2 1397 cunm2l 1399 cunm2r 1400 cunml2 1402 cunmr2 1404 cunmr3 1405 Intel® Math Kernel Library Reference Manual 2720 D data type in VML 1969 shorthand 41 Data Types 2124 Datatypes, C language 49 dbbcsd 920 dbdsdc 756 dcg_check 1946 dcg_get 1948 dcg_init 1945 dcgmrhs_check 1949 dcgmrhs_get 1952 dcgmrhs_init 1948 DeleteStream 2137 descriptor configuration cluster FFT 2356 descriptor manipulation cluster FFT 2356 DF task dfdconstruct1d 2606 dfdConstruct1D 2606 dfdeditidxptr 2604 dfdEditIdxPtr 2604 dfdeditppspline1d 2595 dfdEditPPSpline1D 2595 dfdeditptr 2601 dfdEditPtr 2601 dfdeletetask 2627 dfDeleteTask 2627 dfdintegrate1d 2613 dfdIntegrate1D 2613 dfdintegrateex1d 2613 dfdIntegrateEx1D 2613 dfdintegrcallback 2623 dfdIntegrCallBack 2623 dfdinterpcallback 2621 dfdInterpCallBack 2621 dfdinterpolate1d 2607 dfdInterpolate1D 2607 dfdinterpolateex1d 2607 dfdInterpolateEx1D 2607 dfdnewtask1d 2592 dfdNewTask1D 2592 dfdsearchcells1d 2619 dfdSearchCells1D 2619 dfdsearchcellscallback 2625 dfdSearchCellsCallBack 2625 dfdsearchcellsex1d 2619 dfdSearchCellsEx1D 2619 dfgmres_check 1953 dfgmres_get 1956 dfgmres_init 1952 dfieditptr 2601 dfiEditPtr 2601 dfieditval 2602 dfiEditVal 2602 dfsconstruct1d 2606 dfsConstruct1D 2606 dfseditidxptr 2604 dfsEditIdxPtr 2604 dfseditppspline1d 2595 dfsEditPPSpline1D 2595 dfseditptr 2601 dfsEditPtr 2601 dfsintegrate1d 2613 dfsIntegrate1D 2613 dfsintegrateex1d 2613 dfsIntegrateEx1D 2613 dfsintegrcallback 2623 dfsIntegrCallBack 2623 dfsinterpcallback 2621 dfsInterpCallBack 2621 dfsinterpolate1d 2607 dfsInterpolate1D 2607 dfsinterpolateex1d 2607 dfsInterpolateEx1D 2607 dfsnewtask1d 2592 dfsNewTask1D 2592 dfssearchcells1d 2619 dfsSearchCells1D 2619 dfssearchcellscallback 2625 dfsSearchCellsCallBack 2625 dfssearchcellsex1d 2619 dfsSearchCellsEx1D 2619 DFT routines descriptor configuration DftiSetValue 2325 DftiCommitDescriptor 2316 DftiCommitDescriptorDM 2358 DftiComputeBackward 2322 DftiComputeBackwardDM 2362 DftiComputeForward 2320 DftiComputeForwardDM 2360 DftiCopyDescriptor 2318 DftiCreateDescriptor 2314 DftiCreateDescriptorDM 2357 DftiErrorClass 2329 DftiErrorMessage 2331 DftiFreeDescriptor 2317 DftiFreeDescriptorDM 2359 DftiGetValue 2327 DftiGetValueDM 2367 DftiSetValue 2325 DftiSetValueDM 2365 dgbcon 422 dgbrfsx 461 dgbsvx 576 dgbtrs 387 dgecon 420 dgejsv 1045 dgeqpf 676 dgesvj 1051 dgtrfs 467 dhgeqz 885 dhseqr 851 diagonal elements LAPACK 1361 ScaLAPACK 1817 diagonal pivoting factorization Hermitian indefinite matrix 649 symmetric indefinite matrix 635 diagonally dominant tridiagonal matrix solving systems of linear equations 392 diagonally dominant-like banded matrix solving systems of linear equations 1553 diagonally dominant-like tridiagonal matrix solving systems of linear equations 1555 dimension 2645 Direct Sparse Solver (DSS) Interface Routines 1914 Discrete Distribution Generators 2153, 2154 Discrete Distributions 2189 Discrete Fourier Transform DftiSetValue 2325 distributed complex matrix transposition 2433, 2434 distributed general matrix matrix-vector product 2387, 2389 rank-1 update 2391 rank-1 update, unconjugated 2394 Index 2721 rank-l update, conjugated 2393 distributed Hermitian matrix matrix-vector product 2396, 2397 rank-1 update 2399 rank-2 update 2400 rank-k update 2422 distributed matrix equation AX = B 2437 distributed matrix-matrix operation rank-k update distributed Hermitian matrix 2422 transposition complex matrix 2433 complex matrix, conjugated 2434 real matrix 2432 distributed matrix-vector operation product Hermitian matrix 2396, 2397 symmetric matrix 2402, 2404 triangular matrix 2409, 2410 rank-1 update Hermitian matrix 2399 symmetric matrix 2406 rank-1 update, conjugated 2393 rank-1 update, unconjugated 2394 rank-2 update Hermitian matrix 2400 symmetric matrix 2407 distributed real matrix transposition 2432 distributed symmetric matrix matrix-vector product 2402, 2404 rank-1 update 2406 rank-2 update 2407 distributed triangular matrix matrix-vector product 2409, 2410 solving systems of linear equations 2413 distributed vector-scalar product 2384 distributed vectors adding magnitudes of vector elements 2377 copying 2379 dot product complex vectors 2382 complex vectors, conjugated 2381 real vectors 2380 Euclidean norm 2383 global index of maximum element 2376 linear combination of vectors 2378 sum of vectors 2378 swapping 2385 vector-scalar product 2384 distributed-memory computations Distribution Generators 2153 Distribution Generators Supporting Accurate Mode 2154 divide and conquer algorithm 1706, 1715 djacobi 2515 djacobi_delete 2514 djacobi_init 2512 djacobi_solve 2513 djacobix 2516 dla_gbamv 1455 dla_gbrcond 1457 dla_gbrfsx_extended 1462 dla_gbrpvgrw 1467 dla_geamv 1468 dla_gercond 1470 dla_gerfsx_extended 1473 dla_lin_berr 1488 dla_porcond 1489 dla_porfsx_extended 1493 dla_porpvgrw 1498 dla_rpvgrw 1503 dla_syamv 1505 dla_syrcond 1507 dla_syrfsx_extended 1511 dla_syrpvgrw 1516 dla_wwaddw 1517 dlag2s 1427 dlapmr 1260 dlapmt 1262 dlarfb 1295 dlarft 1300 dlarscl2 1504 dlartgp 1324 dlartgs 1326 dlascl2 1504 dlat2s 1453 dlatps 1383 dlatrd 1385 dlatrs 1387 dlatrz 1390 dlauu2 1392 dlauum 1393 dNewAbstractStream 2133 dorbdb 925 dorcsd 1060 dorg2l 1394 dorg2r 1395 dorgl2 1396 dorgr2 1397 dorm2l 1399 dorm2r 1400 dorml2 1402 dormr2 1404 dormr3 1405 dot product complex vectors, conjugated 60 complex vectors, unconjugated 61 distributed complex vectors, conjugated 2381 distributed complex vectors, unconjugated 2382 distributed real vectors 2380 real vectors 58 real vectors (extended precision) 59 sparse complex vectors 145 sparse complex vectors, conjugated 144 sparse real vectors 143 dpbtf2 1407 dporfsx 472 dpotf2 1408 dpprfs 478 dpptrs 396 dptts2 1409 driver expert 1536 simple 1536 Driver Routines 557, 930 drscl 1411 dss_create 1916 dsyconv 436 dsygs2 1415 dsyswapr 1411 dsyswapr1 1414 dsytd2 1417 dsytf2 1418 dsytri2 523 dsytri2x 527 dsytrs2 406 dtgex2 1421 dtgsy2 1423 dtrexc 868 Intel® Math Kernel Library Reference Manual 2722 dtrnlsp_check 2499 dtrnlsp_delete 2503 dtrnlsp_get 2502 dtrnlsp_init 2497 dtrnlsp_solve 2500 dtrnlspbc_check 2506 dtrnlspbc_delete 2511 dtrnlspbc_get 2510 dtrnlspbc_init 2505 dtrnlspbc_solve 2508 dtrti2 1426 dzsum1 1165 E eigenpairs, sorting 1868 eigenvalue problems general matrix 833, 877, 1656 generalized form 819 Hermitian matrix 758 symmetric matrix 758 symmetric tridiagonal matrix 1870, 1878 eigenvalues eigenvalue problems 758 eigenvectors eigenvalue problems 758 elementary reflector complex matrix 1809 general matrix 1329, 1804 general rectangular matrix LAPACK 1294, 1302 ScaLAPACK 1793, 1798 LAPACK generation 1298, 1299 ScaLAPACK generation 1800 error diagnostics, in VML 1973 error estimation for linear equations distributed tridiagonal coefficient matrix 1576 error handling pxerbla 1882, 2530 xerbla 1973 errors in solutions of linear equations banded matrix 461, 1462, 1493 distributed tridiagonal coefficient matrix 1576 general matrix band storage 458 Hermitian indefinite matrix 496, 1482 Hermitian matrix packed storage 504 Hermitian positive-definite matrix band storage 480 packed storage 478 symmetric indefinite matrix 488, 1511 symmetric matrix packed storage 501 symmetric positive-definite matrix band storage 480 packed storage 478 triangular matrix band storage 511 packed storage 508 tridiagonal matrix 467 Estimates 2606 Euclidean norm of a distributed vector 2383 of a vector 62 expert driver 1536 Exponential 2165 F factorization Bunch-Kaufman LAPACK 357 ScaLAPACK 1538 Cholesky LAPACK 357, 1407, 1408 ScaLAPACK 1857 diagonal pivoting Hermitian matrix complex 1419 packed 663 symmetric matrix indefinite 1418 packed 657 LU LAPACK 357 ScaLAPACK 1538 orthogonal LAPACK 670 ScaLAPACK 1586 partial complex Hermitian indefinite matrix 1378 real/complex symmetric matrix 1377 triangular factorization 357, 1538 upper trapezoidal matrix 1390 fast Fourier transform DftiCommitDescriptor 2316 DftiCommitDescriptorDM 2358 DftiComputeBackward 2322 DftiComputeBackwardDM 2362 DftiComputeForwardDM 2360 DftiCopyDescriptor 2318 DftiCreateDescriptor 2314 DftiCreateDescriptorDM 2357 DftiErrorClass 2329 DftiErrorMessage 2331 DftiFreeDescriptor 2317 DftiFreeDescriptorDM 2359 DftiGetValue 2327 DftiGetValueDM 2367 DftiSetValueDM 2365 fast Fourier Transform DftiComputeForward 2320 FFT computation cluster FFT 2356 FFT functions descriptor manipulation DftiCommitDescriptor 2316 DftiCommitDescriptorDM 2358 DftiCopyDescriptor 2318 DftiCreateDescriptor 2314 DftiCreateDescriptorDM 2357 DftiFreeDescriptor 2317 DftiFreeDescriptorDM 2359 DFT computation DftiComputeBackward 2322 DftiComputeForward 2320 FFT computation DftiComputeForwardDM 2360 status checking DftiErrorClass 2329 DftiErrorMessage 2331 FFT Interface 2313 FFT routines descriptor configuration DftiGetValue 2327 DftiGetValueDM 2367 DftiSetValueDM 2365 Index 2723 FFT computation DftiComputeBackwardDM 2362 FFTW interface to Intel(R) MKL for FFTW2 2689 for FFTW3 2697 fill-in, for sparse matrices 2631 finding index of the element of a vector with the largest absolute value of the real part 1744 element of a vector with the largest absolute value 71 element of a vector with the largest absolute value of the real part and its global index 1745 element of a vector with the smallest absolute value 72 font conventions 41 Fortran 95 interface conventions BLAS, Sparse BLAS 52 LAPACK 351 Fortran 95 Interfaces for LAPACK absent from Netlib 2684 identical to Netlib 2681 modified Netlib interfaces 2684 new functionality 2687 with replaced Netlib argument names 2682 Fortran 95 Interfaces for LAPACK Routines specific MKL features Fortran 95 LAPACK interface vs. Netlib 352 free_Helmholtz_2D 2474 free_Helmholtz_3D 2474 free_sph_np 2480 free_sph_p 2480 free_trig_transform 2451 Frobenius norm complex Hermitian matrix packed storage 1250 complex Hermitian matrix in RFP format 1443 complex Hermitian tridiagonal matrix 1251 complex symmetric matrix 1252 general rectangular matrix 1244, 1779 general tridiagonal matrix 1245 Hermitian band matrix 1248 real symmetric matrix 1252, 1782 real symmetric matrix in RFP format 1442 real symmetric tridiagonal matrix 1251 symmetric band matrix 1247 symmetric matrix packed storage 1249 trapezoidal matrix 1257 triangular band matrix 1255 triangular matrix packed storage 1256 upper Hessenberg matrix 1246, 1780 full storage scheme 2646 full-storage vectors 140 function name conventions, in VML 1970 G Gamma 2183 gathering sparse vector's elements into compressed form and writing zeros to these elements 147 Gaussian 2159 GaussianMV 2161 gbcon 422 gbsvx 576 gbtrs 387 gecon 420 general distributed matrix scalar-matrix-matrix product 2418 general matrix block reflector 1330, 1807 eigenvalue problems 833, 877, 1656 elementary reflector 1329, 1804 estimating the condition number band storage 422 inverting matrix LAPACK 514 ScaLAPACK 1578 LQ factorization 689, 1598 LU factorization band storage 359, 1166, 1540, 1542, 1872, 1873 matrix-vector product band storage 75 multiplying by orthogonal matrix from LQ factorization 1402, 1846 from QR factorization 1400, 1843 from RQ factorization 1404, 1849 from RZ factorization 1405 multiplying by unitary matrix from LQ factorization 1402, 1846 from QR factorization 1400, 1843 from RQ factorization 1404, 1849 from RZ factorization 1405 QL factorization LAPACK 700 ScaLAPACK 1608 QR factorization with pivoting 676, 678, 1589 rank-1 update 79 rank-1 update, conjugated 81 rank-1 update, unconjugated 82 reduction to bidiagonal form 1167, 1181, 1751 reduction to upper Hessenberg form 1754 RQ factorization LAPACK 710 ScaLAPACK 1636 scalar-matrix-matrix product 119, 333 solving systems of linear equations band storage LAPACK 387 ScaLAPACK 1551 general rectangular distributed matrix computing scaling factors 1583 equilibration 1583 general rectangular matrix 1-norm value LAPACK 1244 ScaLAPACK 1779 block reflector LAPACK 1295 ScaLAPACK 1795 elementary reflector LAPACK 1294, 1798 ScaLAPACK 1793 Frobenius norm LAPACK 1244 ScaLAPACK 1779 infinity- norm LAPACK 1244 ScaLAPACK 1779 largest absolute value of element LAPACK 1244 ScaLAPACK 1779 LQ factorization LAPACK 1170 ScaLAPACK 1756 multiplication LAPACK 1335 Intel® Math Kernel Library Reference Manual 2724 ScaLAPACK 1815 QL factorization LAPACK 1171 ScaLAPACK 1758 QR factorization LAPACK 1172, 1174 ScaLAPACK 1760 reduction of first columns LAPACK 1226, 1228 ScaLAPACK 1775 reduction to bidiagonal form 1765 row interchanges LAPACK 1374 ScaLAPACK 1821 RQ factorization LAPACK 1175 ScaLAPACK 1617, 1762 scaling 1787 general square matrix reduction to upper Hessenberg form 1168 trace 1822 general triangular matrix LU factorization band storage 1746 general tridiagonal matrix 1-norm value 1245 Frobenius norm 1245 infinity- norm 1245 largest absolute value of element 1245 general tridiagonal triangular matrix LU factorization band storage 1748 generalized eigenvalue problems complex Hermitian-definite problem band storage 829 packed storage 825 real symmetric-definite problem band storage 827 packed storage 823 See also LAPACK routines, generalized eigenvalue problems 819 Generalized LLS Problems 943 Generalized Nonsymmetric Eigenproblems 1120 generalized Schur factorization 1223, 1293, 1304, 1305 Generalized Singular Value Decomposition 910 generalized Sylvester equation 902 Generalized SymmetricDefinite Eigenproblems 1065 generation methods 2116 Geometric 2196 geqpf 676 GetBrngProperties 2210 getcpuclocks 2533 getcpufrequency 2534 GetNumRegBrngs 2152 GetStreamSize 2145 GetStreamStateBrng 2151 GFSR 2118 Givens rotation modified Givens transformation parameters 67 of sparse vectors 148 parameters 64 global array 1535, 2373 global index of maximum element of a distributed vector 2376 gtrfs 467 Gumbel 2181 H Helmholtz problem three-dimensional 2461 two-dimensional 2458 Helmholtz problem on a sphere non-periodic 2459 periodic 2459 Hermitian band matrix 1-norm value 1248 Frobenius norm 1248 infinity- norm 1248 largest absolute value of element 1248 Hermitian distributed matrix rank-n update 2424 scalar-matrix-matrix product 2420 Hermitian indefinite matrix matrix-vector product 1478 Hermitian matrix Bunch-Kaufman factorization packed storage 383 eigenvalues and eigenvectors 1713, 1715, 1717 estimating the condition number packed storage 441 generalized eigenvalue problems 819 inverting the matrix packed storage 532 matrix-vector product band storage 84 packed storage 91 rank-1 update packed storage 92 rank-2 update packed storage 94 rank-2k update 126 rank-k update 124 reducing to standard form LAPACK 1415 ScaLAPACK 1859 reducing to tridiagonal form LAPACK 1385, 1417 ScaLAPACK 1823, 1861 scalar-matrix-matrix product 122 scaling 1789 solving systems of linear equations packed storage 411 Hermitian positive definite distributed matrix computing scaling factors 1584 equilibration 1584 Hermitian positive semidefinite matrix Cholesky factorization 366 Hermitian positive-definite band matrix Cholesky factorization 1407 Hermitian positive-definite distributed matrix inverting the matrix 1580 Hermitian positive-definite matrix Cholesky factorization band storage 371, 1546 packed storage 369 estimating the condition number band storage 430 packed storage 428 inverting the matrix packed storage 519 solving systems of linear equations band storage 398, 1558 packed storage 396 Hermitian positive-definite tridiagonal matrix solving systems of linear equations 1560 heswapr 1413 hetri2 525 hetri2x 529 hgeqz 885 Householder matrix LAPACK 1298, 1299 Index 2725 ScaLAPACK 1800 Householder reflector 1867 hseqr 851 Hypergeometric 2200 I i?amax 71 i?amin 72 i?max1 1164 IBM ESSL library 2214 IEEE arithmetic 1778 IEEE standard implementation 1880 signbit position 1882 ila?lr 1432 iladiag 1530 ilaenv 1520 ilaprec 1531 ilatrans 1531 ilauplo 1532 ilaver 1519 ILU0 preconditioner 1958 Incomplete LU Factorization Technique 1958 increment 2645 iNewAbstractStream 2131 infinity-norm complex Hermitian matrix packed storage 1250 complex Hermitian matrix in RFP format 1443 complex Hermitian tridiagonal matrix 1251 complex symmetric matrix 1252 general rectangular matrix 1244, 1779 general tridiagonal matrix 1245 Hermitian band matrix 1248 real symmetric matrix 1252, 1782 real symmetric matrix in RFP format 1442 real symmetric tridiagonal matrix 1251 symmetric band matrix 1247 symmetric matrix packed storage 1249 trapezoidal matrix 1257 triangular band matrix 1255 triangular matrix packed storage 1256 upper Hessenberg matrix 1246, 1780 Interface Consideration 153 inverse matrix. inverting a matrix 514, 1578, 1580, 1581 inverting a matrix general matrix LAPACK 514 ScaLAPACK 1578 Hermitian matrix packed storage 532 Hermitian positive-definite matrix LAPACK 516 packed storage 519 ScaLAPACK 1580 symmetric matrix packed storage 530 symmetric positive-definite matrix LAPACK 516 packed storage 519 ScaLAPACK 1580 triangular distributed matrix 1581 triangular matrix packed storage 536 iparmq 1522 Iterative Sparse Solvers 1932 Iterative Sparse Solvers based on Reverse Communication Interface (RCI ISS) 1932 J Jacobi plane rotations 1051 Jacobian matrix calculation routines ?jacobi 2515 ?jacobi_delete 2514 ?jacobi_init 2512 ?jacobi_solve 2513 ?jacobix 2516 L la_gbamv 1455 la_gbrcond 1457 la_gbrcond_c 1459 la_gbrcond_x 1460 la_gercond 1470 la_gercond_c 1471 la_gercond_x 1472 la_hercond_c 1480 la_hercond_x 1481 la_lin_berr 1488 la_porcond 1489 la_porcond_c 1490 la_porcond_x 1492 la_syrcond 1507 la_syrcond_c 1508 la_syrcond_x 1509 LAPACK naming conventions 347 LAPACK auxiliary routines ?la_geamv 1468 ?la_heamv 1478 ?la_syamv 1505 ?larscl2 1504 ?lascl2 1504 LAPACK routines ?gsvj0 1432 ?gsvj1 1434 ?hfrk 1438 ?larfp 1429 ?sfrk 1437 2-by-2 generalized eigenvalue problem 1214 2-by-2 Hermitian matrix plane rotation 1293 2-by-2 orthogonal matrices 1216 2-by-2 real matrix generalized Schur factorization 1223 2-by-2 real nonsymmetric matrix Schur factorization 1259 2-by-2 symmetric matrix plane rotation 1293 2-by-2 triangular matrix singular values 1334 SVD 1373 approximation to smallest eigenvalue 1365 auxiliary routines ?gbtf2 1166 ?gebd2 1167 ?gehd2 1168 ?gelq2 1170 ?geql2 1171 ?geqr2 1172 ?geqr2p 1174 ?gerq2 1175 ?gesc2 1176 ?getc2 1177 ?getf2 1178 Intel® Math Kernel Library Reference Manual 2726 ?gtts2 1179 ?hetf2 1419 ?hfrk 1438 ?isnan 1180 ?la_gbrpvgrw 1467 ?la_herpvgrw 1487 ?la_porpvgrw 1498 ?la_rpvgrw 1503 ?la_syrpvgrw 1516 ?la_wwaddw 1517 ?labrd 1181 ?lacgv 1155 ?lacn2 1184 ?lacon 1185 ?lacp2 1455 ?lacpy 1186 ?lacrm 1156 ?lacrt 1156 ?ladiv 1187 ?lae2 1188 ?laebz 1189 ?laed0 1192 ?laed1 1194 ?laed2 1195 ?laed3 1197 ?laed4 1199 ?laed5 1200 ?laed6 1200 ?laed7 1202 ?laed8 1204 ?laed9 1207 ?laeda 1208 ?laein 1209 ?laesy 1157 ?laev2 1212 ?laexc 1213 ?lag2 1214 ?lags2 1216 ?lagtf 1218 ?lagtm 1220 ?lagts 1221 ?lagv2 1223 ?lahef 1378 ?lahqr 1224 ?lahr2 1228 ?lahrd 1226 ?laic1 1230 ?laisnan 1181 ?laln2 1232 ?lals0 1234 ?lalsa 1236 ?lalsd 1239 ?lamrg 1241 ?laneg 1242 ?langb 1243 ?lange 1244 ?langt 1245 ?lanhb 1248 ?lanhe 1253 ?lanhf 1443 ?lanhp 1250 ?lanhs 1246 ?lansb 1247 ?lansf 1442 ?lansp 1249 ?lanst/?lanht 1251 ?lansy 1252 ?lantb 1255 ?lantp 1256 ?lantr 1257 ?lanv2 1259 ?lapll 1259 ?lapmr 1260 ?lapmt 1262 ?lapy2 1262 ?lapy3 1263 ?laqgb 1264 ?laqge 1265 ?laqhb 1266 ?laqhe 1499 ?laqhp 1501 ?laqp2 1268 ?laqps 1269 ?laqr0 1270 ?laqr1 1273 ?laqr2 1274 ?laqr3 1277 ?laqr4 1280 ?laqr5 1282 ?laqsb 1285 ?laqsp 1286 ?laqsy 1287 ?laqtr 1289 ?lar1v 1290 ?lar2v 1293 ?larcm 1502 ?larf 1294 ?larfb 1295 ?larfg 1298 ?larfgp 1299 ?larfp 1429 ?larft 1300 ?larfx 1302 ?largv 1304 ?larnv 1305 ?larra 1306 ?larrb 1307 ?larrc 1309 ?larrd 1310 ?larre 1312 ?larrf 1315 ?larrj 1317 ?larrk 1318 ?larrr 1319 ?larrv 1320 ?lartg 1323 ?lartgp 1324 ?lartgs 1326 ?lartv 1327 ?laruv 1328 ?larz 1329 ?larzb 1330 ?larzt 1332 ?las2 1334 ?lascl 1335 ?lasd0 1336 ?lasd1 1338 ?lasd2 1340 ?lasd3 1342 ?lasd4 1344 ?lasd5 1346 ?lasd6 1347 ?lasd7 1350 ?lasd8 1353 ?lasd9 1354 ?lasda 1356 ?lasdq 1358 ?lasdt 1360 ?laset 1361 ?lasq1 1362 Index 2727 ?lasq2 1363 ?lasq3 1364 ?lasq4 1365 ?lasq5 1366 ?lasq6 1367 ?lasr 1368 ?lasrt 1371 ?lassq 1372 ?lasv2 1373 ?laswp 1374 ?lasy2 1375 ?lasyf 1377 ?latbs 1380 ?latdf 1382 ?latps 1383 ?latrd 1385 ?latrs 1387 ?latrz 1390 ?lauu2 1392 ?lauum 1393 ?orbdb/?unbdb 925 ?orcsd/?uncsd 1060 ?org2l/?ung2l 1394 ?org2r/?ung2r 1395 ?orgl2l/?ungl2 1396 ?orgr2/?ungr2 1397 ?orm2l/?unm2l 1399 ?orm2r/?unm2r 1400 ?orml2/?unml2 1402 ?ormr2/?unmr2 1404 ?ormr3/?unmr3 1405 ?pbtf2 1407 ?potf2 1408 ?pstf2 1451 ?ptts2 1409 ?rot 1158 ?rscl 1411 ?sfrk 1437 ?spmv 1159 ?spr 1161 ?sum1 1165 ?sygs2/?hegs2 1415 ?symv 1162 ?syr 1163 ?sytd2/?hetd2 1417 ?sytf2 1418 ?tfttp 1444 ?tfttr 1445 ?tgex2 1421 ?tgsy2 1423 ?tpttf 1446 ?tpttr 1448 ?trti2 1426 ?trttf 1449 ?trttp 1450 clag2z 1427 dlag2s 1427 dlat2s 1453 i?max1 1164 ila?lc 1431 ila?lr 1432 slag2d 1428 zlag2c 1429 zlat2c 1454 banded matrix equilibration ?gbequ 542 ?gbequb 545 bidiagonal divide and conquer 1360 block reflector triangular factor 1300, 1332 checking for safe infinity 1523 checking for strings equality 1524 complex Hermitian matrix packed storage 1250 complex Hermitian matrix in RFP format 1443 complex Hermitian tridiagonal matrix 1251 complex matrix multiplication 1156, 1502 complex symmetric matrix computing eigenvalues and eigenvectors 1157 matrix-vector product 1162 symmetric rank-1 update 1163 complex symmetric packed matrix symmetric rank-1 update 1161 complex vector 1-norm using true absolute value 1165 index of element with max absolute value 1164 linear transformation 1156 matrix-vector product 1159 plane rotation 1158 complex vector conjugation 1155 condition number estimation ?disna 818 ?gbcon 422 ?gecon 420 ?gtcon 424 ?hecon 438 ?hpcon 441 ?pbcon 430 ?pocon 426 ?ppcon 428 ?ptcon 432 ?spcon 439 ?sycon 434 ?tbcon 447 ?tpcon 445 ?trcon 443 determining machine parameters 1526 diagonally dominant triangular factorization ?dttrfb 363 dqd transform 1367 dqds transform 1366 driver routines generalized LLS problems ?ggglm 946 ?gglse 943 generalized nonsymmetric eigenproblems ?gges 1121 ?ggesx 1126 ?ggev 1132 ?ggevx 1136 Intel® Math Kernel Library Reference Manual 2728 generalized symmetric definite eigenproblems ?hbgv 1105 ?hbgvd 1110 ?hbgvx 1117 ?hegv 1068 ?hegvd 1074 ?hegvx 1081 ?hpgv 1087 ?hpgvd 1092 ?hpgvx 1099 ?sbgv 1103 ?sbgvd 1107 ?sbgvx 1113 ?spgv 1085 ?spgvd 1089 ?spgvx 1096 ?sygv 1066 ?sygvd 1071 ?sygvx 1077 linear least squares problems ?gels 930 ?gelsd 939 ?gelss 937 ?gelsy 933 ?lals0 (auxiliary) 1234 ?lalsa (auxiliary) 1236 ?lalsd (auxiliary) 1239 nonsymmetric eigenproblems ?gees 1020 ?geesx 1024 ?geev 1028 ?geevx 1032 singular value decomposition ?gejsv 1045 ?gelsd 939 ?gesdd 1041 ?gesvd 1037 ?gesvj 1051 ?ggsvd 1055 solving linear equations ?dtsvb 595 ?gbsv 574 ?gbsvx 576 ?gbsvxx 582 ?gesv 558 ?gesvx 561 ?gesvxx 567 ?gtsv 589 ?gtsvx 591 ?hesv 642 ?hesvx 645 ?hesvxx 649 ?hpsv 661 ?hpsvx 663 ?pbsv 617 ?pbsvx 619 ?posv 596 ?posvx 599 ?posvxx 604 ?ppsv 611 ?ppsvx 612 ?ptsv 623 ?ptsvx 625 ?spsv 655 ?spsvx 657 ?sysv 629 ?sysvx 631 ?sysvxx 635 symmetric eigenproblems ?hbev 993 ?hbevd 998 ?hbevx 1004 ?heev 951 ?heevd 956 ?heevr 970 ?heevx 963 ?hpev 977 ?hpevd 981 ?hpevx 988 ?sbev 991 ?sbevd 995 ?sbevx 1001 ?spev 975 ?spevd 979 ?spevx 985 ?stev 1008 ?stevd 1009 ?stevr 1015 ?stevx 1012 ?syev 949 ?syevd 954 ?syevr 966 ?syevx 959 environmental enquiry 1520, 1522 finding a relatively isolated eigenvalue 1315 general band matrix equilibration 1264 general matrix block reflector 1330 elementary reflector 1329 reduction to bidiagonal form 1167, 1181 general matrix equilibration ?geequ 538 ?geequb 540 general rectangular matrix block reflector 1295 elementary reflector 1294, 1302 equilibration 1265, 1499, 1501 LQ factorization 1170 plane rotation 1368 QL factorization 1171 QR factorization 1172, 1174 row interchanges 1374 RQ factorization 1175 general square matrix reduction to upper Hessenberg form 1168 general tridiagonal matrix 1218, 1220, 1221, 1245, 1312, 1320 generalized eigenvalue problems ?hbgst 829 ?hegst 822 ?hpgst 825 ?pbstf 831 ?sbgst 827 ?spgst 823 ?sygst 820 generalized SVD ?ggsvp 910 ?tgsja 914 generalized Sylvester equation ?tgsyl 902 Hermitian band matrix equilibration 1266, 1287 Index 2729 Hermitian band matrix in packed storage equilibration 1286 Hermitian indefinite matrix equilibration ?heequb 556 Hermitian matrix computing eigenvalues and eigenvectors 1212 Hermitian positive-definite matrix equilibration ?poequ 547 ?poequb 549 Householder matrix elementary reflector 1298, 1299 ila?lc 1431 ila?lr 1432 incremental condition estimation 1230 linear dependence of vectors 1259 LQ factorization ?gelq2 1170 ?gelqf 689 ?orglq 692 ?ormlq 694 ?unglq 696 ?unmlq 698 LU factorization general band matrix 1166 matrix equilibration ?laqgb 1264 ?laqge 1265 ?laqhb 1266 ?laqhe 1499 ?laqhp 1501 ?laqsb 1285 ?laqsp 1286 ?laqsy 1287 ?pbequ 552 ?ppequ 550 matrix inversion ?getri 514 ?hetri 522 ?hetri2 525 ?hetri2x 529 ?hptri 532 ?potri 516 ?pptri 519 ?sptri 530 ?sytri 520 ?sytri2 523 ?sytri2x 527 ?tptri 536 ?trtri 534 matrix-matrix product ?lagtm 1220 merging sets of singular values 1340, 1350 mixed precision iterative refinement subroutines 558, 596, 1427–1429 nonsymmetric eigenvalue problems ?gebak 849 ?gebal 847 ?gehrd 835 ?hsein 855 ?hseqr 851 ?orghr 837 ?ormhr 839 ?trevc 860 ?trexc 868 ?trsen 870 ?trsna 864 ?unghr 842 ?unmhr 844 off-diagonal and diagonal elements 1361 permutation list creation 1241 permutation of matrix columns 1262 permutation of matrix rows 1260 plane rotation 1323, 1324, 1326, 1327, 1368 plane rotation vector 1304 QL factorization ?geql2 1171 ?geqlf 700 ?orgql 702 ?ormql 706 ?ungql 704 ?unmql 708 QR factorization ?geqp3 678 ?geqpf 676 ?geqr2 1172 ?geqr2p 1174 ?geqrf 671 ?geqrfp 674 ?ggqrf 728 ?ggrqf 731 ?laqp2 1268 ?laqps 1269 ?orgqr 681 ?ormqr 683 ?ungqr 685 ?unmqr 687 p?geqrf 1587 random numbers vector 1305 real lower bidiagonal matrix SVD 1358 real square bidiagonal matrix singular values 1362 real symmetric matrix 1252 real symmetric matrix in RFP format 1442 real symmetric tridiagonal matrix 1189, 1251 real upper bidiagonal matrix singular values 1336 SVD 1338, 1356, 1358 real upper quasi-triangular matrix orthogonal similarity transformation 1213 reciprocal condition numbers for eigenvalues and/or eigenvectors ?tgsna 906 rectangular full packed format 368, 395 RQ factorization ?geqr2 1175 ?gerqf 710 ?orgrq 712 ?ormrq 716 ?ungrq 714 ?unmrq 718 RZ factorization ?ormrz 723 ?tzrzf 720 ?unmrz 725 singular value decomposition ?bdsdc 756 ?bdsqr 752 ?gbbrd 739 ?gebrd 736 ?orgbr 742 ?ormbr 744 ?ungbr 747 ?unmbr 749 solution refinement and error estimation ?gbrfs 458 Intel® Math Kernel Library Reference Manual 2730 ?gbrfsx 461 ?gerfs 449 ?gerfsx 452 ?gtrfs 467 ?herfs 494 ?herfsx 496 ?hprfs 504 ?la_gbrfsx_extended 1462 ?la_gerfsx_extended 1473 ?la_herfsx_extended 1482 ?la_porfsx_extended 1493 ?la_syrfsx_extended 1511 ?pbrfs 480 ?porfs 469 ?porfsx 472 ?pprfs 478 ?ptrfs 483 ?sprfs 501 ?syrfs 485 ?syrfsx 488 ?tbrfs 511 ?tprfs 508 ?trrfs 506 solving linear equations ?dttrsb 392 ?gbtrs 387 ?getrs 385 ?gttrs 389 ?heswapr 1413 ?hetrs 404 ?hetrs2 408 ?hptrs 411 ?laln2 1232 ?laqtr 1289 ?pbtrs 398 ?pftrs 395 ?potrs 393 ?pptrs 396 ?pttrs 400 ?sptrs 409 ?syswapr 1411 ?syswapr1 1414 ?sytrs 402 ?sytrs2 406 ?tbtrs 418 ?tptrs 416 ?trtrs 413 sorting numbers 1371 square root 1262, 1263 square roots 1342, 1344, 1346, 1353, 1354, 1524 Sylvester equation ?lasy2 1375 ?tgsy2 1423 ?trsyl 874 symmetric band matrix equilibration 1285, 1287 symmetric band matrix in packed storage equilibration 1286 symmetric eigenvalue problems ?disna 818 ?hbtrd 791 ?herdb 766 ?hetrd 772 ?hptrd 784 ?opgtr 781 ?opmtr 782 ?orgtr 768 ?ormtr 770 ?pteqr 810 ?sbtrd 789 ?sptrd 779 ?stebz 813 ?stedc 801 ?stegr 805 ?stein 815 ?stemr 798 ?steqr 795 ?sterf 793 ?syrdb 764 ?sytrd 762 ?ungtr 775 ?unmtr 776 ?upgtr 786 ?upmtr 787 auxiliary ?lae2 1188 ?laebz 1189 ?laed0 1192 ?laed1 1194 ?laed2 1195 ?laed3 1197 ?laed4 1199 ?laed5 1200 ?laed6 1200 ?laed7 1202 ?laed8 1204 ?laed9 1207 ?laeda 1208 symmetric indefinite matrix equilibration ?syequb 554 symmetric matrix computing eigenvalues and eigenvectors 1212 packed storage 1249 symmetric positive-definite matrix equilibration ?poequ 547 ?poequb 549 symmetric positive-definite tridiagonal matrix eigenvalues 1363 trapezoidal matrix 1257, 1390 triangular factorization ?gbtrf 359 ?getrf 357 ?gttrf 361 ?hetrf 378 ?hptrf 383 ?pbtrf 371 ?potrf 364 ?pptrf 369 ?pstrf 366 ?pttrf 373 ?sptrf 381 ?sytrf 374 p?dbtrf 1542 triangular matrix packed storage 1256 triangular matrix factorization ?pftrf 368 ?pftri 517 ?tftri 535 triangular system of equations 1383, 1387 tridiagonal band matrix 1255 uniform distribution 1328 unreduced symmetric tridiagonal matrix 1192 updated upper bidiagonal matrix Index 2731 SVD 1347 updating sum of squares 1372 upper Hessenberg matrix computing a specified eigenvector 1209 eigenvalues 1224 Schur factorization 1224 utility functions and routines ?labad 1524 ?lamc1 1526 ?lamc2 1526 ?lamc3 1527 ?lamc4 1528 ?lamc5 1528 ?lamch 1525 chla_transtype 1529 ieeeck 1523 iladiag 1530 ilaenv 1520 ilaprec 1531 ilatrans 1531 ilauplo 1532 ilaver 1519 iparmq 1522 lsamen 1524 second/dsecnd 1529 xerbla_array 1532 Laplace 2168 Laplace problem three-dimensional 2461 two-dimensional 2459 largest absolute value of element complex Hermitian matrix packed storage 1250 complex Hermitian matrix in RFP format 1443 complex Hermitian tridiagonal matrix 1251 complex symmetric matrix 1252 general rectangular matrix 1244, 1779 general tridiagonal matrix 1245 Hermitian band matrix 1248 real symmetric matrix 1252, 1782 real symmetric matrix in RFP format 1442 real symmetric tridiagonal matrix 1251 symmetric band matrix 1247 symmetric matrix packed storage 1249 trapezoidal matrix 1257 triangular band matrix 1255 triangular matrix packed storage 1256 upper Hessenberg matrix 1246, 1780 leading dimension 2648 leapfrog method 2121 LeapfrogStream 2146 least squares problems length. dimension 2645 library version 2521 Library Version Obtaining 2521 library version string 2523 linear combination of distributed vectors 2378 linear combination of vectors 55, 327 Linear Congruential Generator 2118 linear equations, solving tridiagonal symmetric positive-definite matrix LAPACK 623 ScaLAPACK 1699 band matrix LAPACK 574, 576 ScaLAPACK 1685 banded matrix extra precise interative refinement LAPACK 582 extra precise iterative refinement 461, 1462, 1493 LAPACK 582 Cholesky-factored matrix LAPACK 398 ScaLAPACK 1558 diagonally dominant tridiagonal matrix LAPACK 392, 595 diagonally dominant-like matrix banded 1553 tridiagonal 1555 general band matrix ScaLAPACK 1687 general matrix band storage 387, 1551 extra precise interative refinement 452 extra precise iterative refinement 1473 general tridiagonal matrix ScaLAPACK 1689 Hermitian indefinite matrix extra precise interative refinement LAPACK 649 extra precise iterative refinement 1482 LAPACK 649 Hermitian matrix error bounds 645, 663 packed storage 411, 661, 663 Hermitian positive-definite matrix band storage LAPACK 617 ScaLAPACK 1697 error bounds LAPACK 599 ScaLAPACK 1693 extra precise interative refinement LAPACK 604 LAPACK linear equations, solving multiple right-sides symmetric packed storage 396, 611, 612 ScaLAPACK 1693 Hermitian positive-definite tridiagonal linear equations 1876 Hermitian positive-definite tridiagonal matrix 1560 multiple right-hand sides band matrix LAPACK 574, 576 ScaLAPACK 1685 banded matrix LAPACK 582 diagonally dominant tridiagonal matrix 595 Hermitian indefinite matrix LAPACK 649 Hermitian matrix 642, 661 Hermitian positive-definite matrix band storage 617 square matrix LAPACK 558, 561, 567 ScaLAPACK 1679, 1681 symmetric indefinite matrix LAPACK 635 symmetric matrix 629, 655 symmetric positive-definite matrix band storage 617 Intel® Math Kernel Library Reference Manual 2732 symmetric/Hermitian positive-definite matrix LAPACK 604 tridiagonal matrix 589, 591 overestimated or underestimated system 1701 square matrix error bounds LAPACK 561, 576 ScaLAPACK 1681 extra precise interative refinement LAPACK 567 LAPACK 558, 561, 567 ScaLAPACK 1679, 1681 symmetric indefinite matrix extra precise interative refinement LAPACK 635 extra precise iterative refinement 1511 LAPACK 635 symmetric matrix error bounds 631, 657 packed storage 409, 655, 657 symmetric positive-definite matrix band storage LAPACK 617 ScaLAPACK 1697 error bounds LAPACK 599 ScaLAPACK 1693 extra precise interative refinement LAPACK 472, 604 LAPACK 596, 599, 604 packed storage 396, 611, 612 ScaLAPACK 1691, 1693 symmetric positive-definite tridiagonal linear equations 1876 triangular matrix band storage 418, 1851 packed storage 416 tridiagonal Hermitian positive-definite matrix error bounds 625 LAPACK 623 ScaLAPACK 1699 tridiagonal matrix error bounds 591 LAPACK 389, 400, 589, 591 LAPACK auxiliary 1290 ScaLAPACK auxiliary 1875 tridiagonal symmetric positive-definite matrix error bounds 625 Linear Least Squares (LLS) Problems 930 LoadStreamF 2141 LoadStreamM 2144 Lognormal 2178 LQ factorization computing the elements of orthogonal matrix Q 692 real orthogonal matrix Q 1600 unitary matrix Q 696, 1602 general rectangular matrix 1170, 1756 lsame 2530 lsamen 1524, 2531 LU factorization band matrix blocked algorithm 1873 unblocked algorithm 1872 diagonally dominant tridiagonal matrix 363 diagonally dominant-like tridiagonal matrix 1543 general band matrix 1166 general matrix 1178, 1763 solving linear equations general matrix 1176 square matrix 1681 tridiagonal matrix 1179, 1221 triangular band matrix 1746 tridiagonal band matrix 1748 tridiagonal matrix 361, 1218, 1874 with complete pivoting 1177, 1382 with partial pivoting 1178, 1763 M machine parameters LAPACK 1525 ScaLAPACK 1881 matrix arguments column-major ordering 2645, 2648 example 2649 leading dimension 2648 number of columns 2648 number of rows 2648 transposition parameter 2648 matrix block QR factorization with pivoting 1268 matrix converters mkl_?csrbsr 304 mkl_?csrcoo 301 mkl_?csrcsc 307 mkl_?csrdia 309 mkl_?csrsky 313 mkl_?dnscsr 298 matrix equation AX = B 138, 355, 385, 1440, 1536, 1550 matrix one-dimensional substructures 2645 matrix-matrix operation product general distributed matrix 2418 general matrix 119, 333 rank-2k update Hermitian distributed matrix 2424 Hermitian matrix 126 symmetric distributed matrix 2430 symmetric matrix 133 rank-k update Hermitian matrix 124 symmetric distributed matrix 2428 rank-n update symmetric matrix 131 scalar-matrix-matrix product Hermitian distributed matrix 2420 Hermitian matrix 122 symmetric distributed matrix 2426 symmetric matrix 128 matrix-matrix operation:scalar-matrix-matrix product triangular distributed matrix 2435 triangular matrix 135 matrix-vector operation product Hermitian matrix 84, 86, 91 real symmetric matrix 98, 102 triangular matrix 107, 112, 115 rank-1 update Hermitian matrix 87, 92 real symmetric matrix 99, 104 rank-2 update Hermitian matrix 89, 94 symmetric matrix 101, 106 matrix-vector operation:product Hermitian matrix band storage 84 packed storage 91 Index 2733 real symmetric matrix packed storage 98 symmetric matrix band storage 95 triangular matrix band storage 107 packed storage 112 matrix-vector operation:rank-1 update Hermitian matrix packed storage 92 real symmetric matrix packed storage 99 matrix-vector operation:rank-2 update Hermitian matrix packed storage 94 symmetric matrix packed storage 101 mkl_?bsrgemv 164 mkl_?bsrmm 246 mkl_?bsrmv 218 mkl_?bsrsm 268 mkl_?bsrsv 232 mkl_?bsrsymv 173 mkl_?bsrtrsv 184 mkl_?coogemv 166 mkl_?coomm 254 mkl_?coomv 225 mkl_?coosm 265 mkl_?coosv 239 mkl_?coosymv 176 mkl_?cootrsv 186 mkl_?cscmm 250 mkl_?cscmv 222 mkl_?cscsm 261 mkl_?cscsv 235 mkl_?csradd 316 mkl_?csrbsr 304 mkl_?csrcoo 301 mkl_?csrcsc 307 mkl_?csrdia 309 mkl_?csrgemv 161 mkl_?csrmm 242 mkl_?csrmultcsr 320 mkl_?csrmultd 324 mkl_?csrmv 215 mkl_?csrsky 313 mkl_?csrsm 257 mkl_?csrsv 228 mkl_?csrsymv 171 mkl_?csrtrsv 181 mkl_?diagemv 169 mkl_?diamm 284 mkl_?diamv 272 mkl_?diasm 291 mkl_?diasv 278 mkl_?diasymv 178 mkl_?diatrsv 189 mkl_?dnscsr 298 mkl_?imatcopy 335 mkl_?omatadd 344 mkl_?omatcopy 338 mkl_?omatcopy2 341 mkl_?skymm 288 mkl_?skymv 275 mkl_?skysm 295 mkl_?skysv 281 mkl_cspblas_?bsrgemv 194 mkl_cspblas_?bsrsymv 202 mkl_cspblas_?bsrtrsv 209 mkl_cspblas_?coogemv 197 mkl_cspblas_?coosymv 204 mkl_cspblas_?csrgemv 192 mkl_cspblas_?csrsymv 199 mkl_cspblas_?csrtrsv 207 mkl_cspblas_?dcootrsv 212 mkl_disable_fast_mm 2538 MKL_Disable_Fast_MM 2538 mkl_domain_get_max_threads 2527 MKL_Domain_Get_Max_Threads 2527 mkl_domain_set_num_threads 2525 MKL_Domain_Set_Num_Threads 2525 mkl_enable_instructions 2544 MKL_Enable_Instructions 2544 mkl_free usage example 2540 MKL_free 2540 mkl_free_buffers 2536 MKL_Free_Buffers 2536 MKL_FreeBuffers 2536 mkl_get_clocks_frequency 2535 MKL_Get_Clocks_Frequency 2535 mkl_get_cpu_clocks 2533 MKL_Get_Cpu_Clocks 2533 mkl_get_cpu_frequency 2534 MKL_Get_Cpu_Frequency 2534 mkl_get_dynamic 2528 MKL_Get_Dynamic 2528 mkl_get_max_cpu_frequency 2534 MKL_Get_Max_Cpu_Frequency 2534 mkl_get_max_threads 2526 MKL_Get_Max_Threads 2526 mkl_get_version 2521 MKL_Get_Version 2521 mkl_get_version_string 2523 mkl_malloc usage example 2540 MKL_malloc 2539 mkl_mem_stat usage example 2540 MKL_Mem_Stat 2538 MKL_MemStat 2538 mkl_progress 2542 mkl_set_dynamic 2526 MKL_Set_Dynamic 2526 mkl_set_interface_layer 2545 mkl_set_num_threads 2524 MKL_Set_Num_Threads 2524 mkl_set_progress 2547 mkl_set_threading_layer 2546 mkl_set_xerbla 2546 mkl_thread_free_buffers 2537 MKL_Thread_Free_Buffers 2537 MKLGetVersion 2521 MKLGetVersionString 2523 MPI Multiplicative Congruential Generator 2118 N naming conventions BLAS 51 LAPACK 668, 1536 Nonlinear Optimization Solvers 2496 PBLAS 2374 Sparse BLAS Level 1 140 Sparse BLAS Level 2 151 Sparse BLAS Level 3 151 VML 1970 negative eigenvalues 1778 NegBinomial 2206 Intel® Math Kernel Library Reference Manual 2734 NewStream 2128 NewStreamEx 2129 NewTaskX1D 2228 Nonsymmetric Eigenproblems 1019 O off-diagonal elements initialization 1817 LAPACK 1361 ScaLAPACK 1817 one-dimensional FFTs storage effects 2341–2343 orthogonal matrix CS decomposition LAPACK 920, 925, 1060 from LQ factorization LAPACK 1396 ScaLAPACK 1836 from QL factorization LAPACK 1394, 1399 ScaLAPACK 1833, 1840 from QR factorization LAPACK 1395 ScaLAPACK 1835 from RQ factorization LAPACK 1397 ScaLAPACK 1838 P p?agemv 2389 p?ahemv 2397 p?amax 2376 p?asum 2377 p?asymv 2404 p?atrmv 2410 p?axpy 2378 p?copy 2379 p?dbsv 1687 p?dbtrf 1542 p?dbtrs 1553 p?dbtrsv 1746 p?dot 2380 p?dotc 2381 p?dotu 2382 p?dtsv 1689 p?dttrf 1543 p?dttrs 1555 p?dttrsv 1748 p?gbsv 1685 p?gbtrf 1540 p?gbtrs 1551 p?geadd 2415 p?gebd2 1751 p?gebrd 1666 p?gecon 1564 p?geequ 1583 p?gehd2 1754 p?gehrd 1657 p?gelq2 1756 p?gelqf 1598 p?gels 1701 p?gemm 2418 p?gemv 2387 p?geql2 1758 p?geqlf 1608 p?geqpf 1589 p?geqr2 1760 p?geqrf 1587 p?ger 2391 p?gerc 2393 p?gerfs 1570 p?gerq2 1762 p?gerqf 1617 p?geru 2394 p?gesv 1679 p?gesvd 1723 p?gesvx 1681 p?getf2 1763 p?getrf 1538 p?getri 1578 p?getrs 1550 p?ggqrf 1633 p?ggrqf 1636 p?heev 1713 p?heevd 1715 p?heevx 1717 p?hegst 1677 p?hegvx 1732 p?hemm 2420 p?hemv 2396 p?her 2399 p?her2 2400 p?her2k 2424 p?herk 2422 p?hetrd 1646 p?labad 1879 p?labrd 1765 p?lachkieee 1880 p?lacon 1768 p?laconsb 1769 p?lacp2 1770 p?lacp3 1772 p?lacpy 1773 p?laevswp 1774 p?lahqr 1664 p?lahrd 1775 p?laiect 1778 p?lamch 1881 p?lange 1779 p?lanhs 1780 p?lantr 1783 p?lapiv 1785 p?laqge 1787 p?laqsy 1789 p?lared1d 1791 p?lared2d 1792 p?larf 1793 p?larfb 1795 p?larfc 1798 p?larfg 1800 p?larft 1802 p?larz 1804 p?larzb 1807 p?larzt 1813 p?lascl 1815 p?laset 1817 p?lasmsub 1818 p?lasnbt 1882 p?lassq 1819 p?laswp 1821 p?latra 1822 p?latrd 1823 p?latrz 1828 p?lauu2 1830 p?lauum 1831 p?lawil 1832 p?max1 1744 p?nrm2 2383 Index 2735 p?org2l/p?ung2l 1833 p?org2r/p?ung2r 1835 p?orgl2/p?ungl2 1836 p?orglq 1600 p?orgql 1609 p?orgqr 1591 p?orgr2/p?ungr2 1838 p?orgrq 1619 p?orm2l/p?unm2l 1840 p?orm2r/p?unm2r 1843 p?ormbr 1669 p?ormhr 1659 p?orml2/p?unml2 1846 p?ormlq 1603 p?ormql 1612 p?ormqr 1594 p?ormr2/p?unmr2 1849 p?ormrq 1622 p?ormrz 1628 p?ormtr 1643 p?pbsv 1697 p?pbtrf 1546 p?pbtrs 1558 p?pbtrsv 1851 p?pocon 1566 p?poequ 1584 p?porfs 1573 p?posv 1691 p?posvx 1693 p?potf2 1857 p?potrf 1545 p?potri 1580 p?potrs 1557 p?ptsv 1699 p?pttrf 1548 p?pttrs 1560 p?pttrsv 1854 p?rscl 1858 p?scal 2384 p?stebz 1651 p?stein 1653 p?sum1 1745 p?swap 2385 p?syev 1704 p?syevd 1706 p?syevx 1708 p?sygs2/p?hegs2 1859 p?sygst 1676 p?sygvx 1726 p?symm 2426 p?symv 2402 p?syr 2406 p?syr2 2407 p?syr2k 2430 p?syrk 2428 p?sytd2/p?hetd2 1861 p?sytrd 1640 p?tradd 2416 p?tran 2432 p?tranc 2434 p?tranu 2433 p?trcon 1568 p?trmm 2435 p?trmv 2409 p?trrfs 1576 p?trsm 2437 p?trsv 2413 p?trti2 1864 p?trtri 1581 p?trtrs 1562 p?tzrzf 1626 p?unglq 1602 p?ungql 1611 p?ungqr 1592 p?ungrq 1620 p?unmbr 1672 p?unmhr 1662 p?unmlq 1605 p?unmql 1615 p?unmqr 1596 p?unmrq 1624 p?unmrz 1631 p?unmtr 1648 Packed formats 2347 packed storage scheme 2646 parallel direct solver (Pardiso) 1885 parallel direct sparse solver interface pardiso 1886 pardiso_64 1903 pardiso_getenv 1904 pardiso_setenv 1904 pardisoinit 1902 parameters for a Givens rotation 64 modified Givens transformation 67 pardiso 1886 PARDISO parameters 1905 pardiso_64 1903 pardiso_getenv 1904 pardiso_setenv 1904 PARDISO* solver 1885 pardisoinit 1902 Partial Differential Equations support Helmholtz problem on a sphere 2459 Poisson problem on a sphere 2460 three-dimensional Helmholtz problem 2461 three-dimensional Laplace problem 2461 three-dimensional Poisson problem 2461 two-dimensional Helmholtz problem 2458 two-dimensional Laplace problem 2459 two-dimensional Poisson problem 2458 PBLAS Level 1 functions p?amax 2376 p?asum 2377 p?dot 2380 p?dotc 2381 p?dotu 2382 p?nrm2 2383 PBLAS Level 1 routines p?amax 2375 p?asum 2375 p?axpy 2375, 2378 p?copy 2375, 2379 p?dot 2375 p?dotc 2375 p?dotu 2375 p?nrm2 2375 p?scal 2375, 2384 p?swap 2375, 2385 PBLAS Level 2 routines ?agemv 2386 ?asymv 2386 ?gemv 2386 ?ger 2386 ?gerc 2386 ?geru 2386 ?hemv 2386 ?her 2386 ?her2 2386 ?symv 2386 Intel® Math Kernel Library Reference Manual 2736 ?syr 2386 ?syr2 2386 ?trmv 2386 ?trsv 2386 p?agemv 2389 p?ahemv 2397 p?asymv 2404 p?atrmv 2410 p?gemv 2387 p?ger 2391 p?gerc 2393 p?geru 2394 p?hemv 2396 p?her 2399 p?her2 2400 p?symv 2402 p?syr 2406 p?syr2 2407 p?trmv 2409 p?trsv 2413 PBLAS Level 3 routines p?geadd 2415 p?gemm 2414, 2418 p?hemm 2414, 2420 p?her2k 2414, 2424 p?herk 2414, 2422 p?symm 2414, 2426 p?syr2k 2414, 2430 p?syrk 2414, 2428 p?tradd 2416 p?tran 2432 p?tranc 2434 p?tranu 2433 p?trmm 2414, 2435 p?trsm 2414, 2437 PBLAS routines routine groups pcagemv 2389 pcahemv 2397 pcamax 2376 pcatrmv 2410 pcaxpy 2378 pccopy 2379 pcdotc 2381 pcdotu 2382 pcgeadd 2415 pcgecon 1564 pcgemm 2418 pcgemv 2387 pcgerc 2393 pcgeru 2394 pchemm 2420 pchemv 2396 pcher 2399 pcher2 2400 pcher2k 2424 pcherk 2422 pcnrm2 2383 pcscal 2384 pcsscal 2384 pcswap 2385 pcsymm 2426 pcsyr2k 2430 pcsyrk 2428 pctradd 2416 pctranu 2433 pctrmm 2435 pctrmv 2409 pctrsm 2437 pctrsv 2413 pdagemv 2389 pdamax 2376 pdasum 2377 pdasymv 2404 pdatrmv 2410 pdaxpy 2378 pdcopy 2379 pddot 2380 PDE support pdgeadd 2415 pdgecon 1564 pdgemm 2418 pdgemv 2387 pdger 2391 pdlaiectb 1778 pdlaiectl 1778 pdnrm2 2383 pdscal 2384 pdswap 2385 pdsymm 2426 pdsymv 2402 pdsyr 2406 pdsyr2 2407 pdsyr2k 2430 pdsyrk 2428 pdtradd 2416 pdtran 2432 pdtranc 2434 pdtrmm 2435 pdtrmv 2409 pdtrsm 2437 pdtrsv 2413 pdzasum 2377 permutation matrix 2630 picopy 2379 pivoting matrix rows or columns 1785 PL Interface 2457 points rotation in the modified plane 65 in the plane 63 Poisson 2202 Poisson Library routines ?_commit_Helmholtz_2D 2467 ?_commit_Helmholtz_3D 2467 ?_commit_sph_np 2476 ?_commit_sph_p 2476 ?_Helmholtz_2D 2470 ?_Helmholtz_3D 2470 ?_init_Helmholtz_2D 2465 ?_init_Helmholtz_3D 2465 ?_init_sph_np 2475 ?_init_sph_p 2475 ?_sph_np 2478 ?_sph_p 2478 free_Helmholtz_2D 2474 free_Helmholtz_3D 2474 free_sph_np 2480 free_sph_p 2480 structure 2457 Poisson problem on a sphere 2460 three-dimensional 2461 two-dimensional 2458 PoissonV 2204 pprfs 478 pptrs 396 preconditioned Jacobi SVD 1045 preconditioners based on incomplete LU factorization dcsrilu0 1961 Index 2737 dcsrilut 1963 Preconditioners Interface Description 1960 process grid 1535, 2373 product distributed matrix-vector general matrix 2387, 2389 distributed vector-scalar 2384 matrix-vector distributed Hermitian matrix 2396, 2397 distributed symmetric matrix 2402, 2404 distributed triangular matrix 2409, 2410 general matrix 75, 77, 329, 331, 1468 Hermitian indefinite matrix 1478 Hermitian matrix 84, 86, 91 real symmetric matrix 98, 102 symmetric indefinite matrix 1505 triangular matrix 107, 112, 115 scalar-matrix general distributed matrix 2418 general matrix 119, 333 Hermitian distributed matrix 2420 Hermitian matrix 122 scalar-matrix-matrix general distributed matrix 2418 general matrix 119, 333 Hermitian distributed matrix 2420 Hermitian matrix 122 symmetric distributed matrix 2426 symmetric matrix 128 triangular distributed matrix 2435 triangular matrix 135 vector-scalar 69 product:matrix-vector general matrix band storage 75 Hermitian matrix band storage 84 packed storage 91 real symmetric matrix packed storage 98 symmetric matrix band storage 95 triangular matrix band storage 107 packed storage 112 psagemv 2389 psamax 2376 psasum 2377 psasymv 2404 psatrmv 2410 psaxpy 2378 pscasum 2377 pscopy 2379 psdot 2380 pseudorandom numbers psgeadd 2415 psgecon 1564 psgemm 2418 psgemv 2387 psger 2391 pslaiect 1778 psnrm2 2383 psscal 2384 psswap 2385 pssymm 2426 pssymv 2402 pssyr 2406 pssyr2 2407 pssyr2k 2430 pssyrk 2428 pstradd 2416 pstran 2432 pstranc 2434 pstrmm 2435 pstrmv 2409 pstrsm 2437 pstrsv 2413 pxerbla 1882, 2530 pzagemv 2389 pzahemv 2397 pzamax 2376 pzatrmv 2410 pzaxpy 2378 pzcopy 2379 pzdotc 2381 pzdotu 2382 pzdscal 2384 pzgeadd 2415 pzgecon 1564 pzgemm 2418 pzgemv 2387 pzgerc 2393 pzgeru 2394 pzhemm 2420 pzhemv 2396 pzher 2399 pzher2 2400 pzher2k 2424 pzherk 2422 pznrm2 2383 pzscal 2384 pzswap 2385 pzsymm 2426 pzsyr2k 2430 pzsyrk 2428 pztradd 2416 pztranu 2433 pztrmm 2435 pztrmv 2409 pztrsm 2437 pztrsv 2413 Q QL factorization computing the elements of complex matrix Q 704 orthogonal matrix Q 1609 real matrix Q 702 unitary matrix Q 1611 general rectangular matrix LAPACK 1171 ScaLAPACK 1758 multiplying general matrix by orthogonal matrix Q 1612 unitary matrix Q 1615 QR factorization computing the elements of orthogonal matrix Q 681, 1591 unitary matrix Q 685, 1592 general rectangular matrix LAPACK 1172, 1174, 1175 ScaLAPACK 1760, 1762 with pivoting ScaLAPACK 1589 quasi-random numbers quasi-triangular matrix LAPACK 833, 877 ScaLAPACK 1656 quasi-triangular system of equations 1289 Intel® Math Kernel Library Reference Manual 2738 R random number generators 2115 random stream 2123 random stream descriptor 2117 Random Streams 2123 rank-1 update conjugated, distributed general matrix 2393 conjugated, general matrix 81 distributed general matrix 2391 distributed Hermitian matrix 2399 distributed symmetric matrix 2406 general matrix 79 Hermitian matrix packed storage 92 real symmetric matrix packed storage 99 unconjugated, distributed general matrix 2394 unconjugated, general matrix 82 rank-2 update distributed Hermitian matrix 2400 distributed symmetric matrix 2407 Hermitian matrix packed storage 94 symmetric matrix packed storage 101 rank-2k update Hermitian distributed matrix 2424 Hermitian matrix 126 symmetric distributed matrix 2430 symmetric matrix 133 rank-k update distributed Hermitian matrix 2422 Hermitian matrix 124 symmetric distributed matrix 2428 rank-n update symmetric matrix 131 Rayleigh 2175 RCI CG Interface 1933 RCI CG sparse solver routines dcg 1946, 1950 dcg_check 1946 dcg_get 1948 dcg_init 1945 dcgmrhs_check 1949 dcgmrhs_get 1952 dcgmrhs_init 1948 RCI FGMRES Interface 1938 RCI FGMRES sparse solver routines dfgmres_check 1953 dfgmres_get 1956 dfgmres_init 1952 RCI GFMRES sparse solver routines dfgres 1954 RCI ISS 1932 RCI ISS interface 1932 RCI ISS sparse solver routines implementation details 1957 real matrix QR factorization with pivoting 1269 real symmetric matrix 1-norm value 1252 Frobenius norm 1252 infinity- norm 1252 largest absolute value of element 1252 real symmetric tridiagonal matrix 1-norm value 1251 Frobenius norm 1251 infinity- norm 1251 largest absolute value of element 1251 reducing generalized eigenvalue problems LAPACK 820 ScaLAPACK 1676 reduction to upper Hessenberg form general matrix 1754 general square matrix 1168 refining solutions of linear equations band matrix 458 banded matrix 461, 1462, 1493 general matrix 449, 452, 1473, 1570 Hermitian indefinite matrix 496, 1482 Hermitian matrix packed storage 504 Hermitian positive-definite matrix band storage 480 packed storage 478 symmetric indefinite matrix 488, 1511 symmetric matrix packed storage 501 symmetric positive-definite matrix band storage 480 packed storage 478 symmetric/Hermitian positive-definite distributed matrix 1573 tridiagonal matrix 467 RegisterBrng 2209 registering a basic generator 2208 reordering of matrices 2631 Reverse Communication Interface 1932 rotation of points in the modified plane 65 of points in the plane 63 of sparse vectors 148 parameters for a Givens rotation 64 parameters of modified Givens transformation 67 routine name conventions BLAS 51 Nonlinear Optimization Solvers 2496 PBLAS 2374 Sparse BLAS Level 1 140 Sparse BLAS Level 2 151 Sparse BLAS Level 3 151 RQ factorization computing the elements of complex matrix Q 714 orthogonal matrix Q 1619 real matrix Q 712 unitary matrix Q 1620 S SaveStreamF 2140 SaveStreamM 2142 sbbcsd 920 sbdsdc 756 ScaLAPACK ScaLAPACK routines 1D array redistribution 1791, 1792 auxiliary routines ?combamax1 1745 ?dbtf2 1872 ?dbtrf 1873 ?dttrf 1874 ?dttrsv 1875 ?lamsh 1866 ?laref 1867 Index 2739 ?lasorte 1868 ?lasrt2 1869 ?pttrsv 1876 ?stein2 1870 ?steqr2 1878 p?dbtrsv 1746 p?dttrsv 1748 p?gebd2 1751 p?gehd2 1754 p?gelq2 1756 p?geql2 1758 p?geqr2 1760 p?gerq2 1762 p?getf2 1763 p?labrd 1765 p?lacgv 1743 p?lacon 1768 p?laconsb 1769 p?lacp2 1770 p?lacp3 1772 p?lacpy 1773 p?laevswp 1774 p?lahrd 1775 p?laiect 1778 p?lange 1779 p?lanhs 1780 p?lansy, p?lanhe 1782 p?lantr 1783 p?lapiv 1785 p?laqge 1787 p?laqsy 1789 p?lared1d 1791 p?lared2d 1792 p?larf 1793 p?larfb 1795 p?larfc 1798 p?larfg 1800 p?larft 1802 p?larz 1804 p?larzb 1807 p?larzc 1809 p?larzt 1813 p?lascl 1815 p?laset 1817 p?lasmsub 1818 p?lassq 1819 p?laswp 1821 p?latra 1822 p?latrd 1823 p?latrs 1826 p?latrz 1828 p?lauu2 1830 p?lauum 1831 p?lawil 1832 p?max1 1744 p?org2l/p?ung2l 1833 p?org2r/p?ung2r 1835 p?orgl2/p?ungl2 1836 p?orgr2/p?ungr2 1838 p?orm2l/p?unm2l 1840 p?orm2r/p?unm2r 1843 p?orml2/p?unml2 1846 p?ormr2/p?unmr2 1849 p?pbtrsv 1851 p?potf2 1857 p?pttrsv 1854 p?rscl 1858 p?sum1 1745 p?sygs2/p?hegs2 1859 p?sytd2/p?hetd2 1861 p?trti2 1864 pdlaiectb 1778 pdlaiectl 1778 pslaiect 1778 block reflector triangular factor 1802, 1813 Cholesky factorization 1548 complex matrix complex elementary reflector 1809 complex vector 1-norm using true absolute value 1745 complex vector conjugation 1743 condition number estimation p?gecon 1564 p?pocon 1566 p?trcon 1568 driver routines p?dbsv 1687 p?dtsv 1689 p?gbsv 1685 p?gels 1701 p?gesv 1679 p?gesvd 1723 p?gesvx 1681 p?heev 1713 p?heevd 1715 p?heevx 1717 p?hegvx 1732 p?pbsv 1697 p?posv 1691 p?posvx 1693 p?ptsv 1699 p?syev 1704 p?syevd 1706 p?syevx 1708 p?sygvx 1726 error estimation p?trrfs 1576 error handling pxerbla 1882, 2530 general matrix block reflector 1807 elementary reflector 1804 LU factorization 1763 reduction to upper Hessenberg form 1754 general rectangular matrix elementary reflector 1793 LQ factorization 1756 QL factorization 1758 QR factorization 1760 reduction to bidiagonal form 1765 reduction to real bidiagonal form 1751 row interchanges 1821 RQ factorization 1762 generalized eigenvalue problems p?hegst 1677 p?sygst 1676 Householder matrix elementary reflector 1800 LQ factorization p?gelq2 1756 p?gelqf 1598 p?orglq 1600 p?ormlq 1603 p?unglq 1602 p?unmlq 1605 LU factorization p?dbtrsv 1746 p?dttrf 1543 p?dttrsv 1748 Intel® Math Kernel Library Reference Manual 2740 p?getf2 1763 matrix equilibration p?geequ 1583 p?poequ 1584 matrix inversion p?getri 1578 p?potri 1580 p?trtri 1581 nonsymmetric eigenvalue problems p?gehrd 1657 p?lahqr 1664 p?ormhr 1659 p?unmhr 1662 QL factorization ?geqlf 1608 ?ungql 1611 p?geql2 1758 p?orgql 1609 p?ormql 1612 p?unmql 1615 QR factorization p?geqpf 1589 p?geqr2 1760 p?ggqrf 1633 p?orgqr 1591 p?ormqr 1594 p?ungqr 1592 p?unmqr 1596 RQ factorization p?gerq2 1762 p?gerqf 1617 p?ggrqf 1636 p?orgrq 1619 p?ormrq 1622 p?ungrq 1620 p?unmrq 1624 RZ factorization p?ormrz 1628 p?tzrzf 1626 p?unmrz 1631 singular value decomposition p?gebrd 1666 p?ormbr 1669 p?unmbr 1672 solution refinement and error estimation p?gerfs 1570 p?porfs 1573 solving linear equations ?dttrsv 1875 ?pttrsv 1876 p?dbtrs 1553 p?dttrs 1555 p?gbtrs 1551 p?getrs 1550 p?potrs 1557 p?pttrs 1560 p?trtrs 1562 symmetric eigenproblems p?hetrd 1646 p?ormtr 1643 p?stebz 1651 p?stein 1653 p?sytrd 1640 p?unmtr 1648 symmetric eigenvalue problems ?stein2 1870 ?steqr2 1878 trapezoidal matrix 1828 triangular factorization ?dbtrf 1873 ?dttrf 1874 p?dbtrsv 1746 p?dttrsv 1748 p?gbtrf 1540 p?getrf 1538 p?pbtrf 1546 p?potrf 1545 p?pttrf 1548 triangular system of equations 1826 updating sum of squares 1819 utility functions and routines p?labad 1879 p?lachkieee 1880 p?lamch 1881 p?lasnbt 1882 pxerbla 1882, 2530 scalar-matrix product 119, 122, 128, 333, 2418, 2420, 2426 scalar-matrix-matrix product general distributed matrix 2418 general matrix 119, 333 symmetric distributed matrix 2426 symmetric matrix 128 triangular distributed matrix 2435 triangular matrix 135 scaling general rectangular matrix 1787 symmetric/Hermitian matrix 1789 scaling factors general rectangular distributed matrix 1583 Hermitian positive definite distributed matrix 1584 symmetric positive definite distributed matrix 1584 scattering compressed sparse vector's elements into full storage form 149 Schur decomposition 894, 896 Schur factorization 1223, 1224, 1259 scsum1 1165 second/dsecnd 2532 Service Functions 1972 Service Routines 2127 SetInternalDecimation 2237 sgbcon 422 sgbrfsx 461 sgbsvx 576 sgbtrs 387 sgecon 420 sgejsv 1045 sgeqpf 676 sgesvj 1051 sgtrfs 467 shgeqz 885 shseqr 851 simple driver 1536 Single Dynamic Library mkl_set_interface_layer 2545 mkl_set_progress 2547 mkl_set_threading_layer 2546 mkl_set_xerbla 2546 single node matrix 1866 singular value decomposition LAPACK 734 LAPACK routines, singular value decomposition 1666 ScaLAPACK 1666, 1723 See also LAPACK routines, singular value decomposition 734 Singular Value Decomposition 1037 sjacobi 2515 sjacobi_delete 2514 sjacobi_init 2512 sjacobi_solve 2513 Index 2741 sjacobix 2516 SkipAheadStream 2148 sla_gbamv 1455 sla_gbrcond 1457 sla_gbrfsx_extended 1462 sla_gbrpvgrw 1467 sla_geamv 1468 sla_gercond 1470 sla_gerfsx_extended 1473 sla_lin_berr 1488 sla_porcond 1489 sla_porfsx_extended 1493 sla_porpvgrw 1498 sla_rpvgrw 1503 sla_syamv 1505 sla_syrcond 1507 sla_syrfsx_extended 1511 sla_syrpvgrw 1516 sla_wwaddw 1517 slag2d 1428 slapmr 1260 slapmt 1262 slarfb 1295 slarft 1300 slarscl2 1504 slartgp 1324 slartgs 1326 slascl2 1504 slatps 1383 slatrd 1385 slatrs 1387 slatrz 1390 slauu2 1392 slauum 1393 small subdiagonal element 1818 smallest absolute value of a vector element 72 sNewAbstractStream 2135 solver direct 2629 iterative 2629 Solver Sparse 1885 solving linear equations 387 solving linear equations. linear equations 1551 solving linear equations. See linear equations 1232 sorbdb 925 sorcsd 1060 sorg2l 1394 sorg2r 1395 sorgl2 1396 sorgr2 1397 sorm2l 1399 sorm2r 1400 sorml2 1402 sormr2 1404 sormr3 1405 sorting eigenpairs 1868 numbers in increasing/decreasing order LAPACK 1371 ScaLAPACK 1869 Sparse BLAS Level 1 data types 140 naming conventions 140 Sparse BLAS Level 1 routines and functions ?axpyi 141 ?dotci 144 ?doti 143 ?dotui 145 ?gthr 146 ?gthrz 147 ?roti 148 ?sctr 149 Sparse BLAS Level 2 naming conventions 151 sparse BLAS Level 2 routines mkl_?bsrgemv 164 mkl_?bsrmv 218 mkl_?bsrsv 232 mkl_?bsrsymv 173 mkl_?bsrtrsv 184 mkl_?coogemv 166 mkl_?coomv 225 mkl_?coosv 239 mkl_?coosymv 176 mkl_?cootrsv 186 mkl_?cscmv 222 mkl_?cscsv 235 mkl_?csrgemv 161 mkl_?csrmv 215 mkl_?csrsv 228 mkl_?csrsymv 171 mkl_?csrtrsv 181 mkl_?diagemv 169 mkl_?diamv 272 mkl_?diasv 278 mkl_?diasymv 178 mkl_?diatrsv 189 mkl_?skymv 275 mkl_?skysv 281 mkl_cspblas_?bsrgemv 194 mkl_cspblas_?bsrsymv 202 mkl_cspblas_?bsrtrsv 209 mkl_cspblas_?coogemv 197 mkl_cspblas_?coosymv 204 mkl_cspblas_?cootrsv 212 mkl_cspblas_?csrgemv 192 mkl_cspblas_?csrsymv 199 mkl_cspblas_?csrtrsv 207 Sparse BLAS Level 3 naming conventions 151 sparse BLAS Level 3 routines mkl_?bsrmm 246 mkl_?bsrsm 268 mkl_?coomm 254 mkl_?coosm 265 mkl_?cscmm 250 mkl_?cscsm 261 mkl_?csradd 316 mkl_?csrmm 242 mkl_?csrmultcsr 320 mkl_?csrmultd 324 mkl_?csrsm 257 mkl_?diamm 284 mkl_?diasm 291 mkl_?skymm 288 mkl_?skysm 295 sparse BLAS routines mkl_?csrbsr 304 mkl_?csrcoo 301 mkl_?csrcsc 307 mkl_?csrdia 309 mkl_?csrsky 313 mkl_?dnscsr 298 sparse matrices 151 sparse matrix 151 Sparse Matrix Storage Formats 152 sparse solver parallel direct sparse solver interface pardiso 1886 Intel® Math Kernel Library Reference Manual 2742 pardiso_64 1903 pardiso_getenv 1904 pardiso_setenv 1904 pardisoinit 1902 Sparse Solver direct sparse solver interface dss_create 1916 dss_define_structure dss_define_structure 1918 dss_delete 1926 dss_factor 1921 dss_factor_complex 1921 dss_factor_real 1921 dss_reorder 1920 dss_solve 1923 dss_solve_complex 1923 dss_solve_real 1923 dss_statistics 1927 mkl_cvt_to_null_terminated_str 1930 iterative sparse solver interface dcg 1946 dcg_check 1946 dcg_get 1948 dcg_init 1945 dcgmrhs 1950 dcgmrhs_check 1949 dcgmrhs_get 1952 dcgmrhs_init 1948 dfgmres 1954 dfgmres_check 1953 dfgmres_get 1956 dfgmres_init 1952 preconditioners based on incomplete LU factorization dcsrilu0 1961 dcsrilut 1963 Sparse Solvers 1905 sparse vectors adding and scaling 141 complex dot product, conjugated 144 complex dot product, unconjugated 145 compressed form 140 converting to compressed form 146, 147 converting to full-storage form 149 full-storage form 140 Givens rotation 148 norm 140 passed to BLAS level 1 routines 140 real dot product 143 scaling 140 spbtf2 1407 specific hardware support mkl_enable_instructions 2544 Spline Methods 2606 split Cholesky factorization (band matrices) 831 sporfsx 472 spotf2 1408 spprfs 478 spptrs 396 sptts2 1409 square matrix 1-norm estimation LAPACK 1184, 1185 ScaLAPACK 1768 srscl 1411 ssyconv 436 ssygs2 1415 ssyswapr 1411 ssyswapr1 1414 ssytd2 1417 ssytf2 1418 ssytri2 523 ssytri2x 527 ssytrs2 406 stgex2 1421 stgsy2 1423 stream 2123 strexc 868 stride. increment 2645 strnlsp_check 2499 strnlsp_delete 2503 strnlsp_get 2502 strnlsp_init 2497 strnlsp_solve 2500 strnlspbc_check 2506 strnlspbc_delete 2511 strnlspbc_get 2510 strnlspbc_init 2505 strnlspbc_solve 2508 strti2 1426 sum of distributed vectors 2378 of magnitudes of elements of a distributed vector 2377 of magnitudes of the vector elements 54 of sparse vector and full-storage vector 141 of vectors 55, 327 sum of squares updating LAPACK 1372 ScaLAPACK 1819 summary statistics vsldsscompute 2302 vsldSSCompute 2302 vsldsseditcorparameterization 2298 vsldSSEditCorParameterization 2298 vsldsseditcovcor 2280 vsldSSEditCovCor 2280 vsldsseditmissingvalues 2294 vsldSSEditMissingValues 2294 vsldsseditmoments 2278 vsldSSEditMoments 2278 vsldsseditoutliersdetection 2292 vsldSSEditOutliersDetection 2292 vsldsseditpartialcovcor 2282 vsldSSEditPartialCovCor 2282 vsldsseditpooledcovariance 2287 vsldSSEditPooledCovariance 2287 vsldsseditquantiles 2284 vsldSSEditQuantiles 2284 vsldsseditrobustcovariance 2289 vsldSSEditRobustCovariance 2289 vsldsseditstreamquantiles 2286 vsldSSEditStreamQuantiles 2286 vsldssedittask 2270 vsldSSEditTask 2270 vsldssnewtask 2267 vsldSSNewTask 2267 vslissedittask 2270 vsliSSEditTask 2270 vslssdeletetask 2303 vslSSDeleteTask 2303 vslssscompute 2302 vslsSSCompute 2302 vslssseditcorparameterization 2298 vslsSSEditCorParameterization 2298 vslssseditcovcor 2280 vslsSSEditCovCor 2280 vslssseditmissingvalues 2294 vslsSSEditMissingValues 2294 Index 2743 vslssseditmoments 2278 vslsSSEditMoments 2278 vslssseditoutliersdetection 2292 vslsSSEditOutliersDetection 2292 vslssseditpartialcovcor 2282 vslsSSEditPartialCovCor 2282 vslssseditpooledcovariance 2287 vslsSSEditPooledCovariance 2287 vslssseditquantiles 2284 vslsSSEditQuantiles 2284 vslssseditrobustcovariance 2289 vslsSSEditRobustCovariance 2289 vslssseditstreamquantiles 2286 vslsSSEditStreamQuantiles 2286 vslsssedittask 2270 vslsSSEditTask 2270 vslsssnewtask 2267 vslsSSNewTask 2267 summary statistics usage examples 2304 support functions mkl_free 2540 mkl_malloc 2539 mkl_mem_stat 2538 mkl_progress 2542 support routines mkl_disable_fast_mm 2538 mkl_free_buffers 2536 mkl_thread_free_buffers 2537 progress information 2542 SVD (singular value decomposition) LAPACK 734 ScaLAPACK 1666 swapping adjacent diagonal blocks 1213, 1421 swapping distributed vectors 2385 swapping vectors 70 Sylvester's equation 874 symmetric band matrix 1-norm value 1247 Frobenius norm 1247 infinity- norm 1247 largest absolute value of element 1247 symmetric distributed matrix rank-n update 2428, 2430 scalar-matrix-matrix product 2426 Symmetric Eigenproblems 948 symmetric indefinite matrix factorization with diagonal pivoting method 1418 matrix-vector product 1505 symmetric matrix Bunch-Kaufman factorization packed storage 381 eigenvalues and eigenvectors 1704, 1706, 1708 estimating the condition number packed storage 439 generalized eigenvalue problems 819 inverting the matrix packed storage 530 matrix-vector product band storage 95 packed storage 98, 1159 rank-1 update packed storage 99, 1161 rank-2 update packed storage 101 rank-2k update 133 rank-n update 131 reducing to standard form LAPACK 1415 ScaLAPACK 1859 reducing to tridiagonal form LAPACK 1385 ScaLAPACK 1823 scalar-matrix-matrix product 128 scaling 1789 solving systems of linear equations packed storage 409 symmetric matrix in packed form 1-norm value 1249 Frobenius norm 1249 infinity- norm 1249 largest absolute value of element 1249 symmetric positive definite distributed matrix computing scaling factors 1584 equilibration 1584 symmetric positive semidefinite matrix Cholesky factorization 366 symmetric positive-definite band matrix Cholesky factorization 1407 symmetric positive-definite distributed matrix inverting the matrix 1580 symmetric positive-definite matrix Cholesky factorization band storage 371, 1546 LAPACK 1408 packed storage 369 ScaLAPACK 1545, 1857 estimating the condition number band storage 430 packed storage 428 tridiagonal matrix 432 inverting the matrix packed storage 519 solving systems of linear equations band storage 398, 1558 LAPACK 393 packed storage 396 ScaLAPACK 1557 symmetric positive-definite tridiagonal matrix solving systems of linear equations 1560 system of linear equations with a distributed triangular matrix 2413 with a triangular matrix band storage 109 packed storage 113 systems of linear equations linear equations 1875 systems of linear equationslinear equations 1550 syswapr 1411 syswapr1 1414 sytri2 523 sytri2x 527 T Task Computation Routines 2606 Task Creation and Initialization NewTask1d 2592 Task Status 2590 threading control mkl_domain_get_max_threads 2527 mkl_domain_set_num_threads 2525 mkl_get_dynamic 2528 mkl_get_max_threads 2526 mkl_set_dynamic 2526 mkl_set_num_threads 2524 Threading Control 2524 timing functions mkl_get_clocks_frequency 2535 MKL_Get_Cpu_Clocks 2533 Intel® Math Kernel Library Reference Manual 2744 mkl_get_cpu_frequency 2534 mkl_get_max_cpu_frequency 2534 second/dsecnd 2532 TR routines ?trnlsp_check 2499 ?trnlsp_delete 2503 ?trnlsp_get 2502 ?trnlsp_init 2497 ?trnlsp_solve 2500 ?trnlspbc_check 2506 ?trnlspbc_delete 2511 ?trnlspbc_get 2510 ?trnlspbc_init 2505 ?trnlspbc_solve 2508 nonlinear least squares problem with linear bound constraints 2504 without constraints 2496 organization and implementation 2495 transposition distributed complex matrix 2433 distributed complex matrix, conjugated 2434 distributed real matrix 2432 Transposition and General Memory Movement Routines 327 transposition parameter 2648 trapezoidal matrix 1-norm value 1257 Frobenius norm 1257 infinity- norm 1257 largest absolute value of element 1257 reduction to triangular form 1828 RZ factorization LAPACK 720 ScaLAPACK 1626 trexc 868 triangular band matrix 1-norm value 1255 Frobenius norm 1255 infinity- norm 1255 largest absolute value of element 1255 triangular banded equations LAPACK 1380 ScaLAPACK 1851 triangular distributed matrix inverting the matrix 1581 scalar-matrix-matrix product 2435 triangular factorization band matrix 359, 1540, 1542, 1746, 1873 diagonally dominant tridiagonal matrix LAPACK 363 general matrix 357, 1538 Hermitian matrix packed storage 383 Hermitian positive semidefinite matrix 366 Hermitian positive-definite matrix band storage 371, 1546 packed storage 369 tridiagonal matrix 373, 1548 symmetric matrix packed storage 381 symmetric positive semidefinite matrix 366 symmetric positive-definite matrix band storage 371, 1546 packed storage 369 tridiagonal matrix 373, 1548 tridiagonal matrix LAPACK 361 ScaLAPACK 1874 triangular matrix 1-norm value LAPACK 1257 ScaLAPACK 1783 copying 1444–1446, 1448–1450 estimating the condition number band storage 447 packed storage 445 Frobenius norm LAPACK 1257 ScaLAPACK 1783 infinity- norm LAPACK 1257 ScaLAPACK 1783 inverting the matrix LAPACK 1426 packed storage 536 ScaLAPACK 1864 largest absolute value of element LAPACK 1257 ScaLAPACK 1783 matrix-vector product band storage 107 packed storage 112 product blocked algorithm 1393, 1831 LAPACK 1392, 1393 ScaLAPACK 1830, 1831 unblocked algorithm 1392 ScaLAPACK 1656 scalar-matrix-matrix product 135 solving systems of linear equations band storage 109, 418 packed storage 113, 416 ScaLAPACK 1562 swapping adjacent diagonal blocks 1421 triangular matrix factorization Hermitian positive-definite matrix 364 symmetric positive-definite matrix 364 triangular matrix in packed form 1-norm value 1256 Frobenius norm 1256 infinity- norm 1256 largest absolute value of element 1256 triangular system of equations solving with scale factor LAPACK 1387 ScaLAPACK 1826 tridiagonal matrix estimating the condition number 424 solving systems of linear equations ScaLAPACK 1875 tridiagonal system of equations 1409 tridiagonal triangular factorization band matrix 1748 tridiagonal triangular system of equations 1854 trigonometric transform backward cosine 2442 backward sine 2442 backward staggered cosine 2443 backward staggered sine 2442 backward twice staggered cosine 2443 backward twice staggered sine 2442 forward cosine 2442 forward sine 2442 forward staggered cosine 2443 forward staggered sine 2442 forward twice staggered cosine 2443 forward twice staggered sine 2442 Trigonometric Transform interface routines ?_backward_trig_transform 2450 Index 2745 ?_commit_trig_transform 2446 ?_forward_trig_transform 2448 ?_init_trig_transform 2445 free_trig_transform 2451 Trigonometric Transforms interface 2445 TT interface 2441 TT routines 2445 two matrices QR factorization LAPACK 728 ScaLAPACK 1633 U ungbr 747 Uniform (continuous) 2156 Uniform (discrete) 2189 UniformBits 2191 UniformBits32 2192 UniformBits64 2193 unitary matrix CS decomposition LAPACK 920, 925, 1060 from LQ factorization LAPACK 1396 ScaLAPACK 1836 from QL factorization LAPACK 1394, 1399 ScaLAPACK 1833, 1840 from QR factorization LAPACK 1395 ScaLAPACK 1835 from RQ factorization LAPACK 1397 ScaLAPACK 1838 ScaLAPACK 1656, 1666 Unpack Functions 1972 updating rank-1 distributed general matrix 2391 distributed Hermitian matrix 2399 distributed symmetric matrix 2406 general matrix 79 Hermitian matrix 87, 92 real symmetric matrix 99, 104 rank-1, conjugated distributed general matrix 2393 general matrix 81 rank-1, unconjugated distributed general matrix 2394 general matrix 82 rank-2 distributed Hermitian matrix 2400 distributed symmetric matrix 2407 Hermitian matrix 89, 94 symmetric matrix 101, 106 rank-2k Hermitian distributed matrix 2424 Hermitian matrix 126 symmetric distributed matrix 2430 symmetric matrix 133 rank-k distributed Hermitian matrix 2422 Hermitian matrix 124 symmetric distributed matrix 2428 rank-n symmetric matrix 131 updating:rank-1 Hermitian matrix packed storage 92 real symmetric matrix packed storage 99 updating:rank-2 Hermitian matrix packed storage 94 symmetric matrix packed storage 101 upper Hessenberg matrix 1-norm value LAPACK 1246 ScaLAPACK 1780 Frobenius norm LAPACK 1246 ScaLAPACK 1780 infinity- norm LAPACK 1246 ScaLAPACK 1780 largest absolute value of element LAPACK 1246 ScaLAPACK 1780 ScaLAPACK 1656 user time 1529 V v?Abs 1989 v?Acos 2042 v?Acosh 2061 v?Add 1976 v?Arg 1991 v?Asin 2045 v?Asinh 2064 v?Atan 2047 v?Atan2 2050 v?Atanh 2067 v?Cbrt 2004 v?CdfNorm 2075 v?CdfNormInv 2082 v?Ceil 2089 v?CIS 2038 v?Conj 1987 v?Cos 2031 v?Cosh 2052 v?Div 1997 v?Erf 2070 v?Erfc 2073 v?ErfcInv 2080 v?ErfInv 2077 v?Exp 2019 v?Expm1 2022 v?Floor 2088 v?Hypot 2017 v?Inv 1995 v?InvCbrt 2006 v?InvSqrt 2002 v?lgamma 2084 v?LGamma 2084 v?LinearFrac 1993 v?Ln 2024 v?Log10 2027 v?Log1p 2030 v?Modf 2098 v?Mul 1983 v?MulByConj 1986 v?NearbyInt 2094 v?Pack 2100 v?Pow 2011 v?Pow2o3 2007 v?Pow3o2 2009 Intel® Math Kernel Library Reference Manual 2746 v?Powx 2014 v?Rint 2096 v?Round 2093 v?Sin 2034 v?SinCos 2036 v?Sinh 2055 v?Sqr 1981 v?Sqrt 2000 v?Sub 1979 v?Tan 2040 v?Tanh 2058 v?tgamma 2086 v?TGamma 2086 v?Trunc 2091 v?Unpack 2103 vcAdd 1976 vcPackI 2100 vcPackM 2100 vcPackV 2100 vcSin 2034 vcSub 1979 vcUnpackI 2103 vcUnpackM 2103 vcUnpackV 2103 vdAdd 1976 vdlgamma 2084 vdLGamma 2084 vdPackI 2100 vdPackM 2100 vdPackV 2100 vdSin 2034 vdSub 1979 vdtgamma 2086 vdTGamma 2086 vdUnpackI 2103 vdUnpackM 2103 vdUnpackV 2103 vector arguments array dimension 2645 default 2646 examples 2645 increment 2645 length 2645 matrix one-dimensional substructures 2645 sparse vector 140 vector conjugation 1155, 1743 vector indexing 1973 vector mathematical functions absolute value 1989 addition 1976 argument 1991 complementary error function value 2073 complex exponent of real vector elements 2038 computing a rounded integer value and raising inexact result exception 2096 computing a rounded integer value in current rounding mode 2094 computing a truncated integer value 2098 conjugation 1987 cosine 2031 cube root 2004 cumulative normal distribution function value 2075 denary logarithm 2027 division 1997 error function value 2070 exponential 2019 exponential of elements decreased by 1 2022 four-quadrant arctangent 2050 gamma function 2084, 2086 hyperbolic cosine 2052 hyperbolic sine 2055 hyperbolic tangent 2058 inverse complementary error function value 2080 inverse cosine 2042 inverse cube root 2006 inverse cumulative normal distribution function value 2082 inverse error function value 2077 inverse hyperbolic cosine 2061 inverse hyperbolic sine 2064 inverse hyperbolic tangent 2067 inverse sine 2045 inverse square root 2002 inverse tangent 2047 inversion 1995 linear fraction transformation 1993 multiplication 1983 multiplication of conjugated vector element 1986 natural logarithm 2024 natural logarithm of vector elements increased by 1 2030 power 2011 power (constant) 2014 power 2/3 2007 power 3/2 2009 rounding to nearest integer value 2093 rounding towards minus infinity 2088 rounding towards plus infinity 2089 rounding towards zero 2091 scaling 1504 scaling, reciprocal 1504 sine 2034 sine and cosine 2036 square root 2000 square root of sum of squares 2017 squaring 1981 subtraction 1979 tangent 2040 Vector Mathematical Functions vector multiplication LAPACK 1411 ScaLAPACK 1858 vector pack function 2100 vector statistics functions Bernoulli 2195 Beta 2186 Binomial 2198 Cauchy 2173 CopyStream 2138 CopyStreamState 2139 DeleteStream 2137 dNewAbstractStream 2133 Exponential 2165 Gamma 2183 Gaussian 2159 GaussianMV 2161 Geometric 2196 GetBrngProperties 2210 GetNumRegBrngs 2152 GetStreamSize 2145 GetStreamStateBrng 2151 Gumbel 2181 Hypergeometric 2200 iNewAbstractStream 2131 Laplace 2168 LeapfrogStream 2146 LoadStreamF 2141 LoadStreamM 2144 Lognormal 2178 NegBinomial 2206 Index 2747 NewStream 2128 NewStreamEx 2129 Poisson 2202 PoissonV 2204 Rayleigh 2175 RegisterBrng 2209 SaveStreamF 2140 SaveStreamM 2142 SkipAheadStream 2148 sNewAbstractStream 2135 Uniform (continuous) 2156 Uniform (discrete) 2189 UniformBits 2191 UniformBits32 2192 UniformBits64 2193 Weibull 2170 vector unpack function 2103 vector-scalar product sparse vectors 141 vectors adding magnitudes of vector elements 54 copying 56 dot product complex vectors 61 complex vectors, conjugated 60 real vectors 58 element with the largest absolute value 71 element with the largest absolute value of real part and its index 1745 element with the smallest absolute value 72 Euclidean norm 62 Givens rotation 64 linear combination of vectors 55, 327 modified Givens transformation parameters 67 rotation of points 63 rotation of points in the modified plane 65 sparse vectors 140 sum of vectors 55, 327 swapping 70 vector-scalar product 69 viRngUniformBits 2191 viRngUniformBits32 2192 viRngUniformBits64 2193 vmcAdd 1976 vmcSin 2034 vmcSub 1979 vmdAdd 1976 vmdSin 2034 vmdSub 1979 vml Functions Interface 1971 Input Parameters 1972 Output Parameters 1973 VML 1969 VML arithmetic functions 1976 VML exponential and logarithmic functions 2019 VML functions mathematical functions v?Abs 1989 v?Acos 2042 v?Acosh 2061 v?Add 1976 v?Arg 1991 v?Asin 2045 v?Asinh 2064 v?Atan 2047 v?Atan2 2050 v?Atanh 2067 v?Cbrt 2004 v?CdfNorm 2075 v?CdfNormInv 2082 v?Ceil 2089 v?CIS 2038 v?Conj 1987 v?Cos 2031 v?Cosh 2052 v?Div 1997 v?Erf 2070 v?Erfc 2073 v?ErfcInv 2080 v?ErfInv 2077 v?Exp 2019 v?Expm1 2022 v?Floor 2088 v?Hypot 2017 v?Inv 1995 v?InvCbrt 2006 v?InvSqrt 2002 v?LGamma 2084 v?LinearFrac 1993 v?Ln 2024 v?Log10 2027 v?Log1p 2030 v?Modf 2098 v?Mul 1983 v?MulByConj 1986 v?NearbyInt 2094 v?Pow 2011 v?Pow2o3 2007 v?Pow3o2 2009 v?Powx 2014 v?Rint 2096 v?Round 2093 v?Sin 2034 v?SinCos 2036 v?Sinh 2055 v?Sqr 1981 v?Sqrt 2000 v?Sub 1979 v?Tan 2040 v?Tanh 2058 v?TGamma 2086 v?Trunc 2091 pack/unpack functions v?Pack 2100 v?Unpack 2103 service functions ClearErrorCallBack 2114 ClearErrStatus 2111 GetErrorCallBack 2114 GetErrStatus 2110 GetMode 2108 SetErrorCallBack 2111 SetErrStatus 2109 SetMode 2106 VML hyperbolic functions 2052 VML mathematical functions arithmetic 1976 exponential and logarithmic 2019 hyperbolic 2052 power and root 1995 rounding 2088 special 2070 special value notations 1976 trigonometric 2031 VML Mathematical Functions 1971 VML Pack Functions 1971 VML Pack/Unpack Functions 2100 VML power and root functions 1995 VML rounding functions 2088 Intel® Math Kernel Library Reference Manual 2748 VML Service Functions 2106 VML special functions 2070 VML trigonometric functions 2031 vmlClearErrorCallBack 2114 vmlClearErrStatus 2111 vmlGetErrorCallBack 2114 vmlGetErrStatus 2110 vmlGetMode 2108 vmlSetErrorCallBack 2111 vmlSetErrorStatus 2109 vmlSetMode 2106 vmsAdd 1976 vmsSin 2034 vmsSub 1979 vmzAdd 1976 vmzSin 2034 vmzSub 1979 vsAdd 1976 VSL Fortran header 2115 VSL routines advanced service routines GetBrngProperties 2210 RegisterBrng 2209 convolution/correlation CopyTask 2254 DeleteTask 2253 Exec 2239 Exec1D 2242 ExecX 2246 ExecX1D 2249 NewTask 2220 NewTask1D 2223 NewTaskX 2225 NewTaskX1D 2228 SetInternalPrecision 2234 generator routines Bernoulli 2195 Beta 2186 Binomial 2198 Cauchy 2173 Exponential 2165 Gamma 2183 Gaussian 2159 GaussianMV 2161 Geometric 2196 Gumbel 2181 Hypergeometric 2200 Laplace 2168 Lognormal 2178 NegBinomial 2206 Poisson 2202 PoissonV 2204 Rayleigh 2175 Uniform (continuous) 2156 Uniform (discrete) 2189 UniformBits 2191 UniformBits32 2192 UniformBits64 2193 Weibull 2170 service routines CopyStream 2138 CopyStreamState 2139 DeleteStream 2137 dNewAbstractStream 2133 GetNumRegBrngs 2152 GetStreamSize 2145 GetStreamStateBrng 2151 iNewAbstractStream 2131 LeapfrogStream 2146 LoadStreamF 2141 LoadStreamM 2144 NewStream 2128 NewStreamEx 2129 SaveStreamF 2140 SaveStreamM 2142 SkipAheadStream 2148 sNewAbstractStream 2135 summary statistics Compute 2302 DeleteTask 2303 EditCorParameterization 2298 EditCovCor 2280 EditMissingValues 2294 EditMoments 2278 EditOutliersDetection 2292 EditPartialCovCor 2282 EditPooledCovariance 2287 EditQuantiles 2284 EditRobustCovariance 2289 EditStreamQuantiles 2286 EditTask 2270 NewTask 2267 VSL routines:convolution/correlation SetInternalDecimation 2237 SetMode 2232 SetStart 2235 VSL Summary Statistics 2261 VSL task 2115 vslConvCopyTask 2254 vslCorrCopyTask 2254 vsldsscompute 2302 vsldSSCompute 2302 vsldsseditcorparameterization 2298 vsldSSEditCorParameterization 2298 vsldsseditcovcor 2280 vsldSSEditCovCor 2280 vsldsseditmissingvalues 2294 vsldSSEditMissingValues 2294 vsldsseditmoments 2278 vsldSSEditMoments 2278 vsldsseditoutliersdetection 2292 vsldSSEditOutliersDetection 2292 vsldsseditpartialcovcor 2282 vsldSSEditPartialCovCor 2282 vsldsseditpooledcovariance 2287 vsldSSEditPooledCovariance 2287 vsldsseditquantiles 2284 vsldSSEditQuantiles 2284 vsldsseditrobustcovariance 2289 vsldSSEditRobustCovariance 2289 vsldsseditstreamquantiles 2286 vsldSSEditStreamQuantiles 2286 vsldssedittask 2270 vsldSSEditTask 2270 vsldssnewtask 2267 vsldSSNewTask 2267 vslgamma 2084 vsLGamma 2084 vslissedittask 2270 vsliSSEditTask 2270 vslLoadStreamF 2141 vslSaveStreamF 2140 vslssdeletetask 2303 vslSSDeleteTask 2303 vslssscompute 2302 vslsSSCompute 2302 vslssseditcorparameterization 2298 vslsSSEditCorParameterization 2298 vslssseditcovcor 2280 vslsSSEditCovCor 2280 Index 2749 vslssseditmissingvalues 2294 vslsSSEditMissingValues 2294 vslssseditmoments 2278 vslsSSEditMoments 2278 vslssseditoutliersdetection 2292 vslsSSEditOutliersDetection 2292 vslssseditpartialcovcor 2282 vslsSSEditPartialCovCor 2282 vslssseditpooledcovariance 2287 vslsSSEditPooledCovariance 2287 vslssseditquantiles 2284 vslsSSEditQuantiles 2284 vslssseditrobustcovariance 2289 vslsSSEditRobustCovariance 2289 vslssseditstreamquantiles 2286 vslsSSEditStreamQuantiles 2286 vslsssedittask 2270 vslsSSEditTask 2270 vslsssnewtask 2267 vslsSSNewTask 2267 vsPackI 2100 vsPackM 2100 vsPackV 2100 vsSin 2034 vsSub 1979 vstgamma 2086 vsTGamma 2086 vsUnpackI 2103 vsUnpackM 2103 vsUnpackV 2103 vzAdd 1976 vzPackI 2100 vzPackM 2100 vzPackV 2100 vzSin 2034 vzSub 1979 vzUnpackI 2103 vzUnpackM 2103 vzUnpackV 2103 W Weibull 2170 Wilkinson transform 1832 X xerbla 2529 xerbla_array 1532 xerbla, error reporting routine 1973 Z zbbcsd 920 zdla_gercond_c 1471 zdla_gercond_x 1472 zgbcon 422 zgbrfsx 461 zgbsvx 576 zgbtrs 387 zgecon 420 zgeqpf 676 zgtrfs 467 zhegs2 1415 zheswapr 1413 zhetd2 1417 zhetri2 525 zhetri2x 529 zhetrs2 408 zhgeqz 885 zhseqr 851 zla_gbamv 1455 zla_gbrcond_c 1459 zla_gbrcond_x 1460 zla_gbrfsx_extended 1462 zla_gbrpvgrw 1467 zla_geamv 1468 zla_gerfsx_extended 1473 zla_heamv 1478 zla_hercond_c 1480 zla_hercond_x 1481 zla_herfsx_extended 1482 zla_herpvgrw 1487 zla_lin_berr 1488 zla_porcond_c 1490 zla_porcond_x 1492 zla_porfsx_extended 1493 zla_porpvgrw 1498 zla_rpvgrw 1503 zla_syamv 1505 zla_syrcond_c 1508 zla_syrcond_x 1509 zla_syrfsx_extended 1511 zla_syrpvgrw 1516 zla_wwaddw 1517 zlag2c 1429 zlapmr 1260 zlapmt 1262 zlarfb 1295 zlarft 1300 zlarscl2 1504 zlascl2 1504 zlat2c 1454 zlatps 1383 zlatrd 1385 zlatrs 1387 zlatrz 1390 zlauu2 1392 zlauum 1393 zpbtf2 1407 zporfsx 472 zpotf2 1408 zpprfs 478 zpptrs 396 zptts2 1409 zrscl 1411 zsyconv 436 zsyswapr 1411 zsyswapr1 1414 zsytf2 1418 zsytri2 523 zsytri2x 527 zsytrs2 406 ztgex2 1421 ztgsy2 1423 ztrexc 868 ztrti2 1426 zunbdb 925 zuncsd 1060 zung2l 1394 zung2r 1395 zungbr 747 zungl2 1396 zungr2 1397 zunm2l 1399 zunm2r 1400 zunml2 1402 zunmr2 1404 zunmr3 1405 Intel® Math Kernel Library Reference Manual 2750 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS Document Number: 324207-005US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information................................................................................5 Introducing the Intel(R) VTune(TM) Amplifier XE................................7 Prerequisites........................................................................................9 Navigation Quick Start.......................................................................11 Key Terms and Concepts....................................................................13 Chapter 1: Tutorial: Finding Hotspots Learning Objectives..................................................................................17 Workflow Steps to Identify and Analyze Hotspots.........................................17 Build Application and Create New Project....................................................18 Run Hotspots Analysis..............................................................................21 Interpret Result Data................................................................................22 Analyze Code..........................................................................................25 Tune Algorithms......................................................................................27 Compare with Previous Result....................................................................30 Summary................................................................................................32 Chapter 2: Tutorial: Analyzing Locks and Waits Learning Objectives..................................................................................33 Workflow Steps to Identify Locks and Waits.................................................33 Build Application and Create New Project....................................................34 Run Locks and Waits Analysis....................................................................36 Interpret Result Data................................................................................37 Analyze Code..........................................................................................41 Remove Lock...........................................................................................42 Compare with Previous Result....................................................................45 Summary................................................................................................47 Chapter 3: Tutorial: Identifying Hardware Issues Learning Objectives..................................................................................49 Workflow Steps to Identify Hardware Issues................................................49 Build Application and Create New Project....................................................50 Run General Exploration Analysis...............................................................51 Interpret Results......................................................................................52 Analyze Code..........................................................................................55 Resolve Issue..........................................................................................57 Resolve Next Issue...................................................................................60 Summary................................................................................................63 Chapter 4: More Resources Getting Help............................................................................................65 Product Website and Support.....................................................................65 Chapter 5: Intel(R) VTune(TM) Amplifier XE Tutorials Troubleshooting Troubleshooting.......................................................................................67 Contents 3Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 4Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. Java is a registered trademark of Oracle and/or its affiliates. Copyright (C) 2010-2011, Intel Corporation. All rights reserved. 5 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 6Introducing the Intel(R) VTune(TM) Amplifier XE The Intel(R) VTune(TM) Amplifier XE, an Intel(R) Parallel Studio XE tool, provides information on code performance for users developing serial and multithreaded applications on Windows* and Linux* operating systems. On Windows systems, the VTune Amplifier XE integrates into Microsoft Visual Studio* software and is also available as a standalone GUI client. On Linux systems, VTune Amplifier XE works only as a standalone GUI client. On both Windows and Linux systems, you can benefit from using the command-line interface for collecting data remotely or for performing regression testing. VTune Amplifier XE helps you analyze the algorithm choices and identify where and how your application can benefit from available hardware resources. Use the VTune Amplifier XE to locate or determine the following: • The most time-consuming (hot) functions in your application and/or on the whole system • Sections of code that do not effectively utilize available processor time • The best sections of code to optimize for sequential performance and for threaded performance • Synchronization objects that affect the application performance • Whether, where, and why your application spends time on input/output operations • The performance impact of different synchronization methods, different numbers of threads, or different algorithms • Thread activity and transitions • Hardware-related bottlenecks in your code Intel VTune Amplifier XE Tutorials These tutorials tell you how to use the VTune Amplifier XE to analyze the performance of a sample application by identifying software- and hardware-related issues in the code. • Finding Hotspots • Analyzing Locks and Waits • Identifying Hardware Issues Check http://software.intel.com/en-us/articles/intel-software-product-tutorials/ for the printable version (PDF) of product tutorials. See Also Getting Help 7 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 8Prerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. Required Tools You need the following tools to use these tutorials: • Intel(R) VTune(TM) Amplifier XE • Sample code included with the VTune Amplifier XE. VTune Amplifier XE provides the following sample applications: • tachyon application used for the Finding Hotspots and Analyzing Locks and Waits tutorials • matrix application used for the Identifying Hardware Issues tutorial • VTune Amplifier XE Help To acquire the VTune Amplifier XE: If you do not already have access to the VTune Amplifier XE, you can download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/. To install the VTune Amplifier XE, follow the instructions in the Release Notes. To install and set up VTune Amplifier XE sample code: 1. Copy the tachyon_vtune_amp_xe.tar.gz and matrix_vtune_amp_xe.tar.gz files from the samples/ folder in the IntelVTune Amplifier XE installation directory to a writable directory or share on your system. The default installation directory is /opt/intel/vtune_amplifier_xe_2011 . 2. Extract the sample(s) from the .tar file. NOTE • Samples are non-deterministic. Your screens may vary from the screen shots shown throughout these tutorials. • Samples are designed only to illustrate VTune Amplifier XE features and do not represent best practices for tuning the code. Results may vary depending on the nature of the analysis. To run the VTune Amplifier XE: Launch the amplxe-gui script from the /opt/intel/vtune_amplifier_xe_2011/bin32 directory. To access VTune Amplifier XE Help: See the Getting Help topic. 9 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 10Navigation Quick Start Standalone Intel(R) VTune(TM) Amplifier XE Use the VTune Amplifier XE menu to control result collection, define and view project properties, and set various options. Use the VTune Amplifier XE toolbar to configure and control result collection. Use the Project Navigator to manage your VTune Amplifier XE projects and collected analysis results. Click the Project Navigator button on the toolbar to enable/disable the Project Navigator. Use the VTune Amplifier XE result tabs to manage result data. You can view or change the result file location from the Project Properties dialog box. Use the drop-down menu to select a viewpoint, a preset configuration of windows/panes for an analysis result. For each analysis type, you can switch among several preset configurations to focus on particular performance metrics. Click the yellow question mark icon to read the viewpoint description. 11Switch between window tabs to explore the analysis type configuration options and collected data provided by the selected viewpoint. Use the Grouping drop-down menu to choose a granularity level for grouping data in the grid. Use the filter toolbar to filter out the result data according to the selected categories. Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 12Key Terms and Concepts Key Terms baseline: A performance metric used as a basis for comparison of the application versions before and after optimization. Baseline should be measurable and reproducible. CPU time: The amount of time a thread spends executing on a logical processor. For multiple threads, the CPU time of the threads is summed. The application CPU time is the sum of the CPU time of all the threads that run the application. Elapsed time:The total time your target ran, calculated as follows: Wall clock time at end of application – Wall clock time at start of application. hotspot: A section of code that took a long time to execute. Some hotspots may indicate bottlenecks and can be removed, while other hotspots inevitably take a long time to execute due to their nature. target: A target is an executable file you analyze using the Intel(R) VTune(TM) Amplifier XE. viewpoint: A preset result tab configuration that filters out the data collected during a performance analysis and enables you to focus on specific performance problems. When you select a viewpoint, you select a set of performance metrics the VTune Amplifier XE shows in the windows/panes of the result tab. To select the required viewpoint, click the button and use the drop-down menu at the top of the result tab. Wait time: The amount of time that a given thread waited for some event to occur, such as: synchronization waits and I/O waits. Key Concept: CPU Usage For the user-mode sampling and tracing analysis types, the Intel(R) VTune(TM) Amplifier XE identifies a processor utilization scale, calculates the target CPU usage, and defines default utilization ranges depending on the number of processor cores. You can change the utilization ranges by dragging the slider in the CPU Usage histogram in the Summary window. Utilizatio n Type Default color Description Idle All CPUs are waiting - no threads are running. Poor Poor usage. By default, poor usage is when the number of simultaneously running CPUs is less than or equal to 50% of the target CPU usage. OK Acceptable (OK) usage. By default, OK usage is when the number of simultaneously running CPUs is between 51-85% of the target CPU usage. Ideal Ideal usage. By default, Ideal usage is when the number of simultaneously running CPUs is between 86-100% of the target CPU usage. Key Concept: Data of Interest The VTune Amplifier XE maintains a special column called Data of Interest. This column is highlighted with yellow background and a yellow star in the column header . The data in the Data of Interest column is used by various windows as follows: 13• The Call Stack pane calculates the contribution, shown in the contribution bar, using the Data of Interest column values. • The Filter bar uses the data of interest values to calculate the percentage indicated in the filtered option. • The Source/Assembly window uses this column for hotspot navigation. If a viewpoint has more than one column with numeric data or bars, you can change the default Data of Interest column by right-clicking the required column and selecting the Set Column as Data of Interest command from the pop-up menu. Key Concept: Event-based Metrics When analyzing data collected during a hardware event-based sampling analysis, the VTune Amplifier XE uses the performance metrics. Each metric is an event ratio with its own threshold values. As soon as the performance of a program unit per metric exceeds the threshold, the VTune Amplifier XE marks this value as a performance issue (in pink) and provides recommendations how to fix it. Each column in the Bottom-up pane provides data per metric. To read the metric description and see the formula used for the metric calculation, mouse over the metric column header. To read the description of the hardware issue and see the threshold formula used for this issue, mouse over the link cell in the grid. For the full list of metrics used by the VTune Amplifier XE, see the Hardware Event-based Metrics topic in the online help. Key Concept: Event-based Sampling Analysis VTune Amplifier XE introduces a set of advanced hardware analysis types based on the event-based sampling data collection and targeted for the Intel(R) Core(TM) 2 processor family, processors based on the Intel(R) microarchitecture code name Nehalem and Intel(R) microarchitecture code name Sandy Bridge. Depending on the analysis type, the VTune Amplifier XE monitors a set of hardware events and, as a result, provides collected data per, so-called, hardware event-based metrics defined by Intel architects (for example, Clockticks per Instructions Retired, Contested Accesses, and so on). Typically, you are recommended to start with the General Exploration analysis type that collects the maximum number of events and provides the widest picture of the hardware issues that affected the performance of your application. For more information on the event-based sampling analysis, see the Hardware Event-based Sampling Collection topic in the online help. Key Concept: Event Skid Event skid is the recording of an event not exactly on the code line that caused the event. Event skids may even result in a caller function event being recorded in the callee function. Event skid is caused by a number of factors: • The delay in propagating the event out of the processor's microcode through the interrupt controller (APIC) and back into the processor. • The current instruction retirement cycle must be completed. • When the interrupt is received, the processor must serialize its instruction stream which causes a flushing of the execution pipeline. The Intel(R) processors support accurate event location for some events. These events are called precise events.See the online help for more details. Key Concept: Finalization Finalization is the process of the Intel(R) VTune(TM) Amplifier XE converting the collected data to a database, resolving symbol information, and pre-computing data to make further analysis more efficient and responsive. The VTune Amplifier XE finalizes data automatically when data collection completes. You may want to re-finalize a result to: Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 14• update symbol information after changes in the search directories settings • resolve the number of [Unknown]-s in the results Key Concept: Hotspots Analysis The Hotspots analysis helps understand the application flow and identify sections of code that took a long time to execute (hotspots). A large number of samples collected at a specific process, thread, or module can imply high processor utilization and potential performance bottlenecks. Some hotspots can be removed, while other hotspots are fundamental to the application functionality and cannot be removed. The Intel(R)VTune(TM) Amplifier XE creates a list of functions in your application ordered by the amount of time spent in a function. It also detects the call stacks for each of these functions so you can see how the hot functions are called. The VTune Amplifier XE uses a low overhead (about 5%) user-mode sampling and tracing collection that gets you the information you need without slowing down the application execution significantly. Key Concept: Locks and Waits Analysis While the Concurrency analysis helps identify where your application is not parallel, the Locks and Waits analysis helps identify the cause of the ineffective processor utilization. One of the most common problems is threads waiting too long on synchronization objects (locks). Performance suffers when waits occur while cores are under-utilized. During the Locks and Waits analysis you can estimate the impact each synchronization object introduces to the application and understand how long the application was required to wait on each synchronization object, or in blocking APIs, such as sleep and blocking I/O. Key Concept: Thread Concurrency The number of active threads corresponds to the concurrency level of an application. By comparing the concurrency level with the number of processors, Intel(R) VTune(TM) Amplifier XE classifies how an application utilizes the processors in the system. It defines default utilization ranges depending on the number of processor cores and displays the thread concurrency in the Summary and Bottom-up window. You can change the utilization ranges by dragging the slider in the Summary window. Thread concurrency may be higher than CPU Usage if threads are in the runnable state and not consuming CPU time. VTune Amplifier XE defines the Target Concurrency level for your application that is, by default, equal to the number of physical cores. Utilizatio n Type Default color Description Idle All threads in the application are waiting - no threads are running. There can be only one bar in the Thread Concurrency histogram indicating Idle utilization. Poor Poor utilization. By default, poor utilization is when the number of threads is up to 50% of the target concurrency. OK Acceptable (OK) utilization. By default, OK utilization is when the number of threads is between 51-85% of the target concurrency. Ideal Ideal utilization. By default, ideal utilization is when the number of threads is between 86-115% of the target concurrency. Over Over-utilization. By default, over-utilization is when the number of threads is more than 115% of the target concurrency. Key Terms and Concepts 15 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 16Tutorial: Finding Hotspots 1 Learning Objectives This tutorial shows how to use the Hotspots analysis of the Intel(R) VTune(TM) Amplifier XE to understand where the sample application is spending time, identify hotspots - the most time-consuming program units, and detect how they were called. Some hotspots may indicate bottlenecks that can be removed, while other hotspots are inevitable and take a long time to execute due to their nature. Typically, the hotspot functions identified during the Hotspots analysis use the most time-consuming algorithms and are good candidates for parallelization. The Hotspots analysis is useful to analyze the performance of both serial and parallel applications. Estimated completion time: 15 minutes. Sample application: tachyon. After you complete this tutorial, you should be able to: • Choose an analysis target. • Choose the Hotspots analysis type. • Run the Hotspots analysis to locate most time-consuming functions in an application. • Analyze the function call flow and threads. • Analyze the source code to locate the most time-critical code lines. • Compare results before and after optimization. Start Here Workflow Steps to Identify and Analyze Hotspots Workflow Steps to Identify and Analyze Hotspots You can use the Intel(R) VTune(TM) Amplifier XE to identify and analyze hotspot functions in your serial or parallel application by performing a series of steps in a workflow. This tutorial guides you through these workflow steps while using a sample ray-tracer application named tachyon. 171. Build an application to analyze for hotspots and create a new VTune Amplifier XE project 2. Choose and run the Hotspots analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical function. 5. Modify the code to tune the algorithms. 6. Re-build the target, re-run the Hotspots analysis, and compare the result data before and after optimization. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Build Application and Create New Project Before you start analyzing your application target for hotspots, do the following: 1. Build application in the release mode with full optimizations. 2. Run the application without debugging to create a performance baseline. 3. Create a VTune Amplifier XE project. Choose a Build Mode and Build a Target 1. Browse to the directory where you extracted the sample code (for example, /home/intel/samples/ tachyon_vtune_amp_xe). Make sure this directory contains Makefile. 2. Clean up all the previous builds as follows: $ make clean 3. Build your target in the release mode as follows: 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 18$ make release The tachyon_find_hotspots application is built. Create a Performance Baseline 1. Run tachyon_find_hotspots with dat/balls.dat as an input parameter. For example: $ /home/intel/samples/tachyon_vtune_amp_xe/tachyon_find_hotspots dat/balls.dat The tachyon_find_hotspots.exe application starts running. NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. 2. Note the execution time displayed in the window caption or in the shell window. For the tachyon_find_hotspots.exe executable in the figure above, the execution time is 83.539 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. Tutorial: Finding Hotspots 1 19NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Create a Project 1. Set the EDITOR or VISUAL environment variable to associate your source files with the code editor (like emacs, vi, vim, gedit, and so on). For example: $ export EDITOR=gedit 2. From the /bin32 directory (for IA-32 architecture) or from the /bin64 directory (for Intel(R) 64 architecture), run the amplxe-gui script launching the VTune Amplifier XE GUI. By default, the is /opt/intel/vtune_amplifier_xe_2011. 3. Create a new project via File > New > Project.... The Create a Project dialog box opens. 4. Specify the project name tachyon that will be used as the project directory name. The VTune Amplifier XE creates the tachyon project directory under the root/intel/My Amplifier XE Projects directory and opens the Project Properties: Target dialog box. 5. In the Application to Launch pane of the Target tab, specify and configure your target as follows: • For the Application field, browse to: /tachyon_find_hotspots, for example: / home/intel/samples/en/tachyon_vtune_amp_xe/tachyon_find_hotspots. • For the Application parameters field, enter dat/balls.dat. 6. Click OK to apply the settings and exit the Project Properties dialog box. Recap You built the target in the Release mode, created the performance baseline, and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Key Terms and Concepts • Term: target, baseline Next Step Run Hotspots Analysis 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 20Run Hotspots Analysis Before running an analysis, choose a configuration level to influence Intel(R) VTune(TM) Amplifier XE analysis scope and running time. In this tutorial, you run the Hotspots analysis to identify the hotspots that took much time to execute. To run an analysis: 1. From the VTune Amplifier XE toolbar, click the New Analysis button. The VTune Amplifier XE result tab opens with the Analysis Type window active. 2. On the left pane of the Analysis Type window, locate the analysis tree and select Algorithm Analysis > Hotspots. The right pane is updated with the default options for the Hotspots analysis. 3. Click the Start button on the right command bar. VTune Amplifier XE launches the tachyon_find_hotspots application that renders balls.dat as an input file, calculates the execution time, and exits. VTune Amplifier XE finalizes the collected results and opens the Hotspots viewpoint. NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Recap You launched the Hotspots data collection that analyzes function calls and CPU time spent in each program unit of your application. NOTE This tutorial explains how to run an analysis from the VTune Amplifier XE graphical user interface (GUI). You can also use the VTune Amplifier XE command-line interface (amplxe-cl command) to run an analysis. For more details, check the Command-line Interface Support section of the VTune Amplifier XE Help. Key Terms and Concepts • Term: hotspot, Elapsed time, viewpoint • Concept: Hotspot Analysis, Finalization Tutorial: Finding Hotspots 1 21Next Step Interpret Result Data Interpret Result Data When the sample application exits, the Intel(R) VTune(TM) Amplifier XE finalizes the results and opens the Hotspots viewpoint that consists of the Summary, Bottom-up, and Top-down Tree windows. To interpret the data on the sample code performance, do the following: • Understand the basic performance metrics provided by the Hotspots analysis. • Analyze the most time-consuming functions. • Analyze CPU usage per function. NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Understand the Basic Hotspots Metrics Start analysis with the Summary window. To interpret the data, hover over the question mark icons to read the pop-up help and better understand what each performance metric means. Note that CPU Time for the sample application is equal to 89.876 seconds. It is the sum of CPU time for all application threads. Total Thread Count is 1, so the sample application is single-threaded. The Top Hotspots section provides data on the most time-consuming functions (hotspot functions) sorted by CPU time spent on their execution. For the sample application, the initialize_2D_buffer function, which took 52.939 seconds to execute, shows up at the top of the list as the hottest function. The [Others] entry at the bottom shows the sum of CPU time for all functions not listed in the table. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 22Analyze the Most Time-consuming Functions Click the Bottom-up tab to explore the Bottom-up pane. By default, the data in the grid is sorted by Function. You may change the grouping level using the Grouping drop-down menu at the top of the grid. Analyze the CPU Time column values. This column is marked with a yellow star as the Data of Interest column. It means that the VTune Amplifier XE uses this type of data for some calculations (for example, filtering, stack contribution, and others). Functions that took most CPU time to execute are listed on top. The initialize_2D_buffer function took 52.939 seconds to execute. Click the arrow sign at the initialize_2D_buffer function to expand the stacks calling this function. You see that it was called only by the setup_2D_buffer function. Select the initialize_2D_buffer function in the grid and explore the data provided in the Call Stack pane on the right. The Call Stack pane displays full stack data for each hotspot function, enables you to navigate between function call stacks and understand the impact of each stack to the function CPU time. The stack functions in the Call Stack pane are represented in the following format: ! - :, where the line number corresponds to the line calling the next function in the stack. For the sample application, the hottest function is called at line 87 of the setup_2D_buffer function in the global.cpp file. Analyze CPU Usage per Function VTune Amplifier XE enables you to analyze the collected data from different perspectives by using multiple viewpoints. For the Hotspots analysis result, you may switch to the Hotspots by CPU Usage viewpoint to understand how your hotspot function performs in Tutorial: Finding Hotspots 1 23terms of the CPU usage. Explore this viewpoint to determine how your application utilized available cores and identify the most serial code. If you go back to the Summary window, you can see the CPU Usage Histogram that represents the Elapsed time and usage level for the available logical processors. The tachyon_find_hotspots application ran mostly on one logical CPU. If you hover over the highest bar, you see that it spent 79.695 seconds using one core only, which is classified by the VTune Amplifier XE as a Poor utilization for a dual-core system. To understand what prevented the application from using all available logical CPUs effectively, explore the Bottom-up pane. To get the detailed CPU usage information per function, use the button in the Bottom-up window to expand the CPU Time column. Note that initialize_2D_buffer is the function with the longest poor CPU utilization (red bars). This means that the processor cores were underutilized most of the time spent on executing this function. If you change the grouping level (highlighted in the figure above) in the Bottom-up pane from Function/ Call Stack to Thread/Function/Call Stack, you see that the initialize_2D_buffer function belongs to the thread_video thread. This thread is also identified as a hotspot and shows up at the top in the Bottomup pane. To get detailed information on the hotspot thread performance, explore the Timeline pane. Timeline area. When you hover over the graph element, the timeline tooltip displays the time passed since the application has been launched. Threads area that shows the distribution of CPU time utilization per thread. Hover over a bar to see the CPU time utilization in percent for this thread at each moment of time. Green zones show the time threads are active. CPU Usage area that shows the distribution of CPU time utilization for the whole application. Hover over a bar to see the application-level CPU time utilization in percent at each moment of time. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 24VTune Amplifier XE calculates the overall CPU Usage metric as the sum of CPU time per each thread of the Threads area. Maximum CPU Usage value is equal to [number of processor cores] x 100%. The Timeline analysis also identifies the thread_video thread as the most active. The tooltip shows that CPU time values are about 100% whereas the maximum CPU time value for dual-core systems is 200%. This means that the processor cores were half-utilized for most of the time spent on executing the tachyon_find_hotspots application. Recap You identified a function that took the most CPU time and could be a good candidate for algorithm tuning. Key Terms and Concepts • Term: Elapsed time, CPU time, viewpoint • Concept: Hotspots Analysis, CPU Usage, Data of Interest Next Step Analyze Code Analyze Code You identified initialize_2D_buffer as the hottest function. In the Bottom-up pane, double-click this function to open the Source window and analyze the source code: • Understand basic options provided in the Source window. • Identify the hottest code lines. Understand Basic Source Window Options The table below explains some of the features available in the Source window when viewing the Hotspots analysis data. Tutorial: Finding Hotspots 1 25Source pane displaying the source code of the application if the function symbol information is available. The code line that took the most CPU time to execute is highlighted. The source code in the Source pane is not editable. If the function symbol information is not available, the Assembly pane opens displaying assembler instructions for the selected hotspot function. To enable the Source pane, make sure to build the target properly. Assembly pane displaying the assembler instructions for the selected hotspot function. Assembler instructions are grouped by basic blocks. The assembler instructions for the selected hotspot function are highlighted. To get help on an assembler instruction, right-click the instruction and select Instruction Reference. NOTE To get the help on a particular instruction, make sure to have the Adobe* Acrobat Reader* 9 (or later) installed. If an earlier version of the Adobe Acrobat Reader is installed, the Instruction Reference opens but you need to locate the help on each instruction manually. Processor time attributed to a particular code line. If the hotspot is a system function, its time, by default, is attributed to the user function that called this system function. Source window toolbar. Use the hotspot navigation buttons to switch between most performancecritical code lines. Hotspot navigation is based on the metric column selected as a Data of Interest. For the Hotspots analysis, this is CPU Time. Use the Source/Assembly buttons to toggle the Source/Assembly panes (if both of them are available) on/off. Heat map markers to quickly identify performance-critical code lines (hotspots). The bright blue markers indicate hot lines for the function you selected for analysis. Light blue markers indicate hot lines for other functions. Scroll to a marker to locate the hot code line it identifies. Identify the Hottest Code Lines When you identify a hotspot in the serial code, you can make some changes in the code to tune the algorithms and speed up that hotspot. Another option is to parallelize the sample code by adding threads to the application so that it performs well on multi-core processors. This tutorial focuses on algorithm tuning. By default, when you double-click the hotspot in the Bottom-up pane, VTune Amplifier XE opens the source file related to this function. For the initialize_2D_buffer function, the hottest code line is 121. This code is used to initialize a memory array using non-sequential memory locations. Click the Source Editor button on the Source window toolbar to open the default code editor and work on optimizing the code. Recap You identified the code section that took the most CPU time to execute. Key Terms and Concepts • Term: hotspot, CPU time • Concept: Hotspots Analysis, Data of Interest Next Step Tune Algorithms 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 26Tune Algorithms In the Source window, you identified that in the initialize_2D_buffer hotspot function the code line 121 took the most CPU time. Focus on this line and do the following: 1. Open the code editor. 2. Optimize the algorithm used in this code section. Open the Code Editor In the Source window, click the Source Editor button to open the initbuffer.cpp file in the default code editor: Tutorial: Finding Hotspots 1 27Hotspot line is used to initialize a memory array using non-sequential memory locations. For demonstration purposes, the code lines are commented as a slower method of filling the array. Resolve the Problem To resolve this issue, optimize your algorithm as follows: 1. Edit lines 110 and 113 to comment out code lines 111-125 marked as a "First (slower) method". 2. Edit line 144 to uncomment code lines 145-151 marked as a "Faster method". In this step, you interchange the for loops to initialize the code in sequential memory locations. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 283. Save the changes made in the source file. 4. Browse to the directory you extracted the sample code (for example, /home/intel/samples/en/ tachyon_vtune_amp_xe). 5. Rebuild your target in the release mode using the make command as follows: $ make clean $ make release The tachyon_find_hotspots application is rebuilt and stored in the tachyon_vtune_amp_xe directory. 6. Run tachyon_find_hotspots as follows: /home/intel/samples/en/tachyon_vtune_amp_xe/tachyon_find_hotspots dat/balls.dat System runs the tachyon_find_hotspots.exe application. Note that execution time reduced from 83.539 seconds to 43.760 seconds. Recap You interchanged the loops in the hotspot function, rebuilt the application, and got performance gain of 40 seconds. Tutorial: Finding Hotspots 1 29Key Terms and Concepts • Term: hotspot Next Step Compare with Previous Result Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Compare with Previous Result You optimized your code to apply a loop interchange mechanism that gave you 40 seconds of improvement in the application execution time. To understand whether you got rid of the hotspot and what kind of optimization you got per function, re-run the Hotspots analysis on the optimized code and compare results: • Compare results before and after optimization. • Identify the performance gain. Compare Results Before and After Optimization 1. Run the Hotspots analysis on the modified code. 2. Click the Compare Results button on the Intel(R) VTune(TM) Amplifier XE toolbar. The Compare Results window opens. 3. Specify the Hotspots analysis results you want to compare and click the Compare Results button. The Hotspots Bottom-up window opens, showing the CPU time usage across the two results and the differences side by side. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 30Difference in CPU time between the two results in the following format: = . CPU time for the initial version of the tachyon_find_hotspots application. CPU time for the optimized version of the tachyon_find_hotspots. Identify the Performance Gain Explore the Bottom-up pane to compare CPU time data for the first hotspot: CPU Time:r001hs - CPU Time:r002hs = CPU Time: Difference. 52.939s - 11.971s = 40.968s, which means that you got the optimization of ~41 seconds for the initialize_2D_buffer function. If you switch to the Summary window, you see that the Elapsed time also shows 3.6 seconds of optimization for the whole application execution: Recap You ran the Hotspots analysis on the optimized code and compared the results before and after optimization using the Compare mode of the VTune Amplifier XE. Compare analysis results regularly to look for regressions and to track how incremental changes to the code affect its performance. You may also want to use the VTune Amplifier XE command-line interface and run the amplxe-cl command to test your code for regressions. For more details, see the Command-line Interface Support section in the VTune Amplifier XE online help. Key Terms and Concepts • Term: hotspot, CPU time • Concept: Hotspots Analysis Tutorial: Finding Hotspots 1 31Next Step Read Summary Summary You have completed the Finding Hotspots tutorial. Here are some important things to remember when using the Intel(R) VTune(TM) Amplifier XE to analyze your code for hotspots: Step 1. Choose and Build Your Target • Create a performance baseline to compare the application versions before and after optimization. Make sure to use the same workload for each application run. • Create a VTune Amplifier XE project and u se the Project Properties: Target tab to choose and configure your analysis target. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. For example, you may limit the data collection to a predefined amount of data or enable the VTune Amplifier XE to collect more accurate CPU time data. You can also run the analysis from command line using the amplxecl command. Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application from the Summary window to explore the performance metrics for the whole application. Then, move to the Bottom-up window to analyze the performance per function. Focus on the hotspots - functions that took the most CPU time. By default, they are located at the top of the table. • Double-click the hotspot function in the Bottom-up pane or Call Stack pane to open its source code at the code line that took the most CPU time. Step 4. Compare Results Before and After Optimization • Perform regular regression testing by comparing analysis results before and after optimization. From GUI, click the Compare Results button on the VTune Amplifier XE toolbar. From command line, use the amplxe-cl command. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 32Tutorial: Analyzing Locks and Waits 2 Learning Objectives This tutorial shows how to use the Locks and Waits analysis of the Intel(R) VTune(TM) Amplifier XE to identify one of the most common reasons for an inefficient parallel application - threads waiting too long on synchronization objects (locks) while processor cores are underutilized. Focus your tuning efforts on objects with long waits where the system is underutilized. Estimated completion time: 15 minutes. Sample application: tachyon. After you complete this tutorial, you should be able to: • Choose an analysis target. • Choose the Locks and Waits analysis type. • Run the Locks and Waits analysis. • Identify the synchronization objects with long waits and poor CPU utilization. • Analyze the source code to locate the most critical code lines. • Compare results before and after optimization. Start Here Workflow Steps to Identify Locks and Waits Workflow Steps to Identify Locks and Waits You can use the Intel(R) VTune(TM) Amplifier XE to understand the cause of the ineffective processor utilization by performing a series of steps in a workflow. This tutorial guides you through these workflow steps while using a sample ray-tracer application named tachyon. 331. Build an application to analyze for locks and waits and create a new VTune Amplifier XE project. 2. Run the Locks and Waits analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical function. 5. Modify the code to remove the lock. 6. Re-build the target, re-run the Locks and Waits analysis, and compare the result data before and after optimization. Build Application and Create New Project Before you start analyzing your application for locks and waits, do the following: 1. Build application in the release mode with full optimizations. 2. Run the application without debugging to create a performance baseline. 3. Create a VTune Amplifier XE project. Choose a Build Mode and Build a Target 1. Browse to the directory where you extracted the sample code (for example, /home/intel/samples/en/ tachyon_vtune_amp_xe). Make sure this directory contains Makefile. 2. Clean up all the previous builds using the following command: $ make clean 3. Build your target in the release mode using the following command: $ make release The tachyon_analyze_locks application is built and stored in the tachyon_vtune_amp_xe directory. Create a Performance Baseline 1. Run tachyon_analyze_locks with dat/balls.dat as an input parameter. For example: /home/intel/samplesen/tachyon_vtune_amp_xe/tachyon_analyze_locks dat/balls.dat 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 34The tachyon_analyze_locks application runs in multiple sections (depending on the number of CPUs in your system). NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. 2. Note the execution time displayed in the window caption and in the shell window. For the tachyon_analyze_locks executable in the figure above, the execution time is 29.647 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Create a Project 1. Set the EDITOR or VISUAL environment variable to associate your source files with the code editor (like emacs, vi, vim, gedit, and so on). For example: $ export EDITOR=gedit Tutorial: Analyzing Locks and Waits 2 352. From the /bin32 directory (for IA-32 architecture) or from the /bin64 directory (for Intel(R) 64 architecture), run the amplxe-gui script launching the VTune Amplifier XE. By default, the is /opt/intel/vtune_amplifier_xe_2011. 3. Create a new project via File > New > Project.... The Create a Project dialog box opens. 4. Specify the project name tachyon that will be used as the project directory name. VTune Amplifier XE creates a project directory under the root/intel/My Amplifier XE Projects directory and opens the Project Properties: Target dialog box. 5. In the Application to Launch pane of the Target tab, specify and configure your target as follows: • For the Application field, browse to: /tachyon_analyze_locks (for example, / home/intel/samples/tachyon_vtune_amp_xe/tachyon_analyze_locks). • For the Application parameters field, specify dat/balls.dat. 6. Click OK to apply the settings and exit the Project Properties dialog box. Recap You built the target in the Release mode, created the performance baseline, and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Key Terms and Concepts • Term: target, baseline • Concept: Locks and Waits Analysis Next Step Run Locks and Waits Analysis Run Locks and Waits Analysis Before running an analysis, choose a configuration level to define the Intel(R) VTune(TM) Amplifier XE analysis scope and running time. In this tutorial, you run the Locks and Waits analysis to identify synchronization objects that caused contention and fix the problem in the source. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 36To run an analysis: 1. From the VTune Amplifier XE toolbar, analysis type from the drop-down menuclick the New Analysis button. The VTune Amplifier XE result tab opens with the Analysis Type window active. 2. From the analysis tree on the left, select Algorithm Analysis > Locks and Waits. The right pane is updated with the default options for the Locks and Waits analysis. 3. Click the Start button on the right command bar. The VTune Amplifier XE launches the tachyon_analyze_locks executable that renders balls.dat as an input file, calculates the execution time, and exits. The VTune Amplifier XE finalizes the collected data and opens the results in the Locks and Waits viewpoint. NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Recap You ran the Locks and Waits data collection that analyzes how long the application had to wait on each synchronization object, or on blocking APIs, such as sleep() and blocking I/O, and estimates processor utilization during the wait. NOTE This tutorial explains how to run an analysis from the VTune Amplifier XE graphical user interface (GUI). You can also use the VTune Amplifier XE command-line interface (amplxe-cl command) to run an analysis. For more details, check the Command-line Interface Support section of the VTune Amplifier XE Help. Key Terms and Concepts • Term: viewpoint • Concept: Locks and Waits Analysis, Finalization Next Step Interpret Result Data Interpret Result Data Tutorial: Analyzing Locks and Waits 2 37 When the sample application exits, the Intel(R) VTune(TM) Amplifier XE finalizes the results and opens the Locks and Waits viewpoint that consists of the Summary window, Bottom-up pane, Top-down Tree pane, Call Stack pane, and Timeline pane. To interpret the data on the sample code performance, do the following: • Analyze the basic performance metrics provided by the Locks and Waits analysis. • Identify locks. NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Analyze the Basic Locks and Waits Metrics Start with exploring the data provided in the Summary window for the whole application performance. To interpret the data, hover over the question mark icons to read the pop-up help and better understand what each performance metric means. The Result Summary section provides data on the overall application performance per the following metrics: 1) Elapsed Time is the total time for each core when it was either waiting or not utilized by the application; 2)Total Thread Count is the number of threads in the application; 3)Wait Time is the amount of time the application threads waited for some event to occur, such as synchronization waits and I/O waits; 4) Wait Count is the overall number of times the system wait API was called for the analyzed application; 5) CPU Time is the sum of CPU time for all threads; 6) Spin Time is the time a thread is active in a synchronization construct. For the tachyon_analyze_locks application, the Wait time is high. To identify the cause, you need to understand how this Wait time was distributed per synchronization objects. The Top Waiting Objects section provides the list of five synchronization objects with the highest Wait Time and Wait Count, sorted by the Wait Time metric. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 38For the tachyon_analyze_locks application, focus on the first three objects and explore the Bottom-up pane data for more details. The Thread Concurrency Histogram represents the Elapsed time and concurrency level for the specified number of running threads. Ideally, the highest bar of your chart should be within the Ok or Ideal utilization range. Note the Target value. By default, this number is equal to the number of physical cores. Consider this number as your optimization goal. The Average metric is calculated as CPU time / Elapsed time. Use this number as a baseline for your performance measurements. The closer this number to the number of cores, the better. For the sample code, the chart shows that tachyon_analyze_locks is a multithreaded application running two threads on a machine with four cores. But it is not using available cores effectively. The Average CPU Usage on the chart is about 0.8 while your target should be making it as closer to 4 as possible (for the system with four cores). Hover over the second bar to understand how long the application ran serially. The tooltip shows that the application ran one thread for almost 29 seconds, which is classified as Poor concurrency. The CPU Usage Histogram represents the Elapsed time and usage level for the logical CPUs. Ideally, the highest bar of your chart should be within the Ok or Ideal utilization range. The tachyon_analyze_locks application ran mostly on one logical CPU. If you hover over the second bar, you see that it spent 24.897 seconds using one core only, which is classified by the VTune Amplifier XE as a Poor utilization. To understand what prevented the application from using all available logical CPUs effectively, explore the Bottom-up pane. Identify Locks Click the Bottom-up tab to open the Bottom-up pane. Tutorial: Analyzing Locks and Waits 2 39Synchronization objects that control threads in the application. The hash (unique number) appended to some names of the objects identify the stack creating this synchronization object. For Intel(R) Threading Building Blocks (Intel(R) TBB), VTune Amplifier XE is able to recognize all types of Intel TBB objects. To display an overhead introduced by Intel TBB library internals, the VTune Amplifier XE creates a pseudo synchronization object TBB scheduler that includes all waits from the Intel TBB runtime libraries. The utilization of the processor time when a given thread waited for some event to occur. By default, the synchronization objects are sorted by Poor processor utilization type. Bars showing OK or Ideal utilization (orange and green) are utilizing the processors well. You should focus your optimization efforts on functions with the longest poor CPU utilization (red bars if the bar format is selected). Next, search for the longest over-utilized time (blue bars). This is the Data of Interest column for the Locks and Waits analysis results that is used for different types of calculations, for example: call stack contribution, percentage value on the filter toolbar. Number of times the corresponding system wait API was called. For a lock, it is the number of times the lock was contended and caused a wait. Usually you are recommended to focus your tuning efforts on the waits with both high Wait Time and Wait Count values, especially if they have poor utilization. Wait time, during which the CPU is busy. This often occurs when a synchronization API causes the CPU to poll while the software thread is waiting. Some Spin time may be preferable to the alternative of the increased thread context switches. However, too much Spin time can reflect lost opportunity for productive work. For the analyzed sample code, you see that the top three synchronization objects caused the longest Wait time. The red bars in the Wait Time column indicate that most of the time for these objects processor cores were underutilized. Consider the first item in the Bottom-up pane that is more interesting. It is a Mutex that shows much serial time and is causing a wait. Click the arrow sign at the object name to expand the node and see the draw_task wait function that contains this mutex and call stack. Double-click the Mutex to see the source code for the wait function. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 40Recap You identified a synchronization object with the high Wait Time and Wait Count values and poor CPU utilization that could be a lock affecting application parallelism. Your next step is to analyze the code of this function. Key Terms and Concepts • Term: Elapsed time, Wait time • Concept: Locks and Waits Analysis, CPU Usage, Data of Interest Next Step Analyze Code Analyze Code You identified the mutex that caused significant Wait time and poor processor utilization. Double-click this critical section in the Bottom-up pane to view the source. The Intel(R) VTune(TM) Amplifier XE opens source and disassembly code. Focus on the Source pane and analyze the source code: • Understand basic options provided in the Source window. • Identify the hottest code lines. Understand Basic Source View Options The table below explains some of the features available in the Source panefor the Locks and Waits viewpoint. Source code of the application displayed if the function symbol information is available. When you go to the source by double-clicking the synchronization object in the Bottom-up pane, the VTune Amplifier XE opens the wait function containing this object and highlights the code line that took the most Wait time. The source code in the Source pane is not editable. If the function symbol information is not available, the Assembly pane opens displaying assembler instructions for the selected wait function. To view the source code in the Source pane, make sure to build the target properly. Tutorial: Analyzing Locks and Waits 2 41Processor time and utilization bar attributed to a particular code line. The colored bar represents the distribution of the Wait time according to the utilization levels (Idle, Poor, Ok, Ideal, and Over) defined by the VTune Amplifier XE. The longer the bar, the higher the value. Ok utilization level is not available for systems with a small number of cores. This is the Data of Interest column for the Locks and Waits analysis. Number of times the corresponding system wait API was called while this code line was executing. For a lock, it is the number of times the lock was contended and caused a wait. Source window toolbar. Use hotspot navigation buttons to switch between most performancecritical code lines. Hotspot navigation is based on the metric column selected as a Data of Interest. For the Locks and Waits analysis, this is Wait Time. Use the source file editor button to open and edit your code in your default editor. Identify the Hottest Code Lines The VTune Amplifier XE highlights line 165 entering the rgb_mutex mutex in the draw_task function. The draw_task function was waiting for almost 27 seconds while this code line was executing and most of the time the processor was underutilized. During this time, the critical section was contended 491 times. The rgb_mutex is the place where the application is serializing. Each thread has to wait for the mutex to be available before it can proceed. Only one thread can be in the mutex at a time. You need to optimize the code to make it more concurrent. Click the Source Editor button on the Source window toolbar to open the code editor and optimize the code. Recap You identified the code section that caused a significant wait and during which the processor was poorly utilized. Key Terms and Concepts • Term: Wait time • Concept: CPU Usage, Locks and Waits Analysis, Data of Interest Next Step Remove Lock Remove Lock In the Source window, you located the mutex that caused a significant wait while the processor cores were underutilized and generated multiple wait count. Focus on this line and do the following: 1. Open the code editor. 2. Modify the code to remove the lock. Open the Code Editor Click the Source Editor button to open the analyze_locks.cpp file in your default editor at the hotspot code line: 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 42Remove the Lock The rgb_mutex was introduced to protect calculation from multithreaded access. The brief analysis shows that the code is thread safe and the mutex is not really needed. To resolve this issue: 1. Comment out code lines 165 and 172 to disable the mutex. 2. Save the changes made in the source file. 3. Browse to the directory where you extracted the sample code (for example, /home/intel/samples/en/ tachyon_vtune_amp_xe). 4. Rebuild your target in the release mode using the make command as follows: $ make clean $ make release The tachyon_analyze_locks application is rebuilt and stored in the tachyon_vtune_amp_xe directory. 5. Run tachyon_analyze_locks as follows: $ /home/intel/samples/en/tachyon_vtune_amp_xe/tachyon_analyze_locks dat/balls.dat Tutorial: Analyzing Locks and Waits 2 43System runs the tachyon_analyze_locks application. Note that execution time reduced from 29.647 seconds to 14.615 seconds. Recap You optimized the application execution time by removing the unnecessary critical section that caused a lot of Wait time. Key Terms and Concepts • Term: hotspot • Concept: Locks and Waits Analysis Next Step Compare with Previous Result 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 44Compare with Previous Result You made sure that removing the mutex gave you 15 seconds of optimization in the application execution time. To understand the impact of your changes and how the CPU utilization has changed, re-run the Locks and Waits analysis on the optimized code and compare results: • Compare results before and after optimization. • Identify the performance gain. Compare Results Before and After Optimization 1. Run the Locks and Waits analysis on the modified code. 2. Click the Compare Results button on the Intel(R) VTune(TM) Amplifier XE toolbar. The Compare Results window opens. 3. Specify the Locks and Waits analysis results you want to compare: The Summary window opens providing the statistics for the difference between collected results. Click the Bottom-up tab to see the list of synchronization objects used in the code, Wait time utilization across the two results, and the differences side by side: Difference in Wait time per utilization level between the two results in the following format: = . By default, the Difference column is expanded to display comparison data per utilization level. You may collapse the column to see the total difference data per Wait time. Wait time and CPU utilization for the initial version of the code. Tutorial: Analyzing Locks and Waits 2 45Wait time and CPU utilization for the optimized version of the code. Difference in Wait count between the two results in the following format: = - . Wait count for the initial version of the code. Wait count for the optimized version of the code. Identify the Performance Gain The Elapsed time data in the Summary window shows the optimization of 4 seconds for the whole application execution and Wait time decreased by 37.5 seconds. According to the Thread Concurrency histogram, before optimization (blue bar) the application ran serially for 9 seconds poorly utilizing available processor cores but after optimization (orange bar) it ran serially only for 2 seconds. After optimization the application ran 5 threads simultaneously overutilizing the cores for almost 5 seconds. Further, you may consider this direction as an additional area for improvement. In the Bottom-up pane, locate the Mutex you identified as a bottleneck in your code. Since you removed it during optimization, the optimized result r004lw does not show any performance data for this synchronization object. If you collapse the Wait Time:Difference column by clicking the button, you see that with the optimized result you got almost 27 seconds of optimization in Wait time. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 46Recap You ran the Locks and Waits analysis on the optimized code and compared the results before and after optimization using the Compare mode of the VTune Amplifier XE. The comparison shows that, with the optimized version of the tachyon_analyze_locks application (r004lw result), you managed to remove the lock preventing application parallelism and significantly reduce the application execution time. Compare analysis results regularly to look for regressions and to track how incremental changes to the code affect its performance. You may also want to use the VTune Amplifier XE command-line interface and run the amplxecl command to test your code for regressions. For more details, see the Command-line Interface Support section in the VTune Amplifier XE online help. Key Terms and Concepts • Term: hotspot, Wait time • Concept: Locks and Waits Analysis, CPU Usage Next Step Read Summary Summary You have completed the Analyzing Locks and Waits tutorial. Here are some important things to remember when using the Intel(R) VTune(TM) Amplifier XE to analyze your code for locks and waits: Step 1. Choose and Build Your Target • Create a performance baseline to compare the application versions before and after optimization. Make sure to use the same workload for each application run. • Create a VTune Amplifier XE project and u se the Project Properties: Target tab to choose and configure your analysis target. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. For example, you may limit the data collection to a predefined amount of data or enable the VTune Amplifier XE to collect more accurate CPU time data. You can also run the analysis from command line using the amplxecl command. Tutorial: Analyzing Locks and Waits 2 47Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application with the Summary pane to explore the performance metrics for the whole application. Then, move to the Bottom-up window to analyze the synchronization objects. Focus on the synchronization objects that under- or over-utilized the available logical CPUs and have the highest Wait time and Wait Count values. By default, the objects with the highest Wait time values show up at the top of the window. • Expand the most time-critical synchronization object in the Bottom-up pane and double-click the wait function it belongs to. This opens the source code for this wait function at the code line with the highest Wait time value. Step 4. Compare Results Before and After Optimization • Perform regular regression testing by comparing analysis results before and after optimization. From GUI, click the Compare Results button on the VTune Amplifier XE toolbar. From command line, use the amplxe-cl command. • Expand each data column by clicking the button to identify the performance gain per CPU utilization level. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 48Tutorial: Identifying Hardware Issues 3 Learning Objectives This tutorial shows how to use the General Exploration analysis of the Intel(R) VTune(TM) Amplifier XE to identify the hardware-related issues in the sample application. Estimated completion time: 15 minutes. Sample application: matrix. After you complete this tutorial, you should be able to: • Choose an analysis target. • Run the General Exploration analysis for Intel(R) microarchitecture code name Nehalem. • Understand the event-based performance metrics. • Identify the types of the most critical hardware issues for the application as a whole. • Identify the modules/functions that caused the most critical hardware issues. • Analyze the source code to locate the most critical code lines. • Identify the next steps of the performance analysis to get more detailed results. Start Here Workflow Steps to Identify Hardware Issues Workflow Steps to Identify Hardware Issues You can use an advanced event-based sampling analysis of the Intel® VTune™ Amplifier XE to identify the most significant hardware issues that affect the performance of your application. This tutorial guides you through these workflow steps running the General Exploration analysis type on a sample matrix application. 491. Build an application to analyze for hardware issues and create a new VTune Amplifier XE project. 2. Choose and run the General Exploration analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical functions. 5. Modify the code to resolve the detected performance issues and rebuild the code. Build Application and Create New Project Before you start analyzing hardware issues affecting the performance of your application, do the following: 1. Build application in the release mode with full optimizations. 2. Create a VTune Amplifier XE project. Choose a Build Mode and Build a Target 1. Browse to the directory where you extracted the sample code (for example, /home/sample/matrix/ linux). Make sure this directory contains Makefile. 2. Build your target in the release mode using the make command. The matrix application is automatically built with the GNU* compiler (as matrix.gcc) and stored in the matrix/linux directory. Create a Project 1. Set the EDITOR or VISUAL environment variable to associate your source files with the code editor (like emacs, vi, vim, gedit, and so on). For example: $ export EDITOR=gedit 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 502. From the /bin32 directory (for IA-32 architecture) or from the /bin64 directory (for Intel(R) 64 architecture), run the amplxe-gui script lauching VTune Amplifier XE GUI client. By default, the is /opt/intel/vtune_amplifier_xe_2011. 3. Create a new project via File > New > Project.... The Create a Project dialog box opens. 4. Specify the project name matrix that will be used as the project directory name and click the Create Project button. By default, the VTune Amplifier XE creates a project directory under the root/intel/amplxe/Projects directory and opens the Project Properties: Target dialog box. 5. In the Target: Application to Launch pane, browse to the matrix.gcc application and click OK. Recap You built the target in the Release mode and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Key Terms and Concepts • Term: target • Concept: Event-based Sampling Analysis Next Step Run General Exploration Analysis Run General Exploration Analysis Before running an analysis, choose a configuration level to influence Intel(R) VTune(TM) Amplifier XE analysis scope and running time. In this tutorial, you run the General Exploration analysis on the Intel(R) Core(TM) i7 processor based on the Intel(R) microarchitecture code name Nehalem. The General Exploration analysis type helps identify the widest scope of hardware issues that affect the application performance. This analysis type is based on the hardware event-based sampling collection. To run the analysis: 1. From the VTune Amplifier XE toolbar, click the New Analysis button. Tutorial: Identifying Hardware Issues 3 51The New Amplifier XE Result tab opens with the Analysis Type configuration window active. 2. From the analysis tree on the left, select the Advanced Intel(R) Microarchitecture Code Name Nehalem Analysis > General Exploration analysis type. 3. Click the Start button on the right to run the analysis. The VTune Amplifier XE launches the matrix application that calculates matrix transformations and exits. The VTune Amplifier XE finalizes the collected data and opens the results in the Hardware Issues viewpoint. NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Recap You ran the General Exploration analysis that monitors how your application performs against a set of eventbased hardware metrics. To see the list of processor events used for this analysis type, see the Details section of the General Exploration configuration pane. Key Terms and Concepts • Term: viewpoint • Concept: Event-based Sampling Analysis, Finalization Next Step Interpret Results Interpret Results When the application exits, the Intel(R) VTune(TM) Amplifier XE finalizes the results and opens the Hardware Issues viewpoint that consists of the Summary window, Bottom-up window, and Timeline pane. To interpret the collected data and understand where you should focus your tuning efforts for the specific hardware, do the following: • Understand the event-based metrics • Identify the hardware issues that affect the performance of your application 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 52NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Understand the Event-based Metrics Click the Summary tab to explore the data provided in the Summary window for the whole application performance. Elapsed time is the wall time from the beginning to the end of the collection. Treat this metric as your basic performance baseline against which you will compare subsequent runs of the application. The goal of your optimization is to reduce the value of this metric. Event-based performance metrics. Each metric is an event ratio provided by Intel architects. Mouse over the yellow icon to see the metric description and formula used for the metric calculation. Values calculated for each metric based on the event count. VTune Amplifier XE highlights those values that exceed the threshold set for the corresponding metric. Such a value highlighted in pink signifies an application-level hardware issue. Tutorial: Identifying Hardware Issues 3 53The text below a metric with the detected hardware issue describes the issue, potential cause and recommendations on the next steps, and displays a threshold formula used for calculation. Mouse over the truncated text to read a full description. Quick look at the summary results discovers that the matrix application has the following issues: • CPI (Clockticks per Instructions Retired) Rate • Retire Stalls • LLC Miss • LLC Load Misses Serviced by Remote DRAM • Execution Stalls • Data Sharing Identify the Hardware Issues Click the Bottom-up tab to open the Bottom-up window and see how each program unit performs against the event-based metrics. Each row represents a program unit and percentage of the CPU cycles used by this unit. Program units that take more than 5% of the CPU time are considered hotspots. This means that by resolving a hardware issue that, for example, took about 20% of the CPU cycles, you can obtain 20% optimization for the hotspot. By default, the VTune Amplifier XE sorts data in the descending order by Clockticks and provides the hotspots at the top of the list. You see that the multiply1 function is the most obvious hotspot in the matrix application. It has the highest event count (Clockticks and Instructions Retired events) and most of the hardware issues were also detected during execution of this function. NOTE Mouse over a column header with an event-based metric name to see the metric description. Mouse over a highlighted cell to read the description of the hardware issue detected for the program unit. For the multiply1 function, the VTune Amplifier XE highlights the same issues that were detected as the issues affecting the performance of the whole application: • CPI Rate is high (>1). Potential causes are memory stalls, instruction starvation, branch misprediction, or long-latency instruction. To define the cause for your code, explore other metrics in the Bottom-up window. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 54• The Retire Stalls metric shows that during the execution of the multiply1 function, about 90% (0.945) of CPU cycles were waiting for data to arrive. This may result from branch misprediction, instruction starvation, long latency operations, and other issues. Once you have located the stalled instructions in your code, analyze metrics such as LLC Miss, Execution Stalls, Remote Accesses, Data Sharing, and Contested Accesses. You can also look for long-latency instructions like divisions and string operations to understand the cause. • LLC misses metric shows that about 120% (1.220) of CPU cycles were spent waiting for LLC load misses to be serviced. Possible optimizations are to reduce data working set size, improve data access locality, blocking and consuming data in chunks that fit in the LLC, or better exploit hardware prefetchers. Consider using software prefetchers but beware that they can increase latency by interfering with normal loads and can increase pressure on the memory system. • LLC Load Misses Serviced by Remote DRAM metric shows that 55% (0.554) of cycles were spent servicing memory requests from remote DRAM. Wherever possible, try to consistently use data on the same core or at least the same package, as it was allocated on. • Execution Stalls metric shows that 36% (0.364) of cycles were spent with no micro-operations executed. Look for long-latency operations at code regions with high execution stalls and try to use alternative methods or lower latency operations. For example, consider replacing div operations with right-shifts or try to reduce the latency of memory accesses. • Data Sharing metric took about 7% (0.066) of cycles. To understand the cause, examine the Contested Accesses metric to determine whether the major component of data sharing is due to contested accesses or simple read sharing. Read sharing is a lower priority than Contested Accesses or issues such as LLC Misses and Remote Accesses. If simple read sharing is a performance bottleneck, consider changing data layout across threads or rearranging computation. However, this type of tuning may not be straightforward and could bring more serious performance issues back. Recap You analyzed the data provided in the Hardware Issues viewpoint, explored the event-based metrics, and identified the areas where your sample application had hardware issues. Also, you were able to identify the exact function with poor performance per metrics and that could be a good candidate for further analysis. Key Terms and Concepts • Term: viewpoint, baseline, Elapsed time • Concept: Event-based Sampling Analysis, Event-based Metrics Next Step Analyze Code Analyze Code You identified a hotspot function with a number of hardware issues. Double-click the multiply1 function in the Bottom-up window to open the source code: Tutorial: Identifying Hardware Issues 3 55The table below explains some of the features available in the Source pane when viewing the event-based sampling analysis data. Source pane displaying the source code of the application, which is available if the function symbol information is available. The code line that took the highest number of Clockticks samples is highlighted. The source code in the Source pane is not editable. Values per hardware event attributed to a particular code line. By default, the data is sorted by the Clockticks event count. Focus on the events that constitute the metrics identified as performancecritical in the Bottom-up window. To identify these events, mouse over the metric column header in the Bottom-up window. Drag-and-drop the columns to organize the view for your convinience. VTune Amplifier XE remembers yours settings and restores them each time you open the viewpoint. Hotspot navigation buttons to switch between code lines that took a long time to execute. Source file editor button to open and edit your code in the default editor. Assembly button to toggle in the Assembly pane that displays assembly instructions for the selected function. In the Source pane for the multiply1 function, you see that line 38 took the most of the Clockticks event samples during execution. But from your code knowledge, you understand that the culprit should be line 39. Due to event skid (that may happen at the low granularity level like source line, instruction, or basic block), the VTune Amplifier XE mistakenly attributed the samples collected for line 39 to line 38. This code section multiplies matrices in the loop but ineffectively accesses the memory. Focus on this section and try to reduce the memory issues. Recap You analyzed the code for the hotspot function identified in the Bottom-up window and located the hotspot line that generated a high number of CPU Clockticks. Key Terms and Concepts • Concept: Event Skid 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 56Next Step Resolve Issue Resolve Issue In the Source pane, you identified that in the multiply1 function the code line 39 resulted in the highest values for the Clockticks event. To solve this issue, do the following: • Change the multiplication algorithm and, if using the Intel(R) compiler, enable vectorization. • Re-run the analysis to verify optimization. Change Algorithm NOTE The proposed solution is one of the multiple ways to optimize the memory access and is used for demonstration purposes only. 1. Open the matrix.c file from the sample code directory (for example, /home/sample/matrix/src). For this sample, the matrix.c file is used to initialize the functions used in the multiply.c file. 2. In line 90, replace the multiply1 function name with the multiply2 function. This new function uses the loop interchange mechanism that optimizes the memory access in the code. Tutorial: Identifying Hardware Issues 3 57The proposed optimization assumes you may use the Intel(R) C++ Compiler to build the code. Intel compiler helps vectorize the data, which means that it uses SIMD instructions that can work with several data elements simultaneously. If only one source file is used, the Intel compiler enables vectorization automatically. The current sample uses several source files, that is why the multiply2 function uses #pragma ivdep to instruct the compiler to ignore assumed vector dependencies. This information lets the compiler enable the Supplemental Streaming SIMD Extensions (SSSE). 3. Save files and rebuild the project using the compiler of your choice. If you have the Intel(R) compiler installed, you may run it from the code sample directory (for example: / home/sample/matrix/linux) as follows: make icc The matrix application is automatically built with the Intel compiler (as matrix.icc) and stored in the matrix/linux directory. Verify Optimization 1. From the VTune Amplifier XE toolbar, click the Project Properties button. The Project Properties dialog box opens with the Target tab active. The Launch Application pane is open by default. 2. In the Application field, click the Browse... button and navigate to the updated matrix application. This tutorial uses the application compiled with the Intel compiler, matrix.icc. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 583. Click OK to close the dialog box. 4. From the VTune Amplifier XE toolbar, click the New Analysis button. The Analysis Type configuration window opens . 5. From the left pane, select Advanced Intel(R) Microarchitecture Code Name Nehalem Analysis > General Exploration and click the Start button on the right. VTune Amplifier XE reruns the General Exploration analysis for the updated matrix target and creates a new result, r001ge, that opens automatically. 6. In the r001ge result, click the Summary tab to see the Elapsed time value for the optimized code: You see that the Elapsed time has reduced from 15.730 seconds to 1.678 seconds and the VTune Amplifier XE now identifies only three types of issues for the application performance: high CPI Rate,Retire Stalls, and LLC Miss. Recap You solved the memory access issue for the sample application by interchanging the loops and sped up the execution time. You also considered using the Intel compiler to enable instruction vectorization. Key Terms and Concepts • Concept: Event-based Sampling Analysis Tutorial: Identifying Hardware Issues 3 59Next Step Resolve Next Issue Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Resolve Next Issue You got a significant performance boost by optimizing the memory access for the multiply1 function. According to the data provided in the Summary window for your updated result, r001ge, you still have high CPI rate, LLC Miss, and Retire Stalls issues. You can try to optimize your code further following the steps below: • Analyze results after optimization • Use more advanced algorithms • Verify optimization Analyze Results after Optimization To get more details on the issues that still affect the performance of the matrix application, switch to the Bottom-up window: You see that the multiply2 function (in fact, updated multiply1 function) is still a hotspot. Double-click this function to view the source code and click both the Source and Assembly buttons on the toolbar to enable the Source and Assembly panes. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 60In the Source pane, the VTune Amplifier XE highlights line 53 that took the highest number of Clockticks samples. This is again the section where matrices are multiplied. The Assembly pane is automatically synchronized with the Source pane. It highlights the basic blocks corresponding to the code line highlighted in the Source pane. If you compiled the application with the Intel(R) Compiler, you can see that highlighted block 1 includes vectorization instructions added after your previous optimization. All vectorization instructions have the p (packed) postfix (for example, mulpd). You may use the /Qvec-report3 option of the Intel compiler to generate the compiler optimization report and see which cycles were not vectorized and why. For more details, see the Intel compiler documentation. Use More Advanced Algorithms 1. Open the matrix.c file from the Source Files of the matrix project. 2. In line 90, replace the multiply2 function name with the multiply3 function. This function enables uploading the matrix data by blocks. Tutorial: Identifying Hardware Issues 3 613. Save the files and rebuild the project. Verify Optimization 1. From the VTune Amplifier XE File menu, select New > Quick Intel(R) Microarchitecture Code Name Nehalem - General Exploration Analysis. VTune Amplifier XE reruns the General Exploration analysis for the updated matrix target and creates a new result, r002ge, that opens automatically. 2. In the r002ge result, click the Summary tab to see the Elapsed time value for the optimized code: You see that the Elapsed time has reduced a little: from 1.678 seconds to 1.244 seconds but the hardware issues identified in the previous run, CPI Rate, Retire Stalls, and LLC Miss, stayed practically the same. This means that there is more room for improvement and you can try other, more effective, mechanisms of matrix multiplication. Recap You tried optimizing the mechanism of matrix multiplication and obtained 0.4 seconds of optimization in the application execution time. Key Terms and Concepts • Concept: Event-based Sampling Analysis, Event-based Metrics 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 62Next Step Read Summary Summary You have completed the Identifying Hotspot Issues tutorial. Here are some important things to remember when using the Intel(R) VTune(TM) Amplifier XE to analyze your code for hardware issues: Step 1. Choose and Build Your Target • Create a VTune Amplifier XE project and u se the Project Properties: Target tab to choose and configure your analysis target. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. You may choose between a predefined analysis type like the General Exploration type used in this tutorial, or create a new custom analysis type and add events of your choice. For more details on the custom collection, see the Creating a New Analysis Type topic in the product online help. Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application from the Summary window to explore the eventbased performance metrics for the whole application. Mouse over the yellow help icons to read the metric descriptions. Use the Elapsed time value as your performance baseline. • Move to the Bottom-up window and analyze the performance per function. Focus on the hotspots - functions that took the highest Clockticks event count. By default, they are located at the top of the table. Analyze the hardware issues detected for the hotspot functions. Hardware issues are highlighted in pink. Mouse over a highlighted value to read the issues description and see the threshold formula. • Double-click the hotspot function in the Bottom-up pane to open its source code at the code line that took the highest Clockticks event count. • Consider using Intel(R) Compiler to vectorize instructions. Explore the compiler documentation for more details. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Tutorial: Identifying Hardware Issues 3 633 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 64More Resources 4 Getting Help Intel(R) VTune(TM) Amplifier XE provides a number of Getting Started tutorials. These tutorials use a sample application to demo you the basic product features and workflows. You can access these documents through the Help menu or by clicking the VTune Amplifier XE icon . : For the standalone user interface, the tutorials are available via Help > Getting Started Tutorials menu. To view help in the standalone user interface, select Intel VTune Amplifier XE 2011 Help from the Help menu. Navigating in the Product Usage Workflow Where applicable, the VTune Amplifier XE help topics provide a Where am I in the workflow? button. Click the button to view the workflow with a highlight on the stage that this topic discusses. Using Context-Sensitive Help Context-sensitive help enables easy access to help topics on active GUI elements. The following contextsensitive help features are available on a product-specific basis: • F1 Help: Press F1 to get help for an active dialog box, property page, pane, or window. Product Website and Support Product Website and Support The following links provide information and support on Intel software products, including Intel(R) Parallel Studio XE: • http://software.intel.com/en-us/articles/tools/ Intel(R) Software Development Products Knowledge Base. • http://www.intel.com/software/products/support/ Technical support information, to register your product, or to contact Intel. For additional support information, see the Technical Support section of your Release Notes. System Requirements For detailed information on system requirements, see the Release Notes. 654 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 66Intel(R) VTune(TM) Amplifier XE Tutorials Troubleshooting 5 Troubleshooting Problem: The Start button is disabled The Start button on the command toolbar is disabled. Solution: Make sure you specified an analysis target. If the target is not specified, click the Project Properties button on the command toolbar and enter the target name in the Application to Launch pane. For the General Exploration analysis, the Start button may be disabled if you mistakenly chose the incorrect processor type. The selected analysis type should match your processor type. 67 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS Document Number: 323906-005US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information................................................................................5 Introducing the Intel® VTune™ Amplifier XE.........................................7 Prerequisites........................................................................................9 Navigation Quick Start.......................................................................11 Key Terms and Concepts....................................................................13 Chapter 1: Tutorial: Finding Hotspots Learning Objectives..................................................................................17 Workflow Steps to Identify and Analyze Hotspots.........................................17 Visual Studio* IDE: Choose Project and Build Application..............................18 Standalone GUI: Build Application and Create New Project............................24 Run Hotspots Analysis..............................................................................29 Interpret Result Data................................................................................30 Analyze Code..........................................................................................33 Tune Algorithms......................................................................................34 Compare with Previous Result....................................................................37 Summary................................................................................................39 Chapter 2: Tutorial: Analyzing Locks and Waits Learning Objectives..................................................................................41 Workflow Steps to Identify Locks and Waits.................................................41 Visual Studio* IDE: Choose Project and Build Application..............................42 Standalone GUI: Build Application and Create New Project............................48 Run Locks and Waits Analysis....................................................................53 Interpret Result Data................................................................................54 Analyze Code..........................................................................................57 Remove Lock...........................................................................................58 Compare with Previous Result....................................................................60 Summary................................................................................................63 Chapter 3: Tutorial: Identifying Hardware Issues Learning Objectives..................................................................................65 Workflow Steps to Identify Hardware Issues................................................65 Visual Studio* IDE: Choose Project and Build Application..............................66 Standalone GUI: Build Application and Create New Project............................70 Run General Exploration Analysis...............................................................74 Interpret Results......................................................................................75 Analyze Code..........................................................................................78 Resolve Issue..........................................................................................79 Resolve Next Issue...................................................................................82 Summary................................................................................................85 Chapter 4: More Resources Getting Help............................................................................................87 Product Website and Support.....................................................................88 Contents 3Chapter 5: Intel® VTune™ Amplifier XE Tutorials Troubleshooting Troubleshooting.......................................................................................89 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 4Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. Java is a registered trademark of Oracle and/or its affiliates. Copyright (C) 2010-2011, Intel Corporation. All rights reserved. 5 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 6Introducing the Intel® VTune™ Amplifier XE The Intel ® VTune™ Amplifier XE, an Intel ® Parallel Studio XE tool, provides information on code performance for users developing serial and multithreaded applications on Windows* and Linux* operating systems. On Windows systems, the VTune Amplifier XE integrates into Microsoft Visual Studio* software and is also available as a standalone GUI client. On Linux systems, VTune Amplifier XE works only as a standalone GUI client. On both Windows and Linux systems, you can benefit from using the command-line interface for collecting data remotely or for performing regression testing. VTune Amplifier XE helps you analyze the algorithm choices and identify where and how your application can benefit from available hardware resources. Use the VTune Amplifier XE to locate or determine the following: • The most time-consuming (hot) functions in your application and/or on the whole system • Sections of code that do not effectively utilize available processor time • The best sections of code to optimize for sequential performance and for threaded performance • Synchronization objects that affect the application performance • Whether, where, and why your application spends time on input/output operations • The performance impact of different synchronization methods, different numbers of threads, or different algorithms • Thread activity and transitions • Hardware-related bottlenecks in your code Intel VTune Amplifier XE Tutorials These tutorials tell you how to use the VTune Amplifier XE to analyze the performance of a sample application by identifying software- and hardware-related issues in the code. • Finding Hotspots • Analyzing Locks and Waits • Identifying Hardware Issues Check http://software.intel.com/en-us/articles/intel-software-product-tutorials/ for the printable version (PDF) of product tutorials. See Also Getting Help 7 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 8Prerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. NOTE The instructions and screen shots in these tutorials refer to the Visual Studio* 2005 integrated development environment (IDE). They may slightly differ for other versions of Visual Studio IDE or for the standalone version of the Intel® VTune™ Amplifier XE. See online help for details. Required Tools You need the following tools to use these tutorials: • Intel ® VTune™ Amplifier XE • Sample code included with the VTune Amplifier XE. VTune Amplifier XE provides the following sample applications: • tachyon application used for the Finding Hotspots and Analyzing Locks and Waits tutorials • matrix application used for the Identifying Hardware Issues tutorial • VTune Amplifier XE Help • Microsoft Visual Studio* 2005 or later To acquire the VTune Amplifier XE: If you do not already have access to the VTune Amplifier XE, you can download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/. To install the VTune Amplifier XE, follow the instructions in the Release Notes. To install and set up VTune Amplifier XE sample code: 1. Copy the tachyon_vtune_amp_xe.zip and matrix_vtune_amp_xe.zip files from the samples \\C++ folder in the IntelVTune Amplifier XE installation directory to a writable directory or share on your system. The default installation directory is C:\Program Files\Intel\VTune Amplifier XE 2011 (on certain systems, instead of Program Files, the folder name is Program Files (x86)). 2. Extract the sample(s) from the .zip file. NOTE • Samples are non-deterministic. Your screens may vary from the screen shots shown throughout these tutorials. • Samples are designed only to illustrate VTune Amplifier XE features and do not represent best practices for tuning the code. Results may vary depending on the nature of the analysis. To run the VTune Amplifier XE: • For Microsoft Visual Studio*: VTune Amplifier XE integrates into Visual Studio when installation completes. To configure and run an analysis, open your solution and go to Tools > Intel VTune Amplifier XE 2011 > New Analysis... from the Visual Studio menu or click the New Analysis button from the VTune Amplifier XE toolbar. See the Navigation Quick Start for more details. • For the standalone interface: From the Start menu, select All Programs > Intel Parallel Studio XE 2011 > Intel VTune Amplifier XE 2011. 9To access VTune Amplifier XE Help: See the Getting Help topic. Required Skills and Knowledge These tutorials are designed for developers with the following skills and knowledge: • Basic understanding of the Microsoft Visual Studio* 2005 development environment (IDE), including how to: • Open a project/solution. • Display the Solution Explorer and Output windows. • Compile and link a project. • Ensure a project compiled successfully. • Access the Document Explorer window. Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 10Navigation Quick Start Intel® VTune™ Amplifier XE /Microsoft Visual Studio* 2005 Integration NOTE This topic describes integration into Microsoft Visual Studio* 2005. Integration to other version of Visual Studio IDE or the standalone VTune Amplifier XE interface may slightly differ. The VTune Amplifier XE integrates into the Visual Studio* development environment (IDE) and can be accessed from the menus, toolbar, and Solution Explorer in the following manner: Use the VTune Amplifier XE toolbar to configure and control result collection. VTune Amplifier XE results *.amplxe show up in the Solution Explorer under the My Amplifier XE Results folder. To configure and control result collection, right-click the project in the Solution Explorer and select the Intel VTune Amplifier XE 2011 menu from the popup menu. To manage previously collected results, right-click the result (for example, r002hs.amplxe) and select the required command from the pop-up menu. 11Use the drop-down menu to select a viewpoint, a preset configuration of windows/panes for an analysis result. For each analysis type, you can switch among several preset configurations to focus on particular performance metrics. Click the buttons on navigation toolbars to change window views and toggle window panes on and off. In the Timeline pane, analyze the thread activity and transitions presented for the user-mode sampling and tracing analysis results (for example, Hotspots, Concurrency, Locks and Waits) or analyze the distribution of the application performance per metric over time for the eventbased sampling analysis results (for example, Memory Access, Bandwidth Breakdown). Use the Call Stack pane to view call paths for a function selected in the grid. Use the filter toolbar to filter out the result data according to the selected categories. In Microsoft Visual Studio* 2005/2008, use the Dynamic Help window to access help topics related to the current VTune Amplifier XE window/pane. Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 12Key Terms and Concepts Key Terms baseline: A performance metric used as a basis for comparison of the application versions before and after optimization. Baseline should be measurable and reproducible. CPU time: The amount of time a thread spends executing on a logical processor. For multiple threads, the CPU time of the threads is summed. The application CPU time is the sum of the CPU time of all the threads that run the application. Elapsed time:The total time your target ran, calculated as follows: Wall clock time at end of application – Wall clock time at start of application. hotspot: A section of code that took a long time to execute. Some hotspots may indicate bottlenecks and can be removed, while other hotspots inevitably take a long time to execute due to their nature. target: A target is an executable file you analyze using the Intel ® VTune™ Amplifier XE. viewpoint: A preset result tab configuration that filters out the data collected during a performance analysis and enables you to focus on specific performance problems. When you select a viewpoint, you select a set of performance metrics the VTune Amplifier XE shows in the windows/panes of the result tab. To select the required viewpoint, click the button and use the drop-down menu at the top of the result tab. Wait time: The amount of time that a given thread waited for some event to occur, such as: synchronization waits and I/O waits. Key Concept: CPU Usage For the user-mode sampling and tracing analysis types, the Intel ® VTune™ Amplifier XE identifies a processor utilization scale, calculates the target CPU usage, and defines default utilization ranges depending on the number of processor cores. You can change the utilization ranges by dragging the slider in the CPU Usage histogram in the Summary window. Utilizatio n Type Default color Description Idle All CPUs are waiting - no threads are running. Poor Poor usage. By default, poor usage is when the number of simultaneously running CPUs is less than or equal to 50% of the target CPU usage. OK Acceptable (OK) usage. By default, OK usage is when the number of simultaneously running CPUs is between 51-85% of the target CPU usage. Ideal Ideal usage. By default, Ideal usage is when the number of simultaneously running CPUs is between 86-100% of the target CPU usage. Key Concept: Data of Interest The VTune Amplifier XE maintains a special column called Data of Interest. This column is highlighted with yellow background and a yellow star in the column header . The data in the Data of Interest column is used by various windows as follows: 13• The Call Stack pane calculates the contribution, shown in the contribution bar, using the Data of Interest column values. • The Filter bar uses the data of interest values to calculate the percentage indicated in the filtered option. • The Source/Assembly window uses this column for hotspot navigation. If a viewpoint has more than one column with numeric data or bars, you can change the default Data of Interest column by right-clicking the required column and selecting the Set Column as Data of Interest command from the pop-up menu. Key Concept: Event-based Metrics When analyzing data collected during a hardware event-based sampling analysis, the VTune Amplifier XE uses the performance metrics. Each metric is an event ratio with its own threshold values. As soon as the performance of a program unit per metric exceeds the threshold, the VTune Amplifier XE marks this value as a performance issue (in pink) and provides recommendations how to fix it. Each column in the Bottom-up pane provides data per metric. To read the metric description and see the formula used for the metric calculation, mouse over the metric column header. To read the description of the hardware issue and see the threshold formula used for this issue, mouse over the link cell in the grid. For the full list of metrics used by the VTune Amplifier XE, see the Hardware Event-based Metrics topic in the online help. Key Concept: Event-based Sampling Analysis VTune Amplifier XE introduces a set of advanced hardware analysis types based on the event-based sampling data collection and targeted for the Intel ® Core™ 2 processor family, processors based on the Intel ® microarchitecture code name Nehalem and Intel ® microarchitecture code name Sandy Bridge. Depending on the analysis type, the VTune Amplifier XE monitors a set of hardware events and, as a result, provides collected data per, so-called, hardware event-based metrics defined by Intel architects (for example, Clockticks per Instructions Retired, Contested Accesses, and so on). Typically, you are recommended to start with the General Exploration analysis type that collects the maximum number of events and provides the widest picture of the hardware issues that affected the performance of your application. For more information on the event-based sampling analysis, see the Hardware Event-based Sampling Collection topic in the online help. Key Concept: Event Skid Event skid is the recording of an event not exactly on the code line that caused the event. Event skids may even result in a caller function event being recorded in the callee function. Event skid is caused by a number of factors: • The delay in propagating the event out of the processor's microcode through the interrupt controller (APIC) and back into the processor. • The current instruction retirement cycle must be completed. • When the interrupt is received, the processor must serialize its instruction stream which causes a flushing of the execution pipeline. The Intel(R) processors support accurate event location for some events. These events are called precise events.See the online help for more details. Key Concept: Finalization Finalization is the process of the Intel ® VTune™ Amplifier XE converting the collected data to a database, resolving symbol information, and pre-computing data to make further analysis more efficient and responsive. The VTune Amplifier XE finalizes data automatically when data collection completes. You may want to re-finalize a result to: Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 14• update symbol information after changes in the search directories settings • resolve the number of [Unknown]-s in the results Key Concept: Hotspots Analysis The Hotspots analysis helps understand the application flow and identify sections of code that took a long time to execute (hotspots). A large number of samples collected at a specific process, thread, or module can imply high processor utilization and potential performance bottlenecks. Some hotspots can be removed, while other hotspots are fundamental to the application functionality and cannot be removed. The Intel ®VTune™ Amplifier XE creates a list of functions in your application ordered by the amount of time spent in a function. It also detects the call stacks for each of these functions so you can see how the hot functions are called. The VTune Amplifier XE uses a low overhead (about 5%) user-mode sampling and tracing collection that gets you the information you need without slowing down the application execution significantly. Key Concept: Locks and Waits Analysis While the Concurrency analysis helps identify where your application is not parallel, the Locks and Waits analysis helps identify the cause of the ineffective processor utilization. One of the most common problems is threads waiting too long on synchronization objects (locks). Performance suffers when waits occur while cores are under-utilized. During the Locks and Waits analysis you can estimate the impact each synchronization object introduces to the application and understand how long the application was required to wait on each synchronization object, or in blocking APIs, such as sleep and blocking I/O. Key Concept: Thread Concurrency The number of active threads corresponds to the concurrency level of an application. By comparing the concurrency level with the number of processors, Intel ® VTune™ Amplifier XE classifies how an application utilizes the processors in the system. It defines default utilization ranges depending on the number of processor cores and displays the thread concurrency in the Summary and Bottom-up window. You can change the utilization ranges by dragging the slider in the Summary window. Thread concurrency may be higher than CPU Usage if threads are in the runnable state and not consuming CPU time. VTune Amplifier XE defines the Target Concurrency level for your application that is, by default, equal to the number of physical cores. Utilizatio n Type Default color Description Idle All threads in the application are waiting - no threads are running. There can be only one bar in the Thread Concurrency histogram indicating Idle utilization. Poor Poor utilization. By default, poor utilization is when the number of threads is up to 50% of the target concurrency. OK Acceptable (OK) utilization. By default, OK utilization is when the number of threads is between 51-85% of the target concurrency. Ideal Ideal utilization. By default, ideal utilization is when the number of threads is between 86-115% of the target concurrency. Over Over-utilization. By default, over-utilization is when the number of threads is more than 115% of the target concurrency. Key Terms and Concepts 15 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 16Tutorial: Finding Hotspots 1 Learning Objectives This tutorial shows how to use the Hotspots analysis of the Intel® VTune™ Amplifier XE to understand where the sample application is spending time, identify hotspots - the most time-consuming program units, and detect how they were called. Some hotspots may indicate bottlenecks that can be removed, while other hotspots are inevitable and take a long time to execute due to their nature. Typically, the hotspot functions identified during the Hotspots analysis use the most time-consuming algorithms and are good candidates for parallelization. The Hotspots analysis is useful to analyze the performance of both serial and parallel applications. Estimated completion time: 15 minutes. Sample application: tachyon. After you complete this tutorial, you should be able to: • Choose an analysis target. • Choose the Hotspots analysis type. • Run the Hotspots analysis to locate most time-consuming functions in an application. • Analyze the function call flow and threads. • Analyze the source code to locate the most time-critical code lines. • Compare results before and after optimization. Start Here Workflow Steps to Identify and Analyze Hotspots Workflow Steps to Identify and Analyze Hotspots You can use the Intel® VTune™ Amplifier XE to identify and analyze hotspot functions in your serial or parallel application by performing a series of steps in a workflow. This tutorial guides you through these workflow steps while using a sample ray-tracer application named tachyon. 171. Do one of the following: • Visual Studio* IDE: Choose a project, verify settings, and build application • Standalone GUI: Build an application to analyze for hotspots and create a new VTune Amplifier XE project 2. Choose and run the Hotspots analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical function. 5. Modify the code to tune the algorithms or rebuild the code with Intel® Compiler. 6. Re-build the target, re-run the Hotspots analysis, and compare the result data before and after optimization. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Visual Studio* IDE: Choose Project and Build Application Before you start analyzing your application target for hotspots, do the following: 1. Choose a project with the analysis target in the Visual Studio IDE. 2. Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that VTune Amplifier XE can properly identify system functions and classify and attribute functions. 3. Configure Visual Studio project properties to generate the debug information for your application so that VTune Amplifier XE can open the source code. 4. Build the target in the release mode with full optimizations, which is recommended for performance analysis. 5. Run the application without debugging to create a performance baseline. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 18For this tutorial, your target is a ray-tracer application, tachyon. To learn how to install and set up the sample code, see Prerequisites. • The steps below are provided for Microsoft Visual Studio 2005. They may slightly differ for other versions of Visual Studio. • Steps provided by this tutorial are generic and applicable to any application. You may choose to follow the proposed workflow using your own application. Choose a Project 1. From the Visual Studio menu, select File > Open > Project/Solution.... The Open Project dialog box opens. 2. In the Open Project dialog box, browse to the location you used to extract the tachyon_vtune_amp_xe.zip file and select the tachyon_vtune_amp_xe.sln file. The solution is added to Visual Studio IDE and shows up in the Solution Explorer. 3. In the Solution Explorer, right-click the find_hotspots project and select Project > Set as StartUp Project. find_hotspots appears in bold in the Solution Explorer. When you choose a project in Visual Studio IDE, the VTune Amplifier XE automatically creates the config.amplxeproj project file and sets the find_hotspots application as an analysis target in the project properties. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http:// msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. Tutorial: Finding Hotspots 1 197. Click OK. Enable Generating Debug Information for Your Binary Files 1. Select the find_hotspots project and go to Project > Properties. 2. From the find_hotspots Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the find_hotspots Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 204. From the find_hotspots Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). Tutorial: Finding Hotspots 1 21Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build find_hotspots. The tachyon_find_hotspots.exe application is built. NOTE The build configuration for tachyon may initially be set to Debug, which is typically used for development. When analyzing performance issues with the VTune Amplifier XE, you are recommended to use the Release build with normal optimizations. In this way, the VTune Amplifier XE is able to analyze the realistic performance of your application. Create a Performance Baseline 1. From the Visual Studio menu, select Debug > Start Without Debugging. The tachyon_find_hotspots.exe application starts running. NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 222. Note the execution time displayed in the window caption. For the tachyon_find_hotspots.exe executable in the figure above, the execution time is 63.609 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Recap You chose the target for the Hotspots analysis, set up your environment to enable generating symbol information for system libraries and your binary files, built the target in the Release mode, and created the performance baseline. Your application is ready for analysis. Key Terms and Concepts • Term: target • Concept: Hotspots Analysis Next Step Run Hotspots Analysis Tutorial: Finding Hotspots 1 23Standalone GUI: Build Application and Create New Project Before you start analyzing your application target for hotspots, do the following: 1. Build application. If you build the code in Visual Studio*, make sure to: • Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that VTune Amplifier XE can properly identify system functions and classify and attribute functions. • Configure Visual Studio project properties to generate the debug information for your application so that VTune Amplifier XE can open the source code. • Build the target in the release mode with full optimizations, which is recommended for performance analysis. 2. Run the application without debugging to create a performance baseline. 3. Create a VTune Amplifier XE project. NOTE The steps below are provided for Microsoft Visual Studio 2005. They may differ slightly for other versions of Visual Studio. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http:// msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 247. Click OK. Enable Generating Debug Information for Your Binary Files 1. Select the find_hotspots project and go to Project > Properties. 2. From the find_hotspots Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the find_hotspots Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). Tutorial: Finding Hotspots 1 254. From the find_hotspots Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 26Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build find_hotspots. The tachyon_find_hotspots.exe application is built. NOTE The build configuration for tachyon may initially be set to Debug, which is typically used for development. When analyzing performance issues with the VTune Amplifier XE, you are recommended to use the Release build with normal optimizations. In this way, the VTune Amplifier XE is able to analyze the realistic performance of your application. Create a Performance Baseline 1. From the Visual Studio menu, select Debug > Start Without Debugging. The tachyon_find_hotspots.exe application starts running. NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. Tutorial: Finding Hotspots 1 272. Note the execution time displayed in the window caption. For the tachyon_find_hotspots.exe executable in the figure above, the execution time is 63.609 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Create a Project 1. From the Start menu select Intel Parallel Studio XE 2011 > Intel VTune Amplifier XE 2011 to launch the VTune Amplifier XE standalone GUI. 2. Create a new project via File > New > Project.... The Create a Project dialog box opens. 3. Specify the project name tachyon that will be used as the project directory name. The VTune Amplifier XE creates the tachyon project directory under the %USERPROFILE%\My Documents \My Amplifier XE Projects directory and opens the Project Properties: Target dialog box. 4. In the Application to Launch pane of the Target tab, specify and configure your target as follows: • For the Application field, browse to: \find_hotspots.exe, for example: C: \samples\tachyon_vtune_amp_xe\vc8\find_hotspots_Win32_Release\find_hotspots.exe. 5. Click OK to apply the settings and exit the Project Properties dialog box. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 28Recap You set up your environment to enable generating symbol information for system libraries and your binary files, built the target in the Release mode, created the performance baseline, and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Key Terms and Concepts • Term: target, baseline Next Step Run Hotspots Analysis Run Hotspots Analysis In this tutorial, you run the Hotspots analysis to identify the hotspots that took much time to execute. To run an analysis: 1. From the VTune Amplifier XE toolbar, click the New Analysis button. The VTune Amplifier XE result tab opens with the Analysis Type window active. 2. On the left pane of the Analysis Type window, locate the analysis tree and select Algorithm Analysis > Hotspots. The right pane is updated with the default options for the Hotspots analysis. 3. Click the Start button on the right command bar. VTune Amplifier XE launches the tachyon_find_hotspots application that renders balls.dat as an input file, calculates the execution time, and exits. VTune Amplifier XE finalizes the collected results and opens the Hotspots viewpoint. NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Recap You launched the Hotspots data collection that analyzes function calls and CPU time spent in each program unit of your application. Tutorial: Finding Hotspots 1 29NOTE This tutorial explains how to run an analysis from the VTune Amplifier XE graphical user interface (GUI). You can also use the VTune Amplifier XE command-line interface (amplxe-cl command) to run an analysis. For more details, check the Command-line Interface Support section of the VTune Amplifier XE Help. Key Terms and Concepts • Term: hotspot, Elapsed time, viewpoint • Concept: Hotspot Analysis, Finalization Next Step Interpret Result Data Interpret Result Data When the sample application exits, the Intel® VTune™ Amplifier XE finalizes the results and opens the Hotspots viewpoint that consists of the Summary, Bottom-up, and Top-down Tree windows. To interpret the data on the sample code performance, do the following: • Understand the basic performance metrics provided by the Hotspots analysis. • Analyze the most time-consuming functions. • Analyze CPU usage per function. NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Understand the Basic Hotspots Metrics Start analysis with the Summary window. To interpret the data, hover over the question mark icons to read the pop-up help and better understand what each performance metric means. Note that CPU Time for the sample application is equal to 64.907 seconds. It is the sum of CPU time for all application threads. Total Thread Count is 3, so the sample application is multi-threaded. The Top Hotspots section provides data on the most time-consuming functions (hotspot functions) sorted by CPU time spent on their execution. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 30For the sample application, the initialize_2D_buffer function, which took 27.671 seconds to execute, shows up at the top of the list as the hottest function. The [Others] entry at the bottom shows the sum of CPU time for all functions not listed in the table. Analyze the Most Time-consuming Functions Click the Bottom-up tab to explore the Bottom-up pane. By default, the data in the grid is sorted by Function. You may change the grouping level using the Grouping drop-down menu at the top of the grid. Analyze the CPU Time column values. This column is marked with a yellow star as the Data of Interest column. It means that the VTune Amplifier XE uses this type of data for some calculations (for example, filtering, stack contribution, and others). Functions that took most CPU time to execute are listed on top. The initialize_2D_buffer function took 27.671 seconds to execute. Click the plus sign at the initialize_2D_buffer function to expand the stacks calling this function. You see that it was called only by the setup_2D_buffer function. Select the initialize_2D_buffer function in the grid and explore the data provided in the Call Stack pane on the right. The Call Stack pane displays full stack data for each hotspot function, enables you to navigate between function call stacks and understand the impact of each stack to the function CPU time. The stack functions in the Call Stack pane are represented in the following format: ! - :, where the line number corresponds to the line calling the next function in the stack. For the sample application, the hottest function initialize_2D_buffer is called at line 86 of the setup_2D_buffer function in the global.cpp file. Analyze CPU Usage per Function VTune Amplifier XE enables you to analyze the collected data from different perspectives by using multiple viewpoints. For the Hotspots analysis result, you may switch to the Hotspots by CPU Usage viewpoint to understand how your hotspot function performs in Tutorial: Finding Hotspots 1 31terms of the CPU usage. Explore this viewpoint to determine how your application utilized available cores and identify the most serial code. If you go back to the Summary window, you can see the CPU Usage Histogram that represents the Elapsed time and usage level for the available logical processors. Ideally, the highest bar of your chart should match the Target level. The tachyon_find_hotspots application ran mostly on one logical CPU. If you hover over the highest bar, you see that it spent 62.491 seconds using one core only, which is classified by the VTune Amplifier XE as a Poor utilization for a dual-core system. To understand what prevented the application from using all available logical CPUs effectively, explore the Bottom-up pane. To get the detailed CPU usage information per function, use the button in the Bottom-up window to expand the CPU Time column. Note that initialize_2D_buffer is the function with the longest poor CPU utilization (red bars). This means that the processor cores were underutilized most of the time spent on executing this function. If you change the grouping level (highlighted in the figure above) in the Bottom-up pane from Function/ Call Stack to Thread/Function/Call Stack, you see that the initialize_2D_buffer function belongs to the thread_video thread. This thread is also identified as a hotspot and shows up at the top in the Bottomup pane. To get detailed information on the hotspot thread performance, explore the Timeline pane. Timeline area. When you hover over the graph element, the timeline tooltip displays the time passed since the application has been launched. Threads area that shows the distribution of CPU time utilization per thread. Hover over a bar to see the CPU time utilization in percent for this thread at each moment of time. Green zones show the time threads are active. CPU Usage area that shows the distribution of CPU time utilization for the whole application. Hover over a bar to see the application-level CPU time utilization in percent at each moment of time. VTune Amplifier XE calculates the overall CPU Usage metric as the sum of CPU time per each thread of the Threads area. Maximum CPU Usage value is equal to [number of processor cores] x 100%. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 32The Timeline analysis also identifies the thread_video thread as the most active. The tooltip shows that CPU time values rarely exceed 100% whereas the maximum CPU time value for dual-core systems is 200%. This means that the processor cores were half-utilized for most of the time spent on executing the tachyon_find_hotspots application. Recap You identified a function that took the most CPU time and could be a good candidate for algorithm tuning. Key Terms and Concepts • Term: Elapsed time, CPU time, viewpoint • Concept: Hotspots Analysis, CPU Usage, Data of Interest Next Step Analyze Code Analyze Code You identified initialize_2D_buffer as the hottest function. In the Bottom-up pane, double-click this function to open the Source window and analyze the source code: • Understand basic options provided in the Source window. • Identify the hottest code lines. Understand Basic Source Window Options The table below explains some of the features available in the Source window when viewing the Hotspots analysis data. Source pane displaying the source code of the application if the function symbol information is available. The code line that took the most CPU time to execute is highlighted. The source code in the Source pane is not editable. Tutorial: Finding Hotspots 1 33If the function symbol information is not available, the Assembly pane opens displaying assembler instructions for the selected hotspot function. To enable the Source pane, make sure to build the target properly. Assembly pane displaying the assembler instructions for the selected hotspot function. Assembler instructions are grouped by basic blocks. The assembler instructions for the selected hotspot function are highlighted. To get help on an assembler instruction, right-click the instruction and select Instruction Reference. NOTE To get the help on a particular instruction, make sure to have the Adobe* Acrobat Reader* 9 (or later) installed. If an earlier version of the Adobe Acrobat Reader is installed, the Instruction Reference opens but you need to locate the help on each instruction manually. Processor time attributed to a particular code line. If the hotspot is a system function, its time, by default, is attributed to the user function that called this system function. Source window toolbar. Use the hotspot navigation buttons to switch between most performancecritical code lines. Hotspot navigation is based on the metric column selected as a Data of Interest. For the Hotspots analysis, this is CPU Time. Use the Source/Assembly buttons to toggle the Source/Assembly panes (if both of them are available) on/off. Heat map markers to quickly identify performance-critical code lines (hotspots). The bright blue markers indicate hot lines for the function you selected for analysis. Light blue markers indicate hot lines for other functions. Scroll to a marker to locate the hot code line it identifies. Identify the Hottest Code Lines When you identify a hotspot in the serial code, you can make some changes in the code to tune the algorithms and speed up that hotspot. Another option is to parallelize the sample code by adding threads to the application so that it performs well on multi-core processors. This tutorial focuses on algorithm tuning. By default, when you double-click the hotspot in the Bottom-up pane, VTune Amplifier XE opens the source file related to this function highlighting the code line that took the most CPU time. For the initialize_2D_buffer function, the hottest code line is 84. This code is used to initialize a memory array using non-sequential memory locations. Click the Source Editor button on the Source window toolbar to open the default code editor and work on optimizing the code. Recap You identified the code section that took the most CPU time to execute. Key Terms and Concepts • Term: hotspot, CPU time • Concept: Hotspots Analysis, Data of Interest Next Step Tune Algorithms Tune Algorithms 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 34 In the Source window, you identified that in the initialize_2D_buffer hotspot function the code line 84 took the most CPU time. Focus on this line and do the following: 1. Open the code editor. 2. Resolve the performance problem using any of these options: • Optimize the algorithm used in this code section. • Recompile the code with the Intel® Compiler. Open the Code Editor In the Source window, click the Source Editor button to open the find_hotspots.cpp file in the default code editor at the hotspot line: Hotspot line 84 is used to initialize a memory array using non-sequential memory locations. For demonstration purposes, the code lines are commented as a slower method of filling the array. Resolve the Problem To resolve this issue, use one of the following methods: Option 1: Optimize your algorithm 1. Edit line 79 to comment out code lines 82-88 marked as a "First (slower) method". 2. Edit line 95 to uncomment code lines 98-104 marked as a "Faster method". In this step, you interchange the for loops to initialize the code in sequential memory locations. Tutorial: Finding Hotspots 1 353. From the Visual Studio menu, select Build > Rebuild find_hotspots. The project is rebuilt. 4. From Visual Studio Debug menu, select Start Without Debugging to run the application. Visual Studio runs the tachyon_find_hotspots.exe. Note that execution time has reduced from 63.609 seconds to 57.282 seconds. Option 2: Recompile the code with Intel ® Compiler This option assumes that you have Intel ® Composer XE installed. Composer XE is part of Intel ® Parallel Studio XE. By default, the Intel ® Compiler, one of the Composer components, uses powerful optimization switches, which typically provides some gain in performance. For more details on the Intel compiler, see the Intel Composer documentation. As an alternative, you may consider running the default Microsoft Visual Studio compiler applying more aggressive optimization switches. To recompile the code with the Intel compiler: 1. From Visual Studio Project menu, select Intel Composer XE> Use Intel C++.... 2. In the Confirmation window, click OK to confirm your choice. The project in Solution Explorer appears with the ComposerXE icon: 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 363. From the Visual Studio menu, select Build > Rebuild find_hotspots. The project is rebuilt with the Intel compiler. 4. From the Visual Studio menu, select Debug > Start Without Debugging. Visual Studio runs the tachyon_find_hotspots.exe. Note that the execution time reduced. Recap You interchanged the loops in the hotspot function, rebuilt the application, and got performance gain of 6 seconds. You also considered an alternative optimization technique using the Intel C++ compiler. Key Terms and Concepts • Term: hotspot Next Step Compare with Previous Result Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Compare with Previous Result You optimized your code to apply a loop interchange mechanism that gave you 6 seconds of improvement in the application execution time. To understand whether you got rid of the hotspot and what kind of optimization you got per function, re-run the Hotspots analysis on the optimized code and compare results: • Compare results before and after optimization. • Identify the performance gain. Compare Results Before and After Optimization 1. Run the Hotspots analysis on the modified code. 2. Click the Compare Results button on the Intel ® VTune™ Amplifier XE toolbar. The Compare Results window opens. 3. Specify the Hotspots analysis results you want to compare and click the Compare Results button: Tutorial: Finding Hotspots 1 37The Hotspots Bottom-up window opens, showing the CPU time usage across the two results and the differences side by side. Difference in CPU time between the two results in the following format: = . CPU time for the initial version of the tachyon_find_hotspots.exe application. CPU time for the optimized version of the tachyon_find_hotspots.exe. Identify the Performance Gain Explore the Bottom-up pane to compare CPU time data for the first hotspot: CPU Time:r000hs - CPU Time:r001hs = CPU Time: Difference. 27.671s - 21.321s = 6.350s, which means that you got the optimization of ~6 seconds for the initialize_2D_buffer function. If you switch to the Summary window, you see that the Elapsed time also shows 3.6 seconds of optimization for the whole application execution: 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 38Recap You ran the Hotspots analysis on the optimized code and compared the results before and after optimization using the Compare mode of the VTune Amplifier XE. Compare analysis results regularly to look for regressions and to track how incremental changes to the code affect its performance. You may also want to use the VTune Amplifier XE command-line interface and run the amplxe-cl command to test your code for regressions. For more details, see the Command-line Interface Support section in the VTune Amplifier XE online help. Key Terms and Concepts • Term: hotspot, CPU time • Concept: Hotspots Analysis Next Step Read Summary Summary You have completed the Finding Hotspots tutorial. Here are some important things to remember when using the Intel® VTune™ Amplifier XE to analyze your code for hotspots: Step 1. Choose and Build Your Target • Configure the Microsoft* symbol server and your project properties to get the most accurate results for system and user binaries and to analyze the performance of your application at the code line level. • Create a performance baseline to compare the application versions before and after optimization. Make sure to use the same workload for each application run. • Use the Project Properties: Target tab to choose and configure your analysis target. For Visual Studio* projects, the analysis target settings are inherited automatically. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. For example, you may limit the data collection to a predefined amount of data or enable the VTune Amplifier XE to collect more accurate CPU time data. You can also run the analysis from command line using the amplxecl command. Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application from the Summary window to explore the performance metrics for the whole application. Then, move to the Bottom-up window to analyze the performance per function. Focus on the hotspots - functions that took the most CPU time. By default, they are located at the top of the table. • Double-click the hotspot function in the Bottom-up pane or Call Stack pane to open its source code at the code line that took the most CPU time. • Consider using Intel ® Compiler, part of the Intel ® Composer XE, to optimize your tuning algorithms. Explore the compiler documentation for more details. Step 4. Compare Results Before and After Optimization • Perform regular regression testing by comparing analysis results before and after optimization. From GUI, click the Compare Results button on the VTune Amplifier XE toolbar. From command line, use the amplxe-cl command. Tutorial: Finding Hotspots 1 39Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 40Tutorial: Analyzing Locks and Waits 2 Learning Objectives This tutorial shows how to use the Locks and Waits analysis of the Intel® VTune™ Amplifier XE to identify one of the most common reasons for an inefficient parallel application - threads waiting too long on synchronization objects (locks) while processor cores are underutilized. Focus your tuning efforts on objects with long waits where the system is underutilized. Estimated completion time: 15 minutes. Sample application: tachyon. After you complete this tutorial, you should be able to: • Choose an analysis target. • Choose the Locks and Waits analysis type. • Run the Locks and Waits analysis. • Identify the synchronization objects with long waits and poor CPU utilization. • Analyze the source code to locate the most critical code lines. • Compare results before and after optimization. Start Here Workflow Steps to Identify Locks and Waits Workflow Steps to Identify Locks and Waits You can use the Intel® VTune™ Amplifier XE to understand the cause of the ineffective processor utilization by performing a series of steps in a workflow. This tutorial guides you through these workflow steps while using a sample ray-tracer application named tachyon. 411. Do one of the following: • Visual Studio* IDE: Choose a project, verify settings, and build application. • Standalone GUI: Build an application to analyze for locks and waits and create a new VTune Amplifier XE project. 2. Run the Locks and Waits analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical function. 5. Modify the code to remove the lock. 6. Re-build the target, re-run the Locks and Waits analysis, and compare the result data before and after optimization. Visual Studio* IDE: Choose Project and Build Application Before you start analyzing your application for locks, do the following: 1. Choose a project with the analysis target in the Visual Studio IDE. 2. Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that VTune Amplifier XE can properly identify system functions and classify and attribute functions. 3. Configure Visual Studio project properties to generate the debug information for your application so that VTune Amplifier XE can open the source code. 4. Build the target in the release mode with full optimizations, which is recommended for performance analysis. 5. Run the application without debugging to create a performance baseline. For this tutorial, your target is a ray-tracer application, tachyon. To learn how to install and set up the sample code, see Prerequisites. • The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE may slightly differ. See online help for details. • Steps provided by this tutorial are generic and applicable to any application. You may choose to follow the proposed workflow using your own application. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 42Choose a Project 1. From the Visual Studio menu, select File > Open > Project/Solution.... The Open Project dialog box opens. 2. In the Open Project dialog box, browse to the location you used to unzip the tachyon_vtune_amp_xe.zip file and select the tachyon_vtune_amp_xe.sln file. The solution is added to Visual Studio and shows up in the Solution Explorer. 3. In Solution Explorer, right-click the analyze_locks project and select Project > Set as StartUp Project. analyze_locks appears in bold in Solution Explorer. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http:// msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft* Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. Tutorial: Analyzing Locks and Waits 2 437. Click Ok. Enable Generating Debug Information for Your Binary Files 1. Select the analyze_locks project and go to Project > Properties. 2. From the analyze_locks Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the analyze_locks Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 444. From the analyze_locks Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). Tutorial: Analyzing Locks and Waits 2 45Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build analyze_locks. The tachyon_analyze_locks application is built. NOTE The build configuration for tachyon may initially be set to Debug, which is typically used for development. When analyzing performance issues with the VTune Amplifier XE, you are recommended to use the Release build with normal optimizations. In this way, the VTune Amplifier XE is able to analyze the realistic performance of your application. Create a Performance Baseline 1. From the Visual Studio menu, select Debug > Start Without Debugging. The tachyon_analyze_locks application runs in multiple sections (depending on the number of CPUs in your system). NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 462. Note the execution time displayed in the window caption. For the tachyon_analyze_locks executable in the figure above, the execution time is 33.578 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Recap You selected the analyze_locks project as the target for the Locks and Waits analysis. Key Terms and Concepts • Term: target Next Step Run Locks and Waits Analysis Tutorial: Analyzing Locks and Waits 2 47Standalone GUI: Build Application and Create New Project Before you start analyzing your application for locks and waits, do the following: 1. Build application. If you build the code in Visual Studio*, make sure to: • Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that VTune Amplifier XE can properly identify system functions and classify and attribute functions. • Configure Visual Studio project properties to generate the debug information for your application so that VTune Amplifier XE can open the source code. • Build the target in the release mode with full optimizations, which is recommended for performance analysis. 2. Run the application without debugging to create a performance baseline. 3. Create a VTune Amplifier XE project. NOTE The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE may differ slightly. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http:// msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft* Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 487. Click Ok. Enable Generating Debug Information for Your Binary Files 1. Select the analyze_locks project and go to Project > Properties. 2. From the analyze_locks Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the analyze_locks Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). Tutorial: Analyzing Locks and Waits 2 494. From the analyze_locks Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 50Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build analyze_locks. The tachyon_analyze_locks application is built. NOTE The build configuration for tachyon may initially be set to Debug, which is typically used for development. When analyzing performance issues with the VTune Amplifier XE, you are recommended to use the Release build with normal optimizations. In this way, the VTune Amplifier XE is able to analyze the realistic performance of your application. Create a Performance Baseline 1. From the Visual Studio menu, select Debug > Start Without Debugging. The tachyon_analyze_locks application runs in multiple sections (depending on the number of CPUs in your system). NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. Tutorial: Analyzing Locks and Waits 2 512. Note the execution time displayed in the window caption. For the tachyon_analyze_locks executable in the figure above, the execution time is 33.578 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Create a Project 1. From the Start menu select Intel Parallel Studio XE 2011 > Intel VTune Amplifier XE 2011 to launch the VTune Amplifier XE standalone GUI. 2. Create a new project via File > New > Project.... The Create a Project dialog box opens. 3. Specify the project name tachyon that will be used as the project directory name. VTune Amplifier XE creates a project directory under the %USERPROFILE%\My Documents\My Amplifier XE Projects directory and opens the Project Properties: Target dialog box. 4. In the Application to Launch pane of the Target tab, specify and configure your target as follows: • For the Application field, browse to: \analyze_locks.exe. 5. Click OK to apply the settings and exit the Project Properties dialog box. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 52Recap You set up your environment to enable generating symbol information for system libraries and your binary files, built the target in the Release mode, created the performance baseline, and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Key Terms and Concepts • Term: target, baseline • Concept: Locks and Waits Analysis Next Step Run Locks and Waits Analysis Run Locks and Waits Analysis Before running an analysis, choose a configuration level to define the Intel® VTune™ Amplifier XE analysis scope and running time. In this tutorial, you run the Locks and Waits analysis to identify synchronization objects that caused contention and fix the problem in the source. To run an analysis: 1. From the VTune Amplifier XE toolbar, analysis type from the drop-down menuclick the New Analysis button. The VTune Amplifier XE result tab opens with the Analysis Type window active. 2. From the analysis tree on the left, select Algorithm Analysis > Locks and Waits. The right pane is updated with the default options for the Locks and Waits analysis. 3. Click the Start button on the right command bar. The VTune Amplifier XE launches the tachyon_analyze_locks executable that renders balls.dat as an input file, calculates the execution time, and exits. The VTune Amplifier XE finalizes the collected data and opens the results in the Locks and Waits viewpoint. NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Tutorial: Analyzing Locks and Waits 2 53Recap You ran the Locks and Waits data collection that analyzes how long the application had to wait on each synchronization object, or on blocking APIs, such as sleep() and blocking I/O, and estimates processor utilization during the wait. NOTE This tutorial explains how to run an analysis from the VTune Amplifier XE graphical user interface (GUI). You can also use the VTune Amplifier XE command-line interface (amplxe-cl command) to run an analysis. For more details, check the Command-line Interface Support section of the VTune Amplifier XE Help. Key Terms and Concepts • Term: viewpoint • Concept: Locks and Waits Analysis, Finalization Next Step Interpret Result Data Interpret Result Data When the sample application exits, the Intel® VTune™ Amplifier XE finalizes the results and opens the Locks and Waits viewpoint that consists of the Summary window, Bottom-up pane, Top-down Tree pane, Call Stack pane, and Timeline pane. To interpret the data on the sample code performance, do the following: • Analyze the basic performance metrics provided by the Locks and Waits analysis. • Identify locks. NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Analyze the Basic Locks and Waits Metrics Start with exploring the data provided in the Summary window for the whole application performance. To interpret the data, hover over the question mark icons to read the pop-up help and better understand what each performance metric means. The Result Summary section provides data on the overall application performance per the following metrics: 1) Elapsed Time is the total time for each core when it was either waiting or not utilized by the application; 2)Total Thread Count is the number of threads in the application; 3)Wait Time is the amount of time the application threads waited for some event to occur, such as synchronization waits and I/O waits; 4) Wait Count is the overall number of times the system wait API was called for the analyzed application; 5) CPU Time is the sum of CPU time for all threads; 6) Spin Time is the time a thread is active in a synchronization construct. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 54For the tachyon_analyze_locks application, the Wait time is high. To identify the cause, you need to understand how this Wait time was distributed per synchronization objects. The Top Waiting Objects section provides the list of five synchronization objects with the highest Wait Time and Wait Count, sorted by the Wait Time metric. For the tachyon_analyze_locks application, focus on the first three objects and explore the Bottom-up pane data for more details. The Thread Concurrency Histogram represents the Elapsed time and concurrency level for the specified number of running threads. Ideally, the highest bar of your chart should be within the Ok or Ideal utilization range. Note the Target value. By default, this number is equal to the number of physical cores. Consider this number as your optimization goal. The Average metric is calculated as CPU time / Elapsed time. Use this number as a baseline for your performance measurements. The closer this number to the number of cores, the better. For the sample code, the chart shows that tachyon_analyze_locks is a multithreaded application running four threads on a machine with four cores. But it is not using available cores effectively. The Average CPU Usage on the chart is about 0.7 while your target should be making it as closer to 4 as possible (for the system with four cores). Hover over the second bar to understand how long the application ran serially. The tooltip shows that the application ran one thread for almost 15 seconds, which is classified as Poor concurrency. The CPU Usage Histogram represents the Elapsed time and usage level for the logical CPUs. Ideally, the highest bar of your chart should be within the Ok or Ideal utilization range. Tutorial: Analyzing Locks and Waits 2 55The tachyon_analyze_locks application ran mostly on one logical CPU. If you hover over the second bar, you see that it spent 16.603 seconds using one core only, which is classified by the VTune Amplifier XE as a Poor utilization. To understand what prevented the application from using all available logical CPUs effectively, explore the Bottom-up pane. Identify Locks Click the Bottom-up tab to open the Bottom-up pane. Synchronization objects that control threads in the application. The hash (unique number) appended to some names of the objects identify the stack creating this synchronization object. For Intel ® Threading Building Blocks (Intel ® TBB), VTune Amplifier XE is able to recognize all types of Intel TBB objects. To display an overhead introduced by Intel TBB library internals, the VTune Amplifier XE creates a pseudo synchronization object TBB scheduler that includes all waits from the Intel TBB runtime libraries. The utilization of the processor time when a given thread waited for some event to occur. By default, the synchronization objects are sorted by Poor processor utilization type. Bars showing OK or Ideal utilization (orange and green) are utilizing the processors well. You should focus your optimization efforts on functions with the longest poor CPU utilization (red bars if the bar format is selected). Next, search for the longest over-utilized time (blue bars). This is the Data of Interest column for the Locks and Waits analysis results that is used for different types of calculations, for example: call stack contribution, percentage value on the filter toolbar. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 56Number of times the corresponding system wait API was called. For a lock, it is the number of times the lock was contended and caused a wait. Usually you are recommended to focus your tuning efforts on the waits with both high Wait Time and Wait Count values, especially if they have poor utilization. Wait time, during which the CPU is busy. This often occurs when a synchronization API causes the CPU to poll while the software thread is waiting. Some Spin time may be preferable to the alternative of the increased thread context switches. However, too much Spin time can reflect lost opportunity for productive work. For the analyzed sample code, you see that the top three synchronization objects caused the longest Wait time. The red bars in the Wait Time column indicate that most of the time for these objects processor cores were underutilized. From the code knowledge, you may understand that the Manual and Auto Reset Event objects are most likely related to the join where the main program is waiting for the worker threads to finish. This should not be a problem. Consider the third item in the Bottom-up pane that is more interesting. It is a Critical Section that shows much serial time and is causing a wait. Click the plus sign at the object name to expand the node and see the draw_task wait function that contains this critical section and call stack. Double-click the Critical Section to see the source code for the wait function. Recap You identified a synchronization object with the high Wait Time and Wait Count values and poor CPU utilization that could be a lock affecting application parallelism. Your next step is to analyze the code of this function. Key Terms and Concepts • Term: Elapsed time, Wait time • Concept: Locks and Waits Analysis, CPU Usage, Data of Interest Next Step Analyze Code Analyze Code You identified the critical section that caused significant Wait time and poor processor utilization. Double-click this critical section in the Bottom-up pane to view the source. The Intel® VTune™ Amplifier XE opens source and disassembly code. Focus on the Source pane and analyze the source code: • Understand basic options provided in the Source window. • Identify the hottest code lines. Understand Basic Source View Options Tutorial: Analyzing Locks and Waits 2 57The table below explains some of the features available in the Source panefor the Locks and Waits viewpoint. Source code of the application displayed if the function symbol information is available. When you go to the source by double-clicking the synchronization object in the Bottom-up pane, the VTune Amplifier XE opens the wait function containing this object and highlights the code line that took the most Wait time. The source code in the Source pane is not editable. If the function symbol information is not available, the Assembly pane opens displaying assembler instructions for the selected wait function. To view the source code in the Source pane, make sure to build the target properly. Processor time and utilization bar attributed to a particular code line. The colored bar represents the distribution of the Wait time according to the utilization levels (Idle, Poor, Ok, Ideal, and Over) defined by the VTune Amplifier XE. The longer the bar, the higher the value. Ok utilization level is not available for systems with a small number of cores. This is the Data of Interest column for the Locks and Waits analysis. Number of times the corresponding system wait API was called while this code line was executing. For a lock, it is the number of times the lock was contended and caused a wait. Source window toolbar. Use hotspot navigation buttons to switch between most performancecritical code lines. Hotspot navigation is based on the metric column selected as a Data of Interest. For the Locks and Waits analysis, this is Wait Time. Use the source file editor button to open and edit your code in your default editor. Identify the Hottest Code Lines The VTune Amplifier XE highlights line 170 entering the critical section rgb_critical_section in the draw_task function. The draw_task function was waiting for almost 27 seconds while this code line was executing and most of the time the processor was underutilized. During this time, the critical section was contended 438 times. The rgb_critical section is the place where the application is serializing. Each thread has to wait for the critical section to be available before it can proceed. Only one thread can be in the critical section at a time. You need to optimize the code to make it more concurrent. Click the Source Editor button on the Source window toolbar to open the code editor and optimize the code. Recap You identified the code section that caused a significant wait and during which the processor was poorly utilized. Key Terms and Concepts • Term: Wait time • Concept: CPU Usage, Locks and Waits Analysis, Data of Interest Next Step Remove Lock Remove Lock In the Source window, you located the critical section that caused a significant wait while the processor cores were underutilized and generated multiple wait count. Focus on this line and do the following: 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 581. Open the code editor. 2. Modify the code to remove the lock. Open the Code Editor Click the Source Editor button to open the analyze_locks.cpp file in your default editor at the hotspot code line: Remove the Lock The rgb_critical_section was introduced to protect calculation from multithreaded access. The brief analysis shows that the code is thread safe and the critical section is not really needed. To resolve this issue: NOTE The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE or for the standalone version of the VTune Amplifier XE may slightly differ. 1. Comment out code lines 170 and 178 to disable the critical section. 2. From Solution Explorer, select the analyze_locks project. 3. From Visual Studio menu, select Build > Rebuild analyze_locks. The project is rebuilt. 4. From Visual Studio menu, select Debug > Start Without Debugging to run the application. Visual Studio runs the tachyon_analyze_locks.exe. Note that execution time reduced from 33.578 seconds to 20.328 seconds. Tutorial: Analyzing Locks and Waits 2 59Recap You optimized the application execution time by removing the unnecessary critical section that caused a lot of Wait time. Key Terms and Concepts • Term: hotspot • Concept: Locks and Waits Analysis Next Step Compare with Previous Result Compare with Previous Result You made sure that removing the critical section gave you 13 seconds of optimization in the application execution time. To understand the impact of your changes and how the CPU utilization has changed, re-run the Locks and Waits analysis on the optimized code and compare results: 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 60• Compare results before and after optimization. • Identify the performance gain. Compare Results Before and After Optimization 1. Run the Locks and Waits analysis on the modified code. 2. Click the Compare Results button on the Intel ® VTune™ Amplifier XE toolbar. The Compare Results window opens. 3. Specify the Locks and Waits analysis results you want to compare: The Summary window opens providing the statistics for the difference between collected results. Click the Bottom-up tab to see the list of synchronization objects used in the code, Wait time utilization across the two results, and the differences side by side: Difference in Wait time per utilization level between the two results in the following format: = . By default, the Difference column is expanded to display comparison data per utilization level. You may collapse the column to see the total difference data per Wait time. Wait time and CPU utilization for the initial version of the code. Wait time and CPU utilization for the optimized version of the code. Difference in Wait count between the two results in the following format: = - . Tutorial: Analyzing Locks and Waits 2 61Wait count for the initial version of the code. Wait count for the optimized version of the code. Identify the Performance Gain The Elapsed time data in the Summary window shows the optimization of 4 seconds for the whole application execution and Wait time decreased by 37.5 seconds. According to the Thread Concurrency histogram, before optimization (blue bar) the application ran serially for 9 seconds poorly utilizing available processor cores but after optimization (orange bar) it ran serially only for 2 seconds. After optimization the application ran 5 threads simultaneously overutilizing the cores for almost 5 seconds. Further, you may consider this direction as an additional area for improvement. In the Bottom-up pane, locate the Critical Section you identified as a bottleneck in your code. Since you removed it during optimization, the optimized result r001lw does not show any performance data for this synchronization object. If you collapse the Wait Time:Difference column by clicking the button, you see that with the optimized result you got almost 29 seconds of optimization in Wait time. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 62Recap You ran the Locks and Waits analysis on the optimized code and compared the results before and after optimization using the Compare mode of the VTune Amplifier XE. The comparison shows that, with the optimized version of the tachyon_analyze_locks application (r001lw result), you managed to remove the lock preventing application parallelism and significantly reduce the application execution time. Compare analysis results regularly to look for regressions and to track how incremental changes to the code affect its performance. You may also want to use the VTune Amplifier XE command-line interface and run the amplxecl command to test your code for regressions. For more details, see the Command-line Interface Support section in the VTune Amplifier XE online help. Key Terms and Concepts • Term: hotspot, Wait time • Concept: Locks and Waits Analysis, CPU Usage Next Step Read Summary Summary You have completed the Analyzing Locks and Waits tutorial. Here are some important things to remember when using the Intel® VTune™ Amplifier XE to analyze your code for locks and waits: Step 1. Choose and Build Your Target • Configure the Microsoft* symbol server and your project properties to get the most accurate results for system and user binaries and to analyze the performance of your application at the code line level. • Create a performance baseline to compare the application versions before and after optimization. Make sure to use the same workload for each application run. • Use the Project Properties: Target tab to choose and configure your analysis target. For Visual Studio* projects, the analysis target settings are inherited automatically. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. For example, you may limit the data collection to a predefined amount of data or enable the VTune Amplifier XE to collect more accurate CPU time data. You can also run the analysis from command line using the amplxecl command. Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application with the Summary pane to explore the performance metrics for the whole application. Then, move to the Bottom-up window to analyze the synchronization objects. Focus on the synchronization objects that under- or over-utilized the available logical CPUs and have the highest Wait time and Wait Count values. By default, the objects with the highest Wait time values show up at the top of the window. • Expand the most time-critical synchronization object in the Bottom-up pane and double-click the wait function it belongs to. This opens the source code for this wait function at the code line with the highest Wait time value. Tutorial: Analyzing Locks and Waits 2 63Step 4. Compare Results Before and After Optimization • Perform regular regression testing by comparing analysis results before and after optimization. From GUI, click the Compare Results button on the VTune Amplifier XE toolbar. From command line, use the amplxe-cl command. • Expand each data column by clicking the button to identify the performance gain per CPU utilization level. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 64Tutorial: Identifying Hardware Issues 3 Learning Objectives This tutorial shows how to use the General Exploration analysis of the Intel® VTune™ Amplifier XE to identify the hardware-related issues in the sample application. Estimated completion time: 15 minutes. Sample application: matrix. After you complete this tutorial, you should be able to: • Choose an analysis target. • Run the General Exploration analysis for Intel® microarchitecture code name Nehalem. • Understand the event-based performance metrics. • Identify the types of the most critical hardware issues for the application as a whole. • Identify the modules/functions that caused the most critical hardware issues. • Analyze the source code to locate the most critical code lines. • Identify the next steps of the performance analysis to get more detailed results. Start Here Workflow Steps to Identify Hardware Issues Workflow Steps to Identify Hardware Issues You can use an advanced event-based sampling analysis of the Intel® VTune™ Amplifier XE to identify the most significant hardware issues that affect the performance of your application. This tutorial guides you through these workflow steps running the General Exploration analysis type on a sample matrix application. 651. Do one of the following: • Visual Studio* IDE: Choose a project, verify settings, and build application. • Standalone GUI: Build an application to analyze for hardware issues and create a new VTune Amplifier XE project. 2. Choose and run the General Exploration analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical functions. 5. Modify the code to resolve the detected performance issues and rebuild the code. Visual Studio* IDE: Choose Project and Build Application Before you start analyzing hardware issues affecting the performance of your application, do the following: 1. Choose a project with the analysis target in the Visual Studio IDE. 2. Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that the VTune Amplifier XE can properly identify system functions and classify and attribute functions. 3. Configure Visual Studio project properties to generate the debug information for your application so that the VTune Amplifier XE can open the source code. 4. Build the target in the release mode with full optimizations, which is recommended for performance analysis. For this tutorial, your target is the matrix application that calculates matrix transformations. To learn how to install and set up the sample code, see Prerequisites. • The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE or for the standalone version of the Intel® VTune™ Amplifier XE 2011 may slightly differ. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 66• Steps provided by this tutorial are generic and applicable to any application. You may choose to follow the proposed workflow using your own application. Choose a Project 1. From the Visual Studio menu, select File > Open > Project/Solution.... The Open Project dialog box opens. 2. In the Open Project dialog box, browse to the location where you extracted the matrix_vtune_amp_xe.zip file and select the matrix.sln file. The solution is added to Visual Studio and shows up in the Solution Explorer. VTune Amplifier XE automatically inherits Visual Studio settings and uses the currently opened project as a target project for performance analysis. When you choose a project in Visual Studio IDE, the VTune Amplifier XE automatically creates the config.amplxeproj project file and sets the matrix application as an analysis target in the project properties. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http://msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. Tutorial: Identifying Hardware Issues 3 677. Click Ok. Enable Generating Debug Information for Your Binary Files 1. Select the matrix project and go to Project > Properties. 2. From the matrix Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the matrix Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 684. From the matrix Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). Tutorial: Identifying Hardware Issues 3 69Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build matrix. The matrix.exe application is built. Recap You selected the matrix project as the target for the hardware event-based sampling analysis, set up your environment to enable generating symbol information for system libraries and your binary files, and built the target in the Release mode. Your application is ready for analysis. Next Step Run General Exploration Analysis Standalone GUI: Build Application and Create New Project Before you start analyzing hardware issues affecting the performance of your application, do the following: 1. Build application. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 70If you build the code in Visual Studio*, make sure to: • Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that the VTune Amplifier XE can properly identify system functions and classify and attribute functions. • Configure Visual Studio project properties to generate the debug information for your application so that the VTune Amplifier XE can open the source code. • Build the target in the release mode with full optimizations, which is recommended for performance analysis. 2. Create a VTune Amplifier XE project. NOTE The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE or for the standalone version of the Intel® VTune™ Amplifier XE may differ slightly. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http://msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. 7. Click Ok. Enable Generating Debug Information for Your Binary Files 1. Select the matrix project and go to Project > Properties. Tutorial: Identifying Hardware Issues 3 712. From the matrix Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the matrix Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). 4. From the matrix Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 72Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build matrix. The matrix.exe application is built. Create a Project 1. From the Start menu select Intel Parallel Studio XE 2011 > Intel VTune Amplifier XE 2011 to launch the VTune Amplifier XE GUI client. 2. Create a new project via File > New > Project.... The Create a Project dialog box opens. 3. Specify the project name matrix that will be used as the project directory name and click the Create Project button. By default, the VTune Amplifier XE creates a project directory under the %USERPROFILE%\My Documents \My Amplifier XE Projects directory and opens the Project Properties: Target dialog box. 4. In the Target: Application to Launch pane, browse to the matrix.exe application and click OK. Recap You set up your environment to enable generating symbol information for system libraries and your binary files, built the target in the Release mode, and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Tutorial: Identifying Hardware Issues 3 73Key Terms and Concepts • Term: target • Concept: Event-based Sampling Analysis Next Step Run General Exploration Analysis Run General Exploration Analysis After building the target, you can run it with the Intel® VTune™ Amplifier XE to analyze its performance. In this tutorial, you run the General Exploration analysis on the Intel® Core™ i7 processor based on the Intel® microarchitecture code name Nehalem. The General Exploration analysis type helps identify the widest scope of hardware issues that affect the application performance. This analysis type is based on the hardware event-based sampling collection. NOTE The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE or for the standalone version of the VTune Amplifier XE may slightly differ. To run the analysis: 1. From the VTune Amplifier XE toolbar, click the New Analysis button. The New Amplifier XE Result tab opens with the Analysis Type configuration window active. 2. From the analysis tree on the left, select the Advanced Intel(R) Microarchitecture Code Name Nehalem Analysis > General Exploration analysis type. 3. Click the Start button on the right to run the analysis. The VTune Amplifier XE launches the matrix application that calculates matrix transformations and exits. The VTune Amplifier XE finalizes the collected data and opens the results in the Hardware Issues viewpoint. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 74NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Recap You ran the General Exploration analysis that monitors how your application performs against a set of eventbased hardware metrics. To see the list of processor events used for this analysis type, see the Details section of the General Exploration configuration pane. Key Terms and Concepts • Term: viewpoint • Concept: Event-based Sampling Analysis, Finalization Next Step Interpret Results Interpret Results When the application exits, the Intel® VTune™ Amplifier XE finalizes the results and opens the Hardware Issues viewpoint that consists of the Summary window, Bottom-up window, and Timeline pane. To interpret the collected data and understand where you should focus your tuning efforts for the specific hardware, do the following: • Understand the event-based metrics • Identify the hardware issues that affect the performance of your application NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Understand the Event-based Metrics Click the Summary tab to explore the data provided in the Summary window for the whole application performance. Tutorial: Identifying Hardware Issues 3 75Elapsed time is the wall time from the beginning to the end of the collection. Treat this metric as your basic performance baseline against which you will compare subsequent runs of the application. The goal of your optimization is to reduce the value of this metric. Event-based performance metrics. Each metric is an event ratio provided by Intel architects. Mouse over the yellow icon to see the metric description and formula used for the metric calculation. Values calculated for each metric based on the event count. VTune Amplifier XE highlights those values that exceed the threshold set for the corresponding metric. Such a value highlighted in pink signifies an application-level hardware issue. The text below a metric with the detected hardware issue describes the issue, potential cause and recommendations on the next steps, and displays a threshold formula used for calculation. Mouse over the truncated text to read a full description. Quick look at the summary results discovers that the matrix application has the following issues: • CPI (Clockticks per Instructions Retired) Rate • Retire Stalls • LLC Miss • LLC Load Misses Serviced by Remote DRAM • Execution Stalls • Data Sharing 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 76Identify the Hardware Issues Click the Bottom-up tab to open the Bottom-up window and see how each program unit performs against the event-based metrics. Each row represents a program unit and percentage of the CPU cycles used by this unit. Program units that take more than 5% of the CPU time are considered hotspots. This means that by resolving a hardware issue that, for example, took about 20% of the CPU cycles, you can obtain 20% optimization for the hotspot. By default, the VTune Amplifier XE sorts data in the descending order by Clockticks and provides the hotspots at the top of the list. You see that the multiply1 function is the most obvious hotspot in the matrix application. It has the highest event count (Clockticks and Instructions Retired events) and most of the hardware issues were also detected during execution of this function. NOTE Mouse over a column header with an event-based metric name to see the metric description. Mouse over a highlighted cell to read the description of the hardware issue detected for the program unit. For the multiply1 function, the VTune Amplifier XE highlights the same issues (except for the Data Sharing issue) that were detected as the issues affecting the performance of the whole application: • CPI Rate is high (>1). Potential causes are memory stalls, instruction starvation, branch misprediction, or long-latency instruction. To define the cause for your code, explore other metrics in the Bottom-up window. • The Retire Stalls metric shows that during the execution of the multiply1 function, about 90% (0.902) of CPU cycles were waiting for data to arrive. This may result from branch misprediction, instruction starvation, long latency operations, and other issues. Once you have located the stalled instructions in your code, analyze metrics such as LLC Miss, Execution Stalls, Remote Accesses, Data Sharing, and Contested Accesses. You can also look for long-latency instructions like divisions and string operations to understand the cause. Tutorial: Identifying Hardware Issues 3 77• LLC misses metric shows that about 60% (0.592) of CPU cycles were spent waiting for LLC load misses to be serviced. Possible optimizations are to reduce data working set size, improve data access locality, blocking and consuming data in chunks that fit in the LLC, or better exploit hardware prefetchers. Consider using software prefetchers but beware that they can increase latency by interfering with normal loads and can increase pressure on the memory system. • LLC Load Misses Serviced by Remote DRAM metric shows that 34% (0.340) of cycles were spent servicing memory requests from remote DRAM. Wherever possible, try to consistently use data on the same core or at least the same package, as it was allocated on. • Execution Stalls metric shows that 54% (0.543) of cycles were spent with no micro-operations executed. Look for long-latency operations at code regions with high execution stalls and try to use alternative methods or lower latency operations. For example, consider replacing div operations with right-shifts or try to reduce the latency of memory accesses. Recap You analyzed the data provided in the Hardware Issues viewpoint, explored the event-based metrics, and identified the areas where your sample application had hardware issues. Also, you were able to identify the exact function with poor performance per metrics and that could be a good candidate for further analysis. Key Terms and Concepts • Term: viewpoint, baseline, Elapsed time • Concept: Event-based Sampling Analysis, Event-based Metrics Next Step Analyze Code Analyze Code You identified a hotspot function with a number of hardware issues. Double-click the multiply1 function in the Bottom-up window to open the source code: 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 78The table below explains some of the features available in the Source pane when viewing the event-based sampling analysis data. Source pane displaying the source code of the application, which is available if the function symbol information is available. The code line that took the highest number of Clockticks samples is highlighted. The source code in the Source pane is not editable. Values per hardware event attributed to a particular code line. By default, the data is sorted by the Clockticks event count. Focus on the events that constitute the metrics identified as performancecritical in the Bottom-up window. To identify these events, mouse over the metric column header in the Bottom-up window. Drag-and-drop the columns to organize the view for your convinience. VTune Amplifier XE remembers yours settings and restores them each time you open the viewpoint. Hotspot navigation buttons to switch between code lines that took a long time to execute. Source file editor button to open and edit your code in the default editor. Assembly button to toggle in the Assembly pane that displays assembly instructions for the selected function. In the Source pane for the multiply1 function, you see that line 39 took the most of the Clockticks event samples during execution. This code section multiplies matrices in the loop but ineffectively accesses the memory. Focus on this section and try to reduce the memory issues. Recap You analyzed the code for the hotspot function identified in the Bottom-up window and located the hotspot line that generated a high number of CPU Clockticks. Key Terms and Concepts • Concept: Event Skid Next Step Resolve Issue Resolve Issue In the Source pane, you identified that in the multiply1 function the code line 39 resulted in the highest values for the Clockticks event. To solve this issue, do the following: • Change the multiplication algorithm and, if using the Intel® compiler, enable vectorization. • Re-run the analysis to verify optimization. Change Algorithm NOTE The proposed solution is one of the multiple ways to optimize the memory access and is used for demonstration purposes only. 1. Open the matrix.c file from the Source Files of the matrix project. For this sample, the matrix.c file is used to initialize the functions used in the multiply.c file. 2. In line 90, replace the multiply1 function name with the multiply2 function. This new function uses the loop interchange mechanism that optimizes the memory access in the code. Tutorial: Identifying Hardware Issues 3 79The proposed optimization assumes you may use the Intel ® C++ Compiler to build the code. Intel compiler helps vectorize the data, which means that it uses SIMD instructions that can work with several data elements simultaneously. If only one source file is used, the Intel compiler enables vectorization automatically. The current sample uses several source files, that is why the multiply2 function uses #pragma ivdep to instruct the compiler to ignore assumed vector dependencies. This information lets the compiler enable the Supplemental Streaming SIMD Extensions (SSSE). 3. Save files and rebuild the project using the compiler of your choice. If you have the Intel ® Composer XE installed, you may use it to build the project with the Intel ® C++ Compiler XE. To do this, select Intel Composer XE > Use Intel C++... from the Visual Studio Project menu and then Build > Rebuild matrix. Verify Optimization 1. From the VTune Amplifier XE toolbar, click the New Analysis button and select Quick Intel(R) Microarchitecture Code Name Nehalem - General Exploration Analysis. VTune Amplifier XE reruns the General Exploration analysis for the updated matrix target and creates a new result, r001ge, that opens automatically. 2. In the r001ge result, click the Summary tab to see the Elapsed time value for the optimized code: 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 80You see that the Elapsed time has reduced from 56.740 seconds to 9.122 seconds and the VTune Amplifier XE now identifies only two types of issues for the application performance: high CPI Rateand Retire Stalls. Recap You solved the memory access issue for the sample application by interchanging the loops and sped up the execution time. You also considered using the Intel compiler to enable instruction vectorization. Key Terms and Concepts • Concept: Event-based Sampling Analysis Next Step Resolve Next Issue Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Tutorial: Identifying Hardware Issues 3 81Resolve Next Issue You got a significant performance boost by optimizing the memory access for the multiply1 function. According to the data provided in the Summary window for your updated result, r001ge, you still have high CPI rate and Retire Stalls issues. You can try to optimize your code further following the steps below: • Analyze results after optimization • Use more advanced algorithms • Verify optimization Analyze Results after Optimization To get more details on the issues that still affect the performance of the matrix application, switch to the Bottom-up window: You see that the multiply2 function (in fact, updated multiply1 function) is still a hotspot. Double-click this function to view the source code and click both the Source and Assembly buttons on the toolbar to enable the Source and Assembly panes. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 82In the Source pane, the VTune Amplifier XE highlights line 53 that took the highest number of Clockticks samples. This is again the section where matrices are multiplied. The Assembly pane is automatically synchronized with the Source pane. It highlights the basic blocks corresponding to the code line highlighted in the Source pane. If you compiled the application with the Intel ® Compiler, you can see that highlighted block 156 includes vectorization instructions added after your previous optimization. All vectorization instructions have the p (packed) postfix (for example, mulpd). You may use the /Qvec-report3 option of the Intel compiler to generate the compiler optimization report and see which cycles were not vectorized and why. For more details, see the Intel compiler documentation. Use More Advanced Algorithms 1. Open the matrix.c file from the Source Files of the matrix project. 2. In line 90, replace the multiply2 function name with the multiply3 function. This function enables uploading the matrix data by blocks. Tutorial: Identifying Hardware Issues 3 833. Save the files and rebuild the project. Verify Optimization 1. From the VTune Amplifier XE toolbar, click the New Analysis button and select Quick Intel(R) Microarchitecture Code Name Nehalem - General Exploration Analysis. VTune Amplifier XE reruns the General Exploration analysis for the updated matrix target and creates a new result, r002ge, that opens automatically. 2. In the r002ge result, click the Summary tab to see the Elapsed time value for the optimized code: 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 84You see that the Elapsed time has reduced a little: from 9.122 seconds to 8.896 seconds but the hardware issues identified in the previous run, CPI Rateand Retire Stalls, stayed practically the same. This means that there is more room for improvement and you can try other, more effective, mechanisms of matrix multiplication. Recap You tried optimizing the mechanism of matrix multiplication and obtained 0.2 seconds of optimization in the application execution time. Key Terms and Concepts • Concept: Event-based Sampling Analysis, Event-based Metrics Next Step Read Summary Summary You have completed the Identifying Hotspot Issues tutorial. Here are some important things to remember when using the Intel® VTune™ Amplifier XE to analyze your code for hardware issues: Step 1. Choose and Build Your Target • Configure the Microsoft* symbol server and your project properties to get the most accurate results for system and user binaries and to analyze the performance of your application at the code line level. Tutorial: Identifying Hardware Issues 3 85• Use the Project Properties: Target tab to choose and configure your analysis target. For Visual Studio* projects, the analysis target settings are inherited automatically. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. You may choose between a predefined analysis type like the General Exploration type used in this tutorial, or create a new custom analysis type and add events of your choice. For more details on the custom collection, see the Creating a New Analysis Type topic in the product online help. Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application from the Summary window to explore the eventbased performance metrics for the whole application. Mouse over the yellow help icons to read the metric descriptions. Use the Elapsed time value as your performance baseline. • Move to the Bottom-up window and analyze the performance per function. Focus on the hotspots - functions that took the highest Clockticks event count. By default, they are located at the top of the table. Analyze the hardware issues detected for the hotspot functions. Hardware issues are highlighted in pink. Mouse over a highlighted value to read the issues description and see the threshold formula. • Double-click the hotspot function in the Bottom-up pane to open its source code at the code line that took the highest Clockticks event count. • Consider using Intel ® Compiler, part of the Intel ® Composer XE, to vectorize instructions. Explore the compiler documentation for more details. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 86More Resources 4 Getting Help Intel® VTune™ Amplifier XE provides a number of Getting Started tutorials. These tutorials use a sample application to demo you the basic product features and workflows. You can access these documents through the Help menu or by clicking the VTune Amplifier XE icon . From the Visual Studio user interface, select Help> Intel VTune Amplifier XE 2011 > Getting Started Tutorials and explore available tutorials. : For the standalone user interface, the tutorials are available via Help > Getting Started Tutorials menu. Browsing Help In the Visual Studio IDE, you can browse and search for topics in different ways: • Use Help > Contents to open the Contents window and browse the Table of Contents. • To view help for the VTune Amplifier XE directly, select Help > Intel VTune Amplifier XE 2011 Help. • Use Help > Index to open the Index window and access an index to VTune Amplifier XE topics. Either type in the keyword you are looking for, or scroll through the list of keywords. • Use Help > Search to open the Search page and search the full text of topics in the help. To view help in the standalone user interface, select Intel VTune Amplifier XE 2011 Help from the Help menu. Locating Intel Topics in the Document Explorer To filter the documentation so that only the Intel documentation appears, select Help > Contents from the Visual Studio user interface. In the Filtered by: drop-down list, select Intel. To determine where the currently displayed topic appears in the table of contents (TOC), click the Sync with Table of Contents button on the Visual Studio toolbar to highlight the topic in the Contents pane. Navigating in the Product Usage Workflow Where applicable, the VTune Amplifier XE help topics provide a Where am I in the workflow? button. Click the button to view the workflow with a highlight on the stage that this topic discusses. Activating Intel Search Filters in the Document Explorer With Microsoft Visual Studio 2005 and 2008, you can include Intel documentation in all search results by checking the Intel search filter box for the Language, Technology, and Content Type categories. You must check the Intel search box for all three categories to include Intel documentation in your searches. Unchecking all three Intel search boxes excludes Intel documentation from search results. The Intel search filters work in combination with other search options for each category. Using Context-Sensitive Help Context-sensitive help enables easy access to help topics on active GUI elements. The following contextsensitive help features are available on a product-specific basis: 87• ? Help: In Visual Studio, click the ? button, in the upper-right corner of the dialog box or pane to get help for the dialog box or pane. • F1 Help: Press F1 to get help for an active dialog box, property page, pane, or window. • Dynamic Help: In Visual Studio 2005/2008, select Help > Dynamic Help to open the Dynamic Help window, which displays links to relevant help topics for the current window. Product Website and Support Product Website and Support The following links provide information and support on Intel software products, including Intel ® Parallel Studio XE: • http://software.intel.com/en-us/articles/tools/ Intel ® Software Development Products Knowledge Base. • http://www.intel.com/software/products/support/ Technical support information, to register your product, or to contact Intel. For additional support information, see the Technical Support section of your Release Notes. System Requirements For detailed information on system requirements, see the Release Notes. 4 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 88Intel® VTune™ Amplifier XE Tutorials Troubleshooting 5 Troubleshooting Problem: Cannot open samples The sample projects are Visual Studio* 2005 projects. You may have a problem opening the sample if you have a later version of Visual Studio* software. Solution: Use the conversion wizard to convert the solution/projects to the newer version. Problem: Product is not recognized If you installed a new version of Visual Studio* software, the previously installed Intel ® VTune™ Amplifier XE may not appear in the new installation. Solution 1: If you have the VTune Amplifier XE installation execution file, run the installation program, select Modify, and follow the instructions to reintegrate the VTune Amplifier XE with your new version of Visual Studio* software. Solution 2: 1. Go to Control Panel > Add or Remove Programs. 2. Select the VTune Amplifier XE and select Modify. 3. Follow the instructions to reintegrate the VTune Amplifier XE with your new version of Visual Studio* software. Problem: The Project Properties function is disabled The Intel VTune Amplifier XE 2011 Project Properties option does not appear on the Project menu, and the icon is disabled on the VTune Amplifier XE toolbar. Solution: Make sure the item highlighted in the Solution Explorer is a valid project recognized by Visual Studio* software or a VTune Amplifier XE result. (The My Amplifier XE Results folder is a virtual project.) Problem: The Start button is disabled The Start button on the command toolbar is disabled. Solution: Make sure you specified an analysis target. If the target is not specified, click the Project Properties button on the command toolbar and enter the target name in the Application to Launch pane. For the General Exploration analysis, the Start button may be disabled if you mistakenly chose the incorrect processor type. The selected analysis type should match your processor type. 89 Intel® VTune™ Amplifier XE 2011 Release Notes 1 Intel® VTune™ Amplifier XE 2011 Release Notes for Linux Installation Guide and Release Notes Document number: 323591-001US 2 November 2011 Contents: Introduction What?s New System Requirements Technical Support Installation Notes Issues and Limitations Attributions Disclaimer and Legal Information 1 Introduction The Intel® VTune™ Amplifier XE 2011 provides an integrated performance analysis and tuning environment with graphical user interface that helps you analyze code performance on systems with IA-32 or Intel® 64 architectures. This document provides system requirements, installation instructions, issues and limitations, and legal information. The Intel® VTune™ Amplifier XE 2011 has a standalone graphical user interface (GUI) as well as a command-line interface (CLI). 2 What’s New The Intel® VTune™ Amplifier XE 2011 Update6 adds: ? Intel® Atom™ processors (code name Saltwell and Cedarview) support, including hardware event-based sampling analysis types and metrics for advanced tuning ? Bandwidth analysis for the 32nm Intel® processors code name EagletonIntel® VTune™ Amplifier XE 2011 Release Notes 2 ? Inline functions support (controlled by a filter bar mode) ? “Tiny” threads timeline mode ? Red Hat* Enterprise Linux 5.7 support ? Bug fixes The Intel® VTune™ Amplifier XE 2011 Update5 adds: ? Project Explorer ? Bandwidth Analysis for the 2nd Generation Intel® Core™ Processor Family (codenamed Sandy Bridge) ? Advanced options for analyzing child processes ? Command line reports with stacks ? Support for analysis of MPI programs that use the Intel® MPI library ? Usability improvements ? Newer Linux OS support: Fedora 15, Ubuntu 11.04, Debian 5, MeeGo 1.2 Gold The Intel® VTune™ Amplifier XE 2011 Update4: ? Update 3 sometimes incorrectly presented CPU Time in the thread timeline for Hotspots and Concurrency analysis types. Different scales were used for different threads and, thereby, could confuse a user by presenting low CPU Time in one thread as the same height in the chart as high CPU Time in another thread. The values presented in the tool tip when hovering over the chart were still correct. Update 4 resolves this problem completely. ? Debian* 6.0 support ? Ubuntu* 11.04 support The Intel® VTune™ Amplifier XE 2011 Update3: ? 32nm Westmere Family of Processors (codenamed Westmere-EX) support ? Pre-defined analysis for Intel® Atom™ Processor ? Attach/detach to process for the Hotspots, Concurrency, and Locks and Waits analysis types ? Comparison mode in Summary pane Intel® VTune™ Amplifier XE 2011 Release Notes 3 The Intel® VTune™ Amplifier XE 2011 Update2: ? The 2nd Generation Intel® Core™ Processor Family (codenamed Sandy Bridge) support including EBS based analysis types and metrics for advanced tuning ? Fedora* 14 support ? Automatic highlighting and expansion in Bottom-Up and Top-Down panes ? Tooltips for metrics description in grid panes ? Ability to import tb5/6 files from GUI ? JIT API support for Hotspots, Concurrency, and Locks and Waits analysis types ? Overhead time metric calculation for native threading synchronization The Intel® VTune™ Amplifier XE 2011 Update1: ? Red Hat* Enterprise Linux 6 support ? CentOS* 5.5 support ? Ubuntu* 10.04 support ? Data export to CSV file format ? Source / assembly toggling button ? Several bugs were fixed. 3 System Requirements For an explanation of architecture names, see http://software.intel.com/en-us/articles/intelarchitecture-platform-terminology/ Processor requirements ? For general operations with user interface and all data collection except Hardware eventbased sampling analysis o A PC based on an IA-32 or Intel® 64 architecture processor supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions (Intel® Pentium 4 processor or later, or compatible non-Intel processor. o For the best experience, a multi-core or multi-processor system is recommended.Intel® VTune™ Amplifier XE 2011 Release Notes 4 o Because Intel® VTune ™ Amplifier XE requires specific knowledge of assemblylevel instructions, its analysis may not operate correctly if a program contains non-Intel® instructions. In this case, run the analysis with a target executable that contains only Intel instructions. After you finish using VTune™ Amplifier XE, you can use the assembler code or optimizing compiler options that provide the non-Intel instructions. ? For Hardware event-based sampling analysis (EBS) o EBS analysis makes use of the on chip Performance Monitoring Unit (PMU) and requires a genuine Intel processor for collection. EBS analysis is supported on Intel® Pentium® M, Intel® Core™ microarchitecture and newer processors (for more precise details, see the list below). o EBS analysis is not supported on the Intel® Pentium 4 processor family (Intel® NetBurst® MicroArchitecture) and non-Intel processors. o However, the results collected with EBS can be analyzed using any system meeting the less restrictive general operation requirements. o EBS analysis requires a non-virtual machine to ensure access to the on-chip PMU. EBS is not supported within a virtual machine environment. ? The list of supported processors is constantly being extended. Here is a partial list of processors where the EBS analysis is enabled: Mobile processors Intel® Atom™ Processor Intel® Core™ i7 Mobile Processor Extreme Edition Intel® Core™ i7, i5, i3 Mobile Processors Intel® Core™2 Extreme Mobile Processor Intel® Core™2 Quad Mobile Processor Intel® Core™2 Duo Mobile Processor Intel® Core™ Duo Processor Intel® Core™ Solo Processor Intel® Pentium® Mobile Processor Desktop processors Intel® Atom™ Processor Intel® Core™ i7 Desktop Processor Extreme Edition Intel® Core™ i7, i5, i3 Desktop Processors Intel® Core™2 Quad Desktop Processor Intel® Core™2 Extreme Desktop Processor Intel® Core™2 Duo Desktop Processor Server and workstation processors Intel® Xeon® processors E7-8800/4800/2800 family Intel® Xeon® processors E3-1200 familyIntel® VTune™ Amplifier XE 2011 Release Notes 5 Intel® Xeon® processors 65xx/75xx series Intel® Xeon® processors 36xx/56xx series Intel® Xeon® processors 35xx/55xx series Intel® Xeon® processors 34xx series Quad-Core Intel® Xeon® processors 7xxx, 5xxx, and 3xxx series Dual-Core Intel® Xeon® processors 7xxx, 5xxx, and 3xxx series System Memory Requirements ? At least 2 GB of RAM Disk Space Requirements ? 280 MB free disk space required for all product features and all architectures Software Requirements ? Supported Linux* distributions: o Red Hat* Enterprise Linux 4 (starting from Update 8) o Red Hat* Enterprise Linux 5 and 6 o CentOS* versions equivalent to Red Hat* Enterprise Linux* versions listed above o SUSE* Linux* Enterprise Server (SLES) 10 and 11 o Fedora* 14 and 15 o Ubuntu* 10.04, 10.10 † and 11.04 † o Debian* 5.0 and 6.0 o MeeGo* 1.1 and MeeGo* 1.2 Gold †† † VTune™ Amplifier XE supports Ubuntu* 10.10 and Ubuntu* 11.04 default configuration only for event-based sampling analysis in the command line mode. To learn how to enable all other types of analysis and GUI results, please see the solutions described in the Known Limitation section, items 200197559, 200197563, of this document. †† Please refer to the Intel® AppUp™ SDK Suite for MeeGo* documentation for more information. ? We support all OS distributions above. For your information, VTune™ Amplifier XE was qualified on the builds listed below: o Red Hat* Enterprise Linux 4 Update 8 o Red Hat* Enterprise Linux 5 Update 6 and 7 o SUSE* Linux Enterprise Server 10 Service Pack 4 o SUSE* Linux Enterprise Server 11 Service Pack 1 o Fedora* 14 and 15 o Ubuntu* 10.04 and 11.04 o Debian* 5.0 and 6.0 ? Supported compilers: o Intel® C/C++ Compiler 11 and higher o Intel® Fortran Compiler 11 and higher o GNU C/C++ Compiler 3.4.6 and higher ? Application coding requirements Intel® VTune™ Amplifier XE 2011 Release Notes 6 o Supported programming languages: ? Fortran ? C ? C++ o Concurrency and Locks and Waits analysis types interpret the use of constructs from the following threading methodologies: ? Intel® Threading Building Blocks ? Posix* Threads on Linux* ? OpenMP*[1] ? Intel's C/C++ Parallel Language Extensions ? To view PDF documents, use a PDF reader, such as Adobe Reader*. Notes: 1. VTune™ Amplifier XE supports analysis of applications built with Intel® Fortran Compiler Professional Edition version 11.0 or higher, Intel® C++ Compiler Professional Edition version 11.0 or higher, or GNU C/C++ Compiler 3.4.6. Applications that use OpenMP* technology and are built with the GNU compiler must link to the OpenMP* compatibility library as supplied by an Intel® compiler. 4 Technical Support If you did not register your product during installation, please do so at the Intel® Software Development Products Registration Center. Registration entitles you to free technical support, product updates and upgrades for the duration of the support term. For information about how to find Technical Support, Product Updates, User Forums, FAQs, tips and tricks, and other support information, please visit http://www.intel.com/software/products/support/ Note: If your distributor provides technical support for this product, please contact them for support rather than Intel. 5 Installation Notes If you are installing the product for the first time, please be sure to have the product serial number available so you can type it in during installation. A valid license is required for installation and use. This product package can be used to install the software on both IA-32 systems and Intel® 64 systems. The installer determines the system architecture and installs the appropriate files. Both 32-bit and 64-bit versions of the software are automatically installed on an Intel® 64 system.Intel® VTune™ Amplifier XE 2011 Release Notes 7 To begin installation, do the following: 1. gunzip and untar to retrieve the installation packages. 2. Execute the ./install.sh script file (available at the top level in the untarred contents) as a root user. Activation is required. Note: 1. To install all components to a network-mounted drive or shared file system, execute the following command in place of the one in step 2 above: ./install.sh -- SHARED_INSTALL 2. The install can be run as a non-root user, but in this case not all collectors will be available to the user. 3. For successful installation you should have read and write permissions for the /tmp directory. Installing Collectors on Remote Systems You can install the command line data collection features of the product on remote systems to reduce overhead and simply collect data remotely. Data collection on a remote system does not require a license; however, viewing of the data cannot be done on the remote system unless a license is present. The results of any data collection that is run on the remote system must then be copied to the system where the regular install was done for analysis, viewing, and reporting. To do this: 1. Copy the CLI_install folder (found at the top level in the untarred product install package) to the remote machine. 2. Execute ./install.sh script file (this file is located inside the CLI_install folder). No activation will be required. Default Installation Directories The default top-level installation directory for this product is: ? /opt/intel/vtune_amplifier_xe_2011/ This product installs into an arrangement of directories shown in the diagram below. Not all directories will be present in a given installation. ? /opt/intel/vtune_amplifier_xe_2011/ o bin32Intel® VTune™ Amplifier XE 2011 Release Notes 8 o bin64* o config o documentation o include o lib32 o lib64* o man o message o resources o sepdk o samples (*) bin64 and lib64 are available for Intel® 64 architecture install package Establishing the VTune™ Amplifier XE Environment The amplxe-vars.sh script is used to establish the VTune™ Amplifier XE environment. The command takes the form: source /amplxe-vars.sh Advanced Installation Options VTune™ Amplifier XE uses a kernel driver to enable event-based sampling (EBS) analysis. If you are not using a default kernel on the supported Linux* distributions listed above, use the SEP Driver Kit in VTune™ Amplifier XE to compile drivers for your kernel. If no pre-built drivers are provided for your kernel, VTune™ Amplifier XE installer will automatically use the SEP Driver Kit to try and build a driver for your kernel. The driver can also be built manually after the product is installed using the SEP Driver Kit. Note: additional software may be needed in order to build and load the SEP kernel driver on the Linux* operating system. For details, see the README.txt file in the sepdk/src directory. When the Advanced installation is chosen, the following options are available: ? Driver install type [ use pre-built driver (default) / build driver / driver kit files only ] If no pre-built driver for this system is found, the option will be set to 'build driver'. You may change the option to 'driver kit files only' if you don't want to build/install driver or want to do it manually after installation.Intel® VTune™ Amplifier XE 2011 Release Notes 9 ? Driver access group [ vtune (default) ] Setting the driver access group ownership is a security feature and is used to control access to the kernel module. By default the group for accessing the driver is “vtune”. You may set your own group during installation or change it manually after installation by executing './bootscript -–group ' from the sepdk/src directory. ? Load driver [ yes (default) ] By default installation loads the driver into kernel. ? Install boot script [ yes (default) ] By default installation sets up a boot script which loads the driver into the kernel each time the system is rebooted. The boot script can be disabled later by executing './boot-script -- uninstall' from the sepdk/src directory. How to activate your evaluation software after purchasing Users of evaluation versions of Intel Developer Products have a new tool that allows converting evaluation-licensed products to fully licensed products once the product is purchased and a serial number is obtained. The “Activation Tool” is a utility that allows users of evaluation products to enter a valid product Serial Number to convert the product to fully licensed status. Run the /opt/intel/ActivationTool/Activate script, and provide your purchased product serial number, either as an argument to the program, or when prompted. For example: /opt/intel/ActivationTool/Activate ABCD-123AB45C Be sure to login or “su” to root if you want the product license to be available to all system users. Removing the Product If you want to remove components from an installation, run uninstall.sh script as root user from the product installation folder. 6 Issues and Limitations Known Issues and Limitations ? Running time is attributed to a next instruction (200108041) o To collect the data about time-consuming running regions of the target, the VTune™ Amplifier XE interrupts executing target threads and attributes the time to the context IP address.Intel® VTune™ Amplifier XE 2011 Release Notes 10 o Due to the collection mechanism, the captured IP address points to the instruction occurred AFTER the one that is actually consuming most of the time. This leads to the running time attributed to next instruction (or, rarely to one of the subsequent instructions) in the Assembly view. In rare cases, this can also lead to wrong attribution of running time in the source - the time may be erroneously attributed to the source line AFTER the actual hot line. o In case the inline mode is ON and the program has small functions inlined at the hotspots, this can cause the running time to be attributed to a wrong function since the next instruction can belong to the different function in tightly inlined code . ? An application which allocates massive chunks of memory may fail to work under Amplifier (200083850) o If 32-bit application allocates massive chunks of memory (close to 2 GB) in the heap, it may fail to launch under Amplifier while running fine by its own. This happens because Amplifier requires additional memory in the profiled application process for doing the analysis. The workaround could be in using larger address space (e.g. converting the project to 64-bit). ? SEP may crash certain NHM systems when deep sleep states are enabled (200149603) o On some Intel® Core™ i7 processor-based systems with C-states enabled, sampling may cause system hanging due to a known hardware issue (see errata AAJ134 inhttp://download.intel.com/design/processor/specupdt/320836.pdf). To avoid this, disable the “Cn(ACPI Cn) report to OS” BIOS option before sampling with the VTune Amplifier XE analyzer on Intel Core™ i7 processor-based systems. ? Link to instruction guide: instruction set reference document is not positioned on description of proper instruction. (200091200) o The reference information for assembly instructions can be opened in any PDF viewer, but only Adobe Acrobat Reader* supports positioning the instruction reference document on the required page. To ensure correct functionality of this feature, you are recommended to install the latest available version of Adobe Acrobat Reader.Intel® VTune™ Amplifier XE 2011 Release Notes 11 ? Specifying too low "Sampling After Value" for some events may cause system hang due to frequent events triggering during the collection (200093394) o Use reasonable "Sampling After Value" that result in about 1000 events triggering per second. This is statistically sufficient for the data analysis. For more fine grained analysis of sampling results, decrease the "Sampling After Value" gradually observing the system responsiveness slowdown due to frequent interruptions. ? Security-enhanced Linux* is not supported (200155374) o Security-enhanced Linux* settings (SELinux) are currently not supported by the Intel® VTune™ Amplifier XE and need to be either disabled or set to permissive for a successful tool suite installation. If your Linux* distribution has SELinux enabled the following error message will be issued by the installer: o Your system is protected with Security-enhanced Linux (SELinux). We currently support only "Permissive" mode, which is not found on the system. To rectify this issue, you may either disable SELinux by - setting the line "SELINUX=disabled" in your /etc/sysconfig/selinux file - adding "selinux=0" kernel argument in lilo.conf or grub.conf files or make SELinux mode adjustment by - setting the line "SELINUX=permissive" in your /etc/sysconfig/selinux file or ask your system administrator to make SELinux mode adjustment. You may need to reboot your system after changing the system parameters. More information about SELinux can be found at http://www.nsa.gov/selinux/ ? The tool may not be able to parse correctly certain characters in an application’s command arguments passed though a shell script (200155871) o Using quotes and double quotes in the application?s command arguments may not be parsed correctly. To work around the problem, use double quotes and backslashes to screen double quotes inside. o Incorrect: „this “style” text? o Correct: "this \"style\" text" ? Event-based sampling collection cannot start if the result directory path contains non-English characters (200185851) o When you install the product on a system with language localization, make sure the path to the result directory does not contain non-English characters.Intel® VTune™ Amplifier XE 2011 Release Notes 12 ? On Ubuntu* 10.10 systems, Standalone GUI silently disappears when opening the results. (200197559) o Recommendation: Need to switch visual theme to "New wave" or switch to another window manager (e.g. KDE). ? Intel(R) VTune Amplifier XE collectors may fail to run on the Ubuntu 10.10 and Ubuntu 11.04 (200197563) o Intel(R) VTune Amplifier XE may fail to collect data for Hotspots, Concurrency, and Locks and Waits analysis types on the Ubuntu 10.10 and Ubuntu 11.04 operating system. Once a collection is started, the message appears in the output: Failed to start profiling because the scope of ptrace() system call application is limited. To enable profiling, please set /proc/sys/kernel/yama/ptrace_scope to 0. See the Release Notes for instructions on enabling it permanently.” o To workaround this problem for the current session, set the /proc/sys/kernel/yama/ptrace_scope sysctl to 0. o To make this change permanent, set kernel.yama.ptrace_scope value to 0 at /etc/sysctl.d/10-ptrace.conf file using root permissions and reboot the machine. ? VTune™ Amplifier XE may be killed while opening results on Ubuntu 10.10 or later if no license is provided (200197888) o This happens due to checking a license with enabled trusted storage. Possible workaround is disabling the ptrace protection in OS by using the command: echo 0 | tee /proc/sys/kernel/yama/ptrace_scope o However, normally it?s expected that a license is provided for the product before using it. ? VTune Amplifier XE collectors may crash or produce corrupted data while profiling stripped binaries. (200165647) o VTune Amplifier XE may fail to collect data for Hotspots, Concurrency, and Locks and Waits analysis types if the main executable of an analysis target statically links some symbols from libc.so or libpthread.so (for example, pthread_create). To avoid this, do not strip the main executable. Use the -E linker switch to export the statically linked symbols to the dynamic symbol table of the main executable. Intel® VTune™ Amplifier XE 2011 Release Notes 13 For the list of symbols required for correct profiling, see the Analyzing Statically Linked Libraries topic in the online help. ? Hotspots, Concurrency and Locks and Waits analysis types may not work on executables that do not depend on the libpthread.so.0 library. (200208975) o There is currently a limitation in the product regarding profiling application targets where the executable does not depend on the libpthread.so.0 library. The message o Link libpthread.so to the application statically and restart profiling o appears when profiling an application where program image does not depend on libpthread.so.0 but then it dlopen()-s a shared library which does depend on libpthread.so.0. The collector is not able to follow the program execution and module load/unload so the collection results are likely to be misleading. o A workaround is to set "LD_PRELOAD=libpthread.so.0" before running the collection. ? VTune Amplifier XE collectors may crash on Red Hat Enterprise Linux x64 system while re-attaching to a process. (200212086) o VTune Amplifier XE may fail to collect data for Hotspots, Concurrency, and Locks and Waits analysis types if attempting to attach to a 64-bit process on RHEL6 system after detaching from the same process. ? Event-based profiling results may be incorrect if nmi_watchdog interrupt capability is enabled (200171859) o If the nmi_watchdog interrupt capability is enabled on a Linux system, eventbased profiling results may be incorrect. For example, when using a pauseresume scenario for event-based analysis on 64-bit Red Hat* Enterprise Linux* 6.1 with this feature enabled, no data will be collected after the collection is resumed. Before running event-based analysis on Linux systems, ensure that the nmi_watchdog interrupt capability, if available, is disabled. Disabling the nmi_watchdog interrupt is accomplished by adding the Linux kernel boot parameter 'nmi_watchdog=0' to your system boot loader and then rebooting the system.Intel® VTune™ Amplifier XE 2011 Release Notes 14 ? Information collected via ITT API is not available when attaching to a process. (200172007) o When collecting statistics data using ITT API injected into a source code like Frame Analysis or JIT-profiling, attaching to a process will not bring expected results. Use the VTune Amplifier XE analysis to start an application instead of attaching to a process. ? Do not use -ipo option - it causes the inline debug information to switch off (200260765) o If using the Intel® compiler to get performance data on inline functions, use the additional option “-inline-debug-info”, but avoid using the –ipo option. Currently this option disables generating the inline debug information in the compiler. ? Intel® Compiler currently doesn't support function split ranges in debug info which may lead to wrong performance data attribution in case function ranges are overlapped (e.g. performance data attributed to one function, but should have been split by two). (200260768) o In some cases the Intel® Compiler generates imprecise debug information about ranges of inline functions. This may lead to wrong performance data attribution when the Inline mode is turned on, for example: instead of two functions performance data is attributed just to one of them. 7 Attributions Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.Intel® VTune™ Amplifier XE 2011 Release Notes 15 "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of,Intel® VTune™ Amplifier XE 2011 Release Notes 16 publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditionsIntel® VTune™ Amplifier XE 2011 Release Notes 17 for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONSIntel® VTune™ Amplifier XE 2011 Release Notes 18 Boost Software License - Version 1.0 - August 17th, 2003 Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following: The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Libxml2 Except where otherwise noted in the source code (e.g. the files hash.c,list.c and the trio files, which are covered by a similar license but with different Copyright notices) all the files are: Copyright (C) 1998-2003 Daniel Veillard. All Rights Reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE DANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHERIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except as contained in this notice, the name of Daniel Veillard shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization from him.Intel® VTune™ Amplifier XE 2011 Release Notes 19 Libunwind Copyright (c) 2002 Hewlett-Packard Co. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except where otherwise noted in the source code (e.g. the files hash.c, list.c and the trio files, which are covered by a similar licence but with different Copyright notices) all the files are: Copyright (C) 1998-2003 Daniel Veillard. All Rights Reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE DANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except as contained in this notice, the name of Daniel Veillard shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization from him. PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2Intel® VTune™ Amplifier XE 2011 Release Notes 20 1. This LICENSE AGREEMENT is between the Python Software Foundation ("PSF"), and the Individual or Organization ("Licensee") accessing and otherwise using this software ("Python") in source or binary form and its associated documentation. 2. Subject to the terms and conditions of this License Agreement, PSF hereby grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce, analyze, test, perform and/or display publicly, prepare derivative works, distribute, and otherwise use Python alone or in any derivative version, provided, however, that PSF's License Agreement and PSF's notice of copyright, i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008 Python Software Foundation; All Rights Reserved" are retained in Python alone or in any derivative version prepared by Licensee. 3. In the event Licensee prepares a derivative work that is based on or incorporates Python or any part thereof, and wants to make the derivative work available to others as provided herein, then Licensee hereby agrees to include in any such work a brief summary of the changes made to Python. 4. PSF is making Python available to Licensee on an "AS IS" basis. PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT INFRINGE ANY THIRD PARTY RIGHTS. 5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON, OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF. 6. This License Agreement will automatically terminate upon a material breach of its terms and conditions. 7. Nothing in this License Agreement shall be deemed to create any relationship of agency, partnership, or joint venture between PSF and Licensee. This License Agreement does not grant permission to use PSF trademarks or trade name in a trademark sense to endorse or promote products or services of Licensee, or any third party. 8. By copying, installing or otherwise using Python, Licensee agrees to be bound by the terms and conditions of this License Agreement. wxWidgets Library This product includes wxWindows software which can be downloaded from www.wxwidgets.org/downloads.Intel® VTune™ Amplifier XE 2011 Release Notes 21 wxWindows Library Licence, Version 3.1 ====================================== Copyright (C) 1998-2005 Julian Smart, Robert Roebling et al Everyone is permitted to copy and distribute verbatim copies of this licence document, but changing it is not allowed. WXWINDOWS LIBRARY LICENCE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION This library is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public Licence as published by the Free Software Foundation; either version 2 of the Licence, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public Licence for more details. You should have received a copy of the GNU Library General Public Licence along with this software, usually in a file named COPYING.LIB. If not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. EXCEPTION NOTICE 1. As a special exception, the copyright holders of this library give permission for additional uses of the text contained in this release of the library as licenced under the wxWindows Library Licence, applying either version 3.1 of the Licence, or (at your option) any later version of the Licence as published by the copyright holders of version 3.1 of the Licence document. 2. The exception is that you may use, copy, link, modify and distribute under your own terms, binary object code versions of works based on the Library. 3. If you copy code from files distributed under the terms of the GNU General Public Licence or the GNU Library General Public Licence into a copy of this library, as this licence permits, the exception does not apply to the code that you add in this way. To avoid misleading anyone as to the status of such modified files, you must delete this exception notice from such code and/or adjust the licensing conditions notice accordingly. 4. If you write modifications of your own for this library, it is your choice whether to permit this exception to apply to your modifications. If you do not wish that, you must delete the exception notice from such code and/or adjust the licensing conditions notice accordingly.Intel® VTune™ Amplifier XE 2011 Release Notes 22 /* zlib.h -- interface of the 'zlib' general purpose compression library version 1.2.3, July 18th, 2005 Copyright (C) 1995-2005 Jean-loup Gailly and Mark Adler This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution. Jean-loup Gailly jloup@gzip.org Mark Adler madler@alumni.caltech.edu */ 8 Disclaimer and Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel® VTune™ Amplifier XE 2011 Release Notes 23 Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. This document contains information on products in the design phase of development. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Core Inside, i960, Intel, the Intel logo, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, InTru, the InTru logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, skoool, the skoool logo, Sound Mark, The Journey Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Java and all Java based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries. Copyright (C) 2010-2011, Intel Corporation. All rights reserved. Intel® VTune™ Amplifier XE 2011 Release Notes 1 Intel® VTune™ Amplifier XE 2011 Release Notes for Windows* OS Installation Guide and Release Notes Document number: 323401-001US 2 November 2011 Contents: Introduction What’s New System Requirements Technical Support Installation Notes Issues and Limitations Attributions Disclaimer and Legal Information 1 Introduction The Intel® VTune™ Amplifier XE 2011 provides an integrated performance analysis and tuning environment with graphical user interface that helps you analyze code performance on systems with IA-32 or Intel® 64 architectures. This document provides system requirements, installation instructions, issues and limitations, and legal information. The Intel® VTune™ Amplifier XE 2011 has a standalone graphical user interface (GUI) as well as a command-line interface (CLI). To learn more about this product’s documentation, help, and samples, see the Intel® VTune™ Amplifier XE 2011 Documentation item in the Start menu program folder.Intel® VTune™ Amplifier XE 2011 Release Notes 2 2 What’s New The Intel® VTune™ Amplifier XE 2011 Update6 adds: ? Intel® Atom™ processors (code name Saltwell and Cedarview) support, including hardware event-based sampling analysis types and metrics for advanced tuning ? Bandwidth analysis for the 32nm Intel® processors code name Eagleton ? Inline functions support (controlled by a filter bar mode) ? “Tiny” threads timeline mode ? Bug fixes The Intel® VTune™ Amplifier XE 2011 Update5 adds: ? Project Explorer ? Bandwidth Analysis for the 2nd Generation Intel® Core™ Processor Family (codenamed Sandy Bridge) ? Advanced options for analyzing child processes ? Command line reports with stacks ? Support for analysis of MPI programs that use the Intel® MPI library ? Usability improvements The Intel® VTune™ Amplifier XE 2011 Update4: ? Update 3 sometimes incorrectly presented CPU Time in the thread timeline for Hotspots and Concurrency analysis types. Different scales were used for different threads and, thereby, could confuse a user by presenting low CPU Time in one thread as the same height in the chart as high CPU Time in another thread. The values presented in the tool tip when hovering over the chart were still correct. Update 4 resolves this problem completely. The Intel® VTune™ Amplifier XE 2011 Update3: ? 32nm Westmere Family of Processors (codenamed Westmere-EX) support ? Pre-defined analysis for Intel® Atom™ Processor ? Comparison mode in Summary pane Intel® VTune™ Amplifier XE 2011 Release Notes 3 The Intel® VTune™ Amplifier XE 2011 Update2: ? The 2nd Generation Intel® Core™ Processor Family (codenamed Sandy Bridge) support including EBS based analysis types and metrics for advanced tuning ? Automatic highlighting and expansion in Bottom-Up and Top-Down panes ? Tooltips for metrics description in grid panes ? Ability to import tb5/6 files from GUI ? JIT API support for Hotspots, Concurrency, and Locks and Waits analysis types ? Overhead time metric calculation for native threading synchronization The Intel® VTune™ Amplifier XE 2011 Update1: ? Data export to CSV file format ? Source / assembly toggling button ? Several bugs were fixed. 3 System Requirements For an explanation of architecture names, see http://software.intel.com/en-us/articles/intelarchitecture-platform-terminology/ Processor requirements ? For general operations with user interface and all data collection except Hardware eventbased sampling analysis o A PC based on an IA-32 or Intel® 64 architecture processor supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions (Intel® Pentium 4 processor or later, or compatible non-Intel processor. o For the best experience, a multi-core or multi-processor system is recommended. o Because Intel® VTune™ Amplifier XE requires specific knowledge of assemblylevel instructions, its analysis may not operate correctly if a program contains non-Intel® instructions. In this case, run the analysis with a target executable that contains only Intel instructions. After you finish using VTune™ Amplifier XE you can use the assembler code or optimizing compiler options that provide the non-Intel instructions. ? For Hardware event-based sampling analysis (EBS)Intel® VTune™ Amplifier XE 2011 Release Notes 4 o EBS analysis makes use of the on chip Performance Monitoring Unit (PMU) and requires a genuine Intel processor for collection. EBS analysis is supported on Intel® Pentium® M, Intel® Core™ microarchitecture and newer processors (for more precise details, see the list below). o EBS analysis is not supported on the Intel® Pentium 4 processor family (Intel® NetBurst® MicroArchitecture) and non-Intel processors. o However, the results collected with EBS can be analyzed using any system meeting the less restrictive general operation requirements. o EBS analysis requires a non-virtual machine to ensure access to the on-chip PMU. EBS is not supported within a virtual machine environment. ? The list of supported processors is constantly being extended. Here is a partial list of processors where the EBS analysis is enabled: Mobile processors Intel® Atom™ Processor Intel® Core™ i7 Mobile Processor Extreme Edition Intel® Core™ i7, i5, i3 Mobile Processors Intel® Core™2 Extreme Mobile Processor Intel® Core™2 Quad Mobile Processor Intel® Core™2 Duo Mobile Processor Intel® Core™ Duo Processor Intel® Core™ Solo Processor Intel® Pentium® Mobile Processor Desktop processors Intel® Atom™ Processor Intel® Core™ i7 Desktop Processor Extreme Edition Intel® Core™ i7, i5, i3 Desktop Processors Intel® Core™2 Quad Desktop Processor Intel® Core™2 Extreme Desktop Processor Intel® Core™2 Duo Desktop Processor Server and workstation processors Intel® Xeon® processors E7-8800/4800/2800 family Intel® Xeon® processors E3-1200 family Intel® Xeon® processors 65xx/75xx series Intel® Xeon® processors 36xx/56xx series Intel® Xeon® processors 35xx/55xx series Intel® Xeon® processors 34xx series Quad-Core Intel® Xeon® processors 7xxx, 5xxx, and 3xxx series Dual-Core Intel® Xeon® processors 7xxx, 5xxx, and 3xxx seriesIntel® VTune™ Amplifier XE 2011 Release Notes 5 System Memory Requirements ? At least 2 GB of RAM Disk Space Requirements ? 650 MB free disk space required for all product features and all architectures Software Requirements ? Supported operational systems: o Microsoft* Windows XP* SP2 and SP3 o Microsoft* Windows XP Professional x64 Edition SP1 and SP2 o Microsoft* Windows Vista* (Ultimate) o Microsoft* Windows 7* SP1 o Microsoft* Windows Server 2008* o Embedded editions not supported NOTE: In a future major release of this product, support for installation and use on Microsoft Windows Vista will be removed. ? We support all OS distributions above. For your information VTune™ Amplifier XE was qualified on the systems listed below: o Microsoft* Windows XP* SP2 and SP3 o Microsoft* Windows Vista* (Ultimate) SP1 and SP2 o Microsoft* Windows Server 2008* and SP2 o Microsoft* Windows Server 2008* R2 o Microsoft* Windows 7* and SP1 ? Supported compilers: o Intel® C/C++ Compiler 11 and higher o Intel® Fortran Compiler 11 and higher o Intel Parallel Composer o Microsoft* Visual Studio* C/C++ Compiler ? Supported Microsoft Visual Studio versions: o Microsoft* Visual Studio* 2005 o Microsoft* Visual Studio* 2008 o Microsoft* Visual Studio* 2010 and SP1 NOTE: In a future major release of this product, support for installation and use with Microsoft Visual Studio 2005 will be removed. Intel recommends that customers migrate to Microsoft Visual Studio 2010* at their earliest convenience. ? Application coding requirements o Supported programming languages: ? Fortran ? C ? C++ ? C# (only .NET versions 4.0 and below are supported)Intel® VTune™ Amplifier XE 2011 Release Notes 6 o Concurrency and Locks and Waits analysis types interpret the use of constructs from the following threading methodologies: ? Intel® Threading Building Blocks ? Win32* Threads on Windows* ? OpenMP* ? Intel's C/C++ Parallel Language Extensions ? To view PDF documents, use a PDF reader, such as Adobe Reader*. 4 Technical Support If you did not register your product during installation, please do so at the Intel® Software Development Products Registration Center. Registration entitles you to free technical support, product updates and upgrades for the duration of the support term. For information about how to find Technical Support, Product Updates, User Forums, FAQs, tips and tricks, and other support information, please visit http://www.intel.com/software/products/support/ Note: If your distributor provides technical support for this product, please contact them for support rather than Intel. 5 Installation Notes If you are installing the product for the first time, please be sure to have the product serial number available so you can type it in during installation. A valid license is required for installation and use. The installation of VTune™ Amplifier XE removes any earlier installed version of VTune™ Amplifier XE. The product is a self-extracting executable archive with one IA-32 package you can install on either a 32-bit or 64-bit system. To begin installation, double click on VTune_Amplifier_XE_2011_update6_setup.exe file as a user with Administrative privileges. This installs the full package (includes GUI front-end for using the VTune™ Amplifier XE as well as Microsoft* Visual Studio integration). Activation is required. Installing Collectors on Remote Systems You can install the command line data collection features of the product on remote systems to reduce overhead and simply collect data remotely. Data collection on a remote system does not Intel® VTune™ Amplifier XE 2011 Release Notes 7 require a license; however, viewing of the data cannot be done on the remote system unless a license is present. The results of any data collection that is run on the remote system must then be copied to the system where the regular install was done for analysis, viewing, and reporting. To do this: 1. Unpack the product web image manually using the command: VTune_Amplifier_XE_2011_update6_setup.exe --extract-only --silent --extract-folder C:\temp\AmplXE_update6_unpacked Use any convenient path for the --extract-folder option. In case the --extract-folder option is omitted, the default location for the extracted image would be: "C:\Program Files (x86) \Intel\Download\VTune_Amplifier_XE_2011_update6_setup" for 64-bit and "C:\Program Files \Intel\Download\VTune_Amplifier_XE_2011_update6_setup" for 32-bit OS. 2. Copy the folder containing the installation files for the collectors and command line tools to the remote machine. With the example shown above, the location of this folder would be C:\temp\AmplXE_update6_unpacked\Installs\ps_he_cli.* 3. Run the Amplifier_XE.msi with Administrative privileges and follow the instructions. No activation will be required. 4. On 64-bit remote machine, from VTune™ Amplifier XE installation location, run and install msvcrt_x86.msi and msvcrt_x64.msi (requires Administrative privileges). 5. On 32-bit remote machine, from VTune™ Amplifier XE installation location, run and install msvcrt_x86.msi (requires Administrative privileges). Default Installation Folders The default top-level installation folder for this product is: ? C:\Program Files\Intel\VTune Amplifier XE 2011\ If you are installing on a system with a non-English language version of Windows, the name of the Program Files folder may be different. On Intel® 64 architecture systems, the folder name is Program Files (X86) or the equivalent. This product installs into an arrangement of folders shown in the diagram below. Not all folders will be present in a given installation. ? C:\Program Files\Intel\Amplifier XE 2011\Intel® VTune™ Amplifier XE 2011 Release Notes 8 o bin32 o bin64* o config o documentation o include o lib32 o lib64* o message o resources o sepdk o samples (*) bin64 and lib64 are available for Intel® 64 architecture install package How to activate your evaluation software after purchasing Users of evaluation versions of Intel Developer Products have a new tool that allows converting evaluation-licensed products to fully licensed products once the product is purchased and a serial number is obtained. The “Activation Tool” is a utility that allows users of evaluation products to enter a valid product Serial Number to convert the product to fully licensed status. Please click Start > All Programs > Intel Parallel Studio XE 2011 > Product Activation, supply a valid product serial number, and click Activate to convert your evaluation software to a fully licensed product. Changing, Updating and Removing the Product If you want to add or remove components from an installation, open the Control Panel and select the Add or Remove Programs applet, select “Intel® VTune™ Amplifier XE 2011” and click Change. To remove the product, select Remove instead of Change. When installing an updated version of the product, you do not need to remove the older version. Installation program will remove the old version automatically. Note: If the SEP driver uninstallation failed during the normal uninstall process, open a Command Prompt window and execute the following commands with Administrative privileges to manually remove the SEP driver from the system: cd %windir%\system32\drivers dir sep*.sys net stop sep3_4 # unload SEP3 driver from kernel del sep3_4.sys # delete SEP3 driver from filesystem net stop sepdal # unload PAX driver from kernel del sepdal.sys # delete PAX driver from filesystemIntel® VTune™ Amplifier XE 2011 Release Notes 9 6 Issues and Limitations Known Issues and Limitations ? Running time is attributed to a next instruction (200108041) o To collect the data about time-consuming running regions of the target, the VTune™ Amplifier XE interrupts executing target threads and attributes the time to the context IP address. o Due to the collection mechanism, the captured IP address points to the instruction occurred AFTER the one that is actually consuming most of the time. This leads to the running time attributed to next instruction (or, rarely to one of the subsequent instructions) in the Assembly view. In rare cases, this can also lead to wrong attribution of running time in the source - the time may be erroneously attributed to the source line AFTER the actual hot line. o In case the inline mode is ON and the program has small functions inlined at the hotspots, this can cause the running time to be attributed to a wrong function since the next instruction can belong to the different function in tightly inlined code . ? Incorrect timing results when running on a 32-bit virtual machine (200137061) o Intel® Amplifier may fail to collect correct timing data when running on a virtual machine with problematic virtualization of time stamp counters. In this case Amplifier throws a warning message: o “Warning: Cannot load data file '.trace' (syncAcquiredHandler: timestamps aren't ascended!)” ? An application which allocates massive chunks of memory may fail to work under Amplifier (200083850) o If 32-bit application allocates massive chunks of memory (close to 2 GB) in the heap, it may fail to launch under Amplifier while running fine by its own. This happens because Amplifier requires additional memory in the profiled application process for doing the analysis. The workaround could be in using larger address space (e.g. converting the project to 64-bit).Intel® VTune™ Amplifier XE 2011 Release Notes 10 ? SEP may crash certain NHM systems when deep sleep states are enabled (200149603) o On some Intel® Core™ i7 processor-based systems with C-states enabled, sampling may cause system hanging due to a known hardware issue (see errata AAJ134 inhttp://download.intel.com/design/processor/specupdt/320836.pdf). To avoid this, disable the “Cn(ACPI Cn) report to OS” BIOS option before sampling with the VTune Amplifier XE analyzer on Intel Core™ i7 processor-based systems. ? Link to instruction guide: instruction set reference document is not positioned on description of proper instruction. (200091200) o The reference information for assembly instructions can be opened in any PDF viewer, but only Adobe Acrobat Reader* supports positioning the instruction reference document on the required page. To ensure correct functionality of this feature, you are recommended to install the latest available version of Adobe Acrobat Reader. ? Uninstalling limitation: pin.exe stays running after detaching. (200092295) o The VTune™ Amplifier XE cannot be uninstalled after attaching to the target to be profiled until running the target is over. The cause is that pin.exe keeps working after detaching from the target and exits only after the profiled application/process execution finishes. ? Second attach to the same application should print an error and exit immediately. (200092650) o The VTune™ Amplifier XE allows running the analysis while the previous one is in progress but does not store any data for the second analysis run. ? Specifying too low "Sampling After Value" for some events may cause system hang due to frequent events triggering during the collection (200093394) o Use reasonable "Sampling After Value" that result in about 1000 events triggering per second. This is statistically sufficient for the data analysis. For more fine grained analysis of sampling results, decrease the "Sampling After Value" gradually observing the system responsiveness slowdown due to frequent interruptions.Intel® VTune™ Amplifier XE 2011 Release Notes 11 ? Event-based sampling collection cannot start if the result directory path contains non-English characters (200185851) o When you install the product on a system with language localization, make sure the path to the result directory does not contain non-English characters. ? Truncated .NET module names may be displayed in results view (200199458) o When viewing results collected for a .NET application you may observe truncated .NET module names. Please make sure a system was reboot after the .NET application install before profiling with Amplifier XE. ? VTune™ Amplifier XE may crash on the analysis of OpenMP enabled binaries compiled with a certain version of Intel Complier (200199671) o On Windows 7 64-bit based systems the Hotspot, Concurrency or Lock&Waits Analysis may crash during the analysis of 32-bit binaries compiled with the Intel Compiler v.12.0, also included in the Composer XE 2011 Update1, and enabled with the OpenMP. Applications that use 32-bit Intel IPP or MKL libraries and are re-compiled with the 12.0 compiler may be affected, as well. ? Intel® Compiler only produces first level of inlines. The nested inlines are not emitted into the debug information. (200164310) o Intel® Compiler currently generates debug information only for the first level of inline functions. So, you cannot see performance data attributed to functions inlined to other inline functions. Instead, this performance data are attributed to corresponding functions inlined to regular (not inline) functions. This may also cause wrong source line attribution of performance data in the source view. ? VTune™ Amplifier XE does not resolve symbols correctly on Windows XP SP1 operating system (200216358) o When VTune™ Amplifier XE is ran on Windows XP Service Pack 1 operating system, a problem may be observed that symbols are not resolved correctly but instead are shown as "[foo.dll]" names. This happens because VTune™ Amplifier XE uses Microsoft DIA library version which requires Service Pack 2 to be installed. Please install the service pack to resolve the issue.Intel® VTune™ Amplifier XE 2011 Release Notes 12 ? Information collected via ITT API is not available when attaching to a process. (200172007) o When collecting statistics data using ITT API injected into a source code like Frame Analysis or JIT-profiling, attaching to a process will not bring expected results. Use the VTune Amplifier XE analysis to start an application instead of attaching to a process. ? Do not use -ipo option - it causes the inline debug information to switch off (200260765) o If using the Intel® compiler to get performance data on inline functions, use the additional option “/debug:inline-debug-info”, but avoid using the –ipo (/Qipo on Windows) option. Currently this option disables generating the inline debug information in the compiler. Note that the Intel compiler integrated into the Microsoft Visual Studio* IDE uses the /Qipo by default in the Release configuration. ? Intel® Compiler currently doesn't support function split ranges in debug info which may lead to wrong performance data attribution in case function ranges are overlapped (e.g. performance data attributed to one function, but should have been split by two). (200260768) o In some cases the Intel® Compiler generates imprecise debug information about ranges of inline functions. This may lead to wrong performance data attribution when the Inline mode is turned on, for example: instead of two functions performance data is attributed just to one of them. 7 Attributions Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.Intel® VTune™ Amplifier XE 2011 Release Notes 13 "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.Intel® VTune™ Amplifier XE 2011 Release Notes 14 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construedIntel® VTune™ Amplifier XE 2011 Release Notes 15 as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.Intel® VTune™ Amplifier XE 2011 Release Notes 16 END OF TERMS AND CONDITIONS Boost Software License - Version 1.0 - August 17th, 2003 Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following: The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Libxml2 Except where otherwise noted in the source code (e.g. the files hash.c,list.c and the trio files, which are covered by a similar license but with different Copyright notices) all the files are: Copyright (C) 1998-2003 Daniel Veillard. All Rights Reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE DANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHERIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.Intel® VTune™ Amplifier XE 2011 Release Notes 17 Except as contained in this notice, the name of Daniel Veillard shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization from him. Libunwind Copyright (c) 2002 Hewlett-Packard Co. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except where otherwise noted in the source code (e.g. the files hash.c, list.c and the trio files, which are covered by a similar licence but with different Copyright notices) all the files are: Copyright (C) 1998-2003 Daniel Veillard. All Rights Reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE DANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except as contained in this notice, the name of Daniel Veillard shall not be used in advertising or otherwise to promote the sale, use or other deal-Intel® VTune™ Amplifier XE 2011 Release Notes 18 ings in this Software without prior written authorization from him. PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2 1. This LICENSE AGREEMENT is between the Python Software Foundation ("PSF"), and the Individual or Organization ("Licensee") accessing and otherwise using this software ("Python") in source or binary form and its associated documentation. 2. Subject to the terms and conditions of this License Agreement, PSF hereby grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce, analyze, test, perform and/or display publicly, prepare derivative works, distribute, and otherwise use Python alone or in any derivative version, provided, however, that PSF's License Agreement and PSF's notice of copyright, i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008 Python Software Foundation; All Rights Reserved" are retained in Python alone or in any derivative version prepared by Licensee. 3. In the event Licensee prepares a derivative work that is based on or incorporates Python or any part thereof, and wants to make the derivative work available to others as provided herein, then Licensee hereby agrees to include in any such work a brief summary of the changes made to Python. 4. PSF is making Python available to Licensee on an "AS IS" basis. PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT INFRINGE ANY THIRD PARTY RIGHTS. 5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON, OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF. 6. This License Agreement will automatically terminate upon a material breach of its terms and conditions. 7. Nothing in this License Agreement shall be deemed to create any relationship of agency, partnership, or joint venture between PSF and Licensee. This License Agreement does not grant permission to use PSF trademarks or trade name in a trademark sense to endorse or promote products or services of Licensee, or any third party. 8. By copying, installing or otherwise using Python, Licensee agrees to be bound by the terms and conditions of this License Agreement.Intel® VTune™ Amplifier XE 2011 Release Notes 19 Changes to standard library modules: ==================================== A brief summary of changes made to Python 2.5.2 source: - On Windows*, the code of import, zipimport, and execfile was modified to handle directories containing Unicode characters. wxWidgets Library This product includes wxWindows software which can be downloaded from www.wxwidgets.org/downloads. wxWindows Library Licence, Version 3.1 ====================================== Copyright (C) 1998-2005 Julian Smart, Robert Roebling et al Everyone is permitted to copy and distribute verbatim copies of this licence document, but changing it is not allowed. WXWINDOWS LIBRARY LICENCE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION This library is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public Licence as published by the Free Software Foundation; either version 2 of the Licence, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public Licence for more details. You should have received a copy of the GNU Library General Public Licence along with this software, usually in a file named COPYING.LIB. If not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. EXCEPTION NOTICE 1. As a special exception, the copyright holders of this library give permission for additional uses of the text contained in this release of the library as licenced under the wxWindows Library Licence, applying either version 3.1 of the Licence, or (at your option) any later version of the Licence as published by the copyright holders of version 3.1 of the Licence document. 2. The exception is that you may use, copy, link, modify and distribute under your own terms, binary object code versions of works based on the Library. 3. If you copy code from files distributed under the terms of the GNU General Public Licence or the GNU Library General Public Licence into aIntel® VTune™ Amplifier XE 2011 Release Notes 20 copy of this library, as this licence permits, the exception does not apply to the code that you add in this way. To avoid misleading anyone as to the status of such modified files, you must delete this exception notice from such code and/or adjust the licensing conditions notice accordingly. 4. If you write modifications of your own for this library, it is your choice whether to permit this exception to apply to your modifications. If you do not wish that, you must delete the exception notice from such code and/or adjust the licensing conditions notice accordingly. /* zlib.h -- interface of the 'zlib' general purpose compression library version 1.2.3, July 18th, 2005 Copyright (C) 1995-2005 Jean-loup Gailly and Mark Adler This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution. Jean-loup Gailly jloup@gzip.org Mark Adler madler@alumni.caltech.edu */ 8 Disclaimer and Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE Intel® VTune™ Amplifier XE 2011 Release Notes 21 INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. This document contains information on products in the design phase of development. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Core Inside, i960, Intel, the Intel logo, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, InTru, the InTru logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, skoool, the skoool logo, Sound Mark, The Journey Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Java and all Java based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries.Intel® VTune™ Amplifier XE 2011 Release Notes 22 Copyright (C) 2010-2011, Intel Corporation. All rights reserved. Intel(R) Threading Building Blocks Reference Manual Document Number 315415-014US. World Wide Web: http://www.intel.comIntel(R) Threading Building Blocks ii 315415-014US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/#/en_US_01. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries.* Other names and brands may be claimed as the property of others. Copyright (C) 2005 - 2011, Intel Corporation. All rights reserved. Overview Reference Manual iii Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804Intel(R) Threading Building Blocks iv 315415-014US Revision History Document Number Revision Number Description Revision Date 315415- 014 1.27 Updated the Optimization Notice. 2011-Oct-27 315415- 013 1.26 Moved the flow graph from Appendix D to Section 6 and made a number of updates as it bcomes a fully supported feature. Moved concurrent_priority_queue from Appendix D to Section 5.7 as it becomes fully supported. Added serial subset, memory pools, and parallel_deterministic_reduce to Appendix D. Made other small corrections and additions. 2011-Aug-01 315415- 012 1.25 Moved task and task_group priorities from Appendix D to Section 111.3.8 and 11.6. Updated concurrent_priority_queue documentation in Section D.1 to reflect interface changes. Updated flow graph documentation in D.2 to reflect changes in the interface. Added run-time loader documentation as Section D.3. 2011-July-01 315415- 011 1.24 Fix incorrect cross-reference to Tutorial in Section 11.3.5.3. Clarify left to right properties of parallel_reduce. Add task_group_context syntax and description to parallel algorithms as needed. Add group and change_group method to task. Update description of task_group. Add task and task_group priorities to Community Preview Features as D.3. Add two examples to D.2 and describe body objects. Update overwrite_node, write_once_node and join_node. 2011-Feb-24 315415- 010 1.23 Added graph to Community Preview Features. 2010-Dec-10 315415- 009 1.22 Added Community Preview Features Appendix. 2010-Nov-04 315415- 008 1.21 Added constructor that accepts Finit for enumerable_thread_specific. Added operator= declarations for enumerable_thread_specific. Overview Reference Manual v Contents 1 Overview .........................................................................................................1 2 General Conventions .........................................................................................2 2.1 Notation................................................................................................2 2.2 Terminology ..........................................................................................3 2.2.1 Concept ...................................................................................3 2.2.2 Model ......................................................................................4 2.2.3 CopyConstructible .....................................................................4 2.3 Identifiers .............................................................................................4 2.3.1 Case........................................................................................5 2.3.2 Reserved Identifier Prefixes ........................................................5 2.4 Namespaces ..........................................................................................5 2.4.1 tbb Namespace .........................................................................5 2.4.2 tb::flow Namespace...................................................................5 2.4.3 tbb::interfacex Namespace .........................................................5 2.4.4 tbb::internal Namespace ............................................................5 2.4.5 tbb::deprecated Namespace .......................................................6 2.4.6 tbb::strict_ppl Namespace..........................................................6 2.4.7 std Namespace .........................................................................6 2.5 Thread Safety ........................................................................................7 3 Environment ....................................................................................................8 3.1 Version Information................................................................................8 3.1.1 Version Macros .........................................................................8 3.1.2 TBB_VERSION Environment Variable ............................................8 3.1.3 TBB_runtime_interface_version Function ......................................9 3.2 Enabling Debugging Features ...................................................................9 3.2.1 TBB_USE_ASSERT Macro..........................................................10 3.2.2 TBB_USE_THREADING_TOOLS Macro .........................................10 3.2.3 TBB_USE_PERFORMANCE_WARNINGS Macro ..............................11 3.3 Feature macros ....................................................................................11 3.3.1 TBB_DEPRECATED macro .........................................................11 3.3.2 TBB_USE_EXCEPTIONS macro...................................................11 3.3.3 TBB_USE_CAPTURED_EXCEPTION macro....................................12 4 Algorithms .....................................................................................................13 4.1 Splittable Concept ................................................................................13 4.1.1 split Class ..............................................................................14 4.2 Range Concept.....................................................................................14 4.2.1 blocked_range Template Class ......................................16 4.2.1.1 size_type.................................................................18 4.2.1.2 blocked_range( Value begin, Value end, size_t grainsize=1 ) ............................................................................19 4.2.1.3 blocked_range( blocked_range& range, split )...............19 4.2.1.4 size_type size() const................................................19 4.2.1.5 bool empty() const ...................................................20 4.2.1.6 size_type grainsize() const.........................................20 4.2.1.7 bool is_divisible() const .............................................20Intel(R) Threading Building Blocks vi 315415-014US 4.2.1.8 const_iterator begin() const .......................................20 4.2.1.9 const_iterator end() const..........................................20 4.2.2 blocked_range2d Template Class ...............................................21 4.2.2.1 row_range_type .......................................................23 4.2.2.2 col_range_type ........................................................23 4.2.2.3 blocked_range2d( RowValue row_begin, RowValue row_end, typename row_range_type::size_type row_grainsize, ColValue col_begin, ColValue col_end, typename col_range_type::size_type col_grainsize ) ....................24 4.2.2.4 blocked_range2d( RowValue row_begin, RowValue row_end, ColValue col_begin, ColValue col_end) .....................................................24 4.2.2.5 blocked_range2d ( blocked_range2d& range, split ) .................................24 4.2.2.6 bool empty() const ...................................................24 4.2.2.7 bool is_divisible() const .............................................25 4.2.2.8 const row_range_type& rows() const ...........................25 4.2.2.9 const col_range_type& cols() const .............................25 4.2.3 blocked_range3d Template Class ...............................................25 4.3 Partitioners .........................................................................................26 4.3.1 auto_partitioner Class ..............................................................27 4.3.1.1 auto_partitioner() .....................................................28 4.3.1.2 ~auto_partitioner()...................................................28 4.3.2 affinity_partitioner...................................................................28 4.3.2.1 affinity_partitioner()..................................................30 4.3.2.2 ~affinity_partitioner() ...............................................30 4.3.3 simple_partitioner Class ...........................................................30 4.3.3.1 simple_partitioner() ..................................................31 4.3.3.2 ~simple_partitioner() ................................................31 4.4 parallel_for Template Function ...............................................................31 4.5 parallel_reduce Template Function..........................................................36 4.6 parallel_scan Template Function .............................................................41 4.6.1 pre_scan_tag and final_scan_tag Classes....................................46 4.6.1.1 bool is_final_scan()...................................................46 4.7 parallel_do Template Function................................................................47 4.7.1 parallel_do_feeder class ................................................48 4.7.1.1 void add( const Item& item )......................................49 4.8 parallel_for_each Template Function .......................................................49 4.9 pipeline Class ......................................................................................50 4.9.1 pipeline() ...............................................................................51 4.9.2 ~pipeline() .............................................................................51 4.9.3 void add_filter( filter& f )..........................................................51 4.9.4 void run( size_t max_number_of_live_tokens[, task_group_context& group] ) .................................................................................52 4.9.5 void clear() ............................................................................52 4.9.6 filter Class..............................................................................52 4.9.6.1 filter( mode filter_mode )...........................................53 4.9.6.2 ~filter()...................................................................54 4.9.6.3 bool is_serial() const .................................................54 4.9.6.4 bool is_ordered() const..............................................54 4.9.6.5 virtual void* operator()( void * item )..........................54 4.9.6.6 virtual void finalize( void * item )................................54 4.9.7 thread_bound_filter Class .........................................................55Overview Reference Manual vii 4.9.7.1 thread_bound_filter(mode filter_mode)........................57 4.9.7.2 result_type try_process_item() ...................................57 4.9.7.3 result_type process_item() ........................................58 4.10 parallel_pipeline Function ......................................................................58 4.10.1 filter_t Template Class .............................................................60 4.10.1.1 filter_t() ..................................................................61 4.10.1.2 filter_t( const filter_t& rhs ) ..............................61 4.10.1.3 template filter_t( filter::mode mode, const Func& f ).........................................................61 4.10.1.4 void operator=( const filter_t& rhs ) ...................61 4.10.1.5 ~filter_t()................................................................61 4.10.1.6 void clear() ..............................................................61 4.10.1.7 template filter_t make_filter(filter::mode mode, const Func& f) ...........................................................................62 4.10.1.8 template filter_t operator& (const filter_t& left, const filter_t& right).................................................62 4.10.2 flow_control Class ...................................................................62 4.11 parallel_sort Template Function..............................................................63 4.12 parallel_invoke Template Function ..........................................................64 5 Containers .....................................................................................................67 5.1 Container Range Concept ......................................................................67 5.2 concurrent_unordered_map Template Class .............................................68 5.2.1 Construct, Destroy, Copy..........................................................72 5.2.1.1 explicit concurrent_unordered_map (size_type n = implementation-defined, const hasher& hf = hasher(),const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) ..........................72 5.2.1.2 template concurrent_unordered_map (InputIterator first, InputIterator last, size_type n = implementation-defined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type())72 5.2.1.3 concurrent_unordered_map(const unordered_map& m) .72 5.2.1.4 concurrent_unordered_map(const Alloc& a).................72 5.2.1.5 concurrent_unordered_map(const unordered_map&, const Alloc& a) .................................................................72 5.2.1.6 ~concurrent_unordered_map()...................................73 5.2.1.7 concurrent_ unordered_map& operator=(const concurrent_unordered_map& m); ...............................73 5.2.1.8 allocator_type get_allocator() const; ...........................73 5.2.2 Size and capacity ....................................................................73 5.2.2.1 bool empty() const ...................................................73 5.2.2.2 size_type size() const................................................73 5.2.2.3 size_type max_size() const ........................................73 5.2.3 Iterators ................................................................................73 5.2.3.1 iterator begin().........................................................74 5.2.3.2 const_iterator begin() const .......................................74 5.2.3.3 iterator end() ...........................................................74 5.2.3.4 const_iterator end() const..........................................74 5.2.3.5 const_iterator cbegin() const ......................................74 5.2.3.6 const_iterator cend() const ........................................74Intel(R) Threading Building Blocks viii 315415-014US 5.2.4 Modifiers ................................................................................75 5.2.4.1 std::pair insert(const value_type& x) ....75 5.2.4.2 iterator insert(const_iterator hint, const value_type& x) .75 5.2.4.3 template void insert(InputIterator first, InputIterator last) .............................................75 5.2.4.4 iterator unsafe_erase(const_iterator position) ...............75 5.2.4.5 size_type unsafe_erase(const key_type& k) .................76 5.2.4.6 iterator unsafe_erase(const_iterator first, const_iterator last) .......................................................................76 5.2.4.7 void clear() ..............................................................76 5.2.4.8 void swap(concurrent_unordered_map& m)..................76 5.2.5 Observers ..............................................................................76 5.2.5.1 hasher hash_function() const .....................................76 5.2.5.2 key_equal key_eq() const ..........................................76 5.2.6 Lookup ..................................................................................77 5.2.6.1 iterator find(const key_type& k) .................................77 5.2.6.2 const_iterator find(const key_type& k) const ................77 5.2.6.3 size_type count(const key_type& k) const ....................77 5.2.6.4 std::pair equal_range(const key_type& k)...........................................................................77 5.2.6.5 std::pair equal_range(const key_type& k) const ........................77 5.2.6.6 mapped_type& operator[](const key_type& k) ..............77 5.2.6.7 mapped_type& at( const key_type& k ) .......................78 5.2.6.8 const mapped_type& at(const key_type& k) const.........78 5.2.7 Parallel Iteration .....................................................................78 5.2.7.1 const_range_type range() const .................................78 5.2.7.2 range_type range()...................................................78 5.2.8 Bucket Interface......................................................................78 5.2.8.1 size_type unsafe_bucket_count() const........................79 5.2.8.2 size_type unsafe_max_bucket_count() const ................79 5.2.8.3 size_type unsafe_bucket_size(size_type n)...................79 5.2.8.4 size_type unsafe_bucket(const key_type& k) const........79 5.2.8.5 local_iterator unsafe_begin(size_type n) ......................79 5.2.8.6 const_local_iterator unsafe_begin(size_type n) const .....79 5.2.8.7 local_iterator unsafe_end(size_type n).........................79 5.2.8.8 const_local_iterator unsafe_end(size_type n) const .......80 5.2.8.9 const_local_iterator unsafe_cbegin(size_type n) const ...80 5.2.8.10 const_local_iterator unsafe_cend(size_type n) const ......80 5.2.9 Hash policy.............................................................................80 5.2.9.1 float load_factor() const ............................................80 5.2.9.2 float max_load_factor() const .....................................80 5.2.9.3 void max_load_factor(float z) .....................................80 5.2.9.4 void rehash(size_type n) ...........................................80 5.3 concurrent_unordered_set Template Class ...............................................81 5.3.1 Construct, Destroy, Copy..........................................................84 5.3.1.1 explicit concurrent_unordered_set (size_type n = implementation-defined, const hasher& hf = hasher(),const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) ..........................84 5.3.1.2 template concurrent_unordered_set (InputIterator first, InputIterator last, size_type n = implementation-defined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type())85Overview Reference Manual ix 5.3.1.3 concurrent_unordered_set(const unordered_set& m) .....85 5.3.1.4 concurrent_unordered_set(const Alloc& a)...................85 5.3.1.5 concurrent_unordered_set(const unordered_set&, const Alloc& a) .................................................................85 5.3.1.6 ~concurrent_unordered_set().....................................85 5.3.1.7 concurrent_ unordered_set& operator=(const concurrent_unordered_set& m); .................................85 5.3.1.8 allocator_type get_allocator() const; ...........................85 5.3.2 Size and capacity ....................................................................86 5.3.2.1 bool empty() const ...................................................86 5.3.2.2 size_type size() const................................................86 5.3.2.3 size_type max_size() const ........................................86 5.3.3 Iterators ................................................................................86 5.3.3.1 iterator begin().........................................................86 5.3.3.2 const_iterator begin() const .......................................87 5.3.3.3 iterator end() ...........................................................87 5.3.3.4 const_iterator end() const..........................................87 5.3.3.5 const_iterator cbegin() const ......................................87 5.3.3.6 const_iterator cend() const ........................................87 5.3.4 Modifiers ................................................................................87 5.3.4.1 std::pair insert(const value_type& x) ....87 5.3.4.2 iterator insert(const_iterator hint, const value_type& x) .88 5.3.4.3 template void insert(InputIterator first, InputIterator last) .............................................88 5.3.4.4 iterator unsafe_erase(const_iterator position) ...............88 5.3.4.5 size_type unsafe_erase(const key_type& k) .................88 5.3.4.6 iterator unsafe_erase(const_iterator first, const_iterator last) .......................................................................89 5.3.4.7 void clear() ..............................................................89 5.3.4.8 void swap(concurrent_unordered_set& m) ...................89 5.3.5 Observers ..............................................................................89 5.3.5.1 hasher hash_function() const .....................................89 5.3.5.2 key_equal key_eq() const ..........................................89 5.3.6 Lookup ..................................................................................89 5.3.6.1 iterator find(const key_type& k) .................................89 5.3.6.2 const_iterator find(const key_type& k) const ................90 5.3.6.3 size_type count(const key_type& k) const ....................90 5.3.6.4 std::pair equal_range(const key_type& k)...........................................................................90 5.3.6.5 std::pair equal_range(const key_type& k) const ........................90 5.3.7 Parallel Iteration .....................................................................90 5.3.7.1 const_range_type range() const .................................90 5.3.7.2 range_type range()...................................................90 5.3.8 Bucket Interface......................................................................91 5.3.8.1 size_type unsafe_bucket_count() const........................91 5.3.8.2 size_type unsafe_max_bucket_count() const ................91 5.3.8.3 size_type unsafe_bucket_size(size_type n)...................91 5.3.8.4 size_type unsafe_bucket(const key_type& k) const........91 5.3.8.5 local_iterator unsafe_begin(size_type n) ......................91 5.3.8.6 const_local_iterator unsafe_begin(size_type n) const .....91 5.3.8.7 local_iterator unsafe_end(size_type n).........................92 5.3.8.8 const_local_iterator unsafe_end(size_type n) const .......92 5.3.8.9 const_local_iterator unsafe_cbegin(size_type n) const ...92 5.3.8.10 const_local_iterator unsafe_cend(size_type n) const ......92Intel(R) Threading Building Blocks x 315415-014US 5.3.9 Hash policy.............................................................................92 5.3.9.1 float load_factor() const ............................................92 5.3.9.2 float max_load_factor() const .....................................92 5.3.9.3 void max_load_factor(float z) .....................................92 5.3.9.4 void rehash(size_type n) ...........................................93 5.4 concurrent_hash_map Template Class.....................................................93 5.4.1 Whole Table Operations............................................................97 5.4.1.1 concurrent_hash_map( const allocator_type& a = allocator_type() ) ....................................................97 5.4.1.2 concurrent_hash_map( size_type n, const allocator_type& a = allocator_type() )................................................97 5.4.1.3 concurrent_hash_map( const concurrent_hash_map& table, const allocator_type& a = allocator_type() ) ........97 5.4.1.4 template concurrent_hash_map( InputIterator first, InputIterator last, const allocator_type& a = allocator_type() ) .........97 5.4.1.5 ~concurrent_hash_map() ..........................................98 5.4.1.6 concurrent_hash_map& operator= ( concurrent_hash_map& source ).................................98 5.4.1.7 void swap( concurrent_hash_map& table ) ...................98 5.4.1.8 void rehash( size_type n=0 )......................................98 5.4.1.9 void clear() ..............................................................98 5.4.1.10 allocator_type get_allocator() const.............................99 5.4.2 Concurrent Access ...................................................................99 5.4.2.1 const_accessor.........................................................99 5.4.2.2 accessor ................................................................ 101 5.4.3 Concurrent Operations ........................................................... 102 5.4.3.1 size_type count( const Key& key ) const .................... 104 5.4.3.2 bool find( const_accessor& result, const Key& key ) const104 5.4.3.3 bool find( accessor& result, const Key& key ).............. 104 5.4.3.4 bool insert( const_accessor& result, const Key& key ) .. 104 5.4.3.5 bool insert( accessor& result, const Key& key ) ........... 105 5.4.3.6 bool insert( const_accessor& result, const value_type& value ) .................................................................. 105 5.4.3.7 bool insert( accessor& result, const value_type& value )105 5.4.3.8 bool insert( const value_type& value ) ....................... 105 5.4.3.9 template void insert( InputIterator first, InputIterator last ) ....................... 106 5.4.3.10 bool erase( const Key& key ) .................................... 106 5.4.3.11 bool erase( const_accessor& item_accessor ).............. 106 5.4.3.12 bool erase( accessor& item_accessor )....................... 107 5.4.4 Parallel Iteration ................................................................... 107 5.4.4.1 const_range_type range( size_t grainsize=1 ) const .... 107 5.4.4.2 range_type range( size_t grainsize=1 )...................... 107 5.4.5 Capacity .............................................................................. 108 5.4.5.1 size_type size() const.............................................. 108 5.4.5.2 bool empty() const ................................................. 108 5.4.5.3 size_type max_size() const ...................................... 108 5.4.5.4 size_type bucket_count() const ................................ 108 5.4.6 Iterators .............................................................................. 108 5.4.6.1 iterator begin()....................................................... 108 5.4.6.2 iterator end() ......................................................... 109 5.4.6.3 const_iterator begin() const ..................................... 109 5.4.6.4 const_iterator end() const........................................ 109Overview Reference Manual xi 5.4.6.5 std::pair equal_range( const Key& key ); ......................................................................... 109 5.4.6.6 std::pair equal_range( const Key& key ) const;........................................... 109 5.4.7 Global Functions.................................................................... 109 5.4.7.1 template bool operator==( const concurrent_hash_map& a, const concurrent_hash_map& b); ....................................................................... 110 5.4.7.2 template bool operator!=(const concurrent_hash_map &a, const concurrent_hash_map &b); ..................................................................... 110 5.4.7.3 template void swap(concurrent_hash_map &a, concurrent_hash_map &b)110 5.4.8 tbb_hash_compare Class ........................................................ 110 5.5 concurrent_queue Template Class......................................................... 112 5.5.1 concurrent_queue( const Alloc& a = Alloc () )............................ 114 5.5.2 concurrent_queue( const concurrent_queue& src, const Alloc& a = Alloc() ) ............................................................................... 114 5.5.3 template concurrent_queue( InputIterator first, InputIterator last, const Alloc& a = Alloc() )....................... 114 5.5.4 ~concurrent_queue()............................................................. 114 5.5.5 void push( const T& source )................................................... 115 5.5.6 bool try_pop ( T& destination )................................................ 115 5.5.7 void clear() .......................................................................... 115 5.5.8 size_type unsafe_size() const.................................................. 115 5.5.9 bool empty() const ................................................................ 115 5.5.10 Alloc get_allocator() const ...................................................... 115 5.5.11 Iterators .............................................................................. 116 5.5.11.1 iterator unsafe_begin()............................................ 116 5.5.11.2 iterator unsafe_end() .............................................. 116 5.5.11.3 const_iterator unsafe_begin() const .......................... 117 5.5.11.4 const_iterator unsafe_end() const ............................. 117 5.6 concurrent_bounded_queue Template Class ........................................... 117 5.6.1 void push( const T& source )................................................... 119 5.6.2 void pop( T& destination ) ...................................................... 119 5.6.3 bool try_push( const T& source ) ............................................. 119 5.6.4 bool try_pop( T& destination )................................................. 120 5.6.5 size_type size() const ............................................................ 120 5.6.6 bool empty() const ................................................................ 120 5.6.7 size_type capacity() const ...................................................... 120 5.6.8 void set_capacity( size_type capacity ) ..................................... 120 5.7 concurrent_priority_queue Template Class ............................................. 121 5.7.1 concurrent_priority_queue(const allocator_type& a = allocator_type()) ................................................................... 123 5.7.2 concurrent_priority_queue(size_type init_capacity, const allocator_type& a = allocator_type())....................................... 123Intel(R) Threading Building Blocks xii 315415-014US 5.7.3 concurrent_priority_queue(InputIterator begin, InputIterator end, const allocator_type& a = allocator_type())............................... 123 5.7.4 concurrent_priority_queue (const concurrent_priority_queue& src, const allocator_type& a = allocator_type())............................... 123 5.7.5 concurrent_priority_queue& operator=(const concurrent_priority_queue& src).............................................. 123 5.7.6 ~concurrent_priority_queue() ................................................. 124 5.7.7 bool empty() const ................................................................ 124 5.7.8 size_type size() const ............................................................ 124 5.7.9 void push(const_reference elem) ............................................. 124 5.7.10 bool try_pop(reference elem) .................................................. 124 5.7.11 void clear() .......................................................................... 125 5.7.12 void swap(concurrent_priority_queue& other) ........................... 125 5.7.13 allocator_type get_allocator() const ......................................... 125 5.8 concurrent_vector .............................................................................. 125 5.8.1 Construction, Copy, and Assignment ........................................ 130 5.8.1.1 concurrent_vector( const allocator_type& a = allocator_type() ) ................................................... 130 5.8.1.2 concurrent_vector( size_type n, const_reference t=T(), const allocator_type& a = allocator_type() );.............. 130 5.8.1.3 template concurrent_vector( InputIterator first, InputIterator last, const allocator_type& a = allocator_type() ) ....................... 130 5.8.1.4 concurrent_vector( const concurrent_vector& src ) ...... 131 5.8.1.5 concurrent_vector& operator=( const concurrent_vector& src ) ..................................................................... 131 5.8.1.6 template concurrent_vector& operator=( const concurrent_vector& src )....................... 131 5.8.1.7 void assign( size_type n, const_reference t ) .............. 131 5.8.1.8 template void assign( InputIterator first, InputIterator last ) .......................................... 131 5.8.2 Whole Vector Operations ........................................................ 131 5.8.2.1 void reserve( size_type n )....................................... 132 5.8.2.2 void shrink_to_fit() ................................................. 132 5.8.2.3 void swap( concurrent_vector& x ) ............................ 132 5.8.2.4 void clear() ............................................................ 132 5.8.2.5 ~concurrent_vector() .............................................. 132 5.8.3 Concurrent Growth ................................................................ 133 5.8.3.1 iterator grow_by( size_type delta, const_reference t=T() )133 5.8.3.2 iterator grow_to_at_least( size_type n )..................... 133 5.8.3.3 iterator push_back( const_reference value ) ............... 134 5.8.4 Access ................................................................................. 134 5.8.4.1 reference operator[]( size_type index ) ...................... 134 5.8.4.2 const_refrence operator[]( size_type index ) const ...... 134 5.8.4.3 reference at( size_type index ) ................................. 134 5.8.4.4 const_reference at( size_type index ) const ................ 135 5.8.4.5 reference front()..................................................... 135 5.8.4.6 const_reference front() const ................................... 135 5.8.4.7 reference back() ..................................................... 135 5.8.4.8 const_reference back() const.................................... 135 5.8.5 Parallel Iteration ................................................................... 135 5.8.5.1 range_type range( size_t grainsize=1 )...................... 135 5.8.5.2 const_range_type range( size_t grainsize=1 ) const .... 136 5.8.6 Capacity .............................................................................. 136 5.8.6.1 size_type size() const.............................................. 136Overview Reference Manual xiii 5.8.6.2 bool empty() const ................................................. 136 5.8.6.3 size_type capacity() const........................................ 136 5.8.6.4 size_type max_size() const ...................................... 136 5.8.7 Iterators .............................................................................. 136 5.8.7.1 iterator begin()....................................................... 137 5.8.7.2 const_iterator begin() const ..................................... 137 5.8.7.3 iterator end() ......................................................... 137 5.8.7.4 const_iterator end() const........................................ 137 5.8.7.5 reverse_iterator rbegin() ......................................... 137 5.8.7.6 const_reverse_iterator rbegin() const ........................ 137 5.8.7.7 iterator rend()........................................................ 137 5.8.7.8 const_reverse_iterator rend()................................... 137 6 Flow Graph .................................................................................................. 138 6.1 graph Class ....................................................................................... 144 6.1.1 graph() ................................................................................ 145 6.1.2 ~graph().............................................................................. 145 6.1.3 void increment_wait_count()................................................... 145 6.1.4 void decrement_wait_count().................................................. 146 6.1.5 template< typename Receiver, typename Body > void run( Receiver &r, Body body ) .................................................................... 146 6.1.6 template< typename Body > void run( Body body ) ................... 146 6.1.7 void wait_for_all() ................................................................. 146 6.1.8 task *root_task() .................................................................. 147 6.2 sender Template Class ........................................................................ 147 6.2.1 ~sender() ............................................................................ 148 6.2.2 bool register_successor( successor_type & r ) = 0...................... 148 6.2.3 bool remove_successor( successor_type & r ) = 0...................... 148 6.2.4 bool try_get( output_type & ) ................................................. 148 6.2.5 bool try_reserve( output_type & )............................................ 149 6.2.6 bool try_release( )................................................................. 149 6.2.7 bool try_consume( ) .............................................................. 149 6.3 receiver Template Class ...................................................................... 149 6.3.1 ~receiver()........................................................................... 150 6.3.2 bool register_predecessor( predecessor_type & p ) .................... 150 6.3.3 bool remove_predecessor( predecessor_type & p )..................... 151 6.3.4 bool try_put( const input_type &v ) = 0.................................... 151 6.4 continue_msg Class ............................................................................ 151 6.5 continue_receiver Class....................................................................... 151 6.5.1 continue_receiver( int num_predecessors = 0 ) ......................... 152 6.5.2 continue_receiver( const continue_receiver& src )...................... 153 6.5.3 ~continue_receiver( ) ............................................................ 153 6.5.4 bool try_put( const input_type & ) ........................................... 153 6.5.5 bool register_predecessor( predecessor_type & r ) ..................... 153 6.5.6 bool remove_predecessor( predecessor_type & r ) ..................... 154 6.5.7 void execute() = 0 ................................................................ 154 6.6 graph_node Class............................................................................... 154 6.7 continue_node Template Class ............................................................. 155 6.7.1 template< typename Body> continue_node(graph &g, Body body)157 6.7.2 template< typename Body> continue_node(graph &g, int number_of_predecessors, Body body) ...................................... 157 6.7.3 continue_node( const continue_node & src ) ............................. 157 6.7.4 bool register_predecessor( predecessor_type & r ) ..................... 158 6.7.5 bool remove_predecessor( predecessor_type & r ) ..................... 158Intel(R) Threading Building Blocks xiv 315415-014US 6.7.6 bool try_put( const input_type & ) .......................................... 158 6.7.7 bool register_successor( successor_type & r )............................ 159 6.7.8 bool remove_successor( successor_type & r )............................ 159 6.7.9 bool try_get( output_type &v ) ................................................ 159 6.7.10 bool try_reserve( output_type & )............................................ 159 6.7.11 bool try_release( )................................................................. 160 6.7.12 bool try_consume( ) .............................................................. 160 6.8 function_node Template Class .............................................................. 160 6.8.1 template< typename Body> function_node(graph &g, size_t concurrency, Body body) ........................................................ 163 6.8.2 function_node( const function_node &src )................................ 163 6.8.3 bool register_predecessor( predecessor_type & p ) .................... 164 6.8.4 bool remove_predecessor( predecessor_type & p )..................... 164 6.8.5 bool try_put( const input_type &v )......................................... 164 6.8.6 bool register_successor( successor_type & r )............................ 164 6.8.7 bool remove_successor( successor_type & r )............................ 165 6.8.8 bool try_get( output_type &v ) ................................................ 165 6.8.9 bool try_reserve( output_type & )............................................ 165 6.8.10 bool try_release( )................................................................. 165 6.8.11 bool try_consume( ) .............................................................. 166 6.9 source_node Class.............................................................................. 166 6.9.1 template< typename Body> source_node(graph &g, Body body, bool is_active=true) ..................................................................... 168 6.9.2 source_node( const source_node &src ).................................... 168 6.9.3 bool register_successor( successor_type & r )............................ 168 6.9.4 bool remove_successor( successor_type & r )............................ 169 6.9.5 bool try_get( output_type &v ) ................................................ 169 6.9.6 bool try_reserve( output_type &v ) .......................................... 169 6.9.7 bool try_release( )................................................................. 169 6.9.8 bool try_consume( ) .............................................................. 170 6.10 overwrite_node Template Class ............................................................ 170 6.10.1 overwrite_node() .................................................................. 171 6.10.2 overwrite_node( const overwrite_node &src ) ............................ 171 6.10.3 ~overwrite_node() ................................................................ 172 6.10.4 bool register_predecessor( predecessor_type & ) ....................... 172 6.10.5 bool remove_predecessor( predecessor_type &) ........................ 172 6.10.6 bool try_put( const input_type &v ) ......................................... 172 6.10.7 bool register_successor( successor_type & r )............................ 173 6.10.8 bool remove_successor( successor_type & r )............................ 173 6.10.9 bool try_get( output_type &v ) ................................................ 173 6.10.10 bool try_reserve( output_type & )............................................ 173 6.10.11 bool try_release( )................................................................. 174 6.10.12 bool try_consume( ) .............................................................. 174 6.10.13 bool is_valid()....................................................................... 174 6.10.14 void clear() .......................................................................... 174 6.11 write_once_node Template Class .......................................................... 174 6.11.1 write_once_node() ................................................................ 176 6.11.2 write_once_node( const write_once_node &src )........................ 176 6.11.3 bool register_predecessor( predecessor_type & ) ....................... 176 6.11.4 bool remove_predecessor( predecessor_type &) ........................ 176 6.11.5 bool try_put( const input_type &v ) ......................................... 176 6.11.6 bool register_successor( successor_type & r )............................ 177 6.11.7 bool remove_successor( successor_type & r )............................ 177 6.11.8 bool try_get( output_type &v ) ................................................ 177Overview Reference Manual xv 6.11.9 bool try_reserve( output_type & )............................................ 177 6.11.10 bool try_release( )................................................................. 178 6.11.11 bool try_consume( ) .............................................................. 178 6.11.12 bool is_valid()....................................................................... 178 6.11.13 void clear() .......................................................................... 178 6.12 broadcast_node Template Class............................................................ 178 6.12.1 broadcast_node() .................................................................. 180 6.12.2 broadcast_node( const broadcast_node &src ) ........................... 180 6.12.3 bool register_predecessor( predecessor_type & ) ....................... 180 6.12.4 bool remove_predecessor( predecessor_type &) ........................ 180 6.12.5 bool try_put( const input_type &v ) ......................................... 181 6.12.6 bool register_successor( successor_type & r )............................ 181 6.12.7 bool remove_successor( successor_type & r )............................ 181 6.12.8 bool try_get( output_type & ) ................................................. 181 6.12.9 bool try_reserve( output_type & )............................................ 182 6.12.10 bool try_release( )................................................................. 182 6.12.11 bool try_consume( ) .............................................................. 182 6.13 buffer_node Class............................................................................... 182 6.13.1 buffer_node( graph& g )......................................................... 184 6.13.2 buffer_node( const buffer_node &src )..................................... 184 6.13.3 bool register_predecessor( predecessor_type & ) ....................... 184 6.13.4 bool remove_predecessor( predecessor_type &) ........................ 184 6.13.5 bool try_put( const input_type &v ) ......................................... 184 6.13.6 bool register_successor( successor_type & r )............................ 185 6.13.7 bool remove_successor( successor_type & r )............................ 185 6.13.8 bool try_get( output_type & v ) ............................................... 185 6.13.9 bool try_reserve( output_type & v ) ......................................... 185 6.13.10 bool try_release( )................................................................. 186 6.13.11 bool try_consume( ) .............................................................. 186 6.14 queue_node Template Class................................................................. 186 6.14.1 queue_node( graph& g ) ........................................................ 188 6.14.2 queue_node( const queue_node &src ) .................................... 188 6.14.3 bool register_predecessor( predecessor_type & ) ....................... 188 6.14.4 bool remove_predecessor( predecessor_type &) ........................ 188 6.14.5 bool try_put( const input_type &v ) ......................................... 188 6.14.6 bool register_successor( successor_type & r )............................ 189 6.14.7 bool remove_successor( successor_type & r )............................ 189 6.14.8 bool try_get( output_type & v ) ............................................... 189 6.14.9 bool try_reserve( output_type & v ) ......................................... 189 6.14.10 bool try_release( )................................................................. 190 6.14.11 bool try_consume( ) .............................................................. 190 6.15 priority_queue_node Template Class ..................................................... 190 6.15.1 priority_queue_node( graph& g).............................................. 192 6.15.2 priority_queue_node( const priority_queue_node &src )............. 192 6.15.3 bool register_predecessor( predecessor_type & ) ....................... 192 6.15.4 bool remove_predecessor( predecessor_type &) ........................ 193 6.15.5 bool try_put( const input_type &v ) ......................................... 193 6.15.6 bool register_successor( successor_type &r ) ............................ 193 6.15.7 bool remove_successor( successor_type &r )............................. 193 6.15.8 bool try_get( output_type & v ) ............................................... 194 6.15.9 bool try_reserve( output_type & v ) ......................................... 194 6.15.10 bool try_release( )................................................................. 194 6.15.11 bool try_consume( ) .............................................................. 194 6.16 sequencer_node Template Class ........................................................... 195Intel(R) Threading Building Blocks xvi 315415-014US 6.16.1 template sequencer_node( graph& g, const Sequencer& s ) ..................................................................... 197 6.16.2 sequencer_node( const sequencer_node &src ).......................... 197 6.16.3 bool register_predecessor( predecessor_type & ) ....................... 197 6.16.4 bool remove_predecessor( predecessor_type &) ........................ 197 6.16.5 bool try_put( input_type v ).................................................... 198 6.16.6 bool register_successor( successor_type &r ) ............................ 198 6.16.7 bool remove_successor( successor_type &r )............................. 198 6.16.8 bool try_get( output_type & v ) ............................................... 198 6.16.9 bool try_reserve( output_type & v ) ......................................... 199 6.16.10 bool try_release( )................................................................. 199 6.16.11 bool try_consume( ) .............................................................. 199 6.17 limiter_node Template Class ................................................................ 199 6.17.1 limiter_node( graph &g, size_t threshold, int number_of_decrement_predecessors ) ..................................... 201 6.17.2 limiter_node( const limiter_node &src ) .................................... 201 6.17.3 bool register_predecessor( predecessor_type& p ) ..................... 202 6.17.4 bool remove_predecessor( predecessor_type & r ) ..................... 202 6.17.5 bool try_put( input_type &v ).................................................. 202 6.17.6 bool register_successor( successor_type & r )............................ 203 6.17.7 bool remove_successor( successor_type & r )............................ 203 6.17.8 bool try_get( output_type & ) ................................................. 203 6.17.9 bool try_reserve( output_type & )............................................ 203 6.17.10 bool try_release( )................................................................. 204 6.17.11 bool try_consume( ) .............................................................. 204 6.18 join_node Template Class .................................................................... 204 6.18.1 join_node( graph &g )............................................................ 207 6.18.2 template < typename B0, typename B1, … > join_node( graph &g, B0 b0, B1 b1, … ) .................................................................. 208 6.18.3 join_node( const join_node &src )............................................ 208 6.18.4 input_ports_tuple_type& inputs() ............................................ 208 6.18.5 bool register_successor( successor_type & r )............................ 208 6.18.6 bool remove_successor( successor_type & r )............................ 209 6.18.7 bool try_get( output_type &v ) ................................................ 209 6.18.8 bool try_reserve( T & )........................................................... 209 6.18.9 bool try_release( )................................................................. 209 6.18.10 bool try_consume( ) .............................................................. 210 6.18.11 template typename std::tuple_element::type &input_port(JNT &jn).......... 210 6.19 input_port Template Function............................................................... 210 6.20 make_edge Template Function ............................................................. 211 6.21 remove_edge Template Function .......................................................... 211 6.22 copy_body Template Function .............................................................. 211 7 Thread Local Storage..................................................................................... 212 7.1 combinable Template Class.................................................................. 212 7.1.1 combinable() ........................................................................ 213 7.1.2 template combinable(FInit finit) .................... 213 7.1.3 combinable( const combinable& other ); ................................... 213 7.1.4 ~combinable() ...................................................................... 214 7.1.5 combinable& operator=( const combinable& other ) ................... 214 7.1.6 void clear() .......................................................................... 214 7.1.7 T& local() ............................................................................. 214Overview Reference Manual xvii 7.1.8 T& local( bool& exists ) .......................................................... 214 7.1.9 templateT combine(FCombine fcombine).. 215 7.1.10 template void combine_each(Func f) .............. 215 7.2 enumerable_thread_specific Template Class........................................... 215 7.2.1 Whole Container Operations.................................................... 219 7.2.1.1 enumerable_thread_specific() .................................. 219 7.2.1.2 enumerable_thread_specific(const enumerable_thread_specific &other).......................... 219 7.2.1.3 template enumerable_thread_specific( const enumerable_thread_specific& other ) .......................................................................... 220 7.2.1.4 template< typename Finit> enumerable_thread_specific(Finit finit) ...................... 220 7.2.1.5 enumerable_thread_specific(const &exemplar) ........... 220 7.2.1.6 ~enumerable_thread_specific() ................................ 220 7.2.1.7 enumerable_thread_specific& operator=(const enumerable_thread_specific& other); ........................ 220 7.2.1.8 template< typename U, typename Alloc, ets_key_usage_type Cachetype> enumerable_thread_specific& operator=(const enumerable_thread_specific& other); .................................................................. 221 7.2.1.9 void clear() ............................................................ 221 7.2.2 Concurrent Operations ........................................................... 221 7.2.2.1 reference local() ..................................................... 221 7.2.2.2 reference local( bool& exists )................................... 221 7.2.2.3 size_type size() const.............................................. 222 7.2.2.4 bool empty() const ................................................. 222 7.2.3 Combining............................................................................ 222 7.2.3.1 templateT combine(FCombine fcombine) .............................................................. 222 7.2.3.2 template void combine_each(Func f) 222 7.2.4 Parallel Iteration ................................................................... 223 7.2.4.1 const_range_type range( size_t grainsize=1 ) const .... 223 7.2.4.2 range_type range( size_t grainsize=1 )...................... 223 7.2.5 Iterators .............................................................................. 223 7.2.5.1 iterator begin()....................................................... 223 7.2.5.2 iterator end() ......................................................... 223 7.2.5.3 const_iterator begin() const ..................................... 223 7.2.5.4 const_iterator end() const........................................ 224 7.3 flattened2d Template Class.................................................................. 224 7.3.1 Whole Container Operations.................................................... 226 7.3.1.1 flattened2d( const Container& c ).............................. 227 7.3.1.2 flattened2d( const Container& c, typename Container::const_iterator first, typename Container::const_iterator last )................................. 227 7.3.2 Concurrent Operations ........................................................... 227 7.3.2.1 size_type size() const.............................................. 227 7.3.3 Iterators .............................................................................. 227 7.3.3.1 iterator begin()....................................................... 227 7.3.3.2 iterator end() ......................................................... 227 7.3.3.3 const_iterator begin() const ..................................... 228Intel(R) Threading Building Blocks xviii 315415-014US 7.3.3.4 const_iterator end() const........................................ 228 7.3.4 Utility Functions .................................................................... 228 8 Memory Allocation......................................................................................... 229 8.1 Allocator Concept ............................................................................... 229 8.2 tbb_allocator Template Class ............................................................... 230 8.3 scalable_allocator Template Class ......................................................... 230 8.3.1 C Interface to Scalable Allocator .............................................. 231 8.3.1.1 size_t scalable_msize( void* ptr ) ............................. 233 8.4 cache_aligned_allocator Template Class ................................................ 233 8.4.1 pointer allocate( size_type n, const void* hint=0 ) ..................... 235 8.4.2 void deallocate( pointer p, size_type n ) ................................... 235 8.4.3 char* _Charalloc( size_type size )............................................ 236 8.5 zero_allocator .................................................................................... 236 8.6 aligned_space Template Class .............................................................. 237 8.6.1 aligned_space() .................................................................... 238 8.6.2 ~aligned_space() .................................................................. 238 8.6.3 T* begin() ............................................................................ 238 8.6.4 T* end() .............................................................................. 238 9 Synchronization............................................................................................ 239 9.1 Mutexes ............................................................................................ 239 9.1.1 Mutex Concept ...................................................................... 239 9.1.1.1 C++ 200x Compatibility .......................................... 240 9.1.2 mutex Class ......................................................................... 241 9.1.3 recursive_mutex Class ........................................................... 242 9.1.4 spin_mutex Class .................................................................. 242 9.1.5 queuing_mutex Class............................................................. 243 9.1.6 ReaderWriterMutex Concept.................................................... 243 9.1.6.1 ReaderWriterMutex()............................................... 245 9.1.6.2 ~ReaderWriterMutex() ............................................ 245 9.1.6.3 ReaderWriterMutex::scoped_lock()............................ 245 9.1.6.4 ReaderWriterMutex::scoped_lock( ReaderWriterMutex& rw, bool write =true)............................................... 245 9.1.6.5 ReaderWriterMutex::~scoped_lock() ......................... 245 9.1.6.6 void ReaderWriterMutex:: scoped_lock:: acquire( ReaderWriterMutex& rw, bool write=true ) ................ 245 9.1.6.7 bool ReaderWriterMutex:: scoped_lock::try_acquire( ReaderWriterMutex& rw, bool write=true ) ................ 246 9.1.6.8 void ReaderWriterMutex:: scoped_lock::release()........ 246 9.1.6.9 bool ReaderWriterMutex:: scoped_lock::upgrade_to_writer()............................. 246 9.1.6.10 bool ReaderWriterMutex:: scoped_lock::downgrade_to_reader()........................ 246 9.1.7 spin_rw_mutex Class ............................................................. 247 9.1.8 queuing_rw_mutex Class........................................................ 247 9.1.9 null_mutex Class................................................................... 248 9.1.10 null_rw_mutex Class.............................................................. 248 9.2 atomic Template Class ........................................................................ 249 9.2.1 memory_semantics Enum....................................................... 251 9.2.2 value_type fetch_and_add( value_type addend ) ....................... 251 9.2.3 value_type fetch_and_increment()........................................... 252 9.2.4 value_type fetch_and_decrement().......................................... 252 9.2.5 value_type compare_and_swap............................................... 252Overview Reference Manual xix 9.2.6 value_type fetch_and_store( value_type new_value )................. 252 9.3 PPL Compatibility ............................................................................... 253 9.3.1 critical_section...................................................................... 253 9.3.2 reader_writer_lock Class ........................................................ 254 9.4 C++ 200x Synchronization .................................................................. 255 10 Timing......................................................................................................... 259 10.1 tick_count Class ................................................................................. 259 10.1.1 static tick_count tick_count::now() .......................................... 260 10.1.2 tick_count::interval_t operator-( const tick_count& t1, const tick_count& t0 ) .................................................................... 260 10.1.3 tick_count::interval_t Class .................................................... 260 10.1.3.1 interval_t() ............................................................ 261 10.1.3.2 interval_t( double sec ) ........................................... 261 10.1.3.3 double seconds() const ............................................ 261 10.1.3.4 interval_t operator+=( const interval_t& i ) ................ 261 10.1.3.5 interval_t operator-=( const interval_t& i )................. 262 10.1.3.6 interval_t operator+ ( const interval_t& i, const interval_t& j ) ........................................................ 262 10.1.3.7 interval_t operator- ( const interval_t& i, const interval_t& j ) ........................................................................ 262 11 Task Groups................................................................................................. 263 11.1 task_group Class................................................................................ 264 11.1.1 task_group() ........................................................................ 265 11.1.2 ~task_group() ...................................................................... 265 11.1.3 template void run( const Func& f ) ................. 265 11.1.4 template void run ( task_handle& handle );........................................................................................ 265 11.1.5 template void run_and_wait( const Func& f ) ... 265 11.1.6 template void run _and_wait( task_handle& handle ); ............................................... 266 11.1.7 task_group_status wait()........................................................ 266 11.1.8 bool is_canceling() ................................................................ 266 11.1.9 void cancel() ........................................................................ 266 11.2 task_group_status Enum..................................................................... 266 11.3 task_handle Template Class................................................................. 267 11.4 make_task Template Function.............................................................. 267 11.5 structured_task_group Class ................................................................ 268 11.6 is_current_task_group_canceling Function ............................................. 269 12 Task Scheduler ............................................................................................. 270 12.1 Scheduling Algorithm.......................................................................... 271 12.2 task_scheduler_init Class .................................................................... 272 12.2.1 task_scheduler_init( int max_threads=automatic, stack_size_type thread_stack_size=0 ) ........................................................... 274 12.2.2 ~task_scheduler_init() ........................................................... 275 12.2.3 void initialize( int max_threads=automatic ).............................. 276 12.2.4 void terminate().................................................................... 276 12.2.5 int default_num_threads() ...................................................... 276 12.2.6 bool is_active() const............................................................. 276 12.2.7 Mixing with OpenMP............................................................... 276 12.3 task Class ......................................................................................... 277 12.3.1 task Derivation ..................................................................... 281Intel(R) Threading Building Blocks xx 315415-014US 12.3.1.1 Processing of execute() ........................................... 281 12.3.2 task Allocation ...................................................................... 281 12.3.2.1 new( task::allocate_root( task_group_context& group ) ) T282 12.3.2.2 new( task::allocate_root() ) T .................................. 282 12.3.2.3 new( x.allocate_continuation() ) T............................. 282 12.3.2.4 new( x.allocate_child() ) T ....................................... 283 12.3.2.5 new(task::allocate_additional_child_of( y )) T............. 283 12.3.3 Explicit task Destruction ......................................................... 284 12.3.3.1 static void destroy ( task& victim ) ............................ 284 12.3.4 Recycling Tasks..................................................................... 285 12.3.4.1 void recycle_as_continuation() ................................. 285 12.3.4.2 void recycle_as_safe_continuation() .......................... 286 12.3.4.3 void recycle_as_child_of( task& new_successor ) ........ 286 12.3.5 Synchronization .................................................................... 286 12.3.5.1 void set_ref_count( int count ) ................................. 287 12.3.5.2 void increment_ref_count();..................................... 287 12.3.5.3 int decrement_ref_count(); ...................................... 287 12.3.5.4 void wait_for_all() .................................................. 288 12.3.5.5 static void spawn( task& t )...................................... 289 12.3.5.6 static void spawn ( task_list& list ) ............................ 289 12.3.5.7 void spawn_and_wait_for_all( task& t ) ..................... 289 12.3.5.8 void spawn_and_wait_for_all( task_list& list )............. 290 12.3.5.9 static void spawn_root_and_wait( task& root )............ 290 12.3.5.10 static void spawn_root_and_wait( task_list& root_list ) 290 12.3.5.11 static void enqueue ( task& ).................................... 291 12.3.6 task Context ......................................................................... 291 12.3.6.1 static task& self() ................................................... 291 12.3.6.2 task* parent() const................................................ 291 12.3.6.3 void set_parent(task* p).......................................... 292 12.3.6.4 bool is_stolen_task() const....................................... 292 12.3.6.5 task_group_context* group() ................................... 292 12.3.6.6 void change_group( task_group_context& ctx )........... 292 12.3.7 Cancellation.......................................................................... 292 12.3.7.1 bool cancel_group_execution() ................................. 292 12.3.7.2 bool is_cancelled() const.......................................... 293 12.3.8 Priorities .............................................................................. 293 12.3.8.1 void enqueue ( task& t, priority_t p ) const................. 294 12.3.8.2 void set_group_priority ( priority_t ) ......................... 294 12.3.8.3 priority_t group_priority () const............................... 294 12.3.9 Affinity................................................................................. 294 12.3.9.1 affinity_id .............................................................. 295 12.3.9.2 virtual void note_affinity ( affinity_id id ).................... 295 12.3.9.3 void set_affinity( affinity_id id ) ................................ 295 12.3.9.4 affinity_id affinity() const......................................... 295 12.3.10 task Debugging..................................................................... 295 12.3.10.1 state_type state() const .......................................... 296 12.3.10.2 int ref_count() const ............................................... 297 12.4 empty_task Class ............................................................................... 298 12.5 task_list Class.................................................................................... 298 12.5.1 task_list() ............................................................................ 299 12.5.2 ~task_list() .......................................................................... 299 12.5.3 bool empty() const ................................................................ 299 12.5.4 push_back( task& task )......................................................... 299 12.5.5 task& task pop_front() ........................................................... 300Overview Reference Manual xxi 12.5.6 void clear() .......................................................................... 300 12.6 task_group_context ............................................................................ 300 12.6.1 task_group_context( kind_t relation_to_parent=bound, uintptr_t traits=default_traits ) ............................................................ 302 12.6.2 ~task_group_context() .......................................................... 302 12.6.3 bool cancel_group_execution()................................................ 302 12.6.4 bool is_group_execution_cancelled() const................................ 302 12.6.5 void reset() .......................................................................... 303 12.6.6 void set_priority ( priority_t ).................................................. 303 12.6.7 priority_t priority () const ....................................................... 303 12.7 task_scheduler_observer ..................................................................... 303 12.7.1 task_scheduler_observer() ..................................................... 304 12.7.2 ~task_scheduler_observer() ................................................... 304 12.7.3 void observe( bool state=true ) ............................................... 304 12.7.4 bool is_observing() const........................................................ 304 12.7.5 virtual void on_scheduler_entry( bool is_worker) ....................... 304 12.7.6 virtual void on_scheduler_exit( bool is_worker ) ........................ 305 12.8 Catalog of Recommended task Patterns ................................................. 305 12.8.1 Blocking Style With k Children................................................. 306 12.8.2 Continuation-Passing Style With k Children ............................... 306 12.8.2.1 Recycling Parent as Continuation .............................. 307 12.8.2.2 Recycling Parent as a Child ...................................... 307 12.8.3 Letting Main Thread Work While Child Tasks Run ....................... 308 13 Exceptions ................................................................................................... 310 13.1 tbb_exception.................................................................................... 310 13.2 captured_exception ............................................................................ 311 13.2.1 captured_exception( const char* name, const char* info ) .......... 312 13.3 movable_exception .................................................... 312 13.3.1 movable_exception( const ExceptionData& src ) ........................ 313 13.3.2 ExceptionData& data() throw()................................................ 313 13.3.3 const ExceptionData& data() const throw() ............................... 314 13.4 Specific Exceptions ............................................................................. 314 14 Threads ....................................................................................................... 316 14.1 thread Class ...................................................................................... 317 14.1.1 thread() ............................................................................... 318 14.1.2 template thread(F f).......................................... 318 14.1.3 template thread(F f, X x)................. 318 14.1.4 template thread(F f, X x, Y y) ....................................................................................... 318 14.1.5 thread& operator=(thread& x) ................................................ 318 14.1.6 ~thread ............................................................................... 319 14.1.7 bool joinable() const .............................................................. 319 14.1.8 void join() ............................................................................ 319 14.1.9 void detach() ........................................................................ 319 14.1.10 id get_id() const.................................................................... 319 14.1.11 native_handle_type native_handle() ........................................ 320 14.1.12 static unsigned hardware_concurrency()................................... 320 14.2 thread::id ......................................................................................... 320 14.3 this_thread Namespace ....................................................................... 321 14.3.1 thread::id get_id() ................................................................ 321 14.3.2 void yield()........................................................................... 321 14.3.3 void sleep_for( const tick_count::interval_t & i)......................... 321Intel(R) Threading Building Blocks xxii 315415-014US 15 References................................................................................................... 323 Appendix A Compatibility Features ................................................................................... 324 A.1 parallel_while Template Class............................................................... 324 A.1.1 parallel_while().......................................................... 325 A.1.2 ~parallel_while() ....................................................... 326 A.1.3 Template void run( Stream& stream, const Body& body )........................................................................ 326 A.1.4 void add( const value_type& item ).......................................... 326 A.2 Interface for constructing a pipeline filter............................................... 326 A.2.1 filter::filter( bool is_serial )..................................................... 326 A.2.2 filter::serial .......................................................................... 327 A.3 Debugging Macros .............................................................................. 327 A.4 tbb::deprecated::concurrent_queue Template Class .................. 327 A.5 Interface for concurrent_vector ............................................................ 329 A.5.1 void compact()...................................................................... 330 A.6 Interface for class task........................................................................ 330 A.6.1 void recycle _to_reexecute()................................................... 330 A.6.2 Depth interface for class task .................................................. 331 A.7 tbb_thread Class ................................................................................ 331 Appendix B PPL Compatibility .......................................................................................... 332 Appendix C Known Issues ............................................................................................... 333 C.1 Windows* OS .................................................................................... 333 Appendix D Community Preview Features.......................................................................... 334 D.1 Flow Graph........................................................................................ 335 D.1.1 or_node Template Class ......................................................... 335 D.1.2 multioutput_function_node Template Class ............................... 339 D.1.3 split_node Template Class ...................................................... 343 D.2 Run-time loader ................................................................................. 346 D.2.1 runtime_loader Class ............................................................. 348 D.3 parallel_ deterministic _reduce Template Function................................... 350 D.4 Scalable Memory Pools........................................................................ 353 D.4.1 memory_pool Template Class.................................................. 353 D.4.2 fixed_pool Class .................................................................... 355 D.4.3 memory_pool_allocator Template Class .................................... 356 D.5 Serial subset ..................................................................................... 358 D.5.1 tbb::serial::parallel_for() ....................................................... 358Overview Reference Manual 1 1 Overview Intel® Threading Building Blocks (Intel® TBB) is a library that supports scalable parallel programming using standard ISO C++ code. It does not require special languages or compilers. It is designed to promote scalable data parallel programming. Additionally, it fully supports nested parallelism, so you can build larger parallel components from smaller parallel components. To use the library, you specify tasks, not threads, and let the library map tasks onto threads in an efficient manner. Many of the library interfaces employ generic programming, in which interfaces are defined by requirements on types and not specific types. The C++ Standard Template Library (STL) is an example of generic programming. Generic programming enables Intel® Threading Building Blocks to be flexible yet efficient. The generic interfaces enable you to customize components to your specific needs. The net result is that Intel® Threading Building Blocks enables you to specify parallelism far more conveniently than using raw threads, and at the same time can improve performance. This document is a reference manual. It is organized for looking up details about syntax and semantics. You should first read the Intel® Threading Building Blocks Getting Started Guide and the Intel® Threading Building Blocks Tutorial to learn how to use the library effectively. The Intel® Threading Building Blocks Design Patterns document is another useful resource. TIP: Even experienced parallel programmers should read the Intel® Threading Building Blocks Tutorial before using this reference guide because Intel® Threading Building Blocks uses a surprising recursive model of parallelism and generic algorithms. 2 315415-014US 2 General Conventions This section describes conventions used in this document. 2.1 Notation Literal program text appears in Courier font. Algebraic placeholders are in monospace italics. For example, the notation blocked_range indicates that blocked_range is literal, but Type is a notational placeholder. Real program text replaces Type with a real type, such as in blocked_range. Class members are summarized by informal class declarations that describe the class as it seems to clients, not how it is actually implemented. For example, here is an informal declaration of class Foo: class Foo { public: int x(); int y; ~Foo(); }; The actual implementation might look like: namespace internal { class FooBase { protected: int x(); }; class Foo_v3: protected FooBase { private: int internal_stuff; public: using FooBase::x; int y; }; } typedef internal::Foo_v3 Foo; General Conventions Reference Manual 3 The example shows two cases where the actual implementation departs from the informal declaration: • Foo is actually a typedef to Foo_v3. • Method x() is inherited from a protected base class. • The destructor is an implicit method generated by the compiler. The informal declarations are intended to show you what you need to know to use the class without the distraction of irrelevant clutter particular to the implementation. 2.2 Terminology This section describes terminology specific to Intel® Threading Building Blocks (Intel® TBB). 2.2.1 Concept A concept is a set of requirements on a type. The requirements may be syntactic or semantic. For example, the concept of “sortable” could be defined as a set of requirements that enable an array to be sorted. A type T would be sortable if: • x < y returns a boolean value, and represents a total order on items of type T. • swap(x,y) swaps items x and y You can write a sorting template function in C++ that sorts an array of any type that is sortable. Two approaches for defining concepts are valid expressions and pseudo-signatures0 1 . The ISO C++ standard follows the valid expressions approach, which shows what the usage pattern looks like for a concept. It has the drawback of relegating important details to notational conventions. This document uses pseudo-signatures, because they are concise, and can be cut-and-pasted for an initial implementation. For example, Table 1 shows pseudo-signatures for a sortable type 449H877H T: 1 See Section 3.2.3 of Concepts for C++0x available at http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2005/n1758.pdf for further discussion of valid expressions versus pseudo-signatures. 4 315415-014US Table 1: Pseudo-Signatures for Example Concept “sortable” Pseudo-Signature Semantics bool operator<(const T& x, const T& y) Compare x and y. void swap(T& x, T& y) Swap x and y. A real signature may differ from the pseudo-signature that it implements in ways where implicit conversions would deal with the difference. For an example type U, the real signature that implements operator< in Table 1 can be expressed as 450H878H int operator<( U x, U y ), because C++ permits implicit conversion from int to bool, and implicit conversion from U to (const U&). Similarly, the real signature bool operator<( U& x, U& y ) is acceptable because C++ permits implicit addition of a const qualifier to a reference type. 2.2.2 Model A type models a concept if it meets the requirements of the concept. For example, type int models the sortable concept in Table 1 if there exists a function 451H879H swap(x,y) that swaps two int values x and y. The other requirement for sortable, specifically x as tbb::version4::concurrent_hashmap and employs a using directive to inject it into namespace tbb. Your source code should reference it as tbb::concurrent_hashmap. 2.4.4 tbb::internal Namespace Namespace tbb::internal serves a role similar to tbb::interfacex. It is retained for backwards compatibility with older versions of the library. Your code should never 6 315415-014US directly reference namespace tbb::internal. Indirect reference via a public typedef provided by the header files is permitted. 2.4.5 tbb::deprecated Namespace The library uses the namespace tbb::deprecated for deprecated identifiers that have different default meanings in namespace tbb. Compiling with TBB_DEPRECATED=1 causes such identifiers to replace their counterpart in namespace tbb. For example, tbb::concurrent_queue underwent changes in Intel® TBB 2.2 that split its functionality into tbb::concurrent_queue and tbb::concurrent_bounded_queue and changed the name of some methods. For sake of legacy code, the old Intel® TBB 2.1 functionality is retained in tbb::deprecated::concurrent_queue, which is injected into namespace tbb when compiled with TBB_DEPRECATED=1. 2.4.6 tbb::strict_ppl Namespace The library uses the namespace tbb::strict_ppl for identifiers that are put in namespace Concurrency when tbb/compat/ppl.h is included. 2.4.7 std Namespace The library implements some C++0x features in namespace std. The library version can be used by including the corresponding header in Table 3. 882H Table 3: C++0x Features Optonally Defined by Intel® Threading Building Blocks. Header Identifiers Added to std:: Section tbb/compat/condition_variable defer_lock_t try_to_lock_t adopt_lock_t defer_lock try_to_lock adopt_lock lock_guard unique_lock swap1F 2 condition_variable cv_status timeout no_timeout 9.4 1H883H tbb/compat/thread thread 14.1 2H884H 2 Adds swap of two unique_lock objects, not the general swap template function. General Conventions Reference Manual 7 this_thread To prevent accidental linkage with other implementations of these C++ library features, the library defines the identifiers in other namespaces and injects them into namespace std::. This way the “mangled name” seen by the linker will differ from the “mangled name” generated by other implementations. 2.5 Thread Safety Unless otherwise stated, the thread safety rules for the library are as follows: • Two threads can invoke a method or function concurrently on different objects, but not the same object. • It is unsafe for two threads to invoke concurrently methods or functions on the same object. Descriptions of the classes note departures from this convention. For example, the concurrent containers are more liberal. By their nature, they do permit some concurrent operations on the same container object. 8 315415-014US 3 Environment This section describes features of Intel® Threadinging Building Blocks (Intel® TB) that relate to general environment issues. 3.1 Version Information Intel® TBB has macros, an environment variable, and a function that reveal version and run-time information. 3.1.1 Version Macros The header tbb/tbb_stddef.h defines macros related to versioning, as described in Table 4. You should not redefine these macros. 885H Table 4: Version Macros Macro Description of Value TBB_INTERFACE_VERSION Current interface version. The value is a decimal numeral of the form xyyy where x is the major version number and y is the minor version number. TBB_INTERFACE_VERSION_MAJOR TBB_INTERFACE_VERSION/1000; that is, the major version number. TBB_COMPATIBLE_INTERFACE_VERSION Oldest major interface version still supported. 3.1.2 TBB_VERSION Environment Variable Set the environment variable TBB_VERSION to 1 to cause the library to print information on stderr. Each line is of the form “TBB: tag value”, where tag and value are described in Table 5. 886H Table 5: Output from TBB_VERSION Tag Description of Value VERSION Intel® TBB product version number. INTERFACE_VERSION Value of macro TBB_INTERFACE_VERSION when library was compiled. Environment Reference Manual 9 BUILD_... Various information about the machine configuration on which the library was built. TBB_USE_ASSERT Setting of macro TBB_USE_ASSERT DO_ITT_NOTIFY 1 if library can enable instrumentation for Intel® Parallel Studio and Intel® Threading Tools; 0 or undefined otherwise. ITT yes if library has enabled instrumentation for Intel® Parallel Studio and Intel® Threadng Tools, no otherwise. Typically yes only if the program is running under control of Intel® Parallel Studio or Intel® Threadng Tools. ALLOCATOR Underlying allocator for tbb::tbb_allocator. It is scalable_malloc if the Intel® TBB malloc library was successfully loaded; malloc otherwise. CAUTION: This output is implementation specific and may change at any time. 3.1.3 TBB_runtime_interface_version Function Summary Function that returns the interface version of the Intel® TBB library that was loaded at runtime. Syntax extern “C” int TBB_runtime_interface_version(); Header #include "tbb/tbb_stddef.h" Description The value returned by TBB_runtime_interface_version() may differ from the value of TBB_INTERFACE_VERSION obtained at compile time. This can be used to identify whether an application was compiled against a compatible version of the Intel® TBB headers. In general, the run-time value TBB_runtime_interface_version() must be greater than or equal to the compile-time value of TBB_INTERFACE_VERSION. Otherwise the application may fail to resolve all symbols at run time. 3.2 Enabling Debugging Features Four macros control certain debugging features. In general, it is useful to compile with these features on for development code, and off for production code, because the features may decrease performance. Table 6 summarizes the macros and their default 887H10 315415-014US values. A value of 1 enables the corresponding feature; a value of 0 disables the feature. Table 6: Debugging Macros Macro Default Value Feature Windows* OS: 1 if _DEBUG is defined, 0 otherwise. TBB_USE_DEBUG All other systems: 0. Default value for all other macros in this table. TBB_USE_ASSERT Enable internal assertion checking. Can significantly slow performance. TBB_USE_THREADING_TOOLS Enable full support for Intel® Parallel Studio and Intel® Threading Tools. TBB_USE_PERFORMANCE_WARNINGS TBB_USE_DEBUG Enable warnings about performance issues. 3.2.1 TBB_USE_ASSERT Macro The macro TBB_USE_ASSERT controls whether error checking is enabled in the header files. Define TBB_USE_ASSERT as 1 to enable error checking. If an error is detected, the library prints an error message on stderr and calls the standard C routine abort. To stop a program when internal error checking detects a failure, place a breakpoint on tbb::assertion_failure. TIP: On Microsoft Windows* operating systems, debug builds implicitly set TBB_USE_ASSERT to 1 by default 3.2.2 TBB_USE_THREADING_TOOLS Macro The macro TBB_USE_THREADING_TOOLS controls support for Intel® Threading Tools: • Intel® Parallel Inspector • Intel® Parallel Amplifier • Intel® Thread Profiler • Intel® Thread Checker. Environment Reference Manual 11 Define TBB_USE_THREADING_TOOLS as 1 to enable full support for these tools. That is full support is enabled if error checking is enabled. Leave TBB_USE_THREADING_TOOLS undefined or zero to enable top performance in release builds, at the expense of turning off some support for tools. 3.2.3 TBB_USE_PERFORMANCE_WARNINGS Macro The macro TBB_USE_PERFORMANCE_WARNINGS controls performance warnings. Define it to be 1 to enable the warnings. Currently, the warnings affected are: • Some that report poor hash functions for concurrent_hash_map. Enabling the warnings may impact performance. • Misaligned 8-byte atomic stores on Intel® IA-32 processors. 3.3 Feature macros Macros in this section control optional features in the library. 3.3.1 TBB_DEPRECATED macro The macro TBB_DEPRECATED controls deprecated features that would otherwise conflict with non-deprecated use. Define it to be 1 to get deprecated Intel® TBB 2.1 interfaces. Appendix A describes deprecated features. 888H 3.3.2 TBB_USE_EXCEPTIONS macro The macro TBB_USE_EXCEPTIONS controls whether the library headers use exceptionhandling constructs such as try, catch, and throw. The headers do not use these constructs when TBB_USE_EXCEPTIONS=0. For the Microsoft Windows*, Linux*, and MacOS* operating systems, the default value is 1 if exception handling constructs are enabled in the compiler, and 0 otherwise. CAUTION: The runtime library may still throw an exception when TBB_USE_EXCEPTIONS=0. 12 315415-014US 3.3.3 TBB_USE_CAPTURED_EXCEPTION macro The macro TBB_USE_CAPTURED_EXCEPTION controls rethrow of exceptions within the library. Because C++ 1998 does not support catching an exception on one thread and rethrowing it on another thread, the library sometimes resorts to rethrowing an approximation called tbb::captured_exception 3H . • Define TBB_USE_CAPTURED_EXCEPTION=1 to make the library rethrow an approximation. This is useful for uniform behavior across platforms. • Define TBB_USE_CAPTURED_EXCEPTION=0 to request rethrow of the exact exception. This setting is valid only on platforms that support the std::exception_ptr feature of C++ 200x. Otherwise a compile-time diagnostic is issued. The default value is 1 for supported host compilers with std::exception_ptr, and 0 otherwise. Section 13 describes exception handling and 889H TBB_USE_CAPTURED_EXCEPTION in more detail. Algorithms Reference Manual 13 4 Algorithms Most parallel algorithms provided by Intel® Threading Building Blocks (Intel® TBB) are generic. They operate on all types that model the necessary concepts. Parallel algorithms may be nested. For example, the body of a parallel_for can invoke another parallel_for. CAUTION: When the body of an outer parallel algorithm invokes another parallel algorithm, it may cause the outer body to be re-entered for a different iteration of the outer algorithm. For example, if the outer body holds a global lock while calling an inner parallel algorithm, the body will deadlock if the re-entrant invocation attempts to acquire the same global lock. This ill-formed example is a special case of a general rule that code should not hold a lock while calling code written by another author. 4.1 Splittable Concept Summary Requirements for a type whose instances can be split into two pieces. Requirements Table 7 lists the requirements for a splittable type 454H890H X with instance x. Table 7: Splittable Concept Pseudo-Signature Semantics X::X(X& x, Split) Split x into x and newly constructed object. Description A type is splittable if it has a splitting constructor that allows an instance to be split into two pieces. The splitting constructor takes as arguments a reference to the original object, and a dummy argument of type Split, which is defined by the library. The dummy argument distinguishes the splitting constructor from a copy constructor. After the constructor runs, x and the newly constructed object should represent the two pieces of the original x. The library uses splitting constructors in three contexts: • Partitioning a range into two subranges that can be processed concurrently. • Forking a body (function object) into two bodies that can run concurrently. 14 315415-014US The following model types provide examples. Model Types blocked_range (4.2.1) and 891H blocked_range2d (4.2.2) represent splittable ranges. For 892H each of these, splitting partitions the range into two subranges. See the example in Section 4.2.1.3 for the splitting constructor of 893H blocked_range. The bodies for parallel_reduce (4.5) and 894H parallel_scan (4.6) must be splittable. 895H For each of these, splitting results in two bodies that can be run concurrently. 4.1.1 split Class Summary Type for dummy argument of a splitting constructor. Syntax class split; Header #include "tbb/tbb_stddef.h" Description An argument of type split is used to distinguish a splitting constructor from a copy constructor. Members namespace tbb { class split { }; } 4.2 Range Concept Summary Requirements for type representing a recursively divisible set of values. Requirements Table 8 455H896H lists the requirements for a Range type R.Algorithms Reference Manual 15 Table 8: Range Concept Pseudo-Signature Semantics R::R( const R& ) Copy constructor. R::~R() Destructor. bool R::empty() const True if range is empty. bool R::is_divisible() const True if range can be partitioned into two subranges. R::R( R& r, split ) Split r into two subranges. Description A Range can be recursively subdivided into two parts. It is recommended that the division be into nearly equal parts, but it is not required. Splitting as evenly as possible typically yields the best parallelism. Ideally, a range is recursively splittable until the parts represent portions of work that are more efficient to execute serially rather than split further. The amount of work represented by a Range typically depends upon higher level context, hence a typical type that models a Range should provide a way to control the degree of splitting. For example, the template class blocked_range (4.2.1) 897H has a grainsize parameter that specifies the biggest range considered indivisible. The constructor that implements splitting is called a splitting constructor. If the set of values has a sense of direction, then by convention the splitting constructor should construct the second part of the range, and update the argument to be the first half. Following this convention causes the parallel_for (4.4), 456H898H parallel_reduce (4.5), and 457H899H parallel_scan (4.6) algorithms, when running sequentially, to work across a range in 900H the increasing order typical of an ordinary sequential loop. Example The following code defines a type TrivialIntegerRange that models the Range concept. It represents a half-open interval [lower,upper) that is divisible down to a single integer. struct TrivialIntegerRange { int lower; int upper; bool empty() const {return lower==upper;} bool is_divisible() const {return upper>lower+1;} TrivialIntegerRange( TrivialIntegerRange& r, split ) { int m = (r.lower+r.upper)/2; lower = m; upper = r.upper; r.upper = m; } }; 16 315415-014US TrivialIntegerRange is for demonstration and not very practical, because it lacks a grainsize parameter. Use the library class blocked_range instead. Model Types Type blocked_range (4.2.1) models a one-dimensional range. 901H Type blocked_range2d (4.2.2) models a two-dimensional range. 902H Type blocked_range3d (4.2.3) models a three-dimensional range. 903H Concept Container Range (5.1) models a container as a range. 904H 4.2.1 blocked_range Template Class Summary Template class for a recursively divisible half-open interval. Syntax template class blocked_range; Header #include "tbb/blocked_range.h" Description A blocked_range represents a half-open range [i,j) that can be recursively split. The types of i and j must model the requirements in Table 9. In the table, type D 461H905H is the type of the expression “j-i”. It can be any integral type that is convertible to size_t. Examples that model the Value requirements are integral types, pointers, and STL random-access iterators whose difference can be implicitly converted to a size_t. A blocked_range models the Range concept (4.2). 462H906H Table 9: Value Concept for blocked_range Pseudo-Signature Semantics Value::Value( const Value& ) Copy constructor. Algorithms Reference Manual 17 Value::~Value() Destructor. void2F 3 operator=( const Value& ) Assignment bool operator<( const Value& i, const Value& j ) Value i precedes value j. D operator-( const Value& i, const Value& j ) Number of values in range [i,j). Value operator+( const Value& i, D k ) kth value after i. A blocked_range specifies a grainsize of type size_t. A blocked_range is splittable into two subranges if the size of the range exceeds grain size. The ideal grain size depends upon the context of the blocked_range, which is typically as the range argument to the loop templates parallel_for, parallel_reduce, or parallel_scan. A too small grainsize may cause scheduling overhead within the loop templates to swamp speedup gained from parallelism. A too large grainsize may unnecessarily limit parallelism. For example, if the grain size is so large that the range can be split only once, then the maximum possible parallelism is two. Here is a suggested procedure for choosing grainsize: 1. Set the grainsize parameter to 10,000. This value is high enough to amortize scheduler overhead sufficiently for practically all loop bodies, but may be unnecessarily limit parallelism. 2. Run your algorithm on one processor. 3. Start halving the grainsize parameter and see how much the algorithm slows down as the value decreases. A slowdown of about 5-10% is a good setting for most purposes. TIP: For a blocked_range [i,j) where j typically appears as a range argument to a loop template. See the examples for parallel_for (4.4), 911H parallel_reduce (4.5), and 912H parallel_scan (4.6). 913H 3 The return type void in the pseudo-signature denotes that operator= is not required to return a value. The actual operator= can return a value, which will be ignored by blocked_range. 18 315415-014US Members namespace tbb { template class blocked_range { public: // types typedef size_t size_type; typedef Value const_iterator; // constructors blocked_range( Value begin, Value end, size_type grainsize=1 ); blocked_range( blocked_range& r, split ); // capacity size_type size() const; bool empty() const; // access size_type grainsize() const; bool is_divisible() const; // iterators const_iterator begin() const; const_iterator end() const; }; } 4.2.1.1 size_type Description The type for measuring the size of a blocked_range. The type is always a size_t. const_iterator Description The type of a value in the range. Despite its name, the type const_iterator is not necessarily an STL iterator; it merely needs to meet the Value requirements in Table 9. 914H However, it is convenient to call it const_iterator so that if it is a const_iterator, then the blocked_range behaves like a read-only STL container. Algorithms Reference Manual 19 4.2.1.2 blocked_range( Value begin, Value end, size_t grainsize=1 ) Requirements The parameter grainsize must be positive. The debug version of the library raises an assertion failure if this requirement is not met. Effects Constructs a blocked_range representing the half-open interval [begin,end) with the given grainsize. Example The statement “blocked_range r( 5, 14, 2 );” constructs a range of int that contains the values 5 through 13 inclusive, with a grainsize of 2. Afterwards, r.begin()==5 and r.end()==14. 4.2.1.3 blocked_range( blocked_range& range, split ) Requirements is_divisible() is true. Effects Partitions range into two subranges. The newly constructed blocked_range is approximately the second half of the original range, and range is updated to be the remainder. Each subrange has the same grainsize as the original range. Example Let i and j be integers that define a half-open interval [i,j) and let g specifiy a grain size. The statement blocked_range r(i,j,g) constructs a blocked_range that represents [i,j) with grain size g. Running the statement blocked_range s(r,split); subsequently causes r to represent [i, i +(j -i)/2) and s to represent [i +(j -i)/2, j), both with grain size g. 4.2.1.4 size_type size() const Requirements end()grainsize(); false otherwise. 4.2.1.8 const_iterator begin() const Returns Inclusive lower bound on range. 4.2.1.9 const_iterator end() const Returns Exclusive upper bound on range.Algorithms Reference Manual 21 4.2.2 blocked_range2d Template Class Summary Template class that represents recursively divisible two-dimensional half-open interval. Syntax template class blocked_range2d; Header #include "tbb/blocked_range2d.h" Description A blocked_range2d represents a half-open two dimensional range [i0,j0)×[i1,j1). Each axis of the range has its own splitting threshold. The RowValue and ColValue must meet the requirements in Table 9. A 463H915H blocked_range is splittable if either axis is splittable. A blocked_range models the Range concept (4.2). 464H916H Members namespace tbb { template class blocked_range2d { public: // Types typedef blocked_range row_range_type; typedef blocked_range col_range_type; // Constructors blocked_range2d( RowValue row_begin, RowValue row_end, typename row_range_type::size_type row_grainsize, ColValue col_begin, ColValue col_end, typename col_range_type::size_type col_grainsize); blocked_range2d( RowValue row_begin, RowValue row_end, ColValue col_begin, ColValue col_end); blocked_range2d( blocked_range2d& r, split ); // Capacity bool empty() const; // Access bool is_divisible() const; const row_range_type& rows() const; const col_range_type& cols() const; 22 315415-014US }; } Example The code that follows shows a serial matrix multiply, and the corresponding parallel matrix multiply that uses a blocked_range2d to specify the iteration space. const size_t L = 150; const size_t M = 225; const size_t N = 300; void SerialMatrixMultiply( float c[M][N], float a[M][L], float b[L][N] ) { for( size_t i=0; i& r ) const { float (*a)[L] = my_a; float (*b)[N] = my_b; float (*c)[N] = my_c; for( size_t i=r.rows().begin(); i!=r.rows().end(); ++i ){ for( size_t j=r.cols().begin(); j!=r.cols().end(); ++j ) { Algorithms Reference Manual 23 float sum = 0; for( size_t k=0; k(0, M, 16, 0, N, 32), MatrixMultiplyBody2D(c,a,b) ); } The blocked_range2d enables the two outermost loops of the serial version to become parallel loops. The parallel_for recursively splits the blocked_range2d until the pieces are no larger than 16×32. It invokes MatrixMultiplyBody2D::operator() on each piece. 4.2.2.1 row_range_type Description A blocked_range. That is, the type of the row values. 4.2.2.2 col_range_type Description A blocked_range. That is, the type of the column values.24 315415-014US 4.2.2.3 blocked_range2d( RowValue row_begin, RowValue row_end, typename row_range_type::size_type row_grainsize, ColValue col_begin, ColValue col_end, typename col_range_type::size_type col_grainsize ) Effects Constructs a blocked_range2d representing a two dimensional space of values. The space is the half-open Cartesian product [row_begin,row_end)× [col_begin,col_end), with the given grain sizes for the rows and columns. Example The statement “blocked_range2d r(’a’, ’z’+1, 3, 0, 10, 2 );” constructs a two-dimensional space that contains all value pairs of the form (i, j), where i ranges from ’a’ to ’z’ with a grain size of 3, and j ranges from 0 to 9 with a grain size of 2. 4.2.2.4 blocked_range2d( RowValue row_begin, RowValue row_end, ColValue col_begin, ColValue col_end) Effects Same as blocked_range2d(row_begin,row_end,1,col_begin,col_end,1). 4.2.2.5 blocked_range2d ( blocked_range2d& range, split ) Effects Partitions range into two subranges. The newly constructed blocked_range2d is approximately the second half of the original range, and range is updated to be the remainder. Each subrange has the same grain size as the original range. The split is either by rows or columns. The choice of which axis to split is intended to cause, after repeated splitting, the subranges to approach the aspect ratio of the respective row and column grain sizes. For example, if the row_grainsize is twice col_grainsize, the subranges will tend towards having twice as many rows as columns. 4.2.2.6 bool empty() const Effects Determines if range is empty.Algorithms Reference Manual 25 Returns rows().empty()||cols().empty() 4.2.2.7 bool is_divisible() const Effects Determines if range can be split into subranges. Returns rows().is_divisible()||cols().is_divisible() 4.2.2.8 const row_range_type& rows() const Returns Range containing the rows of the value space. 4.2.2.9 const col_range_type& cols() const Returns Range containing the columns of the value space. 4.2.3 blocked_range3d Template Class Summary Template class that represents recursively divisible three-dimensional half-open interval. Syntax template class blocked_range3d; Header #include "tbb/blocked_range3d.h" Description A blocked_range3d is the three-dimensional extension of blocked_range2d. Members namespace tbb { 26 315415-014US template class blocked_range3d { public: // Types typedef blocked_range page_range_type; typedef blocked_range row_range_type; typedef blocked_range col_range_type; // Constructors blocked_range3d( PageValue page_begin, PageValue page_end, typename page_range_type::size_type page_grainsize, RowValue row_begin, RowValue row_end, typename row_range_type::size_type row_grainsize, ColValue col_begin, ColValue col_end, typename col_range_type::size_type col_grainsize); blocked_range3d( PageValue page_begin, PageValue page_end, RowValue row_begin, RowValue row_end, ColValue col_begin, ColValue col_end); blocked_range3d( blocked_range3d& r, split ); // Capacity bool empty() const; // Access bool is_divisible() const; const page_range_type& pages() const; const row_range_type& rows() const; const col_range_type& cols() const; }; } 4.3 Partitioners Summary A partitioner specifies how a loop template should partition its work among threads. Description The default behavior of the loop templates parallel_for (4.4), 917H parallel_reduce (4.5), and 918H parallel_scan (4.6) tries to recursively split a range into enough parts to 919H keep processors busy, not necessarily splitting as finely as possible. An optional Algorithms Reference Manual 27 partitioner parameter enables other behaviors to be specified, as shown in Table 10. 920H The first column of the table shows how the formal parameter is declared in the loop templates. An affinity_partitioner is passed by non-const reference because it is updated to remember where loop iterations run. Table 10: Partitioners Partitioner Loop Behavior const auto_partitioner& (default)3F 4 Performs sufficient splitting to balance load, not necessarily splitting as finely as Range::is_divisible permits. When used with classes such as blocked_range, the selection of an appropriate grainsize is less important, and often acceptable performance can be achieved with the default grain size of 1. affinity_partitioner& Similar to auto_partitioner, but improves cache affinity by its choice of mapping subranges to worker threads. It can improve performance significantly when a loop is re-executed over the same data set, and the data set fits in cache. const simple_partitioner& Recursively splits a range until it is no longer divisible. The Range::is_divisible function is wholly responsible for deciding when recursive splitting halts. When used with classes such as blocked_range, the selection of an appropriate grainsize is critical to enabling concurrency while limiting overheads (see the discussion in Section 4.2.1). 921H 4.3.1 auto_partitioner Class Summary Specify that a parallel loop should optimize its range subdivision based on workstealing events. Syntax class auto_partitioner; 4 In Intel® TBB 2.1, simple_partitioner was the default. Intel® TBB 2.2 changed the default to auto_partitioner to simplify common usage of the loop templates. To get the old default, compile with the preprocessor symbol TBB_DEPRECATED=1. 28 315415-014US Header #include "tbb/partitioner.h" Description A loop template with an auto_partitioner attempts to minimize range splitting while providing ample opportunities for work-stealing. The range subdivision is initially limited to S subranges, where S is proportional to the number of threads specified by the task_scheduler_init (12.2.1). Each of these 922H subranges is not divided further unless it is stolen by an idle thread. If stolen, it is further subdivided to create additional subranges. Thus a loop template with an auto_partitioner creates additional subranges only when necessary to balance load. TIP: When using auto_partitioner and a blocked_range for a parallel loop, the body may be passed a subrange larger than the blocked_range’s grainsize. Therefore do not assume that the grainsize is an upper bound on the size of the subrange. Use a simple_partitioner if an upper bound is required. Members namespace tbb { class auto_partitioner { public: auto_partitioner(); ~auto_partitioner(); } } 4.3.1.1 auto_partitioner() Construct an auto_partitioner. 4.3.1.2 ~auto_partitioner() Destroy this auto_partitioner. 4.3.2 affinity_partitioner Summary Hint that loop iterations should be assigned to threads in a way that optimizes for cache affinity. Syntax class affinity_partitioner;Algorithms Reference Manual 29 Header #include "tbb/partitioner.h" Description An affinity_partitioner hints that execution of a loop template should assign iterations to the same processors as another execution of the loop (or another loop) with the same affinity_partitioner object. Unlike the other partitioners, it is important that the same affinity_partitioner object be passed to the loop templates to be optimized for affinity. The Tutorial (Section 3.2.3 “Bandwidth and Cache Affinity”) discusses affinity effects in detail. TIP: The affinity_partitioner generally improves performance only when: • The computation does a few operations per data access. • The data acted upon by the loop fits in cache. • The loop, or a similar loop, is re-executed over the same data. • There are more than two hardware threads available. Members namespace tbb { class affinity_partitioner { public: affinity_partitioner(); ~affinity_partitioner(); } } Example The following example can benefit from cache affinity. The example simulates a one dimensional additive automaton. #include "tbb/blocked_range.h" #include "tbb/parallel_for.h" #include "tbb/partitioner.h" using namespace tbb; const int N = 1000000; typedef unsigned char Cell; Cell Array[2][N]; int FlipFlop; struct TimeStepOverSubrange { void operator()( const blocked_range& r ) const { 30 315415-014US int j = r.end(); const Cell* x = Array[FlipFlop]; Cell* y = Array[!FlipFlop]; for( int i=r.begin(); i!=j; ++i ) y[i] = x[i]^x[i+1]; } }; void DoAllTimeSteps( int m ) { affinity_partitioner ap; for( int k=0; k(0,N-1), TimeStepOverSubrange(), ap ); FlipFlop ^= 1; } } For each time step, the old state of the automaton is read from Array[FlipFlop], and the new state is written into Array[!FlipFlop]. Then FlipFlop flips to make the new state become the old state. The aggregate size of both states is about 2 MByte, which fits in most modern processors’ cache. Improvements ranging from 50%-200% have been observed for this example on 8 core machines, compared with using an auto_partitioner instead. The affinity_partitioner must live between loop iterations. The example accomplishes this by declaring it outside the loop that executes all iterations. An alternative would be to declare the affinity partitioner at the file scope, which works as long as DoAllTimeSteps itself is not invoked concurrently. The same instance of affinity_partitioner should not be passed to two parallel algorithm templates that are invoked concurrently. Use separate instances instead. 4.3.2.1 affinity_partitioner() Construct an affinity_partitioner. 4.3.2.2 ~affinity_partitioner() Destroy this affinity_partitioner. 4.3.3 simple_partitioner Class Summary Specify that a parallel loop should recursively split its range until it cannot be subdivided further. Algorithms Reference Manual 31 Syntax class simple_partitioner; Header #include "tbb/partitioner.h" Description A simple_partitioner specifies that a loop template should recursively divide its range until for each subrange r, the condition !r.is_divisible() holds. This is the default behavior of the loop templates that take a range argument. TIP: When using simple_partitioner and a blocked_range for a parallel loop, be careful to specify an appropriate grainsize for the blocked_range. The default grainsize is 1, which may make the subranges much too small for efficient execution. Members namespace tbb { class simple_partitioner { public: simple_partitioner(); ~simple_partitioner(); } } 4.3.3.1 simple_partitioner() Construct a simple_partitioner. 4.3.3.2 ~simple_partitioner() Destroy this simple_partitioner. 4.4 parallel_for Template Function Summary Template function that performs parallel iteration over a range of values. Syntax template Func parallel_for( Index first, Index_type last, const Func& f [, task_group_context& group] ); template 32 315415-014US Func parallel_for( Index first, Index_type last, Index step, const Func& f [, task_group_context& group] ); template void parallel_for( const Range& range, const Body& body, [, partitioner[, task_group_context& group]] ); where the optional partitioner declares any of the partitioners as shown in column 1 of Table 10. 923H Header #include "tbb/parallel_for.h" Description A parallel_for(first,last,step,f) represents parallel execution of the loop: for( auto i=first; i& range ) const { for( int i=range.begin(); i!=range.end(); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.f); } }; // Note: Reads input[0..n] and writes output[1..n-1]. void ParallelAverage( float* output, const float* input, size_t n ) { Average avg; avg.input = input; 34 315415-014US avg.output = output; parallel_for( blocked_range( 1, n ), avg ); } Example This example is more complex and requires familiarity with STL. It shows the power of parallel_for beyond flat iteration spaces. The code performs a parallel merge of two sorted sequences. It works for any sequence with a random-access iterator. The algorithm (Akl 1987) works recursively as follows: 1. If the sequences are too short for effective use of parallelism, do a sequential merge. Otherwise perform steps 2-6. 2. Swap the sequences if necessary, so that the first sequence [begin1,end1) is at least as long as the second sequence [begin2,end2). 3. Set m1 to the middle position in [begin1,end1). Call the item at that location key. 4. Set m2 to where key would fall in [begin2,end2). 5. Merge [begin1,m1) and [begin2,m2) to create the first part of the merged sequence. 6. Merge [m1,end1) and [m2,end2) to create the second part of the merged sequence. The Intel® Threading Building Blocks implementation of this algorithm uses the range object to perform most of the steps. Predicate is_divisible performs the test in step 1, and step 2. The splitting constructor does steps 3-6. The body object does the sequential merges. #include "tbb/parallel_for.h" #include using namespace tbb; template struct ParallelMergeRange { static size_t grainsize; Iterator begin1, end1; // [begin1,end1) is 1st sequence to be merged Iterator begin2, end2; // [begin2,end2) is 2nd sequence to be merged Iterator out; // where to put merged sequence bool empty() const {return (end1-begin1)+(end2-begin2)==0;} bool is_divisible() const { return std::min( end1-begin1, end2-begin2 ) > grainsize; } ParallelMergeRange( ParallelMergeRange& r, split ) { if( r.end1-r.begin1 < r.end2-r.begin2 ) { std::swap(r.begin1,r.begin2); Algorithms Reference Manual 35 std::swap(r.end1,r.end2); } Iterator m1 = r.begin1 + (r.end1-r.begin1)/2; Iterator m2 = std::lower_bound( r.begin2, r.end2, *m1 ); begin1 = m1; begin2 = m2; end1 = r.end1; end2 = r.end2; out = r.out + (m1-r.begin1) + (m2-r.begin2); r.end1 = m1; r.end2 = m2; } ParallelMergeRange( Iterator begin1_, Iterator end1_, Iterator begin2_, Iterator end2_, Iterator out_ ) : begin1(begin1_), end1(end1_), begin2(begin2_), end2(end2_), out(out_) {} }; template size_t ParallelMergeRange::grainsize = 1000; template struct ParallelMergeBody { void operator()( ParallelMergeRange& r ) const { std::merge( r.begin1, r.end1, r.begin2, r.end2, r.out ); } }; template void ParallelMerge( Iterator begin1, Iterator end1, Iterator begin2, Iterator end2, Iterator out ) { parallel_for( ParallelMergeRange(begin1,end1,begin2,end2,out), ParallelMergeBody(), simple_partitioner() ); } Because the algorithm moves many locations, it tends to be bandwidth limited. Speedup varies, depending upon the system. 36 315415-014US 4.5 parallel_reduce Template Function Summary Computes reduction over a range. Syntax template Value parallel_reduce( const Range& range, const Value& identity, const Func& func, const Reduction& reduction, [, partitioner[, task_group_context& group]] ); template void parallel_reduce( const Range& range, const Body& body [, partitioner[, task_group_context& group]] ); where the optional partitioner declares any of the partitioners as shown in column 1 of Table 10. 927H Header #include "tbb/parallel_reduce.h" Description The parallel_reduce template has two forms. The functional form is designed to be easy to use in conjunction with lambda expressions. The imperative form is designed to minimize copying of data. The functional form parallel_reduce(range,identity,func,reduction) performs a parallel reduction by applying func to subranges in range and reducing the results using binary operator reduction. It returns the result of the reduction. Parameter func and reduction can be lambda expressions. Table 12 summarizes the type requirements 928H on the types of identity, func, and reduction. Table 12: Requirements for Func and Reduction Pseudo-Signature Semantics Value Identity; Left identity element for Func::operator(). Value Func::operator()(const Range& range, const Value& x) Accumulate result for subrange, starting with initial value x. Algorithms Reference Manual 37 Value Reduction::operator()(const Value& x, const Value& y); Combine results x and y. The imperative form parallel_reduce(range,body) performs parallel reduction of body over each value in range. Type Range must model the Range concept (468H929H4.2). The body must model the requirements in Table 13. 469H930H Table 13: Requirements for parallel_reduce Body Pseudo-Signature Semantics Body::Body( Body&, split ); Splitting constructor (4.1). Must 470H931H be able to run concurrently with operator() and method join. Body::~Body() Destructor. void Body::operator()(const Range& range); Accumulate result for subrange. void Body::join( Body& rhs ); Join results. The result in rhs should be merged into the result of this. A parallel_reduce recursively splits the range into subranges to the point such that is_divisible() is false for each subrange. A parallel_reduce uses the splitting constructor to make one or more copies of the body for each thread. It may copy a body while the body’s operator() or method join runs concurrently. You are responsible for ensuring the safety of such concurrency. In typical usage, the safety requires no extra effort. When worker threads are available (12.2.1) 471H932H , parallel_reduce invokes the splitting constructor for the body. For each such split of the body, it invokes method join in order to merge the results from the bodies. Define join to update this to represent the accumulated result for this and rhs. The reduction operation should be associative, but does not have to be commutative. For a noncommutative operation op, “left.join(right)” should update left to be the result of left op right. A body is split only if the range is split, but the converse is not necessarily so. Figure 1 472H933H diagrams a sample execution of parallel_reduce. The root represents the original body b0 being applied to the half-open interval [0,20). The range is recursively split at each level into two subranges. The grain size for the example is 5, which yields four leaf ranges. The slash marks (/) denote where copies (b1 and b2) of the body were created by the body splitting constructor. Bodies b0 and b1 each evaluate one leaf. Body b2 evaluates leaf [10,15) and [15,20), in that order. On the way back up the tree, parallel_reduce invokes b0.join(b1) and b0.join(b2) to merge the results of the leaves. 38 315415-014US b0 [0,20) b0 [0,10) b2 [10,20) b0 [0,5) b1 [5,10) b2 [10,15) b2 [15,20) Figure 1: Execution of parallel_reduce over blocked_range(0,20,5) Figure 1 shows only one possible execution. Other valid executions include splitting b 473H934H 2 into b2 and b3, or doing no splitting at all. With no splitting, b0 evaluates each leaf in left to right order, with no calls to join. A given body always evaluates one or more subranges in left to right order. For example, in Figure 1, body b2 is guaranteed to evaluate [10,15) before [15,20). You may rely on the left to right property for a given instance of a body. However, you t must neither rely on a particular choice of body splitting nor on the subranges processed by a given body object being consecutive. parallel_reduce makes the choice of body splitting nondeterministically. b0 [0,20) b0 [0,10) b0 [10,20) b0 [0,5) b1 [5,10) b0 [10,15) b0 [15,20) Figure 2: Example Where Body b0 Processes Non-consecutive Subranges. The subranges evaluated by a given body are not consecutive if there is an intervening join. The joined information represents processing of a gap between evaluated subranges. Figure 2 shows such an example. The body b 935H 0 performs the following sequence of operations: b0( [0,5) ) b0.join()( b1 ) where b1 has already processed [5,10) b0( [10,15) ) b0( [15,20) ) In other words, body b0 gathers information about all the leaf subranges in left to right order, either by directly processing each leaf, or by a join operation on a body that gathered information about one or more leaves in a similar way. When no worker threads are available, parallel_reduce executes sequentially from left to right in the Algorithms Reference Manual 39 same sense as for parallel_for (4.4). Sequential execution never invokes the 474H936H splitting constructor or method join. All overloads can be passed a task_group_context object so that the algorithm’s tasks are executed in this group. By default the algorithm is executed in a bound group 5H of its own. Complexity If the range and body take O(1) space, and the range splits into nearly equal pieces, then the space complexity is O(P log(N)), where N is the size of the range and P is the number of threads. Example (Imperative Form) The following code sums the values in an array. #include "tbb/parallel_reduce.h" #include "tbb/blocked_range.h" using namespace tbb; struct Sum { float value; Sum() : value(0) {} Sum( Sum& s, split ) {value = 0;} void operator()( const blocked_range& r ) { float temp = value; for( float* a=r.begin(); a!=r.end(); ++a ) { temp += *a; } value = temp; } void join( Sum& rhs ) {value += rhs.value;} }; float ParallelSum( float array[], size_t n ) { Sum total; parallel_reduce( blocked_range( array, array+n ), total ); return total.value; } The example generalizes to reduction for any associative operation op as follows: • Replace occurrences of 0 with the identity element for op • Replace occurrences of += with op= or its logical equivalent. 40 315415-014US • Change the name Sum to something more appropriate for op. The operation may be noncommutative. For example, op could be matrix multiplication. Example with Lambda Expressions The following is analogous to the previous example, but written using lambda expressions and the functional form of parallel_reduce. #include "tbb/parallel_reduce.h" #include "tbb/blocked_range.h" using namespace tbb; float ParallelSum( float array[], size_t n ) { return parallel_reduce( blocked_range( array, array+n ), 0.f, [](const blocked_range& r, float init)->float { for( float* a=r.begin(); a!=r.end(); ++a ) init += *a; return init; }, []( float x, float y )->float { return x+y; } ); } STL generalized numeric operations and functions objects can be used to write the example more compactly as follows: #include #include #include "tbb/parallel_reduce.h" #include "tbb/blocked_range.h" using namespace tbb; float ParallelSum( float array[], size_t n ) { return parallel_reduce( blocked_range( array, array+n ), 0.f, [](const blocked_range& r, float value)->float { return std::accumulate(r.begin(),r.end(),value); }, Algorithms Reference Manual 41 std::plus() ); } 4.6 parallel_scan Template Function Summary Template function that computes parallel prefix. Syntax template void parallel_scan( const Range& range, Body& body ); template void parallel_scan( const Range& range, Body& body, const auto_partitioner& ); template void parallel_scan( const Range& range, Body& body, const simple_partitioner& ); Header #include "tbb/parallel_scan.h" Description A parallel_scan(range,body) computes a parallel prefix, also known as parallel scan. This computation is an advanced concept in parallel computing that is sometimes useful in scenarios that appear to have inherently serial dependences. A mathematical definition of the parallel prefix is as follows. Let ? be an associative operation ? with left-identity element id?. The parallel prefix of ? over a sequence x0, x1, ...xn-1 is a sequence y0, y1, y2, ...yn-1 where: • y0 = id? ? x0 • yi = yi-1 ? xi For example, if ? is addition, the parallel prefix corresponds a running sum. A serial implementation of parallel prefix is: T temp = id?; for( int i=1; i<=n; ++i ) { temp = temp ? x[i]; y[i] = temp; 42 315415-014US } Parallel prefix performs this in parallel by reassociating the application of ? and using two passes. It may invoke ? up to twice as many times as the serial prefix algorithm. Given the right grain size and sufficient hardware threads, it can out perform the serial prefix because even though it does more work, it can distribute the work across more than one hardware thread. TIP: Because parallel_scan needs two passes, systems with only two hardware threads tend to exhibit small speedup. parallel_scan is best considered a glimpse of a technique for future systems with more than two cores. It is nonetheless of interest because it shows how a problem that appears inherently sequential can be parallelized. The template parallel_scan implements parallel prefix generically. It requires the signatures described in Table 14. 475H937H Table 14: parallel_scan Requirements Pseudo-Signature Semantics void Body::operator()( const Range& r, pre_scan_tag ) Accumulate summary for range r. void Body::operator()( const Range& r, final_scan_tag ) Compute scan result and summary for range r. Body::Body( Body& b, split ) Split b so that this and b can accumulate summaries separately. Body *this is object a in the table row below. void Body::reverse_join( Body& a ) Merge summary accumulated by a into summary accumulated by this, where this was created earlier from a by a's splitting constructor. Body *this is object b in the table row above. void Body::assign( Body& b ) Assign summary of b to this. A summary contains enough information such that for two consecutive subranges r and s: • If r has no preceding subrange, the scan result for s can be computed from knowing s and the summary for r. • A summary of r concatenated with s can be computed from the summaries of r and s. For example, if computing a running sum of an array, the summary for a range r is the sum of the array elements corresponding to r. Algorithms Reference Manual 43 Figure 3 shows one way that 938H parallel_scan might compute the running sum of an array containing the integers 1-16. Time flows downwards in the diagram. Each color denotes a separate Body object. Summaries are shown in brackets. 7. The first two steps split the original blue body into the pink and yellow bodies. Each body operates on a quarter of the input array in parallel. The last quarter is processed later in step 5. 8. The blue body computes the final scan and summary for 1-4. The pink and yellow bodies compute their summaries by prescanning 5-8 and 9-12 respectively. 9. The pink body computes its summary for 1-8 by performing a reverse_join with the blue body. 10. The yellow body computes its summary for 1-12 by performing a reverse_join with the pink body. 11. The blue, pink, and yellow bodies compute final scans and summaries for portions of the array. 12. The yellow summary is assigned to the blue body. The pink and yellow bodies are destroyed. Note that two quarters of the array were not prescanned. The parallel_scan template makes an effort to avoid prescanning where possible, to improve performance when there are only a few or no extra worker threads. If no other workers are available, parallel_scan processes the subranges without any pre_scans, by processing the subranges from left to right using final scans. That’s why final scans must compute a summary as well as the final scan result. The summary might be needed to process the next subrange if no worker thread has prescanned it yet. 44 315415-014US 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 pre_scan [26] pre_scan [42] final_scan 0 1 3 6 [10] final_scan 10 15 21 28 [36] reverse_join [36] reverse_join [78] final_scan 36 45 55 66 [78] final_scan 78 91 105 120 [136] split [0] split [0] original body [0] original body [0] assign [136] input array Figure 3: Example Execution of parallel_scan The following code demonstrates how the signatures could be implemented to use parallel_scan to compute the same result as the earlier sequential example involving ?. using namespace tbb; class Body { T sum; T* const y; const T* const x; Algorithms Reference Manual 45 public: Body( T y_[], const T x_[] ) : sum(id?), x(x_), y(y_) {} T get_sum() const {return sum;} template void operator()( const blocked_range& r, Tag ) { T temp = sum; for( int i=r.begin(); i(0,n), body ); return body.get_sum(); } The definition of operator() demonstrates typical patterns when using parallel_scan. • A single template defines both versions. Doing so is not required, but usually saves coding effort, because the two versions are usually similar. The library defines static method is_final_scan() to enable differentiation between the versions. • The prescan variant computes the ? reduction, but does not update y. The prescan is used by parallel_scan to generate look-ahead partial reductions. • The final scan variant computes the ? reduction and updates y. The operation reverse_join is similar to the operation join used by parallel_reduce, except that the arguments are reversed. That is, this is the right argument of ?. Template function parallel_scan decides if and when to generate parallel work. It is thus crucial that ? is associative and that the methods of Body faithfully represent it. Operations such as floating-point addition that are somewhat associative can be used, with the understanding that the results may be rounded differently depending upon the association used by parallel_scan. The reassociation may differ between runs even on the same machine. However, if there are no worker threads available, execution associates identically to the serial form shown at the beginning of this section. 46 315415-014US If you change the example to use a simple_partitioner, be sure to provide a grainsize. The code below shows the how to do this for a grainsize of 1000: parallel_scan(blocked_range(0,n,1000), total, simple_partitioner() ); 4.6.1 pre_scan_tag and final_scan_tag Classes Summary Types that distinguish the phases of parallel_scan.. Syntax struct pre_scan_tag; struct final_scan_tag; Header #include "tbb/parallel_scan.h" Description Types pre_scan_tag and final_scan_tag are dummy types used in conjunction with parallel_scan. See the example in Section 4.6 for how they are used in the signature 939H of operator(). Members namespace tbb { struct pre_scan_tag { static bool is_final_scan(); }; struct final_scan_tag { static bool is_final_scan(); }; } 4.6.1.1 bool is_final_scan() Returns True for a final_scan_tag, otherwise false. Algorithms Reference Manual 47 4.7 parallel_do Template Function Summary Template function that processes work items in parallel. Syntax template void parallel_do( InputIterator first, InputIterator last, Body body[, task_group_context& group] ); Header #include "tbb/parallel_do.h" Description A parallel_do(first,last,body) applies a function object body over the half-open interval [first,last). Items may be processed in parallel. Additional work items can be added by body if it has a second argument of type parallel_do_feeder (4.7.1). 940H The function terminates when body(x) returns for all items x that were in the input sequence or added to it by method parallel_do_feeder::add (4.7.1.1). 941H The requirements for input iterators are specified in Section 24.1 of the ISO C++ standard. Table 15 shows the requirements on type 942H Body. Table 15: parallel_do Requirements for Body B and its Argument Type T Pseudo-Signature Semantics B::operator()( cv-qualifiers T& item, parallel_do_feeder& feeder ) const OR B::operator()(cv-qualifiers T& item ) const Process item. Template parallel_do may concurrently invoke operator() for the same this but different item. The signature with feeder permits additional work items to be added. T( const T& ) Copy a work item. ~T::T() Destroy a work item. For example, a unary function object, as defined in Section 20.3 of the C++ standard, models the requirements for B. CAUTION: Defining both the one-argument and two-argument forms of operator() is not permitted. 48 315415-014US TIP: The parallelism in parallel_do is not scalable if all of the items come from an input stream that does not have random access. To achieve scaling, do one of the following: • Use random access iterators to specify the input stream. • Design your algorithm such that the body often adds more than one piece of work. • Use parallel_for instead. To achieve speedup, the grainsize of B::operator() needs to be on the order of at least ~100,000 clock cycles. Otherwise, the internal overheads of parallel_do swamp the useful work. The algorithm can be passed a task_group_context object so that its tasks are executed in this group. By default the algorithm is executed in a bound group 6H of its own. Example The following code sketches a body with the two-argument form of operator(). struct MyBody { void operator()(item_t item, parallel_do_feeder& feeder ) { for each new piece of work implied by item do { item_t new_item = initializer; feeder.add(new_item); } } }; 4.7.1 parallel_do_feeder class Summary Inlet into which additional work items for a parallel_do can be fed. Syntax template class parallel_do_feeder; Header #include "tbb/parallel_do.h" Description A parallel_do_feeder enables the body of a parallel_do to add more work items. Algorithms Reference Manual 49 Only class parallel_do (4.7) can create or destroy a 943H parallel_do_feeder. The only operation other code can perform on a parallel_do_feeder is to invoke method parallel_do_feeder::add. Members namespace tbb { template struct parallel_do_feeder { void add( const Item& item ); }; } 4.7.1.1 void add( const Item& item ) Requirements Must be called from a call to body.operator() created by parallel_do. Otherwise, the termination semantics of method operator() are undefined. Effects Adds item to collection of work items to be processed. 4.8 parallel_for_each Template Function Summary Parallel variant of std::for_each. Syntax template void parallel_for_each (InputIterator first, InputIterator last, const Func& f [, task_group_context& group]); Header #include "tbb/parallel_for_each.h" Description A parallel_for_each(first,last,f) applies f to the result of dereferencing every iterator in the range [first,last), possibly in parallel. It is provided for PPL 50 315415-014US compatibility and equivalent to parallel_do(first,last,f) without "feeder" functionality. If the group argument is specified, the algorithm’s tasks are executed in this group. By default the algorithm is executed in a bound group 7H of its own. 4.9 pipeline Class Summary Class that performs pipelined execution. Syntax class pipeline; Header #include "tbb/pipeline.h" Description A pipeline represents pipelined application of a series of filters to a stream of items. Each filter operates in a particular mode: parallel, serial in order, or serial out of order (MacDonald 2004 8H ). See class filter (4.9.6) for details. 944H A pipeline contains one or more filters, denoted here as fi , where i denotes the position of the filter in the pipeline. The pipeline starts with filter f0, followed by f1, f2, etc. The following steps describe how to use class pipeline. 13. Derive each class fi from filter. The constructor for fi specifies its mode as a parameter to the constructor for base class filter (4.9.6.1). 480H945H 14. Override virtual method filter::operator() to perform the filter’s action on the item, and return a pointer to the item to be processed by the next filter. The first filter f0 generates the stream. It should return NULL if there are no more items in the stream. The return value for the last filter is ignored. 15. Create an instance of class pipeline. 16. Create instances of the filters fi and add them to the pipeline, in order from first to last. An instance of a filter can be added at most once to a pipeline. A filter should never be a member of more than one pipeline at a time. 17. Call method pipeline::run. The parameter max_number_of_live_tokens puts an upper bound on the number of stages that will be run concurrently. Higher values may increase concurrency at the expense of more memory consumption from having more items in flight. See the Tutorial, in the section on class pipeline, for more about effective use of max_number_of_live_tokens. TIP: Given sufficient processors and tokens, the throughput of the pipeline is limited to the throughput of the slowest serial filter. Algorithms Reference Manual 51 NOTE: Function parallel_pipeline 9H provides a strongly typed lambda-friendly way to build and run pipelines. Members namespace tbb { class pipeline { public: pipeline(); ~pipeline();4F 5 void add_filter( filter& f ); void run( size_t max_number_of_live_tokens [, task_group_context& group] ); void clear(); }; } 4.9.1 pipeline() Effects Constructs pipeline with no filters. 4.9.2 ~pipeline() Effects Removes all filters from the pipeline and destroys the pipeline 4.9.3 void add_filter( filter& f ) Effects Appends filter f to sequence of filters in the pipeline. The filter f must not already be in a pipeline. 5 Though the current implementation declares the destructor virtual, do not rely on this detail. The virtual nature is deprecated and may disappear in future versions of Intel® TBB. 52 315415-014US 4.9.4 void run( size_t max_number_of_live_tokens[, task_group_context& group] ) Effects Runs the pipeline until the first filter returns NULL and each subsequent filter has processed all items from its predecessor. The number of items processed in parallel depends upon the structure of the pipeline and number of available threads. At most max_number_of_live_tokens are in flight at any given time. A pipeline can be run multiple times. It is safe to add stages between runs. Concurrent invocations of run on the same instance of pipeline are prohibited. If the group argument is specified, pipeline’s tasks are executed in this group. By default the algorithm is executed in a bound group 10H of its own. 4.9.5 void clear() Effects Removes all filters from the pipeline. 4.9.6 filter Class Summary Abstract base class that represents a filter in a pipeline. Syntax class filter; Header #include "tbb/pipeline.h" Description A filter represents a filter in a pipeline (0). There are three modes of filters: 946H • A parallel filter can process multiple items in parallel and in no particular order. • A serial_out_of_order filter processes items one at a time, and in no particular order. • A serial_in_order filter processes items one at a time. All serial_in_order filters in a pipeline process items in the same order. Algorithms Reference Manual 53 The mode of filter is specified by an argument to the constructor. Parallel filters are preferred when practical because they permit parallel speedup. If a filter must be serial, the out of order variant is preferred when practical because it puts less contraints on processing order. Class filter should only be used in conjunction with class pipeline (0). 947H TIP: Use a serial_in_order input filter if there are any subsequent serial_in_order stages that should process items in their input order. CAUTION: Intel® TBB 2.0 and prior treated parallel input stages as serial. Later versions of Intel® TBB can execute a parallel input stage in parallel, so if you specify such a stage, ensure that its operator() is thread safe. Members namespace tbb { class filter { public: enum mode { parallel = implementation-defined, serial_in_order = implementation-defined, serial_out_of_order = implementation-defined }; bool is_serial() const; bool is_ordered() const; virtual void* operator()( void* item ) = 0; virtual void finalize( void* item ) {} virtual ~filter(); protected: filter( mode ); }; } Example See the example filters MyInputFilter, MyTransformFilter, and MyOutputFilter in the Tutorial (doc/Tutorial.pdf). 4.9.6.1 filter( mode filter_mode ) Effects Constructs a filter of the specified mode. NOTE: Intel® TBB 2.1 and prior had a similar constructor with a bool argument is_serial. That constructor exists but is deprecated (Section A.2.1). 948H54 315415-014US 4.9.6.2 ~filter() Effects Destroys the filter. If the filter is in a pipeline, it is automatically removed from that pipeline. 4.9.6.3 bool is_serial() const Returns False if filter mode is parallel; true otherwise. 4.9.6.4 bool is_ordered() const Returns True if filter mode is serial_in_order, false otherwise. 4.9.6.5 virtual void* operator()( void * item ) Description The derived filter should override this method to process an item and return a pointer to an item to be processed by the next filter. The item parameter is NULL for the first filter in the pipeline. Returns The first filter in a pipeline should return NULL if there are no more items to process. The result of the last filter in a pipeline is ignored. 4.9.6.6 virtual void finalize( void * item ) Description A pipeline can be cancelled by user demand or because of an exception. When a pipeline is cancelled, there may be items returned by a filter’s operator() that have not yet been processed by the next filter. When a pipeline is cancelled, the next filter invokes finalize() on each item instead of operator(). In contrast to operator(), method finalize() does not return an item for further processing. A derived filter should override finalize() to perform proper cleanup for an item. A pipeline will not invoke any further methods on the item. Effects The default definition has no effect. Algorithms Reference Manual 55 4.9.7 thread_bound_filter Class Summary Abstract base class that represents a filter in a pipeline that a thread must service explicitly. Syntax class thread_bound_filter; Header #include "tbb/pipeline.h" Description A thread_bound_filter is a special kind of filter (4.9.6) that is explicitly serviced 949H by a particular thread. It is useful when a filter must be executed by a particular thread. CAUTION: Use thread_bound_filter only if you need a filter to be executed on a particular thread. The thread that services a thread_bound_filter must not be the thread that calls pipeline::run(). Members namespace tbb { class thread_bound_filter: public filter { protected: thread_bound_filter(mode filter_mode); public: enum result_type { success, item_not_available, end_of_stream }; result_type try_process_item(); result_type process_item(); }; } Example The example below shows a pipeline with two filters where the second filter is a thread_bound_filter serviced by the main thread. #include #include "tbb/pipeline.h" #include "tbb/compat/thread" 56 315415-014US #include "tbb/task_scheduler_init.h" using namespace tbb; char InputString[] = "abcdefg\n"; class InputFilter: public filter { char* my_ptr; public: void* operator()(void*) { if (*my_ptr) return my_ptr++; else return NULL; } InputFilter() : filter( serial_in_order ), my_ptr(InputString) {} }; class OutputFilter: public thread_bound_filter { public: void* operator()(void* item) { std::cout << *(char*)item; return NULL; } OutputFilter() : thread_bound_filter(serial_in_order) {} }; void RunPipeline(pipeline* p) { p->run(8); } int main() { // Construct the pipeline InputFilter f; OutputFilter g; pipeline p; p.add_filter(f); p.add_filter(g); // Another thread initiates execution of the pipeline std::thread t(RunPipeline,&p); // Process the thread_bound_filter with the current thread. Algorithms Reference Manual 57 while (g.process_item()!=thread_bound_filter::end_of_stream) continue; // Wait for pipeline to finish on the other thread. t.join(); return 0; } The main thread does the following after constructing the pipeline: 18. Start the pipeline on another thread. 19. Service the thread_bound_filter until it reaches end_of_stream. 20. Wait for the other thread to finish. The pipeline is run on a separate thread because the main thread is responsible for servicing the thread_bound_filter g. The roles of the two threads can be reversed. A single thread cannot do both roles. 4.9.7.1 thread_bound_filter(mode filter_mode) Effects Constructs a filter of the specified mode. Section 4.9.6 describes the modes. 950H 4.9.7.2 result_type try_process_item() Effects If an item is available and it can be processed without exceeding the token limit, process the item with filter::operator(). Returns Table 16: Return Values From try_process_item Return Value Description success Applied filter::operator() to one item. item_not_available No item is currently available to process, or the token limit (4.9.4) would be exceeded. 951H end_of_stream No more items will ever arrive at this filter. 58 315415-014US 4.9.7.3 result_type process_item() Effects Like try_process_item, but waits until it can process an item or the end of the stream is reached. Returns Either success or end_of_stream. See Table 16 for details. 952H CAUTION: The current implementation spin waits until it can process an item or reaches the end of the stream. 4.10 parallel_pipeline Function Summary Strongly typed interface for pipelined execution. Syntax void parallel_pipeline( size_t max_number_of_live_tokens, const filter_t& filter_chain [, task_group_context& group] ); Header #include "tbb/pipeline.h" Description Function parallel_pipeline is a strongly typed lambda-friendly interface for building and running pipelines. The pipeline has characteristics similar to class pipeline 11H , except that the stages of the pipeline are specified via functors instead of class derivation. To build and run a pipeline from functors g0, g1, g2,...gn , write: parallel_pipeline( max_number_of_live_tokens, make_filter(mode0,g0) & make_filter(mode1,g1) & make_filter(mode2,g2) & ... make_filter(moden,gn) ); In general, functor gi should define its operator() to map objects of type Ii to objects of type Ii+1. Functor g0 is a special case, because it notifies the pipeline when the end of the input stream is reached. Functor g0 must be defined such that for a flow_control Algorithms Reference Manual 59 object fc, the expression g0(fc) either returns the next value in the input stream, or if at the end of the input stream, invokes fc.stop() and returns a dummy value. The value max_number_of_live_tokens has the same meaning as it does for pipeline::run 12H . If the group argument is specified, pipeline’s tasks are executed in this group. By default the algorithm is executed in a bound group 13H of its own. Example The following example uses parallel_pipeline compute the root-mean-square of a sequence defined by [first,last). The example is only for demonstrating syntactic mechanics. It is not as a practical way to do the calculation because parallel overhead would be vastly higher than useful work. Operator & requires that the output type of its first filter_t argument matches the input type of its second filter_t argument. float RootMeanSquare( float* first, float* last ) { float sum=0; parallel_pipeline( /*max_number_of_live_token=*/16, make_filter( filter::serial, [&](flow_control& fc)-> float*{ if( first( filter::parallel, [](float* p){return (*p)*(*p);} ) & make_filter( filter::serial, [&](float x) {sum+=x;} ) ); return sqrt(sum); } See the Intel® Threading Building Blocks Tutorial for a non-trivial example of parallel_pipeline. 60 315415-014US 4.10.1 filter_t Template Class Summary A filter or composite filter used in conjunction with function parallel_pipeline. Syntax template class filter_t; template filter_t make_filter( filter::mode mode, const Func& f ); template filter_t operator&( const filter_t& left, const filter_t& right ); Header #include "tbb/pipeline.h" Description A filter_t is a strongly typed filter that specifies its input and output types. A filter_t can be constructed from a functor or by composing of two filter_t objects with operator&. See 4.4 for an example. The same 14H953H filter_t object can be shared by multiple & expressions. Members namespace tbb { template class filter_t { public: filter_t(); filter_t( const filter_t& rhs ); template filter_t( filter::mode mode, const Func& func ); void operator=( const filter_t& rhs ); ~filter_t(); void clear(); }; template filter_t make_filter( filter::mode mode, const Func& f ); template filter_t operator&( const filter_t& left, const filter_t& right ); } Algorithms Reference Manual 61 4.10.1.1 filter_t() Effects Construct an undefined filter. CAUTION: The effect of using an undefined filter by operator& or parallel_pipeline is undefined. 4.10.1.2 filter_t( const filter_t& rhs ) Effects Construct a copy of rhs. 4.10.1.3 template filter_t( filter::mode mode, const Func& f ) Effects Construct a filter_t that uses a copy of functor f to map an input value t of type T to an output value u of type U. NOTE: When parallel_pipeline uses the filter_t, it computes u by evaluating f(t), unless T is void. In the void case u is computed by the expression u=f(fc), where fc is of type flow_control. See 4.9.6 for a description of the 15H954H mode argument. 4.10.1.4 void operator=( const filter_t& rhs ) Effects Update *this to use the functor associated with rhs. 4.10.1.5 ~filter_t() Effects Destroy the filter_t. 4.10.1.6 void clear() Effects Set *this to an undefined filter. 62 315415-014US 4.10.1.7 template filter_t make_filter(filter::mode mode, const Func& f) Returns filter_t(mode,f) 4.10.1.8 template filter_t operator& (const filter_t& left, const filter_t& right) Requires The output type of left must match the input type of right. Returns A filter_t representing the composition of filters left and right. The composition behaves as if the output value of left becomes the input value of right. 4.10.2 flow_control Class class flow_control; Summary Enables the first filter in a composite filter to indicate when the end of input has been reached. Syntax class flow_control; Header #include "tbb/pipeline.h" Description Template function parallel_pipeline passes a flow_control object fc to the input functor of a filter_t. When the input functor reaches the end of its input, it should invoke fc.stop() and return a dummy value. See 4.4 for an example. 16H955H Members namespace tbb { class flow_control { public: void stop(); Algorithms Reference Manual 63 }; } 4.11 parallel_sort Template Function Summary Sort a sequence. Syntax template void parallel_sort(RandomAccessIterator begin, RandomAccessIterator end); template void parallel_sort(RandomAccessIterator begin, RandomAccessIterator end, const Compare& comp ); Header #include "tbb/parallel_sort.h" Description Performs an unstable sort of sequence [begin1, end1). An unstable sort might not preserve the relative ordering of elements with equal keys. The sort is deterministic; sorting the same sequence will produce the same result each time. The requirements on the iterator and sequence are the same as for std::sort. Specifically, RandomAccessIterator must be a random access iterator, and its value type T must model the requirements in Table 17. 483H956H Table 17: Requirements on Value Type T of RandomAccessIterator for parallel_sort Pseudo-Signature Semantics void swap( T& x, T& y ) Swap x and y. bool Compare::operator()( const T& x, const T& y ) True if x comes before y; false otherwise. A call parallel_sort(i,j,comp) sorts the sequence [i,j) using the argument comp to determine relative orderings. If comp(x,y) returns true then x appears before y in the sorted sequence. A call parallel_sort(i,j) is equivalent to parallel_sort(i,j,std::less). 64 315415-014US Complexity parallel_sort is comparison sort with an average time complexity of O(N log (N)), where N is the number of elements in the sequence. When worker threads are available (12.2.1) 484H957H , parallel_sort creates subtasks that may be executed concurrently, leading to improved execution times. Example The following example shows two sorts. The sort of array a uses the default comparison, which sorts in ascending order. The sort of array b sorts in descending order by using std::greater for comparison. #include "tbb/parallel_sort.h" #include using namespace tbb; const int N = 100000; float a[N]; float b[N]; void SortExample() { for( int i = 0; i < N; i++ ) { a[i] = sin((double)i); b[i] = cos((double)i); } parallel_sort(a, a + N); parallel_sort(b, b + N, std::greater()); } 4.12 parallel_invoke Template Function Summary Template function that evaluates several functions in parallel. Syntax5F 6 template 6 When support for C++0x rvalue references become prevalent, the formal parameters may change to rvalue references. Algorithms Reference Manual 65 void parallel_invoke(const Func0& f0, const Func1& f1); template void parallel_invoke(const Func0& f0, const Func1& f1, const Func2& f2); … template void parallel_invoke(const Func0& f0, const Func1& f1 … const Func9& f9); Header #include "tbb/parallel_invoke.h" Description The expression parallel_invoke(f0,f1...fk) evaluates f0(), f1(),...fk possibly in parallel. There can be from 2 to 10 arguments. Each argument must have a type for which operator() is defined. Typically the arguments are either function objects or pointers to functions. Return values are ignored. Example The following example evaluates f(), g(), and h() in parallel. Notice how g and h are function objects that can hold local state. #include "tbb/parallel_invoke.h" using namespace tbb; void f(); extern void bar(int); class MyFunctor { int arg; public: MyFunctor(int a) : arg(a) {} void operator()() const {bar(arg);} }; void RunFunctionsInParallel() { MyFunctor g(2); MyFunctor h(3); tbb::parallel_invoke(f, g, h ); } 66 315415-014US Example with Lambda Expressions Here is the previous example rewritten with C++0x lambda expressions, which generate function objects. #include "tbb/parallel_invoke.h" using namespace tbb; void f(); extern void bar(int); void RunFunctionsInParallel() { tbb::parallel_invoke(f, []{bar(2);}, []{bar(3);} ); } Containers Reference Manual 67 5 Containers The container classes permit multiple threads to simultaneously invoke certain methods on the same container. Like STL, Intel® Threading Building Blocks (Intel® TBB) containers are templated with respect to an allocator argument. Each container uses its allocator to allocate memory for user-visible items. A container may use a different allocator for strictly internal structures. 5.1 Container Range Concept Summary View set of items in a container as a recursively divisible range. Requirements A Container Range is a Range (4.2) with the further requirements listed in 958H Table 18 959H . Table 18: Requirements on a Container Range R (In Addition to Table 8) 960H Pseudo-Signature Semantics R::value_type Item type R::reference Item reference type R::const_reference Item const reference type R::difference_type Type for difference of two iterators R::iterator Iterator type for range R::iterator R::begin() First item in range R::iterator R::end() One past last item in range R::size_type R::grainsize() const Grain size Model Types Classes concurrent_hash_map (5.4.4) and 961H concurrent_vector (5.8.5) both have 962H member types range_type and const_range_type that model a Container Range. Use the range types in conjunction with parallel_for (4.4), 497H963H parallel_reduce (4.5), 498H964H and parallel_scan (4.499H965H 4.6) to iterate over items in a container. 6966H68 315415-014US 5.2 concurrent_unordered_map Template Class Summary Template class for associative container that supports concurrent insertion and traversal. Syntax template , typename Equality = std::equal_to, typename Allocator = tbb::tbb_allocator > > class concurrent_unordered_map; Header #include "tbb/concurrent_unordered_map.h" Description A concurrent_unordered_map supports concurrent insertion and traversal, but not concurrent erasure. The interface has no visible locking. It may hold locks internally, but never while calling user defined code. It has semantics similar to the C++0x std::unordered_map except as follows: • Methods requiring C++0x language features (such as rvalue references and std::initializer_list) are currently omitted. • The erase methods are prefixed with unsafe_, to indicate that they are not concurrency safe. • Bucket methods are prefixed with unsafe_ as a reminder that they are not concurrency safe with respect to insertion. • The insert methods may create a temporary pair that is destroyed if another thread inserts the same key concurrently. • Like std::list, insertion of new items does not invalidate any iterators, nor change the order of items already in the map. Insertion and traversal may be concurrent. • The iterator types iterator and const_iterator are of the forward iterator category. • Insertion does not invalidate or update the iterators returned by equal_range, so insertion may cause non-equal items to be inserted at the end of the range. However, the first iterator will nonethless point to the equal item even after an insertion operation. Containers Reference Manual 69 NOTE: The key differences between classes concurrent_unordered_map and concurrent_hash_map each are: • concurrent_unordered_map: permits concurrent traversal and insertion, no visible locking, closely resembles the C++0x unordered_map. • concurrent_hash_map: permits concurrent erasure, built-in locking CAUTION: As with any form of hash table, keys that are equal must have the same hash code, and the ideal hash function distributes keys uniformly across the hash code space. Members In the following synopsis, methods in bold may be concurrently invoked. For example, three different threads can concurrently call methods insert, begin, and size. Their results might be non-deterministic. For example, the result from size might correspond to before or after the insertion. template , typename Equal = std::equal_to, typename Allocator = tbb::tbb_allocator > > class concurrent_unordered_map { public: // types typedef Key key_type; typedef std::pair value_type; typedef Element mapped_type; typedef Hash hasher; typedef Equality key_equal; typedef Alloc allocator_type; typedef typename allocator_type::pointer pointer; typedef typename allocator_type::const_pointer const_pointer; typedef typename allocator_type::reference reference; typedef typename allocator_type::const_reference const_reference; typedef implementation-defined size_type; typedef implementation-defined difference_type; typedef implementation-defined iterator; typedef implementation-defined const_iterator; typedef implementation-defined local_iterator; typedef implementation-defined const_local_iterator; // construct/destroy/copy explicit concurrent_unordered_map(size_type n = implementation-defined, 70 315415-014US const Hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()); template concurrent_unordered_map( InputIterator first, InputIterator last, size_type n = implementation-defined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()); concurrent_unordered_map(const concurrent_unordered_map&); concurrent_unordered_map(const Alloc&); concurrent_unordered_map(const concurrent_unordered_map&, const Alloc&); ~concurrent_unordered_map(); concurrent_unordered_map& operator=( const concurrent_unordered_map&); allocator_type get_allocator() const; // size and capacity bool empty() const; // May take linear time! size_type size() const; // May take linear time! size_type max_size() const; // iterators iterator begin(); const_iterator begin() const; iterator end(); const_iterator end() const; const_iterator cbegin() const; const_iterator cend() const; // modifiers std::pair insert(const value_type& x); iterator insert(const_iterator hint, const value_type& x); template void insert(InputIterator first, InputIterator last); iterator unsafe_erase(const_iterator position); size_type unsafe_erase(const key_type& k); iterator unsafe_erase(const_iterator first, const_iterator last); void clear(); void swap(concurrent_unordered_map&); Containers Reference Manual 71 // observers hasher hash_function() const; key_equal key_eq() const; // lookup iterator find(const key_type& k); const_iterator find(const key_type& k) const; size_type count(const key_type& k) const; std::pair equal_range(const key_type& k); std::pair equal_range(const key_type& k) const; mapped_type& operator[](const key_type& k); mapped_type& at( const key_type& k ); const mapped_type& at(const key_type& k) const; // parallel iteration typedef implementation defined range_type; typedef implementation defined const_range_type; range_type range(); const_range_type range() const; // bucket interface – for debugging size_type unsafe_bucket_count() const; size_type unsafe_max_bucket_count() const; size_type unsafe_bucket_size(size_type n); size_type unsafe_bucket(const key_type& k) const; local_iterator unsafe_begin(size_type n); const_local_iterator unsafe_begin(size_type n) const; local_iterator unsafe_end(size_type n); const_local_iterator unsafe_end(size_type n) const; const_local_iterator unsafe_cbegin(size_type n) const; const_local_iterator unsafe_cend(size_type n) const; // hash policy float load_factor() const; float max_load_factor() const; void max_load_factor(float z); void rehash(size_type n); }; 72 315415-014US 5.2.1 Construct, Destroy, Copy 5.2.1.1 explicit concurrent_unordered_map (size_type n = implementation-defined, const hasher& hf = hasher(),const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) Effects Construct empty table with n buckets. 5.2.1.2 template concurrent_unordered_map (InputIterator first, InputIterator last, size_type n = implementationdefined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) Effects Construct table with n buckets initialized with value_type(*i) where i is in the half open interval [first,last). 5.2.1.3 concurrent_unordered_map(const unordered_map& m) Effects Construct copy of map m. 5.2.1.4 concurrent_unordered_map(const Alloc& a) Construct empy map using allocator a. 5.2.1.5 concurrent_unordered_map(const unordered_map&, const Alloc& a) Effects Construct copy of map m using allocator a. Containers Reference Manual 73 5.2.1.6 ~concurrent_unordered_map() Effects Destroy the map. 5.2.1.7 concurrent_ unordered_map& operator=(const concurrent_unordered_map& m); Effects Set *this to a copy of map m. 5.2.1.8 allocator_type get_allocator() const; Get copy of the allocator associated with *this. 5.2.2 Size and capacity 5.2.2.1 bool empty() const Returns size()!=0. 5.2.2.2 size_type size() const Returns Number of items in *this. CAUTION: Though the current implementation takes time O(1), possible future implementations might take time O(P), where P is the number of hardware threads. 5.2.2.3 size_type max_size() const Returns CAUTION: Upper bound on number of items that *this can hold. CAUTION: The upper bound may be much higher than what the container can actually hold. 5.2.3 Iterators Template class concurrent_unordered_map supports forward iterators; that is, iterators that can advance only forwards across a table. Reverse iterators are not 74 315415-014US supported. Concurrent operations (count, find, insert) do not invalidate any existing iterators that point into the table. Note that an iterator obtained via begin() will no longer point to the first item if insert inserts an item before it. Methods cbegin and cend follow C++0x conventions. They return const_iterator even if the object is non-const. 5.2.3.1 iterator begin() Returns iterator pointing to first item in the map. 5.2.3.2 const_iterator begin() const Returns const_iterator pointing to first item in in the map. 5.2.3.3 iterator end() Returns iterator pointing to immediately past last item in the map. 5.2.3.4 const_iterator end() const Returns const_iterator pointing to immediately past last item in the map. 5.2.3.5 const_iterator cbegin() const Returns const_iterator pointing to first item in the map. 5.2.3.6 const_iterator cend() const Returns const_iterator pointing to immediately after the last item in the map. Containers Reference Manual 75 5.2.4 Modifiers 5.2.4.1 std::pair insert(const value_type& x) Effects Constructs copy of x and attempts to insert it into the map. Destroys the copy if the attempt fails because there was already an item with the same key. Returns std::pair(iterator,success). The value iterator points to an item in the map with a matching key. The value of success is true if the item was inserted; false otherwise. 5.2.4.2 iterator insert(const_iterator hint, const value_type& x) Effects Same as insert(x). NOTE: The current implementation ignores the hint argument. Other implementations might not ignore it. It exists for similarity with the C++0x class unordered_map. It hints to the implementation about where to start searching. Typically it should point to an item adjacent to where the item will be inserted. Returns Iterator pointing to inserted item, or item already in the map with the same key. 5.2.4.3 template void insert(InputIterator first, InputIterator last) Effects Does insert(*i) where i is in the half-open interval [first,last). 5.2.4.4 iterator unsafe_erase(const_iterator position) Effects Remove item pointed to by position from the map. Returns Iterator pointing to item that was immediately after the erased item, or end() if erased item was the last item in the map. 76 315415-014US 5.2.4.5 size_type unsafe_erase(const key_type& k) Effects Remove item with key k if such an item exists. Returns 1 if an item was removed; 0 otherwise. 5.2.4.6 iterator unsafe_erase(const_iterator first, const_iterator last) Effects Remove *i where i is in the half-open interval [first,last). Returns last 5.2.4.7 void clear() Effects Remove all items from the map. 5.2.4.8 void swap(concurrent_unordered_map& m) Effects Swap contents of *this and m. 5.2.5 Observers 5.2.5.1 hasher hash_function() const Returns Hashing functor associated with the map. 5.2.5.2 key_equal key_eq() const Returns Key equivalence functor associcated with the map. Containers Reference Manual 77 5.2.6 Lookup 5.2.6.1 iterator find(const key_type& k) Returns iterator pointing to item with key equivalent to k, or end() if no such item exists. 5.2.6.2 const_iterator find(const key_type& k) const Returns const_iterator pointing to item with key equivalent to k, or end() if no such item exists. 5.2.6.3 size_type count(const key_type& k) const Returns Number of items with keys equivalent to k. 5.2.6.4 std::pair equal_range(const key_type& k) Returns Range containing all keys in the map that are equivalent to k. 5.2.6.5 std::pair equal_range(const key_type& k) const Returns Range containing all keys in the map that are equivalent to k. 5.2.6.6 mapped_type& operator[](const key_type& k) Effects Inserts a new item if item with key equivalent to k is not already present. Returns Reference to x.second, where x is item in map with key equivalent to k. 78 315415-014US 5.2.6.7 mapped_type& at( const key_type& k ) Effects Throws exception if item with key equivalent to k is not already present. Returns Reference to x.second, where x is the item in map with key equivalent to k. 5.2.6.8 const mapped_type& at(const key_type& k) const Effects Throws exception if item with key equivalent to k is not already present. Returns Const reference to x.second, where x is the item in map with key equivalent to k. 5.2.7 Parallel Iteration Types const_range_type and range_type model the Container Range concept (5.1). 967H The types differ only in that the bounds for a const_range_type are of type const_iterator, whereas the bounds for a range_type are of type iterator. 5.2.7.1 const_range_type range() const Returns const_range_type object representing all keys in the table. 5.2.7.2 range_type range() Returns range_type object representing all keys in the table. 5.2.8 Bucket Interface The bucket interface is intended for debugging. It is not concurrency safe. The mapping of keys to buckets is implementation specific. The interface is similar to the bucket interface for the C++0x class unordered_map, except that the prefix unsafe_ has been added as a reminder that the methods are unsafe to use during concurrent insertion. Containers Reference Manual 79 Buckets are numbered from 0 to unsafe_bucket_count()-1. To iterate over a bucket use a local_iterator or const_local_iterator. 5.2.8.1 size_type unsafe_bucket_count() const Returns Number of buckets. 5.2.8.2 size_type unsafe_max_bucket_count() const Returns Upper bound on possible number of buckets. 5.2.8.3 size_type unsafe_bucket_size(size_type n) Returns Number of items in bucket n. 5.2.8.4 size_type unsafe_bucket(const key_type& k) const Returns Index of bucket where item with key k would be placed. 5.2.8.5 local_iterator unsafe_begin(size_type n) Returns local_iterator pointing to first item in bucket n. 5.2.8.6 const_local_iterator unsafe_begin(size_type n) const Returns const_local_iterator pointing to first item in bucket n. 5.2.8.7 local_iterator unsafe_end(size_type n) Returns local_iterator pointing to immediately after the last item in bucket n.80 315415-014US 5.2.8.8 const_local_iterator unsafe_end(size_type n) const Returns const_local_iterator pointing to immediately after the last item in bucket n. 5.2.8.9 const_local_iterator unsafe_cbegin(size_type n) const Returns const_local_iterator pointing to first item in bucket n. 5.2.8.10 const_local_iterator unsafe_cend(size_type n) const Returns const_local_iterator pointing to immediately past last item in bucket n. 5.2.9 Hash policy 5.2.9.1 float load_factor() const Returns Average number of elements per bucket. 5.2.9.2 float max_load_factor() const Returns Maximum size of a bucket. If insertion of an item causes a bucket to be bigger, the implementaiton may repartition or increase the number of buckets. 5.2.9.3 void max_load_factor(float z) Effects Set maximum size for a bucket to z. 5.2.9.4 void rehash(size_type n) Requirements n must be a power of two. Containers Reference Manual 81 Effects No effect if current number of buckets is at least n. Otherwise increases number of buckets to n. 5.3 concurrent_unordered_set Template Class Summary Template class for a set container that supports concurrent insertion and traversal. Syntax template , typename Equality = std::equal_to, typename Allocator = tbb::tbb_allocator class concurrent_unordered_set; Header #include "tbb/concurrent_unordered_set.h" Description A concurrent_unordered_set supports concurrent insertion and traversal, but not concurrent erasure. The interface has no visible locking. It may hold locks internally, but never while calling user defined code. It has semantics similar to the C++0x std::unordered_set except as follows: • Methods requiring C++0x language features (such as rvalue references and std::initializer_list) are currently omitted. • The erase methods are prefixed with unsafe_, to indicate that they are not concurrency safe. • Bucket methods are prefixed with unsafe_ as a reminder that they are not concurrency safe with respect to insertion. • The insert methods may create a temporary pair that is destroyed if another thread inserts the same key concurrently. • Like std::list, insertion of new items does not invalidate any iterators, nor change the order of items already in the set. Insertion and traversal may be concurrent. • The iterator types iterator and const_iterator are of the forward iterator category. • Insertion does not invalidate or update the iterators returned by equal_range, so insertion may cause non-equal items to be inserted at the end of the range. 82 315415-014US However, the first iterator will nonethless point to the equal item even after an insertion operation. CAUTION: As with any form of hash table, keys that are equal must have the same hash code, and the ideal hash function distributes keys uniformly across the hash code space. Members In the following synopsis, methods in bold may be concurrently invoked. For example, three different threads can concurrently call methods insert, begin, and size. Their results might be non-deterministic. For example, the result from size might correspond to before or after the insertion. template , typename Equal = std::equal_to, typename Allocator = tbb::tbb_allocator class concurrent_unordered_set { public: // types typedef Key key_type; typedef Key value_type; typedef Key mapped_type; typedef Hash hasher; typedef Equality key_equal; typedef Alloc allocator_type; typedef typename allocator_type::pointer pointer; typedef typename allocator_type::const_pointer const_pointer; typedef typename allocator_type::reference reference; typedef typename allocator_type::const_reference const_reference; typedef implementation-defined size_type; typedef implementation-defined difference_type; typedef implementation-defined iterator; typedef implementation-defined const_iterator; typedef implementation-defined local_iterator; typedef implementation-defined const_local_iterator; // construct/destroy/copy explicit concurrent_unordered_set(size_type n = implementation-defined, const Hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()); template concurrent_unordered_set( InputIterator first, InputIterator last, Containers Reference Manual 83 size_type n = implementation-defined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()); concurrent_unordered_set(const concurrent_unordered_set&); concurrent_unordered_set(const Alloc&); concurrent_unordered_set(const concurrent_unordered_set&, const Alloc&); ~concurrent_unordered_set(); concurrent_unordered_set& operator=( const concurrent_unordered_set&); allocator_type get_allocator() const; // size and capacity bool empty() const; // May take linear time! size_type size() const; // May take linear time! size_type max_size() const; // iterators iterator begin(); const_iterator begin() const; iterator end(); const_iterator end() const; const_iterator cbegin() const; const_iterator cend() const; // modifiers std::pair insert(const value_type& x); iterator insert(const_iterator hint, const value_type& x); template void insert(InputIterator first, InputIterator last); iterator unsafe_erase(const_iterator position); size_type unsafe_erase(const key_type& k); iterator unsafe_erase(const_iterator first, const_iterator last); void clear(); void swap(concurrent_unordered_set&); // observers hasher hash_function() const; key_equal key_eq() const; // lookup 84 315415-014US iterator find(const key_type& k); const_iterator find(const key_type& k) const; size_type count(const key_type& k) const; std::pair equal_range(const key_type& k); std::pair equal_range(const key_type& k) const; // parallel iteration typedef implementation defined range_type; typedef implementation defined const_range_type; range_type range(); const_range_type range() const; // bucket interface – for debugging size_type unsafe_bucket_count() const; size_type unsafe_max_bucket_count() const; size_type unsafe_bucket_size(size_type n); size_type unsafe_bucket(const key_type& k) const; local_iterator unsafe_begin(size_type n); const_local_iterator unsafe_begin(size_type n) const; local_iterator unsafe_end(size_type n); const_local_iterator unsafe_end(size_type n) const; const_local_iterator unsafe_cbegin(size_type n) const; const_local_iterator unsafe_cend(size_type n) const; // hash policy float load_factor() const; float max_load_factor() const; void max_load_factor(float z); void rehash(size_type n); }; 5.3.1 Construct, Destroy, Copy 5.3.1.1 explicit concurrent_unordered_set (size_type n = implementation-defined, const hasher& hf = hasher(),const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) Effects Construct empty set with n buckets. Containers Reference Manual 85 5.3.1.2 template concurrent_unordered_set (InputIterator first, InputIterator last, size_type n = implementationdefined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) Effects Construct set with n buckets initialized with value_type(*i) where i is in the half open interval [first,last). 5.3.1.3 concurrent_unordered_set(const unordered_set& m) Effects Construct copy of set m. 5.3.1.4 concurrent_unordered_set(const Alloc& a) Construct empy set using allocator a. 5.3.1.5 concurrent_unordered_set(const unordered_set&, const Alloc& a) Effects Construct copy of set m using allocator a. 5.3.1.6 ~concurrent_unordered_set() Effects Destroy the set. 5.3.1.7 concurrent_ unordered_set& operator=(const concurrent_unordered_set& m); Effects Set *this to a copy of set m. 5.3.1.8 allocator_type get_allocator() const; Get copy of the allocator associated with *this. 86 315415-014US 5.3.2 Size and capacity 5.3.2.1 bool empty() const Returns size()!=0. 5.3.2.2 size_type size() const Returns Number of items in *this. CAUTION: Though the current implementation takes time O(1), possible future implementations might take time O(P), where P is the number of hardware threads. 5.3.2.3 size_type max_size() const Returns CAUTION: Upper bound on number of items that *this can hold. CAUTION: The upper bound may be much higher than what the container can actually hold. 5.3.3 Iterators Template class concurrent_unordered_set supports forward iterators; that is, iterators that can advance only forwards across a set. Reverse iterators are not supported. Concurrent operations (count, find, insert) do not invalidate any existing iterators that point into the set. Note that an iterator obtained via begin() will no longer point to the first item if insert inserts an item before it. Methods cbegin and cend follow C++0x conventions. They return const_iterator even if the object is non-const. 5.3.3.1 iterator begin() Returns iterator pointing to first item in the set. Containers Reference Manual 87 5.3.3.2 const_iterator begin() const Returns const_iterator pointing to first item in in the set. 5.3.3.3 iterator end() Returns iterator pointing to immediately past last item in the set. 5.3.3.4 const_iterator end() const Returns const_iterator pointing to immediately past last item in the set. 5.3.3.5 const_iterator cbegin() const Returns const_iterator pointing to first item in the set. 5.3.3.6 const_iterator cend() const Returns const_iterator pointing to immediately after the last item in the set. 5.3.4 Modifiers 5.3.4.1 std::pair insert(const value_type& x) Effects Constructs copy of x and attempts to insert it into the set. Destroys the copy if the attempt fails because there was already an item with the same key. Returns std::pair(iterator,success). The value iterator points to an item in the set with a matching key. The value of success is true if the item was inserted; false otherwise. 88 315415-014US 5.3.4.2 iterator insert(const_iterator hint, const value_type& x) Effects Same as insert(x). NOTE: The current implementation ignores the hint argument. Other implementations might not ignore it. It exists for similarity with the C++0x class unordered_set. It hints to the implementation about where to start searching. Typically it should point to an item adjacent to where the item will be inserted. Returns Iterator pointing to inserted item, or item already in the set with the same key. 5.3.4.3 template void insert(InputIterator first, InputIterator last) Effects Does insert(*i) where i is in the half-open interval [first,last). 5.3.4.4 iterator unsafe_erase(const_iterator position) Effects Remove item pointed to by position from the set. Returns Iterator pointing to item that was immediately after the erased item, or end() if erased item was the last item in the set. 5.3.4.5 size_type unsafe_erase(const key_type& k) Effects Remove item with key k if such an item exists. Returns 1 if an item was removed; 0 otherwise. Containers Reference Manual 89 5.3.4.6 iterator unsafe_erase(const_iterator first, const_iterator last) Effects Remove *i where i is in the half-open interval [first,last). Returns last 5.3.4.7 void clear() Effects Remove all items from the set. 5.3.4.8 void swap(concurrent_unordered_set& m) Effects Swap contents of *this and m. 5.3.5 Observers 5.3.5.1 hasher hash_function() const Returns Hashing functor associated with the set. 5.3.5.2 key_equal key_eq() const Returns Key equivalence functor associcated with the set. 5.3.6 Lookup 5.3.6.1 iterator find(const key_type& k) Returns iterator pointing to item with key equivalent to k, or end() if no such item exists. 90 315415-014US 5.3.6.2 const_iterator find(const key_type& k) const Returns const_iterator pointing to item with key equivalent to k, or end() if no such item exists. 5.3.6.3 size_type count(const key_type& k) const Returns Number of items with keys equivalent to k. 5.3.6.4 std::pair equal_range(const key_type& k) Returns Range containing all keys in the set that are equivalent to k. 5.3.6.5 std::pair equal_range(const key_type& k) const Returns Range containing all keys in the set that are equivalent to k. 5.3.7 Parallel Iteration Types const_range_type and range_type model the Container Range concept (5.1). 968H The types differ only in that the bounds for a const_range_type are of type const_iterator, whereas the bounds for a range_type are of type iterator. 5.3.7.1 const_range_type range() const Returns const_range_type object representing all keys in the set. 5.3.7.2 range_type range() Returns range_type object representing all keys in the set. Containers Reference Manual 91 5.3.8 Bucket Interface The bucket interface is intended for debugging. It is not concurrency safe. The mapping of keys to buckets is implementation specific. The interface is similar to the bucket interface for the C++0x class unordered_set, except that the prefix unsafe_ has been added as a reminder that the methods are unsafe to use during concurrent insertion. Buckets are numbered from 0 to unsafe_bucket_count()-1. To iterate over a bucket use a local_iterator or const_local_iterator. 5.3.8.1 size_type unsafe_bucket_count() const Returns Number of buckets. 5.3.8.2 size_type unsafe_max_bucket_count() const Returns Upper bound on possible number of buckets. 5.3.8.3 size_type unsafe_bucket_size(size_type n) Returns Number of items in bucket n. 5.3.8.4 size_type unsafe_bucket(const key_type& k) const Returns Index of bucket where item with key k would be placed. 5.3.8.5 local_iterator unsafe_begin(size_type n) Returns local_iterator pointing to first item in bucket n. 5.3.8.6 const_local_iterator unsafe_begin(size_type n) const Returns const_local_iterator pointing to first item in bucket n.92 315415-014US 5.3.8.7 local_iterator unsafe_end(size_type n) Returns local_iterator pointing to immediately after the last item in bucket n. 5.3.8.8 const_local_iterator unsafe_end(size_type n) const Returns const_local_iterator pointing to immediately after the last item in bucket n. 5.3.8.9 const_local_iterator unsafe_cbegin(size_type n) const Returns const_local_iterator pointing to first item in bucket n. 5.3.8.10 const_local_iterator unsafe_cend(size_type n) const Returns const_local_iterator pointing to immediately past last item in bucket n. 5.3.9 Hash policy 5.3.9.1 float load_factor() const Returns Average number of elements per bucket. 5.3.9.2 float max_load_factor() const Returns Maximum size of a bucket. If insertion of an item causes a bucket to be bigger, the implementaiton may repartition or increase the number of buckets. 5.3.9.3 void max_load_factor(float z) Effects Set maximum size for a bucket to z. Containers Reference Manual 93 5.3.9.4 void rehash(size_type n) Requirements n must be a power of two. Effects No effect if current number of buckets is at least n. Otherwise increases number of buckets to n. 5.4 concurrent_hash_map Template Class Summary Template class for associative container with concurrent access. Syntax template, typename A=tbb_allocator > > class concurrent_hash_map; Header #include "tbb/concurrent_hash_map.h" Description A concurrent_hash_map maps keys to values in a way that permits multiple threads to concurrently access values. The keys are unordered. There is at most one element in a concurrent_hash_map for each key. The key may have other elements in flight but not in the map as described in Section 5.4.3. The interface resembles typical STL 969H associative containers, but with some differences critical to supporting concurrent access. It meets the Container Requirements of the ISO C++ standard. Types Key and T must model the CopyConstructible concept (2.2.3). 485H970H Type HashCompare specifies how keys are hashed and compared for equality. It must model the HashCompare concept in Table 19. 971H Table 19: HashCompare Concept Pseudo-Signature Semantics HashCompare::HashCompare( const HashCompare& ) Copy constructor. 94 315415-014US Pseudo-Signature Semantics HashCompare::~HashCompare () Destructor. bool HashCompare::equal( const Key& j, const Key& k ) const True if keys are equal. size_t HashCompare::hash( const Key& k ) const Hashcode for key. CAUTION: As for most hash tables, if two keys are equal, they must hash to the same hash code. That is for a given HashCompare h and any two keys j and k, the following assertion must hold: “!h.equal(j,k) || h.hash(j)==h.hash(k)”. The importance of this property is the reason that concurrent_hash_map makes key equality and hashing function travel together in a single object instead of being separate objects. The hash code of a key must not change while the hash table is non-empty. CAUTION: Good performance depends on having good pseudo-randomness in the low-order bits of the hash code. Example When keys are pointers, simply casting the pointer to a hash code may cause poor performance because the low-order bits of the hash code will be always zero if the pointer points to a type with alignment restrictions. A way to remove this bias is to divide the casted pointer by the size of the type, as shown by the underlined blue text below. size_t MyHashCompare::hash( Key* key ) const { return reinterpret_cast(key)/sizeof(Key); } Members namespace tbb { template > > class concurrent_hash_map { public: // types typedef Key key_type; typedef T mapped_type; typedef std::pair value_type; typedef size_t size_type; typedef ptrdiff_t difference_type; typedef value_type* pointer; typedef const value_type* const_pointer; typedef value_type& reference; typedef Alloc allocator_type; // whole-table operations Containers Reference Manual 95 concurrent_hash_map( const allocator_type& a=allocator_type() ); concurrent_hash_map( size_type n, const allocator_type &a = allocator_type() ); concurrent_hash_map( const concurrent_hash_map&, const allocator_type& a=allocator_type() ); template concurrent_hash_map( InputIterator first, InputIterator last, const allocator_type& a = allocator_type()) ~concurrent_hash_map(); concurrent_hash_map operator=(const concurrent_hash_map&); void rehash( size_type n=0 ); void clear(); allocator_type get_allocator() const; // concurrent access class const_accessor; class accessor; // concurrent operations on a table bool find( const_accessor& result, const Key& key ) const; bool find( accessor& result, const Key& key ); bool insert( const_accessor& result, const Key& key ); bool insert( accessor& result, const Key& key ); bool insert( const_accessor& result, const value_type& value ); bool insert( accessor& result, const value_type& value ); bool insert( const value_type& value ); template void insert( I first, I last ); bool erase( const Key& key ); bool erase( const_accessor& item_accessor ); bool erase( accessor& item_accessor ); // parallel iteration typedef implementation defined range_type; typedef implementation defined const_range_type; range_type range( size_t grainsize=1 ); const_range_type range( size_t grainsize=1 ) const; // capacity size_type size() const; bool empty() const; 96 315415-014US size_type max_size() const; size_type bucket_count() const; // iterators typedef implementation defined iterator; typedef implementation defined const_iterator; iterator begin(); iterator end(); const_iterator begin() const; const_iterator end() const; std::pair equal_range( const Key& key ); std::pair equal_range( const Key& key ) const; }; template bool operator==( const concurrent_hash_map &a, const concurrent_hash_map &b); template bool operator!=(const concurrent_hash_map &a, const concurrent_hash_map &b); template void swap(concurrent_hash_map& a, concurrent_hash_map& b) } Exception Safey The following functions must not throw exceptions: • The hash function • The destructors for types Key and T. The following hold true: • If an exception happens during an insert operation, the operation has no effect. Containers Reference Manual 97 • If an exception happens during an assignment operation, the container may be in a state where only some of the items were assigned, and methods size() and empty() may return invalid answers. 5.4.1 Whole Table Operations These operations affect an entire table. Do not concurrently invoke them on the same table. 5.4.1.1 concurrent_hash_map( const allocator_type& a = allocator_type() ) Effects Constructs empty table. 5.4.1.2 concurrent_hash_map( size_type n, const allocator_type& a = allocator_type() ) Effects Construct empty table with preallocated buckets for at least n items. NOTE: In general, thread contention for buckets is inversely related to the number of buckets. If memory consumption is not an issue and P threads will be accessing the concurrent_hash_map, set n=4P. 5.4.1.3 concurrent_hash_map( const concurrent_hash_map& table, const allocator_type& a = allocator_type() ) Effects Copies a table. The table being copied may have const operations running on it concurrently. 5.4.1.4 template concurrent_hash_map( InputIterator first, InputIterator last, const allocator_type& a = allocator_type() ) Effects Constructs table containing copies of elements in the iterator half-open interval [first,last). 98 315415-014US 5.4.1.5 ~concurrent_hash_map() Effects Invokes clear(). This method is not safe to execute concurrently with other methods on the same concurrent_hash_map. 5.4.1.6 concurrent_hash_map& operator= ( concurrent_hash_map& source ) Effects If source and destination (this) table are distinct, clears the destination table and copies all key-value pairs from the source table to the destination table. Otherwise, does nothing. Returns Reference to the destination table. 5.4.1.7 void swap( concurrent_hash_map& table ) Effects Swaps contents and allocators of this and table. 5.4.1.8 void rehash( size_type n=0 ) Effects Internally, the table is partitioned into buckets. Method rehash reorgnizes these internal buckets in a way that may improve performance of future lookups. Raises number of internal buckets to n if n>0 and n exceeds the current number of buckets. CAUTION: The current implementation never reduces the number of buckets. A future implementation might reduce the number of buckets if n is less than the current number of buckets. NOTE: The ratio of items to buckets affects time and space usage by a table. A high ratio saves space at the expense of time. A low ratio does the opposite. The default ratio is 0.5 to 1 items per bucket on average. 5.4.1.9 void clear() Effects Erases all key-value pairs from the table. Does not hash or compare any keys. Containers Reference Manual 99 If TBB_USE_PERFORMANCE_WARNINGS is nonzero, issues a performance warning if the randomness of the hashing is poor enough to significantly impact performance. 5.4.1.10 allocator_type get_allocator() const Returns Copy of allocator used to construct table. 5.4.2 Concurrent Access Member classes const_accessor and accessor are called accessors. Accessors allow multiple threads to concurrently access pairs in a shared concurrent_hash_map. An accessor acts as a smart pointer to a pair in a concurrent_hash_map. It holds an implicit lock on a pair until the instance is destroyed or method release is called on the accessor. Classes const_accessor and accessor differ in the kind of access that they permit. Table 20: Differences Between const_accessor and accessor Class value_type Implied Lock on pair const_accessor const std::pair Reader lock – permits shared access with other readers. accessor std::pair Writer lock – permits exclusive access by a thread. Blocks access by other threads. Accessors cannot be assigned or copy-constructed, because allowing such would greatly complicate the locking semantics. 5.4.2.1 const_accessor Summary Provides read-only access to a pair in a concurrent_hash_map. Syntax template class concurrent_hash_map::const_accessor; Header #include "tbb/concurrent_hash_map.h" 100 315415-014US Description A const_accessor permits read-only access to a key-value pair in a concurrent_hash_map. Members namespace tbb { template class concurrent_hash_map::const_accessor { public: // types typedef const std::pair value_type; // construction and destruction const_accessor(); ~const_accessor(); // inspection bool empty() const; const value_type& operator*() const; const value_type* operator->() const; // early release void release(); }; } 5.4.2.1.1 bool empty() const Returns True if instance points to nothing; false if instance points to a key-value pair. 5.4.2.1.2 void release() Effects If !empty(), releases the implied lock on the pair, and sets instance to point to nothing. Otherwise does nothing. Containers Reference Manual 101 5.4.2.1.3 const value_type& operator*() const Effects Raises assertion failure if empty() and TBB_USE_ASSERT (3.2.1) is defined as 487H972H nonzero. Returns Const reference to key-value pair. 5.4.2.1.4 const value_type* operator->() const Returns &operator*() 5.4.2.1.5 const_accessor() Effects Constructs const_accessor that points to nothing. 5.4.2.1.6 ~const_accessor Effects If pointing to key-value pair, releases the implied lock on the pair. 5.4.2.2 accessor Summary Class that provides read and write access to a pair in a concurrent_hash_map. Syntax template class concurrent_hash_map::accessor; Header #include "tbb/concurrent_hash_map.h" Description An accessor permits read and write access to a key-value pair in a concurrent_hash_map. It is derived from a const_accessor, and thus can be implicitly cast to a const_accessor. 102 315415-014US Members namespace tbb { template class concurrent_hash_map::accessor: concurrent_hash_map::const_accessor { public: typedef std::pair value_type; value_type& operator*() const; value_type* operator->() const; }; } 5.4.2.2.1 value_type& operator*() const Effects Raises assertion failure if empty() and TBB_USE_ASSERT (3.2.1) is defined as nonzero. 488H973H Returns Reference to key-value pair. 5.4.2.2.2 value_type* operator->() const Returns &operator*() 5.4.3 Concurrent Operations The operations count, find, insert, and erase are the only operations that may be concurrently invoked on the same concurrent_hash_map. These operations search the table for a key-value pair that matches a given key. The find and insert methods each have two variants. One takes a const_accessor argument and provides read-only access to the desired key-value pair. The other takes an accessor argument and provides write access. Additionally, insert has a variant without any accessor. CAUTION: The concurrent operations (count, find, insert, and erase) invalidate any iterators pointing into the affected instance even with const qualifier. It is unsafe to use these operations concurrently with any other operation. An exception to this rule is that count and find do not invalidate iterators if no insertions or erasures have occurred after the most recent call to method rehash 17H . Containers Reference Manual 103 TIP: In serial code, the equal_range method should be used instead of the find method for lookup, since equal_range is faster and does not invalidate iterators. TIP: If the nonconst variant succeeds in finding the key, the consequent write access blocks any other thread from accessing the key until the accessor object is destroyed. Where possible, use the const variant to improve concurrency. Each map operation in this section returns true if the operation succeeds, false otherwise. CAUTION: Though there can be at most one occurrence of a given key in the map, there may be other key-value pairs in flight with the same key. These arise from the semantics of the insert and erase methods. The insert methods can create and destroy a temporary key-value pair that is not inserted into a map. The erase methods remove a key-value pair from the map before destroying it, thus permitting another thread to construct a similar key before the old one is destroyed. TIP: To guarantee that only one instance of a resource exists simultaneously for a given key, use the following technique: • To construct the resource: Obtain an accessor to the key in the map before constructing the resource. • To destroy the resource: Obtain an accessor to the key, destroy the resource, and then erase the key using the accessor. Below is a sketch of how this can be done. extern tbb::concurrent_hash_map Map; void ConstructResource( Key key ) { accessor acc; if( Map.insert(acc,key) ) { // Current thread inserted key and has exclusive access. ...construct the resource here... } // Implicit destruction of acc releases lock } void DestroyResource( Key key ) { accessor acc; if( Map.find(acc,key) ) { // Current thread found key and has exclusive access. ...destroy the resource here... // Erase key using accessor. Map.erase(acc); } } 104 315415-014US 5.4.3.1 size_type count( const Key& key ) const CAUTION: This method may invalidate previously obtained iterators. In serial code, you can instead use equal_range that does not have such problems. Returns 1 if map contains key; 0 otherwise. 5.4.3.2 bool find( const_accessor& result, const Key& key ) const Effects Searches table for pair with given key. If key is found, sets result to provide read-only access to the matching pair. CAUTION: This method may invalidate previously obtained iterators. In serial code, you can instead use equal_range that does not have such problems. Returns True if key was found; false if key was not found. 5.4.3.3 bool find( accessor& result, const Key& key ) Effects Searches table for pair with given key. If key is found, sets result to provide write access to the matching pair CAUTION: This method may invalidate previously obtained iterators. In serial code, you can instead use equal_range that does not have such problems. Returns True if key was found; false if key was not found. 5.4.3.4 bool insert( const_accessor& result, const Key& key ) Effects Searches table for pair with given key. If not present, inserts new pair(key,T()) into the table. Sets result to provide read-only access to the matching pair. Containers Reference Manual 105 Returns True if new pair was inserted; false if key was already in the map. 5.4.3.5 bool insert( accessor& result, const Key& key ) Effects Searches table for pair with given key. If not present, inserts new pair(key,T()) into the table. Sets result to provide write access to the matching pair. Returns True if new pair was inserted; false if key was already in the map. 5.4.3.6 bool insert( const_accessor& result, const value_type& value ) Effects Searches table for pair with given key. If not present, inserts new pair copyconstructed from value into the table. Sets result to provide read-only access to the matching pair. Returns True if new pair was inserted; false if key was already in the map. 5.4.3.7 bool insert( accessor& result, const value_type& value ) Effects Searches table for pair with given key. If not present, inserts new pair copyconstructed from value into the table. Sets result to provide write access to the matching pair. Returns True if new pair was inserted; false if key was already in the map. 5.4.3.8 bool insert( const value_type& value ) Effects Searches table for pair with given key. If not present, inserts new pair copyconstructed from value into the table. 106 315415-014US Returns True if new pair was inserted; false if key was already in the map. TIP: If you do not need to access the data after insertion, use the form of insert that does not take an accessor; it may work faster and use fewer locks. 5.4.3.9 template void insert( InputIterator first, InputIterator last ) Effects For each pair p in the half-open interval [first,last), does insert(p). The order of the insertions, or whether they are done concurrently, is unspecified. CAUTION: The current implementation processes the insertions in order. Future implementations may do the insertions concurrently. If duplicate keys exist in [first,last), be careful to not depend on their insertion order. 5.4.3.10 bool erase( const Key& key ) Effects Searches table for pair with given key. Removes the matching pair if it exists. If there is an accessor pointing to the pair, the pair is nonetheless removed from the table but its destruction is deferred until all accessors stop pointing to it. Returns True if pair was removed by the call; false if key was not found in the map. 5.4.3.11 bool erase( const_accessor& item_accessor ) Requirements item_accessor.empty()==false Effects Removes pair referenced by item_accessor. Concurrent insertion of the same key creates a new pair in the table. Returns True if pair was removed by this thread; false if pair was removed by another thread. Containers Reference Manual 107 5.4.3.12 bool erase( accessor& item_accessor ) Requirements item_accessor.empty()==false Effects Removes pair referenced by item_accessor. Concurrent insertion of the same key creates a new pair in the table. Returns True if pair was removed by this thread; false if pair was removed by another thread. 5.4.4 Parallel Iteration Types const_range_type and range_type model the Container Range concept (5.1). 974H The types differ only in that the bounds for a const_range_type are of type const_iterator, whereas the bounds for a range_type are of type iterator. NOTE: Do not call concurrent operations, including count and find while iterating the table. Use concurrent_unordered_map 18H if concurrent traversal and insertion are required. 5.4.4.1 const_range_type range( size_t grainsize=1 ) const Effects Constructs a const_range_type representing all keys in the table. The parameter grainsize is in units of hash table buckets. Each bucket typically has on average about one key-value pair. Returns const_range_type object for the table. 5.4.4.2 range_type range( size_t grainsize=1 ) Returns range_type object for the table. 108 315415-014US 5.4.5 Capacity 5.4.5.1 size_type size() const Returns Number of key-value pairs in the table. NOTE: This method takes constant time, but is slower than for most STL containers. 5.4.5.2 bool empty() const Returns size()==0. NOTE: This method takes constant time, but is slower than for most STL containers. 5.4.5.3 size_type max_size() const Returns Inclusive upper bound on number of key-value pairs that the table can hold. 5.4.5.4 size_type bucket_count() const Returns Current number of internal buckets. See method rehash 19H for discussion of buckets. 5.4.6 Iterators Template class concurrent_hash_map supports forward iterators; that is, iterators that can advance only forwards across a table. Reverse iterators are not supported. Concurrent operations (count, find, insert, and erase) invalidate any existing iterators that point into the table, An exception to this rule is that count and find do not invalidate iterators if no insertions or erasures have occurred after the most recent call to method rehash 20H . NOTE: Do not call concurrent operations, including count and find while iterating the table. Use concurrent_unordered_map 21H if concurrent traversal and insertion are required. 5.4.6.1 iterator begin() Returns iterator pointing to beginning of key-value sequence. Containers Reference Manual 109 5.4.6.2 iterator end() Returns iterator pointing to end of key-value sequence. 5.4.6.3 const_iterator begin() const Returns const_iterator with pointing to beginning of key-value sequence. 5.4.6.4 const_iterator end() const Returns const_iterator pointing to end of key-value sequence. 5.4.6.5 std::pair equal_range( const Key& key ); Returns Pair of iterators (i,j) such that the half-open range [i,j) contains all pairs in the map (and only such pairs) with keys equal to key. Because the map has no duplicate keys, the half-open range is either empty or contains a single pair. TIP: This method is serial alternative to concurrent count and find methods. 5.4.6.6 std::pair equal_range( const Key& key ) const; Description See 5.4.6.5. 975H 5.4.7 Global Functions These functions in namespace tbb improve the STL compatibility of concurrent_hash_map. 110 315415-014US 5.4.7.1 template bool operator==( const concurrent_hash_map& a, const concurrent_hash_map& b); Returns True if a and b contain equal sets of keys and for each pair (k,v1)?a and pair ,v2)?b, the expression bool(v1==v2) is true. 5.4.7.2 template bool operator!=(const concurrent_hash_map &a, const concurrent_hash_map &b); Returns !(a==b) 5.4.7.3 template void swap(concurrent_hash_map &a, concurrent_hash_map &b) Effects a.swap(b) 5.4.8 tbb_hash_compare Class Summary Default HashCompare for concurrent_hash_map. Syntax template struct tbb_hash_compare; Header #include "tbb/concurrent_hash_map.h" Containers Reference Manual 111 Description A tbb_hash_compare is the default for the HashCompare argument of template class concurrent_hash_map. The built-in definition relies on operator== and tbb_hasher as shown in the Members description. For your own types, you can define a template specialization of tbb_hash_compare or define an overload of tbb_hasher. There are built-in definitions of tbb_hasher for the following Key types: • Types that are convertible to a size_t by static_cast • Pointer types • std::basic_string • std::pair where K1 and K2 are hashed using tbb_hasher. Members namespace tbb { template struct tbb_hash_compare { static size_t hash(const Key& a) { return tbb_hasher(a); } static bool equal(const Key& a, const Key& b) { return a==b; } }; template size_t tbb_hasher(const T&); template size_t tbb_hasher(T*); template size_t tbb_hasher(const std::basic_string&); template size_t tbb_hasher(const std::pair& ); }; 112 315415-014US 5.5 concurrent_queue Template Class Summary Template class for queue with concurrent operations. Syntax template > class concurrent_queue; Header #include "tbb/concurrent_queue.h" Description A concurrent_queue is a first-in first-out data structure that permits multiple threads to concurrently push and pop items. Its capacity is unbounded6F 7 , subject to memory limitations on the target machine. The interface is similar to STL std::queue except where it must differ to make concurrent modification safe. Table 21: Differences Between STL queue and Intel® Threading Building Blocks concurrent_queue Feature STL std::queue concurrent_queue Access to front and back Methods front and back Not present. They would be unsafe while concurrent operations are in progress. size_type unsigned integral type signed integral type unsafe_size() Returns number of items in queue Returns number of items in queue. May return incorrect value if any push or try_pop operations are concurrently in flight. 7 In Intel® TBB 2.1, a concurrent_queue could be bounded. Intel® TBB 2.2 moves this functionality to concurrent_bounded_queue. Compile with TBB_DEPRECATED=1 to restore the old functionality, or (recommended) use concurrent_bounded_queue instead. Containers Reference Manual 113 Feature STL std::queue concurrent_queue Copy and pop item unless queue q is empty. bool b=!q.empty(); if(b) { x=q.front(); q.pop(); } bool b = q.try_pop (x) Members namespace tbb { template > class concurrent_queue { public: // types typedef T value_type; typedef T& reference; typedef const T& const_reference; typedef std::ptrdiff_t size_type; typedef std::ptrdiff_t difference_type; typedef Alloc allocator_type; explicit concurrent_queue(const Alloc& a = Alloc ()); concurrent_queue(const concurrent_queue& src, const Alloc& a = Alloc()); template concurrent_queue(InputIterator first, InputIterator last, const Alloc& a = Alloc()); ~concurrent_queue(); void push( const T& source ); bool try_pop7F 8 ( T& destination ); void clear() ; size_type unsafe_size() const; bool empty() const; Alloc get_allocator() const; 8 Called pop_if_present in Intel® TBB 2.1. Compile with TBB_DEPRECATED=1 to use the old name. 114 315415-014US typedef implementation-defined iterator; typedef implementation-defined const_iterator; // iterators (these are slow and intended only for debugging) iterator unsafe_begin(); iterator unsafe_end(); const_iterator unsafe_begin() const; const_iterator unsafe_end() const; }; } 5.5.1 concurrent_queue( const Alloc& a = Alloc () ) Effects Constructs empty queue. 5.5.2 concurrent_queue( const concurrent_queue& src, const Alloc& a = Alloc() ) Effects Constructs a copy of src. 5.5.3 template concurrent_queue( InputIterator first, InputIterator last, const Alloc& a = Alloc() ) Effects Constructs a queue containing copies of elements in the iterator half-open interval [first,last). 5.5.4 ~concurrent_queue() Effects Destroys all items in the queue. Containers Reference Manual 115 5.5.5 void push( const T& source ) Effects Pushes a copy of source onto back of the queue. 5.5.6 bool try_pop ( T& destination ) Effects If value is available, pops it from the queue, assigns it to destination, and destroys the original value. Otherwise does nothing. Returns True if value was popped; false otherwise. 5.5.7 void clear() Effects Clears the queue. Afterwards size()==0. 5.5.8 size_type unsafe_size() const Returns Number of items in the queue. If there are concurrent modifications in flight, the value might not reflect the actual number of items in the queue. 5.5.9 bool empty() const Returns true if queue has no items; false otherwise. 5.5.10 Alloc get_allocator() const Returns Copy of allocator used to construct the queue. 116 315415-014US 5.5.11 Iterators A concurrent_queue provides limited iterator support that is intended solely to allow programmers to inspect a queue during debugging. It provides iterator and const_iterator types. Both follow the usual STL conventions for forward iterators. The iteration order is from least recently pushed to most recently pushed. Modifying a concurrent_queue invalidates any iterators that reference it. CAUTION: The iterators are relatively slow. They should be used only for debugging. Example The following program builds a queue with the integers 0..9, and then dumps the queue to standard output. Its overall effect is to print 0 1 2 3 4 5 6 7 8 9. #include "tbb/concurrent_queue.h" #include using namespace std; using namespace tbb; int main() { concurrent_queue queue; for( int i=0; i<10; ++i ) queue.push(i); typedef concurrent_queue::iterator iter; for(iter i(queue.unsafe_begin()); i!=queue.unsafe_end(); ++i) cout << *i << " "; cout << endl; return 0; } 5.5.11.1 iterator unsafe_begin() Returns iterator pointing to beginning of the queue. 5.5.11.2 iterator unsafe_end() Returns iterator pointing to end of the queue. Containers Reference Manual 117 5.5.11.3 const_iterator unsafe_begin() const Returns const_iterator with pointing to beginning of the queue. 5.5.11.4 const_iterator unsafe_end() const Returns const_iterator pointing to end of the queue. 5.6 concurrent_bounded_queue Template Class Summary Template class for bounded dual queue with concurrent operations. Syntax template > class concurrent_bounded_queue; Header #include "tbb/concurrent_queue.h" Description A concurrent_bounded_queue is similar to a concurrent_queue, but with the following differences: • Adds the ability to specify a capacity. The default capacity makes the queue practically unbounded. • Changes the push operation so that it waits until it can complete without exceeding the capacity. • Adds a waiting pop operation that waits until it can pop an item. • Changes the size_type to a signed type. • Changes the size() operation to return the number of push operations minus the number of pop operations. For example, if there are 3 pop operations waiting on an empty queue, size() returns -3. 118 315415-014US Members To aid comparison, the parts that differ from concurrent_queue are in bold and annotated. namespace tbb { template > class concurrent_bounded_queue { public: // types typedef T value_type; typedef T& reference; typedef const T& const_reference; typedef Alloc allocator_type; // size_type is signed type typedef std::ptrdiff_t size_type; typedef std::ptrdiff_t difference_type; explicit concurrent_bounded_queue(const allocator_type& a = allocator_type()); concurrent_bounded_queue( const concurrent_bounded_queue& src, const allocator_type& a = allocator_type()); template concurrent_bounded_queue( InputIterator begin, InputIterator end, const allocator_type& a = allocator_type()); ~concurrent_bounded_queue(); // waits until it can push without exceeding capacity. void push( const T& source ); // waits if *this is empty void pop( T& destination ); // skips push if it would exceed capacity. bool try_push8F 9 ( const T& source ); bool try_pop9F 10 ( T& destination ); void clear() ; 9 Method try_push was called push_if_not_full in Intel® TBB 2.1. 10 Method try_pop was called pop_if_present in Intel® TBB 2.1. Containers Reference Manual 119 // safe to call during concurrent modification, can return negative size. size_type size() const; bool empty() const; size_type capacity() const; void set_capacity( size_type capacity ); allocator_type get_allocator() const; typedef implementation-defined iterator; typedef implementation-defined const_iterator; // iterators (these are slow an intended only for debugging) iterator unsafe_begin(); iterator unsafe_end(); const_iterator unsafe_begin() const; const_iterator unsafe_end() const; }; } Because concurrent_bounded_queue is similar to concurrent_queue, the following subsections described only methods that differ. 5.6.1 void push( const T& source ) Effects Waits until size(), typename Alloc=cache_aligned_allocator > class concurrent_priority_queue; Header #include “tbb/concurrent_priority_queue.h” Description A concurrent_priority_queue is a container that permits multiple threads to concurrently push and pop items. Items are popped in priority order as determined by a template parameter. The queue’s capacity is unbounded, subject to memory limitations on the target machine. The interface is similar to STL std::priority_queue except where it must differ to make concurrent modification safe. Table 43: Differences between STL priority_queue and Intel® Threading Building Blocks concurrent_priority_queue Feature STL std::priority_queue concurrent_priority_queue Choice of underlying container Sequence template parameter No choice of underlying container; allocator choice is provided instead Access to highest priority item const value_type& top() const Not available. Unsafe for concurrent container Copy and pop item if present bool b=!q.empty(); if(b) { x=q.top(); q.pop(); } bool b = q.try_pop(x); Get number of items in queue size_type size() const Same, but may be inaccurate due to pending concurrent push or pop operations Check if there are items in queue bool empty() const Same, but may be inaccurate due to pending concurrent push or 122 315415-014US pop operations Members namespace tbb { template , typename A=cache_aligned_allocator > class concurrent_priority_queue { typedef T value_type; typedef T& reference; typedef const T& const_reference; typedef size_t size_type; typedef ptrdiff_t difference_type; typedef A allocator_type; concurrent_priority_queue(const allocator_type& a = allocator_type()); concurrent_priority_queue(size_type init_capacity, const allocator_type& a = allocator_type()); template concurrent_priority_queue(InputIterator begin, InputIterator end, const allocator_type& a = allocator_type()); concurrent_priority_queue(const concurrent_priority_queue& src, const allocator_type& a = allocator_type()); concurrent_priority_queue& operator=(const concurrent_priority_queue& src); ~concurrent_priority_queue(); bool empty() const; size_type size() const; void push(const_reference elem); bool try_pop(reference elem); void clear(); void swap(concurrent_priority_queue& other); allocator_type get_allocator() const; }; } Containers Reference Manual 123 5.7.1 concurrent_priority_queue(const allocator_type& a = allocator_type()) Effects Constructs empty queue. 5.7.2 concurrent_priority_queue(size_type init_capacity, const allocator_type& a = allocator_type()) Effects Constructs an empty queue with an initial capacity. 5.7.3 concurrent_priority_queue(InputIterator begin, InputIterator end, const allocator_type& a = allocator_type()) Effects Constructs a queue containing copies of elements in the iterator half-open interval [begin, end). 5.7.4 concurrent_priority_queue (const concurrent_priority_queue& src, const allocator_type& a = allocator_type()) Effects Constructs a copy of src. This operation is not thread-safe and may result in an error or an invalid copy of src if another thread is concurrently modifying src. 5.7.5 concurrent_priority_queue& operator=(const concurrent_priority_queue& src) Effects Assign contents of src to *this. This operation is not thread-safe and may result in an error or an invalid copy of src if another thread is concurrently modifying src. 124 315415-014US 5.7.6 ~concurrent_priority_queue() Effects Destroys all items in the queue, and the container itself, so that it can no longer be used. 5.7.7 bool empty() const Returns true if queue has no items; false otherwise. May be inaccurate when concurrent push or try_pop operations are pending. This operation reads shared data and may trigger a race condition in race detection tools when used concurrently. 5.7.8 size_type size() const Returns Number of items in the queue. May be inaccurate when concurrent push or try_pop operations are pending. This operation reads shared data and may trigger a race condition in race detection tools when used concurrently. 5.7.9 void push(const_reference elem) Effects Pushes a copy of elem into the queue. This operation is thread-safe with other push and try_pop operations. 5.7.10 bool try_pop(reference elem) Effects If the queue is not empty, copies the highest priority item from the queue and assigns it to elem, and destroys the popped item in the queue; otherwise, does nothing. This operation is thread-safe with other push and try_pop operations. Returns true if an item was popped; false otherwise. Containers Reference Manual 125 5.7.11 void clear() Effects Clears the queue; results in size()==0. This operation is not thread-safe. 5.7.12 void swap(concurrent_priority_queue& other) Effects Swaps the queue contents with those of other. This operation is not thread-safe. 5.7.13 allocator_type get_allocator() const Returns Copy of allocator used to construct the queue. 5.8 concurrent_vector Summary Template class for vector that can be concurrently grown and accessed. Syntax template > class concurrent_vector; Header #include "tbb/concurrent_vector.h" Description A concurrent_vector is a container with the following features: • Random access by index. The index of the first element is zero. • Multiple threads can grow the container and append new elements concurrently. • Growing the container does not invalidate existing iterators or indices. A concurrent_vector meets all requirements for a Container and a Reversible Container as specified in the ISO C++ standard. It does not meet the Sequence requirements due to absence of methods insert() and erase(). 126 315415-014US Members namespace tbb { template > class concurrent_vector { public: typedef size_t size_type; typedef allocator-A-rebound-for-T 10F 11 allocator_type; typedef T value_type; typedef ptrdiff_t difference_type; typedef T& reference; typedef const T& const_reference; typedef T* pointer; typedef const T *const_pointer; typedef implementation-defined iterator; typedef implementation-defined const_iterator; typedef implementation-defined reverse_iterator; typedef implementation-defined const_reverse_iterator; // Parallel ranges typedef implementation-defined range_type; typedef implementation-defined const_range_type; range_type range( size_t grainsize ); const_range_type range( size_t grainsize ) const; // Constructors explicit concurrent_vector( const allocator_type& a = allocator_type() ); concurrent_vector( const concurrent_vector& x ); template concurrent_vector( const concurrent_vector& x ); explicit concurrent_vector( size_type n, const T& t=T(), const allocator_type& a = allocator_type() ); template concurrent_vector(InputIterator first, InputIterator last, const allocator_type& a=allocator_type()); 11 This rebinding follows practice established by both the Microsoft and GNU implementations of std::vector. Containers Reference Manual 127 // Assignment concurrent_vector& operator=( const concurrent_vector& x ); template concurrent_vector& operator=( const concurrent_vector& x ); void assign( size_type n, const T& t ); template void assign( InputIterator first, InputIterator last ); // Concurrent growth operations11F 12 iterator grow_by( size_type delta ); iterator grow_by( size_type delta, const T& t ); iterator grow_to_at_least( size_type n ); iterator push_back( const T& item ); // Items access reference operator[]( size_type index ); const_reference operator[]( size_type index ) const; reference at( size_type index ); const_reference at( size_type index ) const; reference front(); const_reference front() const; reference back(); const_reference back() const; // Storage bool empty() const; size_type capacity() const; size_type max_size() const; size_type size() const; allocator_type get_allocator() const; // Non-concurrent operations on whole container void reserve( size_type n ); void compact(); void swap( concurrent_vector& vector ); 12 The return types of the growth methods are different in Intel® TBB 2.2 than in prior versions. See footnotes in the descriptions of the individual methods for details. 128 315415-014US void clear(); ~concurrent_vector(); // Iterators iterator begin(); iterator end(); const_iterator begin() const; const_iterator end() const; reverse_iterator rbegin(); reverse_iterator rend(); const_reverse_iterator rbegin() const; const_reverse_iterator rend() const; // C++0x extensions const_iterator cbegin() const; const_iterator cend() const; const_reverse_iterator crbegin() const; const_reverse_iterator crend() const; }; // Template functions template bool operator==( const concurrent_vector& a, const concurrent_vector& b ); template bool operator!=( const concurrent_vector& a, const concurrent_vector& b ); template bool operator<( const concurrent_vector& a, const concurrent_vector& b ); template bool operator>( const concurrent_vector& a, const concurrent_vector& b ); template bool operator<=( const concurrent_vector& a, const concurrent_vector& b ); template bool operator>=(const concurrent_vector& a, const concurrent_vector& b ); Containers Reference Manual 129 template void swap(concurrent_vector& a, concurrent_vector& b); } Exception Safety Concurrent growing is fundamentally incompatible with ideal exception safety.12F 13 Nonetheless, concurrent_vector offers a practical level of exception safety. Element type T must meet the following requirements: • Its destructor must not throw an exception. • If its default constructor can throw an exception, its destructor must be non-virtual and work correctly on zero-filled memory. Otherwise the program’s behavior is undefined. Growth (5.8.3) and vector assignment ( 976H 5.8.1) append a sequence of elements to a 977H vector. If an exception occurs, the impact on the vector depends upon the cause of the exception: • If the exception is thrown by the constructor of an element, then all subsequent elements in the appended sequence will be zero-filled. • Otherwise, the exception was thrown by the vector's allocator. The vector becomes broken. Each element in the appended sequence will be in one of three states: o constructed o zero-filled o unallocated in memory Once a vector becomes broken, care must be taken when accessing it: • Accessing an unallocated element with method at causes an exception std::range_error. Any other way of accessing an unallocated element has undefined behavior. • The values of capacity() and size() may be less than expected. • Access to a broken vector via back()has undefined behavior. However, the following guarantees hold for broken or unbroken vectors: 13 For example, consider P threads each appending N elements. To be perfectly exception safe, these operations would have to be serialized, because each operation has to know that the previous operation succeeded before allocating more indices. 130 315415-014US • Let k be an index of an unallocated element. Then size()=capacity()=k. • Growth operations never cause size() or capacity() to decrease. If a concurrent growth operation successfully completes, the appended sequence remains valid and accessible even if a subsequent growth operations fails. Fragmentation Unlike a std::vector, a concurrent_vector never moves existing elements when it grows. The container allocates a series of contiguous arrays. The first reservation, growth, or assignment operation determines the size of the first array. Using a small number of elements as initial size incurs fragmentation across cache lines that may increase element access time. The method shrink_to_fit()merges several smaller arrays into a single contiguous array, which may improve access time. 5.8.1 Construction, Copy, and Assignment Safety These operations must not be invoked concurrently on the same vector. 5.8.1.1 concurrent_vector( const allocator_type& a = allocator_type() ) Effects Constructs empty vector using optionally specified allocator instance. 5.8.1.2 concurrent_vector( size_type n, const_reference t=T(), const allocator_type& a = allocator_type() ); Effects Constructs vector of n copies of t, using optionally specified allocator instance. If t is not specified, each element is default constructed instead of copied. 5.8.1.3 template concurrent_vector( InputIterator first, InputIterator last, const allocator_type& a = allocator_type() ) Effects Constructs vector that is copy of the sequence [first,last), making only N calls to the copy constructor of T, where N is the distance between first and last. Containers Reference Manual 131 5.8.1.4 concurrent_vector( const concurrent_vector& src ) Effects Constructs copy of src. 5.8.1.5 concurrent_vector& operator=( const concurrent_vector& src ) Effects Assigns contents of src to *this. Returns Reference to left hand side. 5.8.1.6 template concurrent_vector& operator=( const concurrent_vector& src ) Assign contents of src to *this. Returns Reference to left hand side. 5.8.1.7 void assign( size_type n, const_reference t ) Assign n copies of t. 5.8.1.8 template void assign( InputIterator first, InputIterator last ) Assign copies of sequence [first,last), making only N calls to the copy constructor of T, where N is the distance between first and last. 5.8.2 Whole Vector Operations Safety Concurrent invocation of these operations on the same instance is not safe. 132 315415-014US 5.8.2.1 void reserve( size_type n ) Effects Reserves space for at least n elements. Throws std::length_error if n>max_size(). It can also throw an exception if the allocator throws an exception. Safety If an exception is thrown, the instance remains in a valid state. 5.8.2.2 void shrink_to_fit()13F 14 Effects Compacts the internal representation to reduce fragmentation. 5.8.2.3 void swap( concurrent_vector& x ) Swap contents of two vectors. Takes O(1) time. 5.8.2.4 void clear() Effects Erases all elements. Afterwards, size()==0. Does not free internal arrays.14F 15 TIP: To free internal arrays, call shrink_to_fit() after clear(). 5.8.2.5 ~concurrent_vector() Effects Erases all elements and destroys the vector. 14 Method shrink_to_fit was called compact() in Intel® TBB 2.1. It was renamed to match the C++0x std::vector::shrink_to_fit(). 15 The original release of Intel® TBB 2.1 and its “update 1” freed the arrays. The change in “update 2” reverts back to the behavior of Intel® TBB 2.0. The motivation for not freeing the arrays is to behave similarly to std::vector::clear(). Containers Reference Manual 133 5.8.3 Concurrent Growth Safety The methods described in this section may be invoked concurrently on the same vector. 5.8.3.1 iterator grow_by( size_type delta, const_reference t=T() )15F 16 Effects Appends a sequence comprising delta copies of t to the end of the vector. If t is not specified, the new elements are default constructed. Returns Iterator pointing to beginning of appended sequence. 5.8.3.2 iterator grow_to_at_least( size_type n ) 16F 17 Effects Appends minimal sequence of elements such that vector.size()>=n. The new elements are default constructed. Blocks until all elements in range [0..n) are allocated (but not necessarily constructed if they are under construction by a different thread). TIP: If a thread must know whether construction of an element has completed, consider the following technique. Instantiate the concurrent_vector using a zero_allocator (8.5). Define the constructor 978H T() such that when it completes, it sets a field of T to non-zero. A thread can check whether an item in the concurrent_vector is constructed by checking whether the field is non-zero. Returns Iterator that points to beginning of appended sequence, or pointer to (*this)[n] if no elements were appended. 16 Return type was size_type in Intel® TBB 2.1. 17 Return type was void in Intel® TBB 2.1. 134 315415-014US 5.8.3.3 iterator push_back( const_reference value )17F 18 Effects Appends copy of value to the end of the vector. Returns Iterator that points to the copy. 5.8.4 Access Safety The methods described in this section may be concurrently invoked on the same vector as methods for concurrent growth (5.8.3). However, the returned reference may be to 979H an element that is being concurrently constructed. 5.8.4.1 reference operator[]( size_type index ) Returns Reference to element with the specified index. 5.8.4.2 const_refrence operator[]( size_type index ) const Returns Const reference to element with the specified index. 5.8.4.3 reference at( size_type index ) Returns Reference to element at specified index. Throws std::out_of_range if index = size(). 18 Return type was size_type in Intel® TBB 2.1. Containers Reference Manual 135 5.8.4.4 const_reference at( size_type index ) const Returns Const reference to element at specified index. Throws std::out_of_range if index = size() or index is for broken portion of vector. 5.8.4.5 reference front() Returns (*this)[0] 5.8.4.6 const_reference front() const Returns (*this)[0] 5.8.4.7 reference back() Returns (*this)[size()-1] 5.8.4.8 const_reference back() const Returns (*this)[size()-1] 5.8.5 Parallel Iteration Types const_range_type and range_type model the Container Range concept (5.495H980H 1981H5.1). The types differ only in that the bounds for a const_range_type are of type const_iterator, whereas the bounds for a range_type are of type iterator. 5.8.5.1 range_type range( size_t grainsize=1 ) Returns Range over entire concurrent_vector that permits read-write access. 136 315415-014US 5.8.5.2 const_range_type range( size_t grainsize=1 ) const Returns Range over entire concurrent_vector that permits read-only access. 5.8.6 Capacity 5.8.6.1 size_type size() const Returns Number of elements in the vector. The result may include elements that are allocated but still under construction by concurrent calls to any of the growth methods (5.8.3). 982H 5.8.6.2 bool empty() const Returns size()==0 5.8.6.3 size_type capacity() const Returns Maximum size to which vector can grow without having to allocate more memory. NOTE: Unlike an STL vector, a concurrent_vector does not move existing elements if it allocates more memory. 5.8.6.4 size_type max_size() const Returns Highest possible size of the vector could reach. 5.8.7 Iterators Template class concurrent_vector supports random access iterators as defined in Section 24.1.4 of the ISO C++ Standard. Unlike a std::vector, the iterators are not raw pointers. A concurrent_vector meets the reversible container requirements in Table 66 of the ISO C++ Standard. Containers Reference Manual 137 5.8.7.1 iterator begin() Returns iterator pointing to beginning of the vector. 5.8.7.2 const_iterator begin() const Returns const_iterator pointing to beginning of the vector. 5.8.7.3 iterator end() Returns iterator pointing to end of the vector. 5.8.7.4 const_iterator end() const Returns const_iterator pointing to end of the vector. 5.8.7.5 reverse_iterator rbegin() Returns reverse iterator pointing to beginning of reversed vector. 5.8.7.6 const_reverse_iterator rbegin() const Returns const_reverse_iterator pointing to beginning of reversed vector. 5.8.7.7 iterator rend() Returns const_reverse_iterator pointing to end of reversed vector. 5.8.7.8 const_reverse_iterator rend() Returns const_reverse_iterator pointing to end of reversed vector. 138 315415-014US 6 Flow Graph There are some applications that best express dependencies as messages passed between nodes in a flow graph. These messages may contain data or simply act as signals that a predecessor has completed. The graph class and its associated node classes can be used to express such applications. All graph-related classes and functions are in the tbb::flow namespace. Primary Components There are 3 types of components used to implement a graph: A graph object Nodes Edges The graph object is the owner of the tasks created on behalf of the flow graph. Users can wait on the graph if they need to wait for the completion of all of the tasks related to the flow graph execution. One can also register external interactions with the graph and run tasks under the ownership of the flow graph. Nodes invoke user-provided function objects or manage messages as the flow to/from other nodes. There are pre-defined nodes that buffer, filter, broadcast or order items as they flow through the graph. Edges are the connections between the nodes, created by calls to the make_edge function. Message Passing Protocol In an Intel® TBB flow graph, edges dynamically switch between a push and pull protocol for passing messages. An Intel® TBB flow graph G = ( V, S, L ), where V is the set of nodes, S is the set of edges that are currently using a push protocol, and L is the set of edges that are currently using a pull protocol. For each edge (Vi, Vj), Vi is the predecessor / sender and Vj is the successor / receiver. When in the push set S, messages over an edge are initiated by the sender, which tries to put to the receiver. When in the pull set, messages are initiated by the receiver, which tries to get from the sender. If a message attempt across an edge fails, the edge is moved to the other set. For example, if a put across the edge (Vi, Vj) fails, the edge is removed from the push set S and placed in the pull set L. This dynamic push/pull protocol is the key to performance in a non-preemptive tasking library such as Intel® TBB, where simply Flow Graph Reference Manual 139 repeating failed sends or receives is not an efficient option. Figure 4 summarizes this 983H dynamic protocol. Use Push Protcol for (Vs , Vr ) Use Pull Protcol for (Vs , Vr ) Putto Vr rejected Requestfrom Vs rejected Putto Vr accepted Requestfrom Vs accepted Figure 4: The dynamic push / pull protocol. Body Objects Some nodes execute user-provided body objects. These objects can be created by instatiating function objects or lambda expressions. The nodes that use body objects include cotinue_node, function_node and source_node. CAUTION: The body objects passed to the flow graph nodes are copied. Therefore updates to member variables will not affect the original object used to construct the node. If the state held within a body object must be inspected from outside of the node, the copy_body function described in 6.22 can be used to obtain an updated copy. 984H Dependency Flow Graph Example #include #include "tbb/flow_graph.h" using namespace tbb::flow; struct body { std::string my_name; body( const char *name ) : my_name(name) {} void operator()( continue_msg ) const { printf("%s\n", my_name.c_str()); } }; int main() { graph g; 140 315415-014US broadcast_node< continue_msg > start; continue_node a( g, body("A")); continue_node b( g, body("B")); continue_node c( g, body("C")); continue_node d( g, body("D")); continue_node e( g, body("E")); make_edge( start, a ); make_edge( start, b ); make_edge( a, c ); make_edge( b, c ); make_edge( c, d ); make_edge( a, e ); for (int i = 0; i < 3; ++i ) { start.try_put( continue_msg() ); g.wait_for_all(); } return 0; } In this example, five computations A-E are setup with the partial ordering shown in Figure 5. For each edge in the flow graph, the node at the tail of the edge must 985H complete its execution before the node at the head may begin. NOTE: This is a simple syntactic example only. Since each node in a flow graph may execute as an independent task, the granularity of each node should follow the general guidelines for tasks as described in Section 3.2.3 of the Intel® Threading Building Blocks Tutorial. Flow Graph Reference Manual 141 Figure 5: A simple dependency graph. In this example, nodes A-E print out their names. All of these nodes are therefore able to use struct body to construct their body objects. In function main, the flow graph is set up once and then run three times. All of the nodes in this example pass around continue_msg objects. This type is described in Section 6.4 and is used to communicate that a node has completed its execution. 986H The first line in function main instantiates a graph object, g. On the next line, a broadcast_node named start is created. Anything passed to this node will be broadcast to all of its successors. The node start is used in the for loop at the bottom of main to launch the execution of the rest of the flow graph. In the example, five continue_node objects are created, named a – e. Each node is constructed with a reference to graph g and the function object to invoke when it runs. The successor / predecessor relationships are set up by the make_edge calls that follow the declaration of the nodes. After the nodes and edges are set up, the try_put in each iteration of the for loop results in a broadcast of a continue_msg to both a and b. Both a and b are waiting for a single continue_msg, since they both have only a single predecessor, start. When they receive the message from start, they execute their body objects. When complete, they each forward a continue_msg to their successors, and so on. The graph 142 315415-014US uses tasks to execute the node bodies as well as to forward messages between the nodes, allowing computation to execute concurrently when possible. The classes and functions used in this example are described in detail in the remaining sections in Appendix D. Message Flow Graph Example #include #include "tbb/flow_graph.h" using namespace tbb::flow; struct square { int operator()(int v) { return v*v; } }; struct cube { int operator()(int v) { return v*v*v; } }; class sum { int &my_sum; public: sum( int &s ) : my_sum(s) {} int operator()( std::tuple< int, int > v ) { my_sum += std::get<0>(v) + std::get<1>(v); return my_sum; } }; int main() { int result = 0; graph g; broadcast_node input; function_node squarer( g, unlimited, square() ); function_node cuber( g, unlimited, cube() ); join_node< std::tuple, queueing > join( g ); function_node,int> summer( g, serial, sum(result) ); make_edge( input, squarer ); make_edge( input, cuber ); make_edge( squarer, std::get<0>( join.inputs() ) ); make_edge( cuber, std::get<1>( join.inputs() ) ); Flow Graph Reference Manual 143 make_edge( join, summer ); for (int i = 1; i <= 10; ++i) input.try_put(i); g.wait_for_all(); printf("Final result is %d\n", result); return 0; } This example calculates the sum of x*x + x*x*x for all x = 1 to 10. NOTE: This is a simple syntactic example only. Since each node in a flow graph may execute as an independent task, the granularity of each node should follow the general guidelines for tasks as described in Section 3.2.3 of the Intel® Threading Building Blocks Tutorial. The layout of this example is shown in Figure 6. Each value enters through the 987H broadcast_node input. This node broadcasts the value to both squarer and cuber, which calculate x*x and x*x*x respectively. The output of each of these nodes is put to one of join’s ports. A tuple containing both values is created by join_node< tuple > join and forwarded to summer, which adds both values to the running total. Both squarer and cuber allow unlimited concurrency, that is they each may process multiple values simultaneously. The final summer, which updates a shared total, is only allowed to process a single incoming tuple at a time, eliminating the need for a lock around the shared value. The classes square, cube and sum define the three user-defined operations. Each class is used to create a function_node. In function main, the flow graph is setup and then the values 1 – 10 are put into the node input. All the nodes in this example pass around values of type int. The nodes used in this example are all class templates and therefore can be used with any type that supports copy construction, including pointers and objects. CAUTION: Values are copied as they pass between nodes and therefore passing around large objects should be avoided. To avoid large copy overheads, pointers to large objects can be passed instead. 144 315415-014US Figure 6: A simple message flow graph. The classes and functions used in this example are described in detail in the remaining sections of Appendix D. 6.1 graph Class Summary Class that serves as a handle to a flow graph of nodes and edges. Syntax class graph; Header #include "tbb/flow_graph.h" Description A graph object contains a root task that is the parent of all tasks created on behalf of the flow graph and its nodes. It provides methods that can be used to access the root task, to wait for the children of the root task to complete, to explicitly increment or decrement the root task’s reference count, and to run a task as a child of the root task. CAUTION: Destruction of flow graph nodes before calling wait_for_all on their associated graph object has undefined behavior and can lead to program failure. Members namespace tbb { namespace flow { Flow Graph Reference Manual 145 class graph { public: graph(); ~graph(); void increment_wait_count(); void decrement_wait_count(); template< typename Receiver, typename Body > void run( Receiver &r, Body body ); template< typename Body > void run( Body body ); void wait_for_all(); task * root_task(); }; } } 6.1.1 graph() Effects Constructs a graph with no nodes. Instantiates a root task of class empty_task to serve as a parent for all of the tasks generated during runs of the graph. Sets ref_count of the root task to 1. 6.1.2 ~graph() Effects Calls wait_for_all on the graph, then destroys the root task. 6.1.3 void increment_wait_count() Description Used to register that an external entity may still interact with the graph. Effects Increments the ref_count of the root task. 146 315415-014US 6.1.4 void decrement_wait_count() Description Used to unregister an external entity that may have interacted with the graph. Effects Decrements the ref_count of the root task. 6.1.5 template< typename Receiver, typename Body > void run( Receiver &r, Body body ) Description This method can be used to enqueue a task that runs a body and puts its output to a specific receiver. The task is created as a child of the graph’s root task and therefore wait_for_all will not return until this task completes. Effects Enqueues a task that invokes r.try_put( body() ). It does not wait for the task to complete. The enqueued task is a child of the root task. 6.1.6 template< typename Body > void run( Body body ) Description This method enqueues a task that runs as a child of the graph’s root task. Calls to wait_for_all will not return until this enqueued task completes. Effects Enqueues a task that invokes body(). It does not wait for the task to complete. 6.1.7 void wait_for_all() Effect Blocks until all tasks associated with the root task have completed and the number of decrement_wait_count calls equals the number of increment_wait_count calls. Because it calls wait_for_all on the root graph task, the calling thread may participate in work-stealing while it is blocked. Flow Graph Reference Manual 147 6.1.8 task *root_task() Retuns Returns a pointer to the root task of the flow graph. 6.2 sender Template Class Summary An abstract base class for nodes that act as message senders. Syntax template< typename T > class sender; Header #include "tbb/flow_graph.h" Description The sender template class is an abstract base class that defines the interface for nodes that can act as senders. Default implementations for several functions are provided. Members namespace tbb { namespace flow { template< typename T > class sender { public: typedef T output_type; typedef receiver successor_type; virtual ~sender(); virtual bool register_successor( successor_type &r ) = 0; virtual bool remove_successor( successor_type &r ) = 0; virtual bool try_get( output_type & ) { return false; } virtual bool try_reserve( output_type & ) { return false; } virtual bool try_release( ) { return false; } virtual bool try_consume( ) { return false; } }; } } 148 315415-014US 6.2.1 ~sender() Description The destructor. 6.2.2 bool register_successor( successor_type & r ) = 0 Description A pure virtual method that describes the interface for adding a successor node to the set of successors for the sender. Returns True if the successor is added. False otherwise. 6.2.3 bool remove_successor( successor_type & r ) = 0 Description A pure virtual method that describes the interface for removing a successor node from the set of successors for a sender. Returns True if the successor is removed. False otherwise. 6.2.4 bool try_get( output_type & ) Description Requests an item from a sender. Returns The default implementation returns false. Flow Graph Reference Manual 149 6.2.5 bool try_reserve( output_type & ) Description Reserves an item at the sender. Returns The default implementation returns false. 6.2.6 bool try_release( ) Description Releases the reservation held at the sender. Returns The default implementation returns false. 6.2.7 bool try_consume( ) Description Consumes the reservation held at the sender. Effect The default implementation returns false. 6.3 receiver Template Class Summary An abstract base class for nodes that act as message receivers. Syntax template< typename T > class receiver; Header #include "tbb/flow_graph.h" 150 315415-014US Description The receiver template class is an abstract base class that defines the interface for nodes that can act as receivers. Default implementations for several functions are provided. Members namespace tbb { namespace flow { template< typename T > class receiver { public: typedef T input_type; typedef sender predecessor_type; virtual ~receiver(); virtual bool try_put( const input_type &v ) = 0; virtual bool register_predecessor( predecessor_type &p ) { return false; } virtual bool remove_predecessor( predecessor_type &p ) { return false; } }; } } 6.3.1 ~receiver() Description The destructor. 6.3.2 bool register_predecessor( predecessor_type & p ) Description Adds a predecessor to the node’s set of predecessors. Returns True if the predecessor is added. False otherwise. The default implementation returns false. Flow Graph Reference Manual 151 6.3.3 bool remove_predecessor( predecessor_type & p ) Description Removes a predecessor from the node’s set of predecessors. Returns True if the predecessor is removed. False otherwise. The default implementation returns false. 6.3.4 bool try_put( const input_type &v ) = 0 Description A pure virtual method that represents the interface for putting an item to a receiver. 6.4 continue_msg Class Summary An empty class that represent a continue message. This class is used to indicate that the sender has completed. Syntax class continue_msg; Header #include "tbb/flow_graph.h" Members namespace tbb { namespace flow { class continue_msg {}; } } 6.5 continue_receiver Class Summary An abstract base class for nodes that act as receivers of continue_msg objects. These nodes call a method execute when the number of try_put calls reaches a threshold that represents the number of known predecessors. 152 315415-014US Syntax class continue_receiver; Header #include "tbb/flow_graph.h" Description This type of node is triggered when its method try_put has been called a number of times that is equal to the number of known predecessors. When triggered, the node calls the method execute, then resets and will fire again when it receives the correct number of try_put calls. This node type is useful for dependency graphs, where each node must wait for its predecessors to complete before executing, but no explicit data is passed across the edge. Members namespace tbb { namespace flow { class continue_receiver : public receiver< continue_msg > { public: typedef continue_msg input_type; typedef sender< input_type > predecessor_type; continue_receiver( int num_predecessors = 0 ); continue_receiver( const continue_receiver& src ); virtual ~continue_receiver(); virtual bool try_put( const input_type &v ); virtual bool register_predecessor( predecessor_type &p ); virtual bool remove_predecessor( predecessor_type &p ); protected: virtual void execute() = 0; }; } } 6.5.1 continue_receiver( int num_predecessors = 0 ) Effect Constructs a continue_receiver that is initialized to trigger after receiving num_predecessors calls to try_put. Flow Graph Reference Manual 153 6.5.2 continue_receiver( const continue_receiver& src ) Effect Constructs a continue_receiver that has the same initial state that src had after its construction. It does not copy the current count of try_puts received, or the current known number of predecessors. The continue_receiver that is constructed will only have a non-zero threshold if src was constructed with a non-zero threshold. 6.5.3 ~continue_receiver( ) Effect Destructor. 6.5.4 bool try_put( const input_type & ) Effect Increments the count of try_put calls received. If the incremented count is equal to the number of known predecessors, a call is made to execute and the internal count of try_put calls is reset to zero. This method performs as if the call to execute and the updates to the internal count occur atomically. Returns True. 6.5.5 bool register_predecessor( predecessor_type & r ) Effect Increments the number of known predecessors. Returns True. 154 315415-014US 6.5.6 bool remove_predecessor( predecessor_type & r ) Effect Decrements the number of know predecessors. CAUTION: The method execute is not called if the count of try_put calls received becomes equal to the number of known predecessors as a result of this call. That is, a call to remove_predecessor will never call execute. 6.5.7 void execute() = 0 Description A pure virtual method that is called when the number of try_put calls is equal to the number of known predecessors. Must be overridden by the child class. CAUTION: This method should be very fast or else enqueue a task to offload its work, since this method is called while the sender is blocked on try_put. 6.6 graph_node Class Summary A base class for all graph nodes. Syntax class graph_node; Header #include "tbb/flow_graph.h" Description The class graph_node is a base class for all flow graph nodes. The virtual destructor allows flow graph nodes to be destroyed through pointers to graph_node. For example, a vector< graph_node * > could be used to hold the addresses of flow graph nodes that will later need to be destroyed. Members namespace tbb { namespace flow { Flow Graph Reference Manual 155 class graph_node { public: virtual ~graph_node() {} }; } } 6.7 continue_node Template Class Summary A template class that is a graph_node, continue_receiver and a sender. It executes a specified body object when triggered and broadcasts the generated value to all of its successors. Syntax template< typename Output > class continue_node; Header #include "tbb/flow_graph.h" Description This type is used for nodes that wait for their predecessors to complete before executing, but no explicit data is passed across the incoming edges. The output of the node can be a continue_msg or a value. An continue_node maintains an internal threshold, T, and an internal counter, C. If a value for the number of predecessors is provided at construction, then T is set to the provided value and C=0. Otherwise, C=T=0. At each call to method register_predecessor, the threshold T is incremented. At each call to method remove_predecessor, the threshold T is decremented. The functions make_edge and remove_edge appropriately call register_predecessor and remove_predecessor when edges are added to or removed from a continue_node. At each call to method try_put, C is incremented. If after the increment, C>=T, then C is reset to 0 and a task is enqueued to broadcast the result of body() to all successors. The increment of C, enqueueing of the task, and the resetting of C are all done atomically with respect to the node. If after the increment, C Body Concept Pseudo-Signature Semantics B::B( const B& ) Copy constructor. B::~B() Destructor. void18F 19 operator=( const B& ) Assignment Output B::operator()(const continue_msg &v) const Perform operation and return value of type Output. CAUTION: The body object passed to a continue_node is copied. Therefore updates to member variables will not affect the original object used to construct the node. If the state held within a body object must be inspected from outside of the node, the copy_body function described in 6.22 can be used to obtain an updated copy. 990H Output must be copy-constructible and assignable. Members namespace tbb { namespace flow { template< typename Output > class continue_node : public graph_node, public continue_receiver, public sender { public: template continue_node( graph &g, Body body ); template continue_node( graph &g, int number_of_predecessors, Body body ); continue_node( const continue_node& src ); // continue_receiver typedef continue_msg input_type; 19 The return type void in the pseudo-signature denotes that operator= is not required to return a value. The actual operator= can return a value, which will be ignored. Flow Graph Reference Manual 157 typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef Output output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 6.7.1 template< typename Body> continue_node(graph &g, Body body) Effect Constructs an continue_node that will invoke body. 6.7.2 template< typename Body> continue_node(graph &g, int number_of_predecessors, Body body) Effect Constructs an continue_node that will invoke body. The threshold T is initialized to number_of_predecessors. 6.7.3 continue_node( const continue_node & src ) Effect Constructs a continue_node that has the same initial state that src had after its construction. It does not copy the current count of try_puts received, or the current known number of predecessors. The continue_node that is constructed will have a 158 315415-014US reference to the same graph object as src, have a copy of the initial body used by src, and only have a non-zero threshold if src was constructed with a non-zero threshold. CAUTION: The new body object is copy constructed from a copy of the original body provided to src at its construction. Therefore changes made to member variables in src’s body after the construction of src will not affect the body of the new continue_node. 6.7.4 bool register_predecessor( predecessor_type & r ) Effect Increments the number of known predecessors. Returns True. 6.7.5 bool remove_predecessor( predecessor_type & r ) Effect Decrements the number of know predecessors. CAUTION: The body is not called if the count of try_put calls received becomes equal to the number of known predecessors as a result of this call. That is, a call to remove_predecessor will never invoke the body. 6.7.6 bool try_put( const input_type & ) Effect Increments the count of try_put calls received. If the incremented count is equal to the number of known predecessors, a task is enqueued to execute the body and the internal count of try_put calls is reset to zero. This method performs as if the enqueueing of the body task and the updates to the internal count occur atomically. It does not wait for the execution of the body to complete. Returns True. Flow Graph Reference Manual 159 6.7.7 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns True. 6.7.8 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns True. 6.7.9 bool try_get( output_type &v ) Description The continue_node does not contain buffering. Therefore it always rejects try_get calls. Returns False. 6.7.10 bool try_reserve( output_type & ) Description The continue_node does not contain buffering. Therefore it cannot be reserved. Returns False. 160 315415-014US 6.7.11 bool try_release( ) Description The continue_node does not contain buffering. Therefore it cannot be reserved. Returns False. 6.7.12 bool try_consume( ) Description The continue_node does not contain buffering. Therefore it cannot be reserved. Returns False. 6.8 function_node Template Class Summary A template class that is a graph_node, receiver and a sender. This node may have concurrency limits as set by the user. By default, a function_node has an internal FIFO buffer at its input. Messages that cannot be immediately processed due to concurrency limits are temporarily stored in this FIFO buffer. A template argument can be used to disable this internal buffer. If the FIFO buffer is disabled, incoming message will be rejected if they cannot be processed immediately while respecting the concurreny limits of the node. Syntax template < typename Input, typename Output = continue_msg, graph_buffer_policy = queueing, typename Allocator=cache_aligned_allocator > class function_node; Header #include "tbb/flow_graph.h" Flow Graph Reference Manual 161 Description A function_node receives messages of type Input at a single input port and generates a single output message of type Output that is broadcast to all successors. Rejection of messages by successors is handled using the protocol in Figure 4. 991H If graph_buffer_policy==queueing, an internal unbounded input buffer is maintained using memory obtained through an allocator of type Allocator. A function_node maintains an internal constant threshold T and an internal counter C. At construction, C=0 and T is set the value passed in to the constructor. The behavior of a call to try_put is determined by the value of T and C as shown in Table 23. 992H Table 23: Behavior of a call to a function_node’s try_put Value of threshold T Value of counter C bool try_put( input_type v ) T == flow::unlimited NA A task is enqueued that broadcasts the result of body(v) to all successors. Returns true. T != flow::unlimited C < T Increments C. A task is enqueued that broadcasts the result of body(v) to all successors and then decrements C. Returns true. T != flow::unlimited C >= T If the template argument graph_buffer_policy==queueing, v is stored in an internal FIFO buffer until C < T. When T becomes less than C, C is incremented and a task is enqueued that broadcasts the result of body(v) to all successors and then decrements C. Returns true. If the template argument graph_buffer_policy==rejectin g and C >= T, returns false. A function_node has a user-settable concurrency limit. It can have flow::unlimited concurrency, which allows an unlimited number of invocations of the body to execute concurrently. It can have flow::serial concurrency, which allows only a single call of body to execute concurrently. The user can also provide a value of type size_t to limit concurrency to a value between 1 and unlimited. A function_node with graph_buffer_policy==rejecting will maintain a predecessor set as described in Figure 4. If the 993H function_node transitions from a state where C >= T to a state where C < T, it will try to get new messages from its set of predecessors until C >= T or there are no valid predecessors left in the set. NOTE: A function_node can serve as a terminal node in the graph. The convention is to use an Output of continue_msg and attach no successor. 162 315415-014US The Body concept for function_node is shown in Table 24 994H . Table 24: function_node Body Concept Pseudo-Signature Semantics B::B( const B& ) Copy constructor. B::~B() Destructor. void19F 20 operator=( const B& ) Assignment Output B::operator()(const Input &v) const Perform operation on v and return value of type OutputType. CAUTION: The body object passed to a function_node is copied. Therefore updates to member variables will not affect the original object used to construct the node. If the state held within a body object must be inspected from outside of the node, the copy_body function described in 6.22 can be used to obtain an updated copy. 995H Input and Output must be copy-constructible and assignable. Members namespace tbb { namespace flow { enum graph_buffer_policy { rejecting, reserving, queueing, tag_matching }; template < typename Input, typename Output = continue_msg, graph_buffer_policy = queueing, typename Allocator=cache_aligned_allocator > class function_node : public graph_node, public receiver, public sender { public: template function_node( graph &g, size_t concurrency, Body body ); function_node( const function_node &src ); // receiver typedef Input input_type; typedef sender predecessor_type; 20 The return type void in the pseudo-signature denotes that operator= is not required to return a value. The actual operator= can return a value, which will be ignored. Flow Graph Reference Manual 163 bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef Output output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 6.8.1 template< typename Body> function_node(graph &g, size_t concurrency, Body body) Description Constructs a function_node that will invoke a copy of body. At most concurrency calls to body may be made concurrently. 6.8.2 function_node( const function_node &src ) Effect Constructs a function_node that has the same initial state that src had when it was constructed. The function_node that is constructed will have a reference to the same graph object as src, will have a copy of the initial body used by src, and have the same concurrency threshold as src. The predecessors and successors of src will not be copied. CAUTION: The new body object is copy constructed from a copy of the original body provided to src at its construction. Therefore changes made to member variables in src’s body after the construction of src will not affect the body of the new function_node.164 315415-014US 6.8.3 bool register_predecessor( predecessor_type & p ) Effect Adds p to the set of predecessors. Returns true. 6.8.4 bool remove_predecessor( predecessor_type & p ) Effect Removes p from the set of predecessors. Returns true. 6.8.5 bool try_put( const input_type &v ) Effect See Table 23 for a description of the behavior of 996H try_put. Returns true. 6.8.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. Flow Graph Reference Manual 165 6.8.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.8.8 bool try_get( output_type &v ) Description A function_node does not contain buffering of its output. Therefore it always rejects try_get calls. Returns false. 6.8.9 bool try_reserve( output_type & ) Description A function_node does not contain buffering of its output. Therefore it cannot be reserved. Returns false. 6.8.10 bool try_release( ) Description A function_node does not contain buffering of its output. Therefore it cannot be reserved. Returns false. 166 315415-014US 6.8.11 bool try_consume( ) Description A function_node does not contain buffering of its output. Therefore it cannot be reserved. Returns false. 6.9 source_node Class Summary A template class that is both a graph_node and a sender. This node can have no predecessors. It executes a user-provided body function object to generate messages that are broadcast to all successors. It is a serial node and will never call its body concurrently. It is able to buffer a single item. If no successor accepts an item that it has generated, the message is buffered and will be provided to successors before a new item is generated. Syntax template < typename OutputType > class source_node; Header #include "tbb/flow_graph.h" Description This type of node generates messages of type Output by invoking the user-provided body and broadcasts the result to all of its successors. Output must be copy-constructible and assignable. A source_node is a serial node. Calls to body will never be made concurrently. A source_node will continue to invoke body and broadcast messages until the body returns false or it has no valid successors. A message may be generated and then rejected by all successors. In that case, the message is buffered and will be the next message sent once a successor is added to the node or try_get is called. Calls to try_get will return a buffer message if available or will invoke body to attempt to generate a new message. A call to body is made only when the internal buffer is empty. Rejection of messages by successors is handled using the protocol in Figure 4. 997HFlow Graph Reference Manual 167 Table 25: source_node Body Concept Pseudo-Signature Semantics B::B( const B& ) Copy constructor. B::~B() Destructor. void20F 21 operator=( const B& ) Assignment bool B::operator()(Output &v) Returns true when it has assigned a new value to v. Returns false when no new values may be generated. CAUTION: The body object passed to a source_node is copied. Therefore updates to member variables will not affect the original object used to construct the node. Output must be copy-constructible and assignable. Members namespace tbb { namespace flow { template < typename Output > class source_node : public graph_node, public sender< Output > { public: typedef Output output_type; typedef receiver< output_type > successor_type; template< typename Body > source_node( graph &g, Body body, bool is_active = true ); source_node( const source_node &src ); ~source_node(); void activate(); bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type &v ); bool try_release( ); bool try_consume( ); }; 21 The return type void in the pseudo-signature denotes that operator= is not required to return a value. The actual operator= can return a value, which will be ignored. 168 315415-014US } } 6.9.1 template< typename Body> source_node(graph &g, Body body, bool is_active=true) Description Constructs a source_node that will invoke body. By default the node is created in the active state, that is, it will begin generating messages immediately. If is_active is false, messages will not be generated until a call to activate is made. 6.9.2 source_node( const source_node &src ) Description Constructs a source_node that has the same initial state that src had when it was constructed. The source_node that is constructed will have a reference to the same graph object as src, will have a copy of the initial body used by src, and have the same initial active state as src. The predecessors and successors of src will not be copied. CAUTION: The new body object is copy constructed from a copy of the original body provided to src at its construction. Therefore changes made to member variables in src’s body after the construction of src will not affect the body of the new source_node. 6.9.3 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. Flow Graph Reference Manual 169 6.9.4 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.9.5 bool try_get( output_type &v ) Description Will copy the buffered message into v if available or will invoke body to attempt to generate a new message that will be copied into v. Returns true if a message is copied to v. false otherwise. 6.9.6 bool try_reserve( output_type &v ) Description Reserves the source_node if possible. If a message can be buffered and the node is not already reserved, the node is reserved for the caller and the value is copied into v. Returns true if the node is reserved for the caller. false otherwise. 6.9.7 bool try_release( ) Description Releases any reservation held on the source_node. The message held in the internal buffer is retained. Returns true 170 315415-014US 6.9.8 bool try_consume( ) Description Releases any reservation held on the source_node and clears the internal buffer. Returns true 6.10 overwrite_node Template Class Summary A template class that is a graph_node, receiver and sender. An overwrite_node is a buffer of a single item that can be over-written. The value held in the buffer is initially invalid. Gets from the node are non-destructive. Syntax template < typename T > class overwrite_node; Header #include "tbb/flow_graph.h" Description This type of node buffers a single item of type T. The value is initially invalid. A try_put will set the value of the internal buffer, and broadcast the new value to all successors. If the internal value is valid, a try_get will return true and copy the buffer value to the output. If the internal value is invalid, try_get will return false. Rejection of messages by successors is handled using the protocol in Figure 4. 998H T must be copy-constructible and assignable Members namespace tbb { namespace flow { template< typename T > class overwrite_node : public graph_node, public receiver, public sender { public: overwrite_node(); overwrite_node( const overwrite_node &src ); Flow Graph Reference Manual 171 ~overwrite_node(); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); bool is_valid(); void clear(); }; } } 6.10.1 overwrite_node() Effect Constructs an object of type overwrite_node with an invalid internal buffer item. 6.10.2 overwrite_node( const overwrite_node &src ) Effect Constructs an object of type overwrite_node with an invalid internal buffer item. The buffered value and list of successors is NOT copied from src. 172 315415-014US 6.10.3 ~overwrite_node() Effect Destroys the overwrite_node. 6.10.4 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.10.5 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.10.6 bool try_put( const input_type &v ) Effect Stores v in the internal single item buffer. Calls try_put( v ) on all successors. Returns true. Flow Graph Reference Manual 173 6.10.7 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. If a valid item v is held in the buffer, a task is enqueued to call r.try_put(v). Returns true. 6.10.8 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.10.9 bool try_get( output_type &v ) Description If the internal buffer is valid, assigns the value to v. Returns true if v is assigned to. false if v is not assigned to. 6.10.10 bool try_reserve( output_type & ) Description Does not support reservations. Returns false. 174 315415-014US 6.10.11 bool try_release( ) Description Does not support reservations. Returns false. 6.10.12 bool try_consume( ) Description Does not support reservations. Returns false. 6.10.13 bool is_valid() Returns Returns true if the buffer holds a valid value, otherwise returns false. 6.10.14 void clear() Effect Invalidates the value held in the buffer. 6.11 write_once_node Template Class Summary A template class that is a graph_node, receiver and sender. A write_once_node represents a buffer of a single item that cannot be over-written. The first put to the node sets the value. The value may be cleared explicitly, after which a new value may be set. Gets from the node are non-destructive. Rejection of messages by successors is handled using the protocol in Figure 4. 999HFlow Graph Reference Manual 175 T must be copy-constructible and assignable Syntax template < typename T > class write_once_node; Header #include "tbb/flow_graph.h" Members namespace tbb { namespace flow { template< typename T > class write_once_node : public graph_node, public receiver, public sender { public: write_once_node(); write_once_node( const write_once_node &src ); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); bool is_valid(); void clear(); }; } } 176 315415-014US 6.11.1 write_once_node() Effect Constructs an object of type write_once_node with an invalid internal buffer item. 6.11.2 write_once_node( const write_once_node &src ) Effect Constructs an object of type write_once_node with an invalid internal buffer item. The buffered value and list of successors is NOT copied from src. 6.11.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.11.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.11.5 bool try_put( const input_type &v ) Effect Stores v in the internal single item buffer if it does not already contain a valid value. If a new value is set, it calls try_put( v ) on all successors. Flow Graph Reference Manual 177 Returns true. 6.11.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. If a valid item v is held in the buffer, a task is enqueued to call r.try_put(v). Returns true. 6.11.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.11.8 bool try_get( output_type &v ) Description If the internal buffer is valid, assigns the value to v. Returns true if v is assigned to. false if v is not assigned to. 6.11.9 bool try_reserve( output_type & ) Description Does not support reservations. 178 315415-014US Returns false. 6.11.10 bool try_release( ) Description Does not support reservations. Returns false. 6.11.11 bool try_consume( ) Description Does not support reservations. Returns false. 6.11.12 bool is_valid() Returns Returns true if the buffer holds a valid value, otherwise returns false. 6.11.13 void clear() Effect Invalidates the value held in the buffer. 6.12 broadcast_node Template Class Summary A node that broadcasts incoming messages to all of its successors. Flow Graph Reference Manual 179 Syntax template < typename T > class broadcast_node; Header #include "tbb/flow_graph.h" Description A broadcast_node is a graph_node, receiver and sender that broadcasts incoming messages of type T to all of its successors. There is no buffering in the node, so all messages are forwarded immediately to all successors. Rejection of messages by successors is handled using the protocol in Figure 4. 1000H T must be copy-constructible and assignable Members namespace tbb { namespace flow { template< typename T > class broadcast_node : public graph_node, public receiver, public sender { public: broadcast_node(); broadcast_node( const broadcast_node &src ); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } 180 315415-014US } 6.12.1 broadcast_node() Effect Constructs an object of type broadcast_node. 6.12.2 broadcast_node( const broadcast_node &src ) Effect Constructs an object of type broadtcast_node. The list of successors is NOT copied from src. 6.12.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.12.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. Flow Graph Reference Manual 181 6.12.5 bool try_put( const input_type &v ) Effect Broadcasts v to all successors. Returns Always returns true, even if it was unable to successfully forward the message to any of its successors. 6.12.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. 6.12.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.12.8 bool try_get( output_type & ) Returns false. 182 315415-014US 6.12.9 bool try_reserve( output_type & ) Returns false. 6.12.10 bool try_release( ) Returns false. 6.12.11 bool try_consume( ) Returns false. 6.13 buffer_node Class Summary An unbounded buffer of messages of type T. Messages are forwarded in arbitrary order. Syntax template< typename T, typename A=cache_aligned_allocator > class buffer_node; Header #include "tbb/flow_graph.h" Description A buffer_node is a graph_node, receiver and sender that forwards messages in arbitrary order to a single successor in its successor set. Successors are tried in the order that they were registered with the node. If a successor rejects the message, it is removed from the successor list according to the policy in Figure 4 and the next 1001H successor in the set is tried. This continues until a successor accepts the message, or all successors have been attempted. Items that are successfully transferred to a successor are removed from the buffer. Flow Graph Reference Manual 183 A buffer_node is reservable and supports a single reservation at a time. While an item is reserved, other items may still be forwarded to successors and try_get calls will return other non-reserved items if available. While an item is reserved, try_put will still return true and add items to the buffer. An allocator of type A is used to allocate internal memory for the buffer_node. T must be copy-constructible and assignable Rejection of messages by successors is handled using the protocol in Figure 4. 1002H Members namespace tbb { namespace flow { template< typename T, typename A=cache_aligned_allocator > class buffer_node : public graph_node, public receiver, public sender { public: buffer_node( graph &g ); buffer_node( const buffer_node &src ); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 184 315415-014US 6.13.1 buffer_node( graph& g ) Effect Constructs an empty buffer_node that belongs to graph g. 6.13.2 buffer_node( const buffer_node &src ) Effect Constructs an empty buffer_node that belongs to the same graph g as src. The list of predecessors, the list of successors and the messages in the buffer are NOT copied. 6.13.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.13.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.13.5 bool try_put( const input_type &v ) Effect Adds v to the buffer. If v is the only item in the buffer, a task is also enqueued to forward the item to a successor. Flow Graph Reference Manual 185 Returns true. 6.13.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. 6.13.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.13.8 bool try_get( output_type & v ) Returns Returns true if an item can be removed from the buffer and assigned to v. Returns false if there is no non-reserved item currently in the buffer. 6.13.9 bool try_reserve( output_type & v ) Effect Assigns a newly reserved item to v if there is no reservation currently held and there is at least one item available in the buffer. If a new reservation is made, the buffer is marked as reserved. Returns Returns true if v has been assigned a newly reserved item. Returns false otherwise. 186 315415-014US 6.13.10 bool try_release( ) Effect Releases the reservation on the buffer. The item that was returned in the last successful call to try_reserve remains in the buffer. Returns Returns true if the buffer is currently reserved and false otherwise. 6.13.11 bool try_consume( ) Effect Releases the reservation on the buffer. The item that was returned in the last successful call to try_reserve is removed from the buffer. Returns Returns true if the buffer is currently reserved and false otherwise. 6.14 queue_node Template Class Summary An unbounded buffer of messages of type T. Messages are forwarded in first-in first-out (FIFO) order. Syntax template > class queue_node; Header #include "tbb/flow_graph.h" Description A queue_node is a graph_node, receiver and sender that forwards messages in first-in first-out (FIFO) order to a single successor in its successor set. Successors are tried in the order that they were registered with the node. If a successor rejects the message, it is removed from the successor list as described by the policy in Figure 1003H 4 and the next successor in the set is tried. This continues until a successor accepts Flow Graph Reference Manual 187 the message, or all successors have been attempted. Items that are successfully transferred to a successor are removed from the buffer. A queue_node is reservable and supports a single reservation at a time. While the queue_node is reserved, no other items will be forwarded to successors and all try_get calls will return false. While reserved, try_put will still return true and add items to the queue_node. An allocator of type A is used to allocate internal memory for the queue_node. T must be copy-constructible and assignable. Rejection of messages by successors is handled using the protocol in Figure 4. 1004H Members namespace tbb { namespace flow { template > class queue_node : public buffer_node { public: queue_node( graph &g ); queue_node( const queue_node &src ); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( input_type v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 188 315415-014US 6.14.1 queue_node( graph& g ) Effect Constructs an empty queue_node that belongs to graph g. 6.14.2 queue_node( const queue_node &src ) Effect Constructs an empty queue_node that belongs to the same graph g as src. The list of predecessors, the list of successors and the messages in the buffer are NOT copied. 6.14.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.14.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.14.5 bool try_put( const input_type &v ) Effect Adds v to the queue_node. If v is the only item in the queue_node, a task is enqueued to forward the item to a successor. Flow Graph Reference Manual 189 Returns true. 6.14.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. 6.14.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.14.8 bool try_get( output_type & v ) Returns Returns true if an item can be removed from the front of the queue_node and assigned to v. Returns false if there is no item currently in the queue_node or if the node is reserved. 6.14.9 bool try_reserve( output_type & v ) Effect If the call returns true, the node is reserved and will forward no more messages until the reservation has been released or consumed. 190 315415-014US Returns Returns true if there is an item in the queue_node and the node is not currently reserved. If an item can be returned, it is assigned to v. Returns false if there is no item currently in the queue_node or if the node is reserved. 6.14.10 bool try_release( ) Effect Release the reservation on the node. The item that was returned in the last successful call to try_reserve remains in the queue_node. Returns Returns true if the node is currently reserved and false otherwise. 6.14.11 bool try_consume( ) Effect Releases the reservation on the queue_node. The item that was returned in the last successful call to try_reserve is popped from the front of the queue. Returns Returns true if the queue_node is currently reserved and false otherwise. 6.15 priority_queue_node Template Class Summary An unbounded buffer of messages of type T. Messages are forwarded in priority order. Syntax template< typename T, typename Compare = std::less, typename A=cache_aligned_allocator > class priority_queue_node;Flow Graph Reference Manual 191 Header #include "tbb/flow_graph.h" Description A priority_queue_node is a graph_node, receiver and sender that forwards messages in priority order to a single successor in its successor set. Successors are tried in the order that they were registered with the node. If a successor rejects the message, it is removed from the successor list as described by the policy in Figure 4 1005H and the next successor in the set is tried. This continues until a successor accepts the message, or all successors have been attempted. Items that are successfully transferred to a successor are removed from the buffer. The next message to be forwarded has the largest priority as determined by Compare. A priority_queue_node is reservable and supports a single reservation at a time. While the priority_queue_node is reserved, no other items will be forwarded to successors and all try_get calls will return false. While reserved, try_put will still return true and add items to the priority_queue_node. An allocator of type A is used to allocate internal memory for the priority_queue_node. T must be copy-constructible and assignable. Rejection of messages by successors is handled using the protocol in Figure 4. 1006H Members namespace tbb { namespace flow { template< typename T, typename Compare = std::less, typename A=cache_aligned_allocator> class priority_queue_node : public queue_node { public: typedef size_t size_type; priority_queue_node( graph &g ); priority_queue_node( const priority_queue_node &src ); ~priority_queue_node(); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( input_type v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); 192 315415-014US // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 6.15.1 priority_queue_node( graph& g) Effect Constructs an empty priority_queue_node that belongs to graph g. 6.15.2 priority_queue_node( const priority_queue_node &src ) Effect Constructs an empty priority_queue_node that belongs to the same graph g as src. The list of predecessors, the list of successors and the messages in the buffer are NOT copied. 6.15.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. Flow Graph Reference Manual 193 6.15.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.15.5 bool try_put( const input_type &v ) Effect Adds v to the priority_queue_node. If v‘s priority is the largest of all of the currently buffered messages, a task is enqueued to forward the item to a successor. Returns true. 6.15.6 bool register_successor( successor_type &r ) Effect Adds r to the set of successors. Returns true. 6.15.7 bool remove_successor( successor_type &r ) Effect Removes r from the set of successors. Returns true. 194 315415-014US 6.15.8 bool try_get( output_type & v ) Returns Returns true if a message is available in the node and the node is not currently reserved. Otherwise returns false. If the node returns true, the message with the largest priority will have been copied to v. 6.15.9 bool try_reserve( output_type & v ) Effect If the call returns true, the node is reserved and will forward no more messages until the reservation has been released or consumed. Returns Returns true if a message is available in the node and the node is not currently reserved. Otherwise returns false. If the node returns true, the message with the largest priority will have been copied to v. 6.15.10 bool try_release( ) Effect Release the reservation on the node. The item that was returned in the last successful call to try_reserve remains in the priority_queue_node. Returns Returns true if the buffer is currently reserved and false otherwise. 6.15.11 bool try_consume( ) Effect Releases the reservation on the node. The item that was returned in the last successful call to try_reserve is removed from the priority_queue_node. Returns Returns true if the buffer is currently reserved and false otherwise. Flow Graph Reference Manual 195 6.16 sequencer_node Template Class Summary An unbounded buffer of messages of type T. Messages are forwarded in sequence order. Syntax template< typename T, typename A=cache_aligned_allocator > class sequencer_node; Header #include "tbb/flow_graph.h" Description A sequencer_node is a graph_node, receiver and sender that forwards messages in sequence order to a single successor in its successor set. Successors are tried in the order that they were registered with the node. If a successor rejects the message, it is removed from the successor list as described by the policy in Figure 4 1007H and the next successor in the set is tried. This continues until a successor accepts the message, or all successors have been attempted. Items that are successfully transferred to a successor are removed from the buffer. Each item that passes through a sequencer_node is ordered by its sequencer order number. These sequence order numbers range from 0 … N, where N is the largest integer representable by the size_t type. An item’s sequencer order number is determined by passing the item to a user-provided function object that models the Sequencer Concept shown in Table 26. 1008H Table 26: sequencer_node Sequencer Concept Pseudo-Signature Semantics S::S( const S& ) Copy constructor. S::~S() Destructor. void21F 22 operator=( const S& ) Assignment size_t S::operator()( const T &v ) Returns the sequence number for the provided message v. 22 The return type void in the pseudo-signature denotes that operator= is not required to return a value. The actual operator= can return a value, which will be ignored. 196 315415-014US A sequencer_node is reservable and supports a single reservation at a time. While a sequencer_node is reserved, no other items will be forwarded to successors and all try_get calls will return false. While reserved, try_put will still return true and add items to the sequencer_node. An allocator of type A is used to allocate internal memory for the sequencer_node. T must be copy-constructible and assignable. Rejection of messages by successors is handled using the protocol in Figure 4. 1009H Members namespace tbb { namespace flow { template< typename T, typename A=cache_aligned_allocator > class sequencer_node : public queue_node { public: template< typename Sequencer > sequencer_node( graph &g, const Sequencer& s ); sequencer_node( const sequencer_node &src ); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( input_type v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } Flow Graph Reference Manual 197 6.16.1 template sequencer_node( graph& g, const Sequencer& s ) Effect Constructs an empty sequencer_node that belongs to graph g and uses s to compute sequence numbers for items. 6.16.2 sequencer_node( const sequencer_node &src ) Effect Constructs an empty sequencer_node that belongs to the same graph g as src and will use a copy of the Sequencer s used to construct src. The list of predecessors, the list of successors and the messages in the buffer are NOT copied. CAUTION: The new Sequencer object is copy constructed from a copy of the original Sequencer object provided to src at its construction. Therefore changes made to member variables in src’s object will not affect the Sequencer of the new sequencer_node. 6.16.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.16.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 198 315415-014US 6.16.5 bool try_put( input_type v ) Effect Adds v to the sequencer_node. If v‘s sequence number is the next item in the sequence, a task is enqueued to forward the item to a successor. Returns true. 6.16.6 bool register_successor( successor_type &r ) Effect Adds r to the set of successors. Returns true. 6.16.7 bool remove_successor( successor_type &r ) Effect Removes r from the set of successors. Returns true. 6.16.8 bool try_get( output_type & v ) Returns Returns true if the next item in the sequence is available in the sequencer_node. If so, it is removed from the node and assigned to v. Returns false if the next item in sequencer order is not available or if the node is reserved. Flow Graph Reference Manual 199 6.16.9 bool try_reserve( output_type & v ) Effect If the call returns true, the node is reserved and will forward no more messages until the reservation has been released or consumed. Returns Returns true if the next item in sequencer order is available in the sequencer_node. If so, the item is assigned to v, but is not removed from the sequencer_node Returns false if the next item in sequencer order is not available or if the node is reserved. 6.16.10 bool try_release( ) Effect Releases the reservation on the node. The item that was returned in the last successful call to try_reserve remains in the sequencer_node. Returns Returns true if the buffer is currently reserved and false otherwise. 6.16.11 bool try_consume( ) Effect Releases the reservation on the node. The item that was returned in the last successful call to try_reserve is removed from the sequencer_node. Returns Returns true if the buffer is currently reserved and false otherwise. 6.17 limiter_node Template Class Summary An node that counts and limits the number of messages that pass through it. Syntax template < typename T > class limiter_node;200 315415-014US Header #include "tbb/flow_graph.h" Description A limiter_node is a graph_node, receiver and sender that broadcasts messages to all of its successors. It keeps a counter C of the number of broadcasts it makes and does not accept new messages once its user-specified threshold T is reached. The internal count of broadcasts C can be decremented through use of its embedded continue_receiver decrement. The behavior of a call to a limiter_node’s try_put is shown in Table 27. 1010H Table 27: Behavior of a call to a limiter_node’s try_put Value of counter C bool try_put( input_type v ) C < T C is incremented and v is broadcast to all successors. If no successor accepts the message, C is decremented. Returns true if the message was successfully broadcast to at least one successor and false otherwise. C == T Returns false. When try_put is called on the member object decrement, the limiter_node will try to get a message from one of its known predecessors and forward that message to all of its successors. If it cannot obtain a message from a predecessor, it will decrement C. Rejection of messages by successors and failed gets from predecessors are handled using the protocol in Figure 4. 1011H T must be copy-constructible and assignable. Members namespace tbb { namespace flow { template< typename T > class limiter_node : public graph_node, public receiver, public sender { public: limiter_node( graph &g, size_t threshold, int number_of_decrement_predecessors = 0 ); limiter_node( const limiter_node &src ); // a continue_receiver implementation-dependent-type decrement; // receiver typedef T input_type; Flow Graph Reference Manual 201 typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 6.17.1 limiter_node( graph &g, size_t threshold, int number_of_decrement_predecessors ) Description Constructs a limiter_node that allows up to threshold items to pass through before rejecting try_puts. Optionally a number_of_decrement_predecessors value can be supplied. This value is passed on to the continue_receiver decrement’s constructor. 6.17.2 limiter_node( const limiter_node &src ) Description Constructs a limiter_node that has the same initial state that src had at its construction. The new limiter_node will belong to the same graph g as src, have the same threshold, and have the same initial number_of_decrement_predecessors. The list of predecessors, the list of successors and the current count of broadcasts, C, are NOT copied from src. 202 315415-014US 6.17.3 bool register_predecessor( predecessor_type& p ) Description Adds a predecessor that can be pulled from once the broadcast count falls below the threshold. Effect Adds p to the set of predecessors. Returns true. 6.17.4 bool remove_predecessor( predecessor_type & r ) Effect Removes p to the set of predecessors. Returns true. 6.17.5 bool try_put( input_type &v ) Effect If the broadcast count is below the threshold, v is broadcast to all successors. For each successor s, if s.try_put( v ) == false && s.register_predecessor( *this ) == true, then s is removed from the set of succesors. Otherwise, s will remain in the set of successors. Returns true if v is broadcast. false if v is not broadcast because the threshold has been reached. Flow Graph Reference Manual 203 6.17.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. 6.17.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.17.8 bool try_get( output_type & ) Description Does not contain buffering and therefore cannot be pulled from. Returns false. 6.17.9 bool try_reserve( output_type & ) Description Does not support reservations. Returns false. 204 315415-014US 6.17.10 bool try_release( ) Description Does not support reservations. Returns false. 6.17.11 bool try_consume( ) Description Does not support reservations. Returns false. 6.18 join_node Template Class Summary A node that creates a tuple from a set of messages received at its input ports and broadcasts the tuple to all of its successors. The class join_node supports three buffering policies at its input ports: reserving, queueing and tag_matching. By default, join_node input ports use the queueing policy. Syntax template class join_node; Header #include "tbb/flow_graph.h" Description A join_node is a graph_node and a sender< std::tuple< T0, T1, … >. It contains a tuple of input ports, each of which is a receiver for each of the T0 .. TN in OutputTuple. It supports multiple input receivers with distinct types and broadcasts a tuple of received messages to all of its successors. All input ports of a join_node must use the same buffering policy. The behavior of a join_node based on its buffering policy is shown in Table 28. 1012HFlow Graph Reference Manual 205 Table 28: Behavior of a join_node based on the buffering policy of its input ports. Buffering Policy Behavior queueing As each input port is put to, the incoming message is added to an unbounded first-in first-out queue in the port. When there is at least one message at each input port, the join_node broadcasts a tuple containing the head of each queue to all successors. If at least one successor accepts the tuple, the head of each input port’s queue is removed, otherwise the messages remain in their respective input port queues. reserving As each input port is put to, the join_node marks that an input may be available at that port and returns false. When all ports have been marked as possibly available, the join_node will try to reserve a message at each port from their known predecessors. If it is unable to reserve a message at a port, it un-marks that port, and releases all previously acquired reservations. If it is able to reserve a message at all ports, it broadcasts a tuple containing these messages to all successors. If at least one successor accepts the tuple, the reservations are consumed; otherwise, they are released. tag_matching As each input port is put to, a user-provided function object is applied to the message to obtain its tag. The message is then added to a hash table at the input port, using the tag as the key. When there is message at each input port for a given tag, the join_node broadcasts a tuple containing the matching messages to all successors. If at least one successor accepts the tuple, the messages are removed from each input port’s hash table; otherwise, the messages remain in their respective input ports. Rejection of messages by successors of the join_node and failed gets from predecessors of the input ports are handled using the protocol in Figure 4. 1013H The function template input_port described in 6.19 simplifies the syntax for getting a 1014H reference to a specific input port. OutputTuple must be a std::tuple where each element is copyconstructible and assignable. Example #include #include "tbb/flow_graph.h" using namespace tbb::flow; int main() { graph g; function_node f1( g, unlimited, [](const int &i) { return 2*i; } ); function_node f2( g, unlimited, 206 315415-014US [](const float &f) { return f/2; } ); join_node< std::tuple > j(g); function_node< std::tuple > f3( g, unlimited, []( const std::tuple &t ) { printf("Result is %f\n", std::get<0>(t) + std::get<1>(t));}); make_edge( f1, input_port<0>(j) ); make_edge( f2, input_port<1>(j) ); make_edge( j, f3 ); f1.try_put( 3 ); f2.try_put( 3 ); g.wait_for_all(); return 0; } In the example above, three function_node objects are created: f1 multiplies an int i by 2, f2 divides a float f by 2, and f3 receives a std::tuple t, adds its elements together and prints the result. The join_node j combines the output of f1 and f2 and forwards the resulting tuple to f3. This example is purely a syntactic demonstration since there is very little work in the nodes. Members namespace tbb { namespace flow { enum graph_buffer_policy { rejecting, reserving, queueing, tag_matching }; template class join_node : public graph_node, public sender< OutputTuple > { public: typedef OutputTuple output_type; typedef receiver successor_type; implementation-dependent-tuple input_ports_tuple_type; join_node(graph &g); join_node(const join_node &src); input_ports_tuple_type &inputs(); Flow Graph Reference Manual 207 bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; // // Specialization for tag_matching // template class join_node : public graph_node, public sender< OutputTuple > { public: // Has the same methdods as previous join_node, // but has constructors to specify the tag_matching // function objects template join_node(graph &g, B0 b0, B1 b1); // Constructors are defined similarly for // 3 through 10 elements … }; } } 6.18.1 join_node( graph &g ) Effect Creates a join_node that will enqueue tasks using the root task in g. 208 315415-014US 6.18.2 template < typename B0, typename B1, … > join_node( graph &g, B0 b0, B1 b1, … ) Description A constructor only available in the tag_matching specialization of join_node. Effect Creates a join_node that uses the function objects b0, b1, …, bN to determine that tags for the input ports 0 through N. It will enqueue tasks using the root task in g. 6.18.3 join_node( const join_node &src ) Effect Creates a join_node that has the same initial state that src had at its construction. The list of predecessors, messages in the input ports, and successors are NOT copied. 6.18.4 input_ports_tuple_type& inputs() Returns A std::tuple of receivers. Each element inherits from tbb::receiver where T is the type of message expected at that input. Each tuple element can be used like any other flow::receiver. The behavior of the ports based on the selected join_node policy is shown in Table 28. 1015H 6.18.5 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. Flow Graph Reference Manual 209 6.18.6 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.18.7 bool try_get( output_type &v ) Description Attempts to generate a tuple based on the buffering policy of the join_node. Returns If it can successully generate a tuple, it copies it to v and returns true. Otherwise it returns false. 6.18.8 bool try_reserve( T & ) Description A join_node cannot be reserved. Returns false. 6.18.9 bool try_release( ) Description A join_node cannot be reserved. Returns false. 210 315415-014US 6.18.10 bool try_consume( ) Description A join_node cannot be reserved. Returns false. 6.18.11 template typename std::tuple_element::type &input_port(JNT &jn) Description Equivalent to calling std::get( jn.inputs() ) Returns Returns the N th input port for join_node jn. 6.19 input_port Template Function Summary A template function that given a join_node or or_node returns a reference to a specific input port. Syntax template typename std::tuple_element::type& input_port(NT &n); Header #include "tbb/flow_graph.h" Flow Graph Reference Manual 211 6.20 make_edge Template Function Summary A template function that adds an edge between a sender and a receiver. Syntax template< typename T > inline void make_edge( sender &p, receiver &s ); Header #include "tbb/flow_graph.h" 6.21 remove_edge Template Function Summary A template function that removes an edge between a sender and a receiver. Syntax template< typename T > void remove_edge( sender &p, receiver &s ); Header #include "tbb/flow_graph.h" 6.22 copy_body Template Function Summary A template function that returns a copy of the body function object from a continue_node or function_node. Syntax template< typename Body, typename Node > Body copy_body( Node &n ); Header #include "tbb/flow_graph.h" 212 315415-014US 7 Thread Local Storage Intel® Threading Building Blocks (Intel® TBB) provides two template classes for thread local storage. Both provide a thread-local element per thread. Both lazily create the elements on demand. They differ in their intended use models: combinable provides thread-local storage for holding per-thread subcomputations that will later be reduced to a single result. It is PPL compatible. enumerable_thread_specific provides thread-local storage that acts like a STL container with one element per thread. The container permits iterating over the elements using the usual STL iteration idioms. This chapter also describes template class flatten2d, which assists a common idiom where an enumerable_thread_specific represents a container partitioner across threads. 7.1 combinable Template Class Summary Template class for holding thread-local values during a parallel computation that will be merged into to final. Syntax template class combinable; Header #include "tbb/combinable.h" Description A combinable provides each thread with its own local instance of type T. Members namespace tbb { template class combinable { public: combinable(); template Thread Local Storage Reference Manual 213 combinable(FInit finit);} combinable(const combinable& other); ~combinable(); combinable& operator=( const combinable& other); void clear(); T& local(); T& local(bool & exists); template T combine(FCombine fcombine); template void combine_each(Func f); }; } 7.1.1 combinable() Effects Constructs combinable such that any thread-local instances of T will be created using default construction. 7.1.2 template combinable(FInit finit) Effects Constructs combinable such that any thread-local element will be created by copying the result of finit(). NOTE: The expression finit() must be safe to evaluate concurrently by multiple threads. It is evaluated each time a thread-local element is created. 7.1.3 combinable( const combinable& other ); Effects Construct a copy of other, so that it has copies of each element in other with the same thread mapping. 214 315415-014US 7.1.4 ~combinable() Effects Destroy all thread-local elements in *this. 7.1.5 combinable& operator=( const combinable& other ) Effects Set *this to be a copy of other. 7.1.6 void clear() Effects Remove all elements from *this. 7.1.7 T& local() Effects If thread-local element does not exist, create it. Returns Reference to thread-local element. 7.1.8 T& local( bool& exists ) Effects Similar to local(), except that exists is set to true if an element was already present for the current thread; false otherwise. Returns Reference to thread-local element. Thread Local Storage Reference Manual 215 7.1.9 templateT combine(FCombine fcombine) Requires Parameter fcombine should be an associative binary functor with the signature T(T,T) or T(const T&,const T&). Effects Computes reduction over all elements using binary functor fcombine. If there are no elements, creates the result using the same rules as for creating a thread-local element. Returns Result of the reduction. 7.1.10 template void combine_each(Func f) Requires Parameter f should be a unary functor with the signature void(T) or void(const T&). Effects Evaluates f(x) for each instance x of T in *this. 7.2 enumerable_thread_specific Template Class Summary Template class for thread local storage. Syntax enum ets_key_usage_type { ets_key_per_instance, ets_no_key }; template , ets_key_usage_type ETS_key_type=ets_no_key> class enumerable_thread_specific; Header #include "tbb/enumerable_thread_specific.h" Description An enumerable_thread_specific provides thread local storage (TLS) for elements of type T. An enumerable_thread_specific acts as a container by providing iterators and ranges across all of the thread-local elements. The thread-local elements are created lazily. A freshly constructed enumerable_thread_specific has no elements. When a thread requests access to a enumerable_thread_specific, it creates an element corresponding to that thread. The number of elements is equal to the number of distinct threads that have accessed the enumerable_thread_specific and not the number of threads in use by the application. Clearing a enumerable_thread_specific removes all of its elements. The ETS_key_usage_type parameter can be used to select between an implementation that consumes no native TLS keys and a specialization that offers higher performance but consumes 1 native TLS key per enumerable_thread_specific instance. If no ETS_key_usage_type parameter is provided, ets_no_key is used by default. CAUTION: The number of native TLS keys is limited and can be fairly small, for example 64 or 128. Therefore it is recommended to restrict the use of the ets_key_per_instance specialization to only the most performance critical cases. Example The following code shows a simple example usage of enumerable_thread_specific. The number of calls to null_parallel_for_body::operator() and total number of iterations executed are counted by each thread that participates in the parallel_for, and these counts are printed at the end of main. #include #include #include "tbb/task_scheduler_init.h" #include "tbb/enumerable_thread_specific.h" #include "tbb/parallel_for.h" #include "tbb/blocked_range.h" using namespace tbb; typedef enumerable_thread_specific< std::pair > Thread Local Storage Reference Manual 217 CounterType; CounterType MyCounters (std::make_pair(0,0)); struct Body { void operator()(const tbb::blocked_range &r) const { CounterType::reference my_counter = MyCounters.local(); ++my_counter.first; for (int i = r.begin(); i != r.end(); ++i) ++my_counter.second; } }; int main() { parallel_for( blocked_range(0, 100000000), Body()); for (CounterType::const_iterator i = MyCounters.begin(); i != MyCounters.end(); ++i) { printf("Thread stats:\n"); printf(" calls to operator(): %d", i->first); printf(" total # of iterations executed: %d\n\n", i->second); } } Example with Lambda Expressions Class enumerable_thread_specific has a method combine(f) that does reduction using binary functor f, which can be written using a lambda expression. For example, the previous example can be extended to sum the thread-local values by adding the following lines to the end of function main: std::pair sum = MyCounters.combine([](std::pair x, std::pair y) { return std::make_pair(x.first+y.first, x.second+y.second); }); printf("Total calls to operator() = %d, " "total iterations = %d\n", sum.first, sum.second); Members namespace tbb { template , ets_key_usage_type ETS_key_type=ets_no_key > class enumerable_thread_specific { public: // Basic types typedef Allocator allocator_type; typedef T value_type; typedef T& reference; typedef const T& const_reference; typedef T* pointer; typedef implementation-dependent size_type; typedef implementation-dependent difference_type; // Iterator types typedef implementation-dependent iterator; typedef implementation-dependent const_iterator; // Parallel range types typedef implementation-dependent range_type; typedef implementation-dependent const_range_type; // Whole container operations enumerable_thread_specific(); enumerable_thread_specific( const enumerable_thread_specific &other ); template enumerable_thread_specific( const enumerable_thread_specific& other ); template enumerable_thread_specific( Finit finit ); enumerable_thread_specific(const T &exemplar); ~enumerable_thread_specific(); enumerable_thread_specific& operator=(const enumerable_thread_specific& other); template enumerable_thread_specific& operator=( const enumerable_thread_specific& other ); void clear(); Thread Local Storage Reference Manual 219 // Concurrent operations reference local(); reference local(bool& existis); size_type size() const; bool empty() const; // Combining template T combine(FCombine fcombine); template void combine_each(Func f); // Parallel iteration range_type range( size_t grainsize=1 ); const_range_type range( size_t grainsize=1 ) const; // Iterators iterator begin(); iterator end(); const_iterator begin() const; const_iterator end() const; }; } 7.2.1 Whole Container Operations Safety These operations must not be invoked concurrently on the same instance of enumerable_thread_specific. 7.2.1.1 enumerable_thread_specific() Effects Constructs an enumerable_thread_specific where each local copy will be default constructed. 7.2.1.2 enumerable_thread_specific(const enumerable_thread_specific &other) Effects Copy construct an enumerable_thread_specific. The values are copy constructed from the values in other and have same thread correspondence. 220 315415-014US 7.2.1.3 template enumerable_thread_specific( const enumerable_thread_specific& other ) Effects Copy construct an enumerable_thread_specific. The values are copy constructed from the values in other and have same thread correspondence. 7.2.1.4 template< typename Finit> enumerable_thread_specific(Finit finit) Effects Constructs enumerable_thread_specific such that any thread-local element will be created by copying the result of finit(). NOTE: The expression finit() must be safe to evaluate concurrently by multiple threads. It is evaluated each time a thread-local element is created. 7.2.1.5 enumerable_thread_specific(const &exemplar) Effects Constructs an enumerable_thread_specific where each local copy will be copy constructed from exemplar. 7.2.1.6 ~enumerable_thread_specific() Effects Destroys all elements in *this. Destroys any native TLS keys that were created for this instance. 7.2.1.7 enumerable_thread_specific& operator=(const enumerable_thread_specific& other); Effects Set *this to be a copy of other. Thread Local Storage Reference Manual 221 7.2.1.8 template< typename U, typename Alloc, ets_key_usage_type Cachetype> enumerable_thread_specific& operator=(const enumerable_thread_specific& other); Effects Set *this to be a copy of other. NOTE: The allocator and key usage specialization is unchanged by this call. 7.2.1.9 void clear() Effects Destroys all elements in *this. Destroys and then recreates any native TLS keys used in the implementation. NOTE: In the current implementation, there is no performance advantage of using clear instead of destroying and reconstructing an enumerable_thread_specific. 7.2.2 Concurrent Operations 7.2.2.1 reference local() Returns A reference to the element of *this that corresponds to the current thread. Effects If there is no current element corresponding to the current thread, then constructs a new element. A new element is copy-constructed if an exemplar was provided to the constructor for *this, otherwise a new element is default constructed. 7.2.2.2 reference local( bool& exists ) Effects Similar to local(), except that exists is set to true if an element was already present for the current thread; false otherwise. Returns Reference to thread-local element. 222 315415-014US 7.2.2.3 size_type size() const Returns The number of elements in *this. The value is equal to the number of distinct threads that have called local() after *this was constructed or most recently cleared. 7.2.2.4 bool empty() const Returns size()==0 7.2.3 Combining The methods in this section iterate across the entire container. 7.2.3.1 templateT combine(FCombine fcombine) Requires Parameter fcombine should be an associative binary functor with the signature T(T,T) or T(const T&,const T&). Effects Computes reduction over all elements using binary functor fcombine. If there are no elements, creates the result using the same rules as for creating a thread-local element. Returns Result of the reduction. 7.2.3.2 template void combine_each(Func f) Requires Parameter f should be a unary functor with the signature void(T) or void(const T&). Effects Evaluates f(x) for each instance x of T in *this. Thread Local Storage Reference Manual 223 7.2.4 Parallel Iteration Types const_range_type and range_type model the Container Range concept (5.1). 1016H The types differ only in that the bounds for a const_range_type are of type const_iterator, whereas the bounds for a range_type are of type iterator. 7.2.4.1 const_range_type range( size_t grainsize=1 ) const Returns A const_range_type representing all elements in *this. The parameter grainsize is in units of elements. 7.2.4.2 range_type range( size_t grainsize=1 ) Returns A range_type representing all elements in *this. The parameter grainsize is in units of elements. 7.2.5 Iterators Template class enumerable_thread_specific supports random access iterators, which enable iteration over the set of all elements in the container. 7.2.5.1 iterator begin() Returns iterator pointing to beginning of the set of elements. 7.2.5.2 iterator end() Returns iterator pointing to end of the set of elements. 7.2.5.3 const_iterator begin() const Returns const_iterator pointing to beginning of the set of elements. 224 315415-014US 7.2.5.4 const_iterator end() const Returns const_iterator pointing to the end of the set of elements. 7.3 flattened2d Template Class Summary Adaptor that provides a flattened view of a container of containers. Syntax template class flattened2; template flattened2d flatten2d(const Container &c); template flattened2d flatten2d( const Container &c, const typename Container::const_iterator b, const typename Container::const_iterator e); Header #include "tbb/enumerable_thread_specific.h" Description A flattened2d provides a flattened view of a container of containers. Iterating from begin() to end()visits all of the elements in the inner containers. This can be useful when traversing a enumerable_thread_specific whose elements are containers. The utility function flatten2d creates a flattened2d object from a container. Example The following code shows a simple example usage of flatten2d and flattened2d. Each thread collects the values of i that are evenly divisible by K in a thread-local vector. In main, the results are printed by using a flattened2d to simplify the traversal of all of the elements in all of the local vectors. #include Thread Local Storage Reference Manual 225 #include #include #include "tbb/task_scheduler_init.h" #include "tbb/enumerable_thread_specific.h" #include "tbb/parallel_for.h" #include "tbb/blocked_range.h" using namespace tbb; // A VecType has a separate std::vector per thread typedef enumerable_thread_specific< std::vector > VecType; VecType MyVectors; int K = 1000000; struct Func { void operator()(const blocked_range& r) const { VecType::reference v = MyVectors.local(); for (int i=r.begin(); i!=r.end(); ++i) if( i%k==0 ) v.push_back(i); } }; int main() { parallel_for(blocked_range(0, 100000000), Func()); flattened2d flat_view = flatten2d( MyVectors ); for( flattened2d::const_iterator i = flat_view.begin(); i != flat_view.end(); ++i) cout << *i << endl; return 0; } Members namespace tbb { template class flattened2d { public: // Basic types 226 315415-014US typedef implementation-dependent size_type; typedef implementation-dependent difference_type; typedef implementation-dependent allocator_type; typedef implementation-dependent value_type; typedef implementation-dependent reference; typedef implementation-dependent const_reference; typedef implementation-dependent pointer; typedef implementation-dependent const_pointer; typedef implementation-dependent iterator; typedef implementation-dependent const_iterator; flattened2d( const Container& c ); flattened2d( const Container& c, typename Container::const_iterator first, typename Container::const_iterator last ); iterator begin(); iterator end(); const_iterator begin() const; const_iterator end() const; size_type size() const; }; template flattened2d flatten2d(const Container &c); template flattened2d flatten2d( const Container &c, const typename Container::const_iterator first, const typename Container::const_iterator last); } 7.3.1 Whole Container Operations Safety These operations must not be invoked concurrently on the same flattend2d. Thread Local Storage Reference Manual 227 7.3.1.1 flattened2d( const Container& c ) Effects Constructs a flattened2d representing the sequence of elements in the inner containers contained by outer container c. 7.3.1.2 flattened2d( const Container& c, typename Container::const_iterator first, typename Container::const_iterator last ) Effects Constructs a flattened2d representing the sequence of elements in the inner containers in the half-open intervale [first, last) of Container c. 7.3.2 Concurrent Operations Safety These operations may be invoked concurrently on the same flattened2d. 7.3.2.1 size_type size() const Returns The sum of the sizes of the inner containers that are viewable in the flattened2d. 7.3.3 Iterators Template class flattened2d supports foward iterators only. 7.3.3.1 iterator begin() Returns iterator pointing to beginning of the set of local copies. 7.3.3.2 iterator end() Returns iterator pointing to end of the set of local copies. 228 315415-014US 7.3.3.3 const_iterator begin() const Returns const_iterator pointing to beginning of the set of local copies. 7.3.3.4 const_iterator end() const Returns const_iterator pointing to the end of the set of local copies. 7.3.4 Utility Functions template flattened2d flatten2d(const Container &c, const typename Container::const_iterator b, const typename Container::const_iterator e) Returns Constructs and returns a flattened2d that provides iterators that traverse the elements in the containers within the half-open range [b, e) of Container c. template flattened2d( const Container &c ) Returns Constructs and returns a flattened2d that provides iterators that traverse the elements in all of the containers within Container c. Memory Allocation Reference Manual 229 8 Memory Allocation This section describes classes related to memory allocation. 8.1 Allocator Concept The allocator concept for allocators in Intel® Threading Building Blocks is similar to the "Allocator requirements" in Table 32 of the ISO C++ Standard, but with further guarantees required by the ISO C++ Standard (Section 20.1.5 paragraph 4) for use with ISO C++ containers. Table 29 summarizes the allocator concept. Here, A and B 500H1017H represent instances of the allocator class. Table 29: Allocator Concept Pseudo-Signature Semantics typedef T* A::pointer Pointer to T. typedef const T* A::const_pointer Pointer to const T. typedef T& A::reference Reference to T. typedef const T& A::const_reference Reference to const T. typedef T A::value_type Type of value to be allocated. typedef size_t A::size_type Type for representing number of values. typedef ptrdiff_t A::difference_type Type for representing pointer difference. template struct rebind { typedef A A::other; }; Rebind to a different type U A() throw() Default constructor. A( const A& ) throw() Copy constructor. template A( const A& ) Rebinding constructor. ~A() throw() Destructor. T* A::address( T& x ) const Take address. const T* A::const_address( const T& x ) const Take const address. T* A::allocate( size_type n, const void* hint=0 ) Allocate space for n values. void A::deallocate( T* p, size_t n ) Deallocate n values. size_type A::max_size() const throw() Maximum plausible 230 315415-014US Pseudo-Signature Semantics argument to method allocate. void A::construct( T* p, const T& value ) new(p) T(value) void A::destroy( T* p ) p->T::~T() bool operator==( const A&, const B& ) Return true. bool operator!=( const A&, const B& ) Return false. Model Types Template classes tbb_allocactor (8.2), 1018H scalable_allocator (8.3), and 1019H cached_aligned_allocator (8.4), and 1020H zero_allocator (8.5) model the Allocator 1021H concept. 8.2 tbb_allocator Template Class Summary Template class for scalable memory allocation if available; possibly non-scalable otherwise. Syntax template class tbb_allocator Header #include "tbb/tbb_allocator.h" Description A tbb_allocator allocates and frees memory via the Intel® TBB malloc library if it is available, otherwise it reverts to using malloc and free. TIP: Set the environment variable TBB_VERSION to 1 to find out if the Intel® TBB malloc library is being used. Details are in Section 3.1.2. 1022H 8.3 scalable_allocator Template Class Summary Template class for scalable memory allocation. Memory Allocation Reference Manual 231 Syntax template class scalable_allocator; Header #include "tbb/scalable_allocator.h" Description A scalable_allocator allocates and frees memory in a way that scales with the number of processors. A scalable_allocator models the allocator requirements described in Table 29. Using a 501H1023H scalable_allocator in place of std::allocator may improve program performance. Memory allocated by a scalable_allocator should be freed by a scalable_allocator, not by a std::allocator. CAUTION: The scalable_allocator requires that the tbb malloc library be available. If the library is missing, calls to the scalable allocator fail. In contrast, tbb_allocator falls back on malloc and free if the tbbmalloc library is missing. Members See Allocator concept (8.1). 1024H Acknowledgement The scalable memory allocator incorporates McRT technology developed by Intel’s PSL CTG team. 8.3.1 C Interface to Scalable Allocator Summary Low level interface for scalable memory allocation. Syntax extern "C" { // Scalable analogs of C memory allocator void* scalable_malloc( size_t size ); void scalable_free( void* ptr ); void* scalable_calloc( size_t nobj, size_t size ); void* scalable_realloc( void* ptr, size_t size ); // Analog of _msize/malloc_size/malloc_usable_size. size_t scalable_msize( void* ptr ); // Scalable analog of posix_memalign 232 315415-014US int scalable_posix_memalign( void** memptr, size_t alignment, size_t size ); // Aligned allocation void* scalable_aligned_malloc( size_t size, size_t alignment); void scalable_aligned_free( void* ptr ); void* scalable_aligned_realloc( void* ptr, size_t size, size_t alignment ); } Header #include "tbb/scalable_allocator.h" Description These functions provide a C level interface to the scalable allocator. Each routine scalable_x behaves analogously to library function x. The routines form the two families shown in Table 30. Storage allocated by a scalable_ 1025H x function in one family must be freed or resized by a scalable_x function in the same family, not by a C standard library function. Likewise storage allocated by a C standard library function should not be freed or resized by a scalable_x function. Table 30: C Interface to Scalable Allocator Family Allocation Routine Deallocation Routine Analogous Library scalable_malloc scalable_calloc scalable_realloc C standard library 1 scalable_posix_memalign scalable_free POSIX*22F 23 2 scalable_aligned_malloc scalable_aligned_free scalable_aligned_free Microsoft* C run-time library 23 See "The Open Group* Base Specifications Issue 6", IEEE* Std 1003.1, 2004 Edition for the definition of posix_memalign. Memory Allocation Reference Manual 233 scalable_aligned_realloc 8.3.1.1 size_t scalable_msize( void* ptr ) Returns The usable size of the memory block pointed to by ptr if it was allocated by the scalable allocator. Returns zero if ptr does not point to such a block. 8.4 cache_aligned_allocator Template Class Summary Template class for allocating memory in way that avoids false sharing. Syntax template class cache_aligned_allocator; Header #include "tbb/cache_aligned_allocator.h" Description A cache_aligned_allocator allocates memory on cache line boundaries, in order to avoid false sharing. False sharing is when logically distinct items occupy the same cache line, which can hurt performance if multiple threads attempt to access the different items simultaneously. Even though the items are logically separate, the processor hardware may have to transfer the cache line between the processors as if they were sharing a location. The net result can be much more memory traffic than if the logically distinct items were on different cache lines. A cache_aligned_allocator models the allocator requirements described in Table 29. 501H1026H It can be used to replace a std::allocator. Used judiciously, cache_aligned_allocator can improve performance by reducing false sharing. However, it is sometimes an inappropriate replacement, because the benefit of allocating on a cache line comes at the price that cache_aligned_allocator implicitly adds pad memory. The padding is typically 128 bytes. Hence allocating many small objects with cache_aligned_allocator may increase memory usage. Members namespace tbb { 234 315415-014US template class cache_aligned_allocator { public: typedef T* pointer; typedef const T* const_pointer; typedef T& reference; typedef const T& const_reference; typedef T value_type; typedef size_t size_type; typedef ptrdiff_t difference_type; template struct rebind { typedef cache_aligned_allocator other; }; #if _WIN64 char* _Charalloc( size_type size ); #endif /* _WIN64 */ cache_aligned_allocator() throw(); cache_aligned_allocator( const cache_aligned_allocator& ) throw(); template cache_aligned_allocator( const cache_aligned_allocator& ) throw(); ~cache_aligned_allocator(); pointer address(reference x) const; const_pointer address(const_reference x) const; pointer allocate( size_type n, const void* hint=0 ); void deallocate( pointer p, size_type ); size_type max_size() const throw(); void construct( pointer p, const T& value ); void destroy( pointer p ); }; template<> class cache_aligned_allocator { public: typedef void* pointer; typedef const void* const_pointer; typedef void value_type; template struct rebind { Memory Allocation Reference Manual 235 typedef cache_aligned_allocator other; }; }; template bool operator==( const cache_aligned_allocator&, const cache_aligned_allocator& ); template bool operator!=( const cache_aligned_allocator&, const cache_aligned_allocator& ); } For sake of brevity, the following subsections describe only those methods that differ significantly from the corresponding methods of std::allocator. 8.4.1 pointer allocate( size_type n, const void* hint=0 ) Effects Allocates size bytes of memory on a cache-line boundary. The allocation may include extra hidden padding. Returns Pointer to the allocated memory. 8.4.2 void deallocate( pointer p, size_type n ) Requirements Pointer p must be result of method allocate(n). The memory must not have been already deallocated. Effects Deallocates memory pointed to by p. The deallocation also deallocates any extra hidden padding. 236 315415-014US 8.4.3 char* _Charalloc( size_type size ) NOTE: This method is provided only on 64-bit Windows* OS platforms. It is a non-ISO method that exists for backwards compatibility with versions of Window's containers that seem to require it. Please do not use it directly. 8.5 zero_allocator Summary Template class for allocator that returns zeroed memory. Syntax template class Alloc = tbb_allocator> class zero_allocator: public Alloc; Header #include "tbb/tbb_allocator.h" Description A zero_allocator allocates zeroed memory. A zero_allocator can be instantiated for any class A that models the Allocator concept. The default for A is tbb_allocator. A zero_allocator forwards allocation requests to A and zeros the allocation before returning it. Members namespace tbb { template class Alloc = tbb_allocator> class zero_allocator : public Alloc { public: typedef Alloc base_allocator_type; typedef typename base_allocator_type::value_type value_type; typedef typename base_allocator_type::pointer pointer; typedef typename base_allocator_type::const_pointer const_pointer; typedef typename base_allocator_type::reference reference; typedef typename base_allocator_type::const_reference const_reference; typedef typename base_allocator_type::size_type Memory Allocation Reference Manual 237 size_type; typedef typename base_allocator_type::difference_type difference_type; template struct rebind { typedef zero_allocator other; }; zero_allocator() throw() { } zero_allocator(const zero_allocator &a) throw(); template zero_allocator(const zero_allocator &a) throw(); pointer allocate(const size_type n, const void* hint=0); }; } 8.6 aligned_space Template Class Summary Uninitialized memory space for an array of a given type. Syntax template class aligned_space; Header #include "tbb/aligned_space.h" Description An aligned_space occupies enough memory and is sufficiently aligned to hold an array T[N]. The client is responsible for initializing or destroying the objects. An aligned_space is typically used as a local variable or field in scenarios where a block of fixed-length uninitialized memory is needed. Members namespace tbb { template class aligned_space { public: aligned_space(); ~aligned_space(); T* begin(); T* end(); 238 315415-014US }; } 8.6.1 aligned_space() Effects None. Does not invoke constructors. 8.6.2 ~aligned_space() Effects None. Does not invoke destructors. 8.6.3 T* begin() Returns Pointer to beginning of storage. 8.6.4 T* end() Returns begin()+N Synchronization Reference Manual 239 9 Synchronization The library supports mutual exclusion and atomic operations. 9.1 Mutexes Mutexes provide MUTual EXclusion of threads from sections of code. In general, strive for designs that minimize the use of explicit locking, because it can lead to serial bottlenecks. If explicitly locking is necessary, try to spread it out so that multiple threads usually do not contend to lock the same mutex. 9.1.1 Mutex Concept The mutexes and locks here have relatively spartan interfaces that are designed for high performance. The interfaces enforce the scoped locking pattern, which is widely used in C++ libraries because: 1. Does not require the programmer to remember to release the lock 2. Releases the lock if an exception is thrown out of the mutual exclusion region protected by the lock There are two parts to the pattern: a mutex object, for which construction of a lock object acquires a lock on the mutex and destruction of the lock object releases the lock. Here’s an example: { // Construction of myLock acquires lock on myMutex M::scoped_lock myLock( myMutex ); ... actions to be performed while holding the lock ... // Destruction of myLock releases lock on myMutex } If the actions throw an exception, the lock is automatically released as the block is exited. Table 31 shows the requirements for the Mutex concept for a mutex type M 502H1027H240 315415-014US Table 31: Mutex Concept Pseudo-Signature Semantics M() Construct unlocked mutex. ~M() Destroy unlocked mutex. typename M::scoped_lock Corresponding scoped-lock type. M::scoped_lock() Construct lock without acquiring mutex. M::scoped_lock(M&) Construct lock and acquire lock on mutex. M::~scoped_lock() Release lock (if acquired). M::scoped_lock::acquire(M&) Acquire lock on mutex. bool M::scoped_lock::try_acquire(M&) Try to acquire lock on mutex. Return true if lock acquired, false otherwise. M::scoped_lock::release() Release lock. static const bool M::is_rw_mutex True if mutex is reader-writer mutex; false otherwise. static const bool M::is_recursive_mutex True if mutex is recursive mutex; false otherwise. static const bool M::is_fair_mutex True if mutex is fair; false otherwise. Table 32 summarizes the classes that model the Mutex concept. 1028H Table 32: Mutexes that Model the Mutex Concept Scalable Fair Reentrant Long Wait Size mutex OS dependent OS dependent No Blocks = 3 words recursive_mutex OS dependent OS dependent Yes Blocks = 3 words spin_mutex No No No Yields 1 byte queuing_mutex 9 9 No Yields 1 word spin_rw_mutex No No No Yields 1 word queuing_rw_mutex 9 9 No Yields 1 word null_mutex - Yes Yes - empty null_rw_mutex - Yes Yes - empty See the Tutorial, Section 6.1.1, for a discussion of the mutex properties and the rationale for null mutexes. 9.1.1.1 C++ 200x Compatibility Classes mutex, recursive_mutex, spin_mutex, and spin_rw_mutex support the C++ 200x interfaces described in Table 33. 1029HSynchronization Reference Manual 241 Table 33: C++ 200x Methods Available for Some Mutexes. Pseudo-Signature Semantics void M::lock() Acquire lock. bool M::try_lock() Try to acquire lock on mutex. Return true if lock acquired, false otherwise. void M::unlock() Release lock. class lock_guard class unique_lock See Section 22H 9.4 1030H Classes mutex and recursive mutex also provide the C++ 200x idiom for accessing their underlying OS handles, as described in Table 34. 1031H Table 34: Native handle interface (M is mutex or recursive_mutex). Pseudo-Signature Semantics M::native_handle_type Native handle type. Operating system Native handle type Windows* operating system LPCRITICAL_SECTION Other operationing systems (pthread_mutex*) native_handle_type M::native_handle() Get underlying native handle of mutex M. As an extension to C++ 200x, class spin_rw_mutex also has methods read_lock() and try_read_lock() for corresponding operations that acquire reader locks. 9.1.2 mutex Class Summary Class that models Mutex Concept using underlying OS locks. Syntax class mutex; Header #include "tbb/mutex.h" 242 315415-014US Description A mutex models the Mutex Concept (9.1.1). It is a wrapper around OS calls that 504H1032H provide mutual exclusion. The advantages of using mutex instead of the OS calls are: • Portable across all operating systems supported by Intel® Threading Building Blocks. • Releases the lock if an exception is thrown from the protected region of code. Members See Mutex Concept (9.1.1). 505H1033H 9.1.3 recursive_mutex Class Summary Class that models Mutex Concept using underlying OS locks and permits recursive acquisition. Syntax class recursive_mutex; Header #include "tbb/recursive_mutex.h" Description A recursive_mutex is similar to a mutex (9.1.2), except that a thread may acquire 1034H multiple locks on it. The thread must release all locks on a recursive_mutex before any other thread can acquire a lock on it. Members See Mutex Concept (9.1.1). 505H1035H 9.1.4 spin_mutex Class Summary Class that models Mutex Concept using a spin lock. Syntax class spin_mutex; Synchronization Reference Manual 243 Header #include "tbb/spin_mutex.h" Description A spin_mutex models the Mutex Concept (9.1.1). A 506H1036H spin_mutex is not scalable, fair, or recursive. It is ideal when the lock is lightly contended and is held for only a few machine instructions. If a thread has to wait to acquire a spin_mutex, it busy waits, which can degrade system performance if the wait is long. However, if the wait is typically short, a spin_mutex significantly improve performance compared to other mutexes. Members See Mutex Concept (9.1.1). 507H1037H 9.1.5 queuing_mutex Class Summary Class that models Mutex Concept that is fair and scalable. Syntax class queuing_mutex; Header #include "tbb/queuing_mutex.h" Description A queuing_mutex models the Mutex Concept (9.1.1). A 508H1038H queuing_mutex is scalable, in the sense that if a thread has to wait to acquire the mutex, it spins on its own local cache line. A queuing_mutex is fair. Threads acquire a lock on a mutex in the order that they request it. A queuing_mutex is not recursive. The current implementation does busy-waiting, so using a queuing_mutex may degrade system performance if the wait is long. Members See Mutex Concept (9.1.1). 509H1039H 9.1.6 ReaderWriterMutex Concept The ReaderWriterMutex concept extends the Mutex Concept to include the notion of reader-writer locks. It introduces a boolean parameter write that specifies whether a 244 315415-014US writer lock (write =true) or reader lock (write =false) is being requested. Multiple reader locks can be held simultaneously on a ReaderWriterMutex if it does not have a writer lock on it. A writer lock on a ReaderWriterMutex excludes all other threads from holding a lock on the mutex at the same time. Table 35 shows the requirements for a ReaderWriterMutex 1040H RW. They form a superset of the Mutex Concept (9.1.1). 1041H Table 35: ReaderWriterMutex Concept Pseudo-Signature Semantics RW() Construct unlocked mutex. ~RW() Destroy unlocked mutex. typename RW::scoped_lock Corresponding scoped-lock type. RW::scoped_lock() Construct lock without acquiring mutex. RW::scoped_lock(RW&, bool write=true) Construct lock and acquire lock on mutex. RW::~scoped_lock() Release lock (if acquired). RW::scoped_lock::acquire(RW&, bool write=true) Acquire lock on mutex. bool RW::scoped_lock::try_acquire(RW&, bool write=true) Try to acquire lock on mutex. Return true if lock acquired, false otherwise. RW::scoped_lock::release() Release lock. bool RW::scoped_lock::upgrade_to_writer() Change reader lock to writer lock. bool RW::scoped_lock::downgrade_to_reader() Change writer lock to reader lock. static const bool RW::is_rw_mutex = true True. static const bool RW::is_recursive_mutex True if mutex is reader-writer mutex; false otherwise. For all current reader-writer mutexes, false. static const bool RW::is_fair_mutex True if mutex is fair; false otherwise. The following subsections explain the semantics of the ReaderWriterMutex concept in detail. Model Types Classes spin_rw_mutex (9.1.7) and 1042H queuing_rw_mutex (9.1.8) model the 1043H ReaderWriterMutex concept. Synchronization Reference Manual 245 9.1.6.1 ReaderWriterMutex() Effects Constructs unlocked ReaderWriterMutex. 9.1.6.2 ~ReaderWriterMutex() Effects Destroys unlocked ReaderWriterMutex. The effect of destroying a locked ReaderWriterMutex is undefined. 9.1.6.3 ReaderWriterMutex::scoped_lock() Effects Constructs a scoped_lock object that does not hold a lock on any mutex. 9.1.6.4 ReaderWriterMutex::scoped_lock( ReaderWriterMutex& rw, bool write =true) Effects Constructs a scoped_lock object that acquires a lock on mutex rw. The lock is a writer lock if write is true; a reader lock otherwise. 9.1.6.5 ReaderWriterMutex::~scoped_lock() Effects If the object holds a lock on a ReaderWriterMutex, releases the lock. 9.1.6.6 void ReaderWriterMutex:: scoped_lock:: acquire( ReaderWriterMutex& rw, bool write=true ) Effects Acquires a lock on mutex rw. The lock is a writer lock if write is true; a reader lock otherwise. 246 315415-014US 9.1.6.7 bool ReaderWriterMutex:: scoped_lock::try_acquire( ReaderWriterMutex& rw, bool write=true ) Effects Attempts to acquire a lock on mutex rw. The lock is a writer lock if write is true; a reader lock otherwise. Returns true if the lock is acquired, false otherwise. 9.1.6.8 void ReaderWriterMutex:: scoped_lock::release() Effects Releases lock. The effect is undefined if no lock is held. 9.1.6.9 bool ReaderWriterMutex:: scoped_lock::upgrade_to_writer() Effects Changes reader lock to a writer lock. The effect is undefined if the object does not already hold a reader lock. Returns false if lock was released in favor of another upgrade request and then reacquired; true otherwise. 9.1.6.10 bool ReaderWriterMutex:: scoped_lock::downgrade_to_reader() Effects Changes writer lock to a reader lock. The effect is undefined if the object does not already hold a writer lock. Returns false if lock was released and reacquired; true otherwise. Intel's current implementations for spin_rw_mutex and queuing_rw_mutex always return true. Different implementations might sometimes return false. Synchronization Reference Manual 247 9.1.7 spin_rw_mutex Class Summary Class that models ReaderWriterMutex Concept that is unfair and not scalable. Syntax class spin_rw_mutex; Header #include "tbb/spin_rw_mutex.h" Description A spin_rw_mutex models the ReaderWriterMutex Concept (9.1.6). A 1044H spin_rw_mutex is not scalable, fair, or recursive. It is ideal when the lock is lightly contended and is held for only a few machine instructions. If a thread has to wait to acquire a spin_rw_mutex, it busy waits, which can degrade system performance if the wait is long. However, if the wait is typically short, a spin_rw_mutex significantly improve performance compared to other mutexes.. Members See ReaderWriterMutex concept (9.1.6). 1045H 9.1.8 queuing_rw_mutex Class Summary Class that models ReaderWriterMutex Concept that is fair and scalable. Syntax class queuing_rw_mutex; Header #include "tbb/queuing_rw_mutex.h" Description A queuing_rw_mutex models the ReaderWriterMutex Concept (9.1.6). A 1046H queuing_rw_mutex is scalable, in the sense that if a thread has to wait to acquire the mutex, it spins on its own local cache line. A queuing_rw_mutex is fair. Threads acquire a lock on a queuing_rw_mutex in the order that they request it. A queuing_rw_mutex is not recursive. 248 315415-014US Members See ReaderWriterMutex concept (9.1.6). 1047H 9.1.9 null_mutex Class Summary Class that models Mutex Concept buts does nothing. Syntax class null_mutex; Header #include "tbb/null_mutex.h" Description A null_mutex models the Mutex Concept (9.1.1) syntactically, but does nothing. It is 504H1048H useful for instantiating a template that expects a Mutex, but no mutual exclusion is actually needed for that instance. Members See Mutex Concept (9.1.1). 505H1049H 9.1.10 null_rw_mutex Class Summary Class that models ReaderWriterMutex Concept but does nothing. Syntax class null_rw_mutex; Header #include "tbb/null_rw_mutex.h" Description A null_rw_mutex models the ReaderWriterMutex Concept (9.1.6) syntactically, but 1050H does nothing. It is useful for instantiating a template that expects a ReaderWriterMutex, but no mutual exclusion is actually needed for that instance.. Synchronization Reference Manual 249 Members See ReaderWriterMutex concept (9.1.6). 1051H 9.2 atomic Template Class Summary Template class for atomic operations. Syntax template atomic; Header #include "tbb/atomic.h" Description An atomic supports atomic read, write, fetch-and-add, fetch-and-store, and compare-and-swap. Type T may be an integral type, enumeration type, or a pointer type. When T is a pointer type, arithmetic operations are interpreted as pointer arithmetic. For example, if x has type atomic and a float occupies four bytes, then ++x advances x by four bytes. Arithmetic on atomic is not allowed if T is an enumeration type, void*, or bool. Some of the methods have template method variants that permit more selective memory fencing. On IA-32 and Intel® 64 architecture processors, they have the same effect as the non-templated variants. On IA-64 architecture (Itanium®) processors, they may improve performance by allowing the memory subsystem more latitude on the orders of reads and write. Using them may improve performance. Table 36 shows 511H1052H the fencing for the non-template form. Table 36: Operation Order Implied by Non-Template Methods Kind Description Default For acquire Operations after the atomic operation never move over it. read release Operations before the atomic operation never move over it. write sequentially consistent Operations on either side never move over it and furthermore, the sequentially consistent atomic operations have a global order. fetch_and_store, fetch_and_add, compare_and_swap250 315415-014US CAUTION: The copy constructor for class atomic is not atomic. To atomically copy an atomic, default-construct the copy first and assign to it. Below is an example that shows the difference. atomic y(x); // Not atomic atomic z; z=x; // Atomic assignment The copy constructor is not atomic because it is compiler generated. Introducing any non-trivial constructors might remove an important property of atomic: namespace scope instances are zero-initialized before namespace scope dynamic initializers run. This property can be essential for code executing early during program startup. To create an atomic with a specific value, default-construct it first, and afterwards assign a value to it. Members namespace tbb { enum memory_semantics { acquire, release }; struct atomic { typedef T value_type; template value_type compare_and_swap( value_type new_value, value_type comparand ); value_type compare_and_swap( value_type new_value, value_type comparand ); template value_type fetch_and_store( value_type new_value ); value_type fetch_and_store( value_type new_value ); operator value_type() const; value_type operator=( value_type new_value ); atomic& operator=( const atomic& value ); // The following members exist only if T is an integral // or pointer type. Synchronization Reference Manual 251 template value_type fetch_and_add( value_type addend ); value_type fetch_and_add( value_type addend ); template value_type fetch_and_increment(); value_type fetch_and_increment(); template value_type fetch_and_decrement(); value_type fetch_and_decrement(); value_type operator+=(value_type); value_type operator-=(value_type); value_type operator++(); value_type operator++(int); value_type operator--(); value_type operator--(int); }; } So that an atomic can be used like a pointer to T, the specialization atomic also defines: T* operator->() const; 9.2.1 memory_semantics Enum Description Defines values used to select the template variants that permit more selective control over visibility of operations (see Table 36). 1053H 9.2.2 value_type fetch_and_add( value_type addend ) Effects Let x be the value of *this. Atomically updates x = x + addend. 252 315415-014US Returns Original value of x. 9.2.3 value_type fetch_and_increment() Effects Let x be the value of *this. Atomically updates x = x + 1. Returns Original value of x. 9.2.4 value_type fetch_and_decrement() Effects Let x be the value of *this. Atomically updates x = x - 1. Returns Original value of x. 9.2.5 value_type compare_and_swap value_type compare_and_swap( value_type new_value, value_type comparand ) Effects Let x be the value of *this. Atomically compares x with comparand, and if they are equal, sets x=new_value. Returns Original value of x. 9.2.6 value_type fetch_and_store( value_type new_value ) Effects Let x be the value of *this. Atomically exchanges old value of x with new_value. Synchronization Reference Manual 253 Returns Original value of x. 9.3 PPL Compatibility Classes critical_section and reader_writer_lock exist for compatibility with the Microsoft Parallel Patterns Library (PPL). They do not follow all of the conventions of other mutexes in Intel® Threading Building Blocks. 9.3.1 critical_section Summary A PPL-compatible mutex. Syntax class critical_section; Header #include "tbb/critical_section.h" Description A critical_section implements a PPL critical_section. Its functionality is a subset of the functionality of a tbb::mutex. Members namespace tbb { class critical_section { public: critical_section(); ~critical_section(); void lock(); bool try_lock(); void unlock(); class scoped_lock { public: scoped_lock( critical_section& mutex ); ~scoped_lock(); }; }; } 254 315415-014US 9.3.2 reader_writer_lock Class Summary A PPL-compatible reader-writer mutex that is scalable and gives preference to writers. Syntax class reader_writer_lock; Header #include "tbb/reader_writer_lock.h" Description A reader_writer_lock implements a PPL-compatible reader-writer mutex. A reader_writer_lock is scalable and nonrecursive. The implementation handles lock requests on a first-come first-serve basis except that writers have preference over readers. Waiting threads busy wait, which can degrade system performance if the wait is long. However, if the wait is typically short, a reader_writer_lock can provide performance competitive with other mutexes. A reader_writer_lock models part of the ReaderWriterMutex Concept (9.1.6) and 1054H part of the C++ 200x compatibility interface (9.1.1.1). The major differences are: 1055H • The scoped interfaces support only strictly scoped locks. For example, the method scoped_lock::release() is not supported. • Reader locking has a separate interface. For example, there is separate scoped interface scoped_lock_read for reader locking, instead of a flag to distinguish the reader cases as in the ReaderWriterMutex Concept. Members namespace tbb { class reader_writer_lock { public: reader_writer_lock(); ~reader_writer_lock(); void lock(); void lock_read(); bool try_lock(); bool try_lock_read(); void unlock(); class scoped_lock { public: scoped_lock( reader_writer_lock& mutex ); ~scoped_lock(); }; Synchronization Reference Manual 255 class scoped_lock_read { public: scoped_lock_read( reader_writer_lock& mutex ); ~scoped_lock_read(); }; }; } Table 37 summarizes the semantics. 1056H Table 37: reader_writer_lock Members Summary Member Semantics reader_writer_lock() Construct unlocked mutex. ~reader_writer_lock() Destroy unlocked mutex. void reader_writer_lock::lock() Acquire write lock on mutex. void reader_writer_lock::lock_read() Acquire read lock on mutex. bool reader_writer_lock::try_lock() Try to acquire write lock on mutex. Returns true if lock acquired, false otherwise. bool reader_writer_lock::try_lock_read() Try to acquire read lock on mutex. Returns true if lock acquired, false otherwise. reader_writer_lock::unlock() Release lock. reader_writer_lock::scoped_lock (reader_writer_lock& m) Acquire write lock on mutex m. reader_writer_lock::~scoped_lock() Release write lock (if acquired). reader_writer_lock::scoped_lock_read (reader_writer_lock& m) Acquire read lock on mutex m. reader_writer_lock::~scoped_lock_read() Release read lock (if acquired). 9.4 C++ 200x Synchronization Intel® TBB approximates a portion of C++ 200x interfaces for condition variables and scoped locking. The approximation is based on the C++0x working draft N3000 23H . The major differences are: • The implementation uses the tbb::tick_count 24H interface instead of the C++ 200x interface. • The implementation throws std::runtime_error instead of a C++ 200x std::system_error. 256 315415-014US • The implementation omits or approximates features requiring C++ 200x language support such as constexpr or explicit operators. • The implementation works in conjunction with tbb::mutex wherever the C++ 200x specification calls for a std::mutex. See 9.1.1.1 for more about C++ 200x mutex 25H1057H support in Intel® TBB. See the working draft N3000 26H for a detailed descriptions of the members. CAUTION: Implementations may change if the C++ 200x specification changes. CAUTION: When support for std::system_error becomes available, implementations may throw std::system_error instead of std::runtime_error. The library defines the C++ 200x interfaces in namespace std, not namespace tbb, as explained in Section 2.4.7. 27H1058H Header #include “tbb/compat/condition_variable” Members namespace std { struct defer_lock_t { }; struct try_to_lock_t { }; struct adopt_lock_t { }; const defer_lock_t defer_lock = {}; const try_to_lock_t try_to_lock = {}; const adopt_lock_t adopt_lock = {}; template class lock_guard { public: typedef M mutex_type; explicit lock_guard(mutex_type& m); lock_guard(mutex_type& m, adopt_lock_t); ~lock_guard(); }; template class unique_lock: no_copy { public: typedef M mutex_type; unique_lock(); explicit unique_lock(mutex_type& m); unique_lock(mutex_type& m, defer_lock_t); unique_lock(mutex_type& m, try_to_lock_t)); Synchronization Reference Manual 257 unique_lock(mutex_type& m, adopt_lock_t); unique_lock(mutex_type& m, const tick_count::interval_t &i); ~unique_lock(); void lock(); bool try_lock(); bool try_lock_for( const tick_count::interval_t &i ); void unlock(); void swap(unique_lock& u); mutex_type* release(); bool owns_lock() const; operator bool() const; mutex_type* mutex() const; }; template void swap(unique_lock& x, unique_lock& y); enum cv_status {no_timeout, timeout}; class condition_variable : no_copy { public: condition_variable(); ~condition_variable(); void notify_one(); void notify_all(); void wait(unique_lock& lock); template void wait(unique_lock& lock, Predicate pred); cv_status wait_for(unique_lock& lock, const tick_count::interval_t& i); template bool wait_for(unique_lock& lock, const tick_count::interval_t &i, 258 315415-014US Predicate pred); typedef implementation-defined native_handle_type; native_handle_type native_handle(); }; } // namespace std Timing Reference Manual 259 10 Timing Parallel programming is about speeding up wall clock time, which is the real time that it takes a program to run. Unfortunately, some of the obvious wall clock timing routines provided by operating systems do not always work reliably across threads, because the hardware thread clocks are not synchronized. The library provides support for timing across threads. The routines are wrappers around operating services that we have verified as safe to use across threads. 10.1 tick_count Class Summary Class for computing wall-clock times. Syntax class tick_count; Header #include "tbb/tick_count.h" Description A tick_count is an absolute timestamp. Two tick_count objects may be subtracted to compute a relative time tick_count::interval_t, which can be converted to seconds. Example using namespace tbb; void Foo() { tick_count t0 = tick_count::now(); ...action being timed... tick_count t1 = tick_count::now(); printf("time for action = %g seconds\n", (t1-t0).seconds() ); } Members namespace tbb { class tick_count { 260 315415-014US public: class interval_t; static tick_count now(); }; tick_count::interval_t operator-( const tick_count& t1, const tick_count& t0 ); } // tbb 10.1.1 static tick_count tick_count::now() Returns Current wall clock timestamp. CAUTION: On Microsoft Windows* operating systems, the current implementation uses the function QueryPerformanceCounter. Some systems may have bugs in their basic input/output system (BIOS) or hardware abstraction layer (HAL) that cause different processors to return different results. 10.1.2 tick_count::interval_t operator-( const tick_count& t1, const tick_count& t0 ) Returns Relative time that t1 occurred after t0. 10.1.3 tick_count::interval_t Class Summary Class for relative wall-clock time. Syntax class tick_count::interval_t; Header #include "tbb/tick_count.h" Description A tick_count::interval_t represents relative wall clock duration. Timing Reference Manual 261 Members namespace tbb { class tick_count::interval_t { public: interval_t(); explicit interval_t( double sec ); double seconds() const; interval_t operator+=( const interval_t& i ); interval_t operator-=( const interval_t& i ); }; tick_count::interval_t operator+( const tick_count::interval_t& i, const tick_count::interval_t& j ); tick_count::interval_t operator-( const tick_count::interval_t& i, const tick_count::interval_t& j ); } // namespace tbb 10.1.3.1 interval_t() Effects Constructs interval_t representing zero time duration. 10.1.3.2 interval_t( double sec ) Effects Constructs interval_t representing specified number of seconds. 10.1.3.3 double seconds() const Returns Time interval measured in seconds. 10.1.3.4 interval_t operator+=( const interval_t& i ) Effects *this = *this + i 262 315415-014US Returns Reference to *this. 10.1.3.5 interval_t operator-=( const interval_t& i ) Effects *this = *this - i Returns Reference to *this. 10.1.3.6 interval_t operator+ ( const interval_t& i, const interval_t& j ) Returns Interval_t representing sum of intervals i and j. 10.1.3.7 interval_t operator- ( const interval_t& i, const interval_t& j ) Returns Interval_t representing difference of intervals i and j. Task Groups Reference Manual 263 11 Task Groups This chapter covers the high-level interface to the task scheduler. Chapter 12 covers 1059H the low-level interface. The high-level interface lets you easily create groups of potentially parallel tasks from functors or lambda expressions. The low-level interface permits more detailed control, such as control over exception propogation and affinity. Summary High-level interface for running functions in parallel. Syntax template task_handle; template task_handle make_task( const Func& f ); enum task_group_status; class task_group; class structured_task_group; bool is_current_task_group_canceling(); Header #include "tbb/task_group.h" Requirements Functor arguments for various methods in this chapter should meet the requirements in Table 38. 1060H Table 38: Requirements on functor arguments Pseudo-Signature Semantics Func::Func (const Func&) Copy constructor. Func::~Func () Destructor. void Func::operator()() const; Evaluate functor. 264 315415-014US 11.1 task_group Class Description A task_group represents concurrent execution of a group of tasks. Tasks may be dynamically added to the group as it is executing. Example with Lambda Expressions #include "tbb/task_group.h" using namespace tbb; int Fib(int n) { if( n<2 ) { return n; } else { int x, y; task_group g; g.run([&]{x=Fib(n-1);}); // spawn a task g.run([&]{y=Fib(n-2);}); // spawn another task g.wait(); // wait for both tasks to complete return x+y; } } CAUTION: Creating a large number of tasks for a single task_group is not scalable, because task creation becomes a serial bottleneck. If creating more than a small number of concurrent tasks, consider using parallel_for (4.4) or 1061H parallel_invoke (4.12) 1062H instead, or structure the spawning as a recursive tree. Members namespace tbb { class task_group { public: task_group(); ~task_group(); template void run( const Func& f ); template void run( task_handle& handle ); template void run_and_wait( const Func& f ); Task Groups Reference Manual 265 template void run_and_wait( task_handle& handle ); task_group_status wait(); bool is_canceling(); void cancel(); } } 11.1.1 task_group() Constructs an empty task group. 11.1.2 ~task_group() Requires Method wait must be called before destroying a task_group, otherwise the destructor throws an exception. 11.1.3 template void run( const Func& f ) Effects Spawn a task that computes f() and return immediately. 11.1.4 template void run ( task_handle& handle ); Effects Spawn a task that computes handle() and return immediately. 11.1.5 template void run_and_wait( const Func& f ) Effects Equivalent to {run(f); wait();}, but guarantees that f runs on the current thread. 266 315415-014US NOTE: Template method run_and_wait is intended to be more efficient than separate calls to run and wait. 11.1.6 template void run _and_wait( task_handle& handle ); Effects Equivalent to {run(handle); wait();}, but guarantees that handle() runs on the current thread. NOTE: Template method run_and_wait is intended to be more efficient than separate calls to run and wait. 11.1.7 task_group_status wait() Effects Wait for all tasks in the group to complete or be cancelled. 11.1.8 bool is_canceling() Returns True if this task group is cancelling its tasks. 11.1.9 void cancel() Effects Cancel all tasks in this task_group. 11.2 task_group_status Enum A task_group_status represents the status of a task_group. Members namespace tbb { enum task_group_status { not_complete, // Not cancelled and not all tasks in group have completed. Task Groups Reference Manual 267 complete, // Not cancelled and all tasks in group have completed canceled // Task group received cancellation request }; } 11.3 task_handle Template Class Summary Template class used to wrap a function object in conjunction with class structured_task_group. Description Class task_handle is used primarily in conjunction with class structured_task_group. For sake of uniformity, class task_group also accepts task_handle arguments. Members template class task_handle { public: task_handle( const Func& f ); void operator()() const; }; 11.4 make_task Template Function Summary Template function for creating a task_handle from a function or functor. Syntax template task_handle make_task( const Func& f ); Returns task_handle(f) 268 315415-014US 11.5 structured_task_group Class Description A structured_task_group is like a task_group, but has only a subset of the functionality. It may permit performance optimizations in the future. The restrictions are: o Methods run and run_and_wait take only task_handle arguments, not general functors. o Methods run and run_and_wait do not copy their task_handle arguments. The caller must not destroy those arguments until after wait or run_and_wait returns. o Methods run, run_and_wait, cancel, and wait should be called only by the thread that created the structured_task_group. o Method wait (or run_and_wait) should be called only once on a given instance of structured_task_group. Example The function fork_join below evaluates f1() and f2(), in parallel if resources permit. #include "tbb/task_group.h" using namespace tbb; template void fork_join( const Func1& f1, const Func2& f2 ) { structured_task_group group; task_handle h1(f1); group.run(h1); // spawn a task task_handle h2(f2); group.run(h2); // spawn another task group.wait(); // wait for both tasks to complete // now safe to destroy h1 and h2 } Members namespace tbb { class structured_task_group { public: structured_task_group(); Task Groups Reference Manual 269 ~structured_task_group(); template void run( task_handle& handle ); template void run_and_wait( task_handle& handle ); task_group_status wait(); bool is_canceling(); void cancel(); }; } 11.6 is_current_task_group_canceling Function Returns True if innermost task group executing on this thread is cancelling its tasks. 270 315415-014US 12 Task Scheduler Intel Threading Building Blocks (Intel® TBB) provides a task scheduler, which is the engine that drives the algorithm templates (Section 4) and task groups (Section 512H1063H 11). 1064H You may also call it directly. Using tasks is often simpler and more efficient than using threads, because the task scheduler takes care of a lot of details. The tasks are quanta of computation. The scheduler maps these onto physical threads. The mapping is non-preemptive. Each thread has a method execute(). Once a thread starts running execute(), the task is bound to that thread until execute() returns. During that time, the thread services other tasks only when it waits on its predecessor tasks, at which time it may run the predecessor tasks, or if there are no pending predecessor tasks, the thread may service tasks created by other threads. The task scheduler is intended for parallelizing computationally intensive work. Because task objects are not scheduled preemptively, they should generally avoid making calls that might block for long periods, because meanwhile that thread is precluded from servicing other tasks. CAUTION: There is no guarantee that potentially parallel tasks actually execute in parallel, because the scheduler adjusts actual parallelism to fit available worker threads. For example, given a single worker thread, the scheduler creates no actual parallelism. For example, it is generally unsafe to use tasks in a producer consumer relationship, because there is no guarantee that the consumer runs at all while the producer is running. Potential parallelism is typically generated by a split/join pattern. Two basic patterns of split/join are supported. The most efficient is continuation-passing form, in which the programmer constructs an explicit “continuation” task. The parent task creates child tasks and specifies a continuation task to be executed when the children complete. The continuation inherits the parent’s ancestor. The parent task then exits; it does not block on its children. The children subsequently run, and after they (or their continuations) finish, the continuation task starts running. Figure 7 shows the steps. 513H1065H The running tasks at each step are shaded. parent parent continuation continuation continuation child child child child Task Scheduler Reference Manual 271 Figure 7: Continuation-passing Style Explicit continuation passing is efficient, because it decouples the thread’s stack from the tasks. However, it is more difficult to program. A second pattern is "blocking style", which uses implicit continuations. It is sometimes less efficient in performance, but more convenient to program. In this pattern, the parent task blocks until its children complete, as shown in Figure 8. 514H1066H parent parent child child child child parent parent Figure 8: Blocking Style The convenience comes with a price. Because the parent blocks, its thread’s stack cannot be popped yet. The thread must be careful about what work it takes on, because continually stealing and blocking could cause the stack to grow without bound. To solve this problem, the scheduler constrains a blocked thread such that it never executes a task that is less deep than its deepest blocked task. This constraint may impact performance because it limits available parallelism, and tends to cause threads to select smaller (deeper) subtrees than they would otherwise choose. 12.1 Scheduling Algorithm The scheduler employs a technique known as work stealing. Each thread keeps a "ready pool" of tasks that are ready to run. The ready pool is structured as a deque (double-ended queue) of task objects that were spawned. Additionally, there is a shared queue of task objects that were enqueued. The distinction between spawning a task and enqueuing a task affects when the scheduler runs the task. After completing a task t, a thread chooses its next task according to the first applicable rule below: 1. The task returned by t.execute() 2. The successor of t if t was its last completed predecessor. 3. A task popped from the end of the thread’s own deque. 4. A task with affinity for the thread. 5. A task popped from approximately the beginning of the shared queue. 6. A task popped from the beginning of another randomly chosen thread’s deque. 272 315415-014US When a thread spawns a task, it pushes it onto the end of its own deque. Hence rule (3) above gets the task most recently spawned by the thread, whereas rule (6) gets the least recently spawned task of another thread. When a thread enqueues a task, it pushes it onto the end of the shared queue. Hence rule (5) gets one of the less recently enqueued tasks, and has no preference for tasks that are enqueued. This is in contrast to spawned tasks, where by rule (3) a thread prefers its own most recently spawned task. Note the “approximately” in rule (5). For scalability reasons, the shared queue does not guarantee precise first-in first-out behavior. If strict first-in first-out behavior is desired, put the real work in a separate queue, and create tasks that pull work from that queue. The chapter “Non-Preemptive Priorities” in the Intel® TBB Design Patterns manual explains the technique. It is important to understand the implications of spawning versus enqueuing for nested parallelism. • Spawned tasks emphasize locality. Enqueued tasks emphasize fairness. • For nested parallelism, spawned tasks tend towards depth-first execution, whereas enqueued tasks cause breadth-first execution. Because the space demands of breadth-first execution can be exponentially higher than depth-first execution, enqueued tasks should be used with care. • A spawned task might never be executed until a thread explicitly waits on the task to complete. An enqueued tasks will eventually run if all previously enqueued tasks complete. In the case where there would ordinarily be no other worker thread to execute an enqueued task, the scheduler creates an extra worker. In general, used spawned tasks unless there is a clear reason to use an enqueued task. Spawned tasks yield the best balance between locality of reference, space efficiency, and parallelism. The algorithm for spawned tasks is similar to the work-stealing algorithm used by Cilk (Blumofe 1995 224H28H ). The notion of work-stealing dates back to the 1980s (Burton 1981 29H ). The thread affinity support is more recent (Acar 2000 30H ). 12.2 task_scheduler_init Class Summary Class that explicity represents thread's interest in task scheduling services. Syntax class task_scheduler_init; Header #include "tbb/task_scheduler_init.h" Task Scheduler Reference Manual 273 Description Using task_scheduler_init is optional in Intel® TBB 2.2. By default, Intel® TBB 2.2 automatically creates a task scheduler the first time that a thread uses task scheduling services and destroys it when the last such thread exits. An instance of task_scheduler_init can be used to control the following aspects of the task scheduler: • When the task scheduler is constructed and destroyed. • The number of threads used by the task scheduler. • The stack size for worker threads. To override the automatic defaults for task scheduling, a task_scheduler_init must become active before the first use of task scheduling services. A task_scheduler_init is either "active" or "inactive". The default constructor for a task_scheduler_init activates it, and the destructor deactivates it. To defer activation, pass the value task_scheduler_init::deferred to the constructor. Such a task_scheduler_init may be activated later by calling method initialize. Destruction of an active task_scheduler_init implicitly deactivates it. To deactivate it earlier, call method terminate. An optional parameter to the constructor and method initialize allow you to specify the number of threads to be used for task execution. This parameter is useful for scaling studies during development, but should not be set for production use. TIP: The reason for not specifying the number of threads in production code is that in a large software project, there is no way for various components to know how many threads would be optimal for other threads. Hardware threads are a shared global resource. It is best to leave the decision of how many threads to use to the task scheduler. To minimize time overhead, it is best to rely upon automatic creation of the task scheduler, or create a single task_scheduler_init object whose activation spans all uses of the library's task scheduler. A task_scheduler_init is not assignable or copyconstructible. Example // Sketch of one way to do a scaling study #include #include "tbb/task_scheduler_init.h" int main() { int n = task_scheduler_init::default_num_threads(); 274 315415-014US for( int p=1; p<=n; ++p ) { // Construct task scheduler with p threads task_scheduler_init init(p); tick_count t0 = tick_count::now(); ... execute parallel algorithm using task or template algorithm here... tick_count t1 = tick_count::now(); double t = (t1-t0).seconds(); cout << "time = " << t << " with " << p << "threads\n"; // Implicitly destroy task scheduler. } return 0; } Members namespace tbb { typedef unsigned-integral-type stack_size_type; class task_scheduler_init { public: static const int automatic = implementation-defined; static const int deferred = implementation-defined; task_scheduler_init( int max_threads=automatic, stack_size_type thread_stack_size=0 ); ~task_scheduler_init(); void initialize( int max_threads=automatic ); void terminate(); static int default_num_threads(); bool is_active() const; }; } // namespace tbb 12.2.1 task_scheduler_init( int max_threads=automatic, stack_size_type thread_stack_size=0 ) Requirements The value max_threads shall be one of the values in Table 39. 516H1067H Effects If max_threads==task_scheduler_init::deferred, nothing happens, and the task_scheduler_init remains inactive. Otherwise, the task_scheduler_init is Task Scheduler Reference Manual 275 activated as follows. If the thread has no other active task_scheduler_init objects, the thread allocates internal thread-specific resources required for scheduling task objects. If there were no threads with active task_scheduler_init objects yet, then internal worker threads are created as described in Table 39. These workers sleep until 517H1068H needed by the task scheduler. Each worker created by the scheduler has an implicit active task_scheduler_init object. NOTE: As of TBB 3.0, it is meaningful for the parameter max_threads to differ for different calling threads. For example, if thread A specifies max_threads=3 and thread B specifies max_threads=7, then A is limited to having 2 workers, but B can have up to 6 workers. Since workers may be shared between A and B, the total number of worker threads created by the scheduler could be 6. NOTE: Some implementations create more workers than necessary. However, the excess workers remain asleep unless needed. The optional parameter thread_stack_size specifies the stack size of each worker thread. A value of 0 specifies use of a default stack size. The first active task_scheduler_init establishes the stack size for all worker threads. Table 39: Values for max_threads max_threads Semantics task_scheduler_init::automatic Let library determine max_threads based on hardware configuration. task_scheduler_init::deferred Defer activation actions. positive integer Request that up to max_threads-1 worker threads work on behalf of the calling thread at any one time. 12.2.2 ~task_scheduler_init() Effects If the task_scheduler_init is inactive, nothing happens. Otherwise, the task_scheduler_init is deactivated as follows. If the thread has no other active task_scheduler_init objects, the thread deallocates internal thread-specific resources required for scheduling task objects. If no existing thread has any active task_scheduler_init objects, then the internal worker threads are terminated. 276 315415-014US 12.2.3 void initialize( int max_threads=automatic ) Requirements The task_scheduler_init shall be inactive. Effects Similar to constructor (12.2.1). 518H1069H 12.2.4 void terminate() Requirements The task_scheduler_init shall be active. Effects Deactivates the task_scheduler_init without destroying it. The description of the destructor (12.2.2) specifies what deactivation entails. 519H1070H 12.2.5 int default_num_threads() Returns One more than the number of worker threads that task_scheduler_init creates by default. 12.2.6 bool is_active() const Returns True if *this is active as described in Section 12.2; false otherwise. 1071H 12.2.7 Mixing with OpenMP Mixing OpenMP with Intel® Threading Building Blocks is supported. Performance may be less than a pure OpenMP or pure Intel® Threading Building Blocks solution if the two forms of parallelism are nested. An OpenMP parallel region that plans to use the task scheduler should create a task_scheduler_init inside the parallel region, because the parallel region may create new threads unknown to Intel® Threading Building Blocks. Each of these new Task Scheduler Reference Manual 277 OpenMP threads, like native threads, must create a task_scheduler_init object before using Intel® Threading Building Blocks algorithms. The following example demonstrates how to do this. void OpenMP_Calls_TBB( int n ) { #pragma omp parallel { task_scheduler_init init; #pragma omp for for( int i=0; irefcount, and if becomes zero, puts the successor into the ready pool. c. Frees the memory of the task for reuse. 3. If the task has been marked for recycling: a. If marked by recycle_to_reexecute 31H (deprecated), puts the task back into the ready pool. b. Otherwise it was marked by recycle_as_child or recycle_as_continuation. 12.3.2 task Allocation Always allocate memory for task objects using one of the special overloaded new operators. The allocation methods do not construct the task. Instead, they return a proxy object that can be used as an argument to an overloaded version of operator new provided by the library. 282 315415-014US In general, the allocation methods must be called before any of the tasks allocated are spawned. The exception to this rule is allocate_additional_child_of(t), which can be called even if task t is already running. The proxy types are defined by the implementation. The only guarantee is that the phrase “new(proxy) T(...)”allocates and constructs a task of type T. Because these methods are used idiomatically, the headings in the subsection show the idiom, not the declaration. The argument this is typically implicit, but shown explicitly in the headings to distinguish instance methods from static methods. TIP: Allocating tasks larger than 216 bytes might be significantly slower than allocating smaller tasks. In general, task objects should be small lightweight entities. 12.3.2.1 new( task::allocate_root( task_group_context& group ) ) T Allocate a task of type T with the specified cancellation group. Figure 10 summarizes 525H1078H the state transition. null result 0 Figure 10: Effect of task::allocate_root() Use method spawn_root_and_wait (12.3.5.9) to execute the 526H1079H task. 12.3.2.2 new( task::allocate_root() ) T Like new(task::allocate_root(task_group_context&)) except that cancellation group is the current innermost cancellation group. 12.3.2.3 new( x.allocate_continuation() ) T Allocates and constructs a task of type T, and transfers the successor from x to the new task. No reference counts change. Figure 11 summarizes the state transition. 527H1080HTask Scheduler Reference Manual 283 successor null successor x x result refcount refcount 0 Figure 11: Effect of allocate_continuation() 12.3.2.4 new( x.allocate_child() ) T Effects Allocates a task with this as its successor. Figure 12 summarizes the state transition. 528H1081H x x result refcount s 0 successor successor Figure 12: Effect of allocate_child() If using explicit continuation passing, then the continuation, not the successor, should call the allocation method, so that successor is set correctly. If the number of tasks is not a small fixed number, consider building a task_list (12.5) of the predecessors first, and spawning them with a single call to 1082H task::spawn (12.3.5.5 1083H ). If a task must spawn some predecessors before all are constructed, it should use task::allocate_additional_child_of(*this) instead, because that method atomically increments refcount, so that the additional predecessor is properly accounted. However, if doing so, the task must protect against premature zeroing of refcount by using a blocking-style task pattern. 12.3.2.5 new(task::allocate_additional_child_of( y )) T Effects Allocates a task as a predecessor of another task y. Task y may be already running or have other predecessors running. Figure 13 summarizes the state transition. 1084H284 315415-014US y result refcount+1 0 y refcount Figure 13: Effect of allocate_additional_child_of(successor) Because y may already have running predecessors, the increment of y.refcount is atomic (unlike the other allocation methods, where the increment is not atomic). When adding a predecessor to a task with other predecessors running, it is up to the programmer to ensure that the successor’s refcount does not prematurely reach 0 and trigger execution of the successor before the new predecessor is added. 12.3.3 Explicit task Destruction Usually, a task is automatically destroyed by the scheduler after its method execute returns. But sometimes task objects are used idiomatically (such as for reference counting) without ever running execute. Such tasks should be disposed with method destroy. 12.3.3.1 static void destroy ( task& victim ) Requirements The refcount of victim must be zero. This requirement is checked in the debug version of the library. Effects Calls destructor and deallocates memory for victim. If victim.parent is not null, atomically decrements victim.parent->refcount. The parent is not put into the ready pool if its refcount becomes zero. Figure 14 summarizes the state transition. 532H1085HTask Scheduler Reference Manual 285 victim successor refcount successor refcount-1 f t dj t t ki d if if t i ll (can be null) Figure 14: Effect of destroy(victim). 12.3.4 Recycling Tasks It is often more efficient to recycle a task object rather than reallocate one from scratch. Often the parent can become the continuation, or one of the predecessors. CAUTION: Overlap rule: A recycled task t must not be put in jeopardy of having t.execute() rerun while the previous invocation of t.execute() is still running. The debug version of the library detects some violations of this rule. For example, t.execute() should never spawn t directly after recycling it. Instead, t.execute() should return a pointer to t, so that t is spawned after t.execute() completes. 12.3.4.1 void recycle_as_continuation() Requirements Must be called while method execute() is running. The refcount for the recycled task should be set to n, where n is the number of predecessors of the continuation task. CAUTION: The caller must guarantee that the task’s refcount does not become zero until after method execute() returns, otherwise the overlap rule 32H is broken. If the guarantee is not possible, use method recycle_as_safe_continuation() instead, and set the refcount to n+1. The race can occur for a task t when: t.execute() recycles t as a continuation. The continuation has predecessors that all complete before t.execute() returns. 286 315415-014US Hence the recycled t will be implicitly respawned with the original t.execute()still running, which breaks the overlap rule. Patterns that use recycle_as_continuation() typically avoid the race by making t.execute() return a pointer to one of the predecessors instead of explicitly spawning that predecessor. The scheduler implicitly spawns that predecessor after t.execute() returns, thus guaranteeing that the recycled t does not rerun prematurely. Effects Causes this to not be destroyed when method execute() returns. 12.3.4.2 void recycle_as_safe_continuation() Requirements Must be called while method execute() is running. The refcount for the recycled task should be set to n+1, where n is the number of predecessors of the continuation task. The additional +1 represents the task to be recycled. Effects Causes this to not be destroyed when method execute() returns. This method avoids the race discussed for recycle_as_continuation 33H because the additional +1 in the refcount prevents the continuation from executing until the original invocation of execute() completes. 12.3.4.3 void recycle_as_child_of( task& new_successor ) Requirements Must be called while method execute() is running. Effects Causes this to become a predecessor of new_successor, and not be destroyed when method execute() returns. 12.3.5 Synchronization Spawning a task t either causes the calling thread to invoke t.execute(), or causes t to be put into the ready pool. Any thread participating in task scheduling may then acquire the task and invoke t.execute(). Section 12.1 describes the structure of the 535H1086H ready pool. Task Scheduler Reference Manual 287 The calls that spawn come in two forms: • Spawn a single task. • Spawn multiple task objects specified by a task_list and clear task_list. Some calls distinguish between spawning root tasks and non-root tasks. A root task is one that was created using method allocate_root. Important A task should not spawn any predecessor task until it has called method set_ref_count to indicate both the number of predecessors and whether it intends to use one of the “wait_for_all” methods. 12.3.5.1 void set_ref_count( int count ) Requirements count=0.25F 26 If the intent is to subsequently spawn n predecessors and wait, then count should be n+1. Otherwise count should be n. Effects Sets the refcount attribute to count. 12.3.5.2 void increment_ref_count(); Effects Atomically increments refcount attribute. 12.3.5.3 int decrement_ref_count(); Effects Atomically decrements refcount attribute. Returns New value of refcount attribute. 26 Intel® TBB 2.1 had the stronger requirement count>0. 288 315415-014US NOTE: Explicit use of increment_ref_count and decrement_ref_count is typically necessary only when a task has more than one immediate successor task. Section 11.6 of the Tutorial ("General Acyclic Graphs of Tasks") explains more. 12.3.5.4 void wait_for_all() Requirements refcount=n+1, where n is the number of predecessors that are still running. Effects Executes tasks in ready pool until refcount is 1. Afterwards, leaves refcount=1 if the task’s task_group_context specifies concurrent_wait, otherwise sets refcount to 0.26F 27 536H Figure 15 summarizes the state transitions. 1087H Also, wait_for_all()automatically resets the cancellation state of the task_group_context implicitly associated with the task (12.6), when all of the 1088H following conditions hold: • The task was allocated without specifying a context. • The calling thread is a user-created thread, not an Intel® TBB worker thread. • It is the outermost call to wait_for_all() by the thread. TIP: Under such conditions there is no way to know afterwards if the task_group_context was cancelled. Use an explicit task_group_context if you need to know. 27 For sake of backwards compatibility, the default for task_group_context is not concurrent_wait, and hence to set refcount=0. Task Scheduler Reference Manual 289 this this n+1 k successor successor n previously spawned predecessors that are still running k = 0 by default k = 1 if corresponding task_group_context specifies concurrent_wait. Figure 15: Effect of wait_for_all 12.3.5.5 static void spawn( task& t ) Effects Puts task t into the ready pool and immediately returns. If the successor of t is not null, then set_ref_count must be called on that successor before spawning any child tasks, because once the child tasks commence, their completion will cause successor.refcount to be decremented asynchronously. The debug version of the library often detects when a required call to set_ref_count is not made, or is made too late. 12.3.5.6 static void spawn ( task_list& list ) Effects Equivalent to executing spawn on each task in list and clearing list, but may be more efficient. If list is empty, there is no effect. NOTE: Spawning a long linear list of tasks can introduce a bottleneck, because tasks are stolen individually. Instead, consider using a recursive pattern or a parallel loop template to create many pieces of independent work. 12.3.5.7 void spawn_and_wait_for_all( task& t ) Requirements Any other predecessors of this must already be spawned. The task t must have a non-null attribute successor. There must be a chain of successor links from t to the calling task. Typically, this chain contains a single link. That is, t is typically an immediate predecessor of this. 290 315415-014US Effects Similar to {spawn(t); wait_for_all();}, but often more efficient. Furthermore, it guarantees that task is executed by the current thread. This constraint can sometimes simplify synchronization. Figure 16 illustrates the state transitions. It is similar to 537H1089H Figure 15, with task 1090H t being the nth task. this this n+1 k t 0 successor successor n-1 previously spawned predecessors that are still running k = 0 by default k = 1 if corresponding task_group_context specifies concurrent_wait. Figure 16: Effect of spawn_and_wait_for_all 12.3.5.8 void spawn_and_wait_for_all( task_list& list ) Effects Similar to {spawn(list); wait_for_all();}, but often more efficient. 12.3.5.9 static void spawn_root_and_wait( task& root ) Requirements The memory for task root was allocated by task::allocate_root(). Effects Sets parent attribute of root to an undefined value and execute root as described in Section 12.3.1.1. Destroys 538H1091H root afterwards unless root was recycled. 12.3.5.10 static void spawn_root_and_wait( task_list& root_list ) Requirements Each task object t in root_list must meet the requirements in Section 12.3.5.9. 539H1092HTask Scheduler Reference Manual 291 Effects For each task object t in root_list, performs spawn_root_and_wait(t), possibly in parallel. Section 12.3.5.9 describes the actions of 540H1093H spawn_root_and_wait(t). 12.3.5.11 static void enqueue ( task& ) Effects The task is scheduled for eventual execution by a worker thread even if no thread ever explicitly waits for the task to complete. If the total number of worker threads is zero, a special additional worker thread is created to execute enqueued tasks. Enqueued tasks are processed in roughly, but not precisely, first-come first-serve order. CAUTION: Using enqueued tasks for recursive parallelism can cause high memory usage, because the recursion will expand in a breadth-first manner. Use ordinary spawning for recursive parallelism. CAUTION: Explicitly waiting on an enqueued task should be avoided, because other enqueued tasks from unrelated parts of the program might have to be processed first. The recommended pattern for using an enqueued task is to have it asynchronously signal its completion, for example, by posting a message back to the thread that enqueued it. See the Intel® Threading Building Blocks Design Patterns manual for such an example. 12.3.6 task Context These methods expose relationships between task objects, and between task objects and the underlying physical threads. 12.3.6.1 static task& self() Returns Reference to innermost task that the calling thread is running. A task is considered “running” if its methods execute(), note_affinity(), or destructor are running. If the calling thread is a user-created thread that is not running any task, self() returns a reference to an implicit dummy task associated with the thread. 12.3.6.2 task* parent() const Returns Value of the attribute successor. The result is an undefined value if the task was allocated by allocate_root and is currently running under control of spawn_root_and_wait. 292 315415-014US 12.3.6.3 void set_parent(task* p) Requirements Both tasks must be in the same task group. For example, for task t, t.group() == p->group(). Effects Sets parent task pointer to specified value p. 12.3.6.4 bool is_stolen_task() const Returns true if task is running on a thread different than the thread that spawned it. NOTE: Tasks enqueued with task::enqueue() are never reported as stolen. 12.3.6.5 task_group_context* group() Returns Descriptor of the task group, which this task belongs to. 12.3.6.6 void change_group( task_group_context& ctx ) Effects Moves the task from its current task group int the one specified by the ctx argument. 12.3.7 Cancellation A task is a quantum of work that is cancelled or executes to completion. A cancelled task skips its method execute() if that method has not yet started. Otherwise cancellation has no direct effect on the task. A task can poll task::is_cancelled() to see if cancellation was requested after it started running. Tasks are cancelled in groups as explained in Section 12.6. 1094H 12.3.7.1 bool cancel_group_execution() Effects Requests cancellation of all tasks in its group and its subordinate groups. Task Scheduler Reference Manual 293 Returns False if the task’s group already received a cancellation request; true otherwise. 12.3.7.2 bool is_cancelled() const Returns True if task’s group has received a cancellation request; false otherwise. 12.3.8 Priorities Priority levels can be assigned to individual tasks or task groups. The library supports three levels {low, normal, high} and two kinds of priority: - Static priority for enqueued 34H tasks. - Dynamic priority for task groups 35H . The former is specified by an optional argument of the task::enqueue() method, affects a specific task only, and cannot be changed afterwards. Tasks with higher priority are dequeued before tasks with lower priorities. The latter affects all the tasks in a group and can be changed at any time either via the associated task_group_context object or via any task belonging to the group. The priority-related methods in task_group_context are described in Section 12.6. 1095H The task scheduler tracks the highest priority of ready tasks (both enqueued and spawned), and postpones execution of tasks with lower priority until all higher priority task are executed. By default all tasks and task groups are created with normal priority. NOTE: Priority changes may not come into effect immediately in all threads. So it is possible that lower priority tasks are still being executed for some time even in the presence of higher priority ones. When several user threads (masters) concurrently execute parallel algorithms, the pool of worker threads is partitioned between them proportionally to the requested 36H concurrency levels. In the presence of tasks with different priorities, the pool of worker threads is proportionally divided among the masters with the highest priority first. Only after fully satisfying the requests of these higher priority masters, will the remaining threads be provided to the other masters. Though masters with lower priority tasks may be left without workers, the master threads are never stalled themselves. Task priorities also do not affect and are not affected by OS thread priority settings. NOTE: Worker thread migration from one master thread to another may not happen immediately. 294 315415-014US Related constants and methods namespace tbb { enum priority_t { priority_normal = implementation-defined, priority_low = implementation-defined, priority_high = implementation-defined }; class task { // . . . static void enqueue( task&, priority_t ); void set_group_priority ( priority_t ); priority_t group_priority () const; // . . . }; } 12.3.8.1 void enqueue ( task& t, priority_t p ) const Effects Enqueues task t at the priority level p. NOTE: Priority of an enqueued task does not affect priority of the task group, from the scope of which task::enqueue() is invoked (i.e. the group, which the task returned by task::self() 37H method belongs to). 12.3.8.2 void set_group_priority ( priority_t ) Effects Changes priority of the task group, which this task belongs to. 12.3.8.3 priority_t group_priority () const Returns Priority of the task group, which this task belongs to. 12.3.9 Affinity These methods enable optimizing for cache affinity. They enable you to hint that a later task should run on the same thread as another task that was executed earlier. To do this: Task Scheduler Reference Manual 295 1. In the earlier task, override note_affinity(id) with a definition that records id. 2. Before spawning the later task, run set_affinity(id) using the id recorded in step 1, The id is a hint and may be ignored by the scheduler. 12.3.9.1 affinity_id The type task::affinity_id is an implementation-defined unsigned integral type. A value of 0 indicates no affinity. Other values represent affinity to a particular thread. Do not assume anything about non-zero values. The mapping of non-zero values to threads is internal to the Intel® TBB implementation. 12.3.9.2 virtual void note_affinity ( affinity_id id ) The task scheduler invokes note_affinity before invoking execute() when: • The task has no affinity, but will execute on a thread different than the one that spawned it. • The task has affinity, but will execute on a thread different than the one specified by the affinity. You can override this method to record the id, so that it can be used as the argument to set_affinity(id) for a later task. Effects The default definition has no effect. 12.3.9.3 void set_affinity( affinity_id id ) Effects Sets affinity of this task to id. The id should be either 0 or obtained from note_affinity. 12.3.9.4 affinity_id affinity() const Returns Affinity of this task as set by set_affinity. 12.3.10 task Debugging Methods in this subsection are useful for debugging. They may change in future implementations. 296 315415-014US 12.3.10.1 state_type state() const CAUTION: This method is intended for debugging only. Its behavior or performance may change in future implementations. The definition of task::state_type may change in future implementations. This information is being provided because it can be useful for diagnosing problems during debugging. Returns Current state of the task. Table 41 describes valid states. Any other value is the result 541H1096H of memory corruption, such as using a task whose memory has been deallocated. Table 41: Values Returned by task::state() Value Description allocated Task is freshly allocated or recycled. ready Task is in ready pool, or is in process of being transferred to/from there. executing Task is running, and will be destroyed after method execute() returns. freed Task is on internal free list, or is in process of being transferred to/from there. reexecute Task is running, and will be respawned after method execute() returns. Figure 17 summarizes possible state transitions for a 542H1097H task. Task Scheduler Reference Manual 297 freed allocated reexecute allocate_...(t) (implicit) spawn(t) spawn_and_wait_for_all(t) return from t.execute() return from t.execute() t.recycle_to_reexecute ready executing t.recycle_as... (implicit) storage returned to heap destroy(t) allocate_...(t) storage from heap Figure 17: Typical task::state() Transitions 12.3.10.2 int ref_count() const CAUTION: This method is intended for debugging only. Its behavior or performance may change in future implementations. Returns The value of the attribute refcount.298 315415-014US 12.4 empty_task Class Summary Subclass of task that represents doing nothing. Syntax class empty_task; Header #include "tbb/task.h" Description An empty_task is a task that does nothing. It is useful as a continuation of a parent task when the continuation should do nothing except wait for its predecessors to complete. Members namespace tbb { class empty_task: public task { /*override*/ task* execute() {return NULL;} }; } 12.5 task_list Class Summary List of task objects. Syntax class task_list; Header #include "tbb/task.h" Description A task_list is a list of references to task objects. The purpose of task_list is to allow a task to create a list of tasks and spawn them all at once via the method task::spawn(task_list&), as described in 12.3.5.6. 543H1098HTask Scheduler Reference Manual 299 A task can belong to at most one task_list at a time, and on that task_list at most once. A task that has been spawned, but not started running, must not belong to a task_list. A task_list cannot be copy-constructed or assigned. Members namespace tbb { class task_list { public: task_list(); ~task_list(); bool empty() const; void push_back( task& task ); task& pop_front(); void clear(); }; } 12.5.1 task_list() Effects Constructs an empty list. 12.5.2 ~task_list() Effects Destroys the list. Does not destroy the task objects. 12.5.3 bool empty() const Returns True if list is empty; false otherwise. 12.5.4 push_back( task& task ) Effects Inserts a reference to task at back of the list. 300 315415-014US 12.5.5 task& task pop_front() Effects Removes a task reference from front of list. Returns The reference that was removed. 12.5.6 void clear() Effects Removes all task references from the list. Does not destroy the task objects. 12.6 task_group_context Summary A cancellable group of tasks. Syntax class task_group_context; Header #include “tbb/task.h” Description A task_group_context represents a group of tasks that can be cancelled or have their priority level set together. All tasks belong to some group. A task can be a member of only one group at any moment. A root task is associated with a group by passing task_group_context object into task::allocate_root() call. A child task automatically joins its parent task’s group. A task can be moved into other group using task::change_group() 38H method. The task_group_context objects form a forest of trees. Each tree’s root is a task_group_context constructed as isolated. A task_group_context is cancelled explicitly by request, or implicitly when an exception is thrown out of a task. Canceling a task_group_context causes the entire subtree rooted at it to be cancelled. Task Scheduler Reference Manual 301 The priorities for all the tasks in a group can be changed at any time either via the associated task_group_context object, or via any task belonging to the group. Priority changes propagate into the child task groups similarly to cancelation 39H . The effect of priorities on task execution is described in Section 12.3.8. 1099H Each user thread that creates a task_scheduler_init (12.2) implicitly has an 1100H isolated task_group_context that acts as the root of its initial tree. This context is associated with the dummy task returned by task::self() when the user thread is not running any task (12.3.6.1). 1101H Members namespace tbb { class task_group_context { public: enum kind_t { isolated = implementation-defined, bound = implementation-defined }; enum traits_type { exact_exception = implementation-defined, concurrent_wait = implementation-defined, #if TBB_USE_CAPTURED_EXCEPTION default_traits = 0 #else default_traits = exact_exception #endif /* !TBB_USE_CAPTURED_EXCEPTION */ }; task_group_context( kind_t relation_with_parent = bound, uintptr_t traits = default_traits ); ~task_group_context(); void reset(); bool cancel_group_execution(); bool is_group_execution_cancelled() const; void set_priority ( priority_t ); priority_t priority () const; }; } 302 315415-014US 12.6.1 task_group_context( kind_t relation_to_parent=bound, uintptr_t traits=default_traits ) Effects Constructs an empty task_group_context. If relation_to_parent is bound, the task_group_context will become a child of the innermost running task 40H ’s group when it is first passed into the call to task::allocate_root(task_group_context&). If this call is made directly from the user thread, the effect will be as if relation_to_parent were isolated. If relation_to_parent is isolated, it has no parent task_group_context. The traits argument should be the bitwise OR of traits_type values. The flag exact_exception controls how precisely exceptions are transferred between threads. See Section 13 for details. The flag 1102H concurrent_wait controls the reference-counting behavior of methods task::wait_for_all 41H and task::spawn_and_wait_for_all 42H . 12.6.2 ~task_group_context() Effects Destroys an empty task_group_context. It is a programmer error if there are still extant tasks in the group. 12.6.3 bool cancel_group_execution() Effects Requests that tasks in group be cancelled. Returns False if group is already cancelled; true otherwise. If concurrently called by multiple threads, exactly one call returns true and the rest return false. 12.6.4 bool is_group_execution_cancelled() const Returns True if group has received cancellation. Task Scheduler Reference Manual 303 12.6.5 void reset() Effects Reinitializes this to uncancelled state. CAUTION: This method is only safe to call once all tasks associated with the group's subordinate groups have completed. This method must not be invoked concurrently by multiple threads. 12.6.6 void set_priority ( priority_t ) Effects Changes priority of the task group. 12.6.7 priority_t priority () const Returns Priority of the task group. 12.7 task_scheduler_observer Summary Class that represents thread's interest in task scheduling services. Syntax class task_scheduler_observer; Header #include "tbb/task_scheduler_observer.h" Description A task_scheduler_observer permits clients to observe when a thread starts or stops participating in task scheduling. You typically derive your own observer class from task_scheduler_observer, and override virtual methods on_scheduler_entry or on_scheduler_exit. An instance has a state observing or not observing. Remember to call observe() to enable observation. 304 315415-014US Members namespace tbb { class task_scheduler_observer { public: task_scheduler_observer(); virtual ~task_scheduler_observer(); void observe( bool state=true ); bool is_observing() const; virtual void on_scheduler_entry( bool is_worker ) {} virtual void on_scheduler_exit( bool is_worker } {} }; } 12.7.1 task_scheduler_observer() Effects Constructs instance with observing disabled. 12.7.2 ~task_scheduler_observer() Effects Disables observing. Waits for extant invocations of on_scheduler_entry or on_scheduler_exit to complete. 12.7.3 void observe( bool state=true ) Effects Enables observing if state is true; disables observing if state is false. 12.7.4 bool is_observing() const Returns True if observing is enabled; false otherwise. 12.7.5 virtual void on_scheduler_entry( bool is_worker) Description Task Scheduler Reference Manual 305 The task scheduler invokes this method on each thread that starts participating in task scheduling, if observing is enabled. If observing is enabled after threads started participating, then this method is invoked once for each such thread, before it executes the first task it steals afterwards. The flag is_worker is true if the thread was created by the task scheduler; false otherwise. NOTE: If a thread enables observing before spawning a task, it is guaranteed that the thread that executes the task will have invoked on_scheduler_entry before executing the task. Effects The default behavior does nothing. 12.7.6 virtual void on_scheduler_exit( bool is_worker ) Description The task scheduler invokes this method when a thread stops participating in task scheduling, if observing is enabled. CAUTION: Sometimes on_scheduler_exit is invoked for a thread but not on_scheduler_entry. This situation can arise if a thread never steals a task. CAUTION: A process does not wait for Intel® TBB worker threads to clean up. Thus a process can terminate before on_scheduler_exit is invoked. Effects The default behavior does nothing. 12.8 Catalog of Recommended task Patterns This section catalogues recommended task patterns. In each pattern, class T is assumed to derive from class task. Subtasks are labeled t1, t2, ... tk. The subscripts indicate the order in which the subtasks execute if no parallelism is available. If parallelism is available, the subtask execution order is non-deterministic, except that t1 is guaranteed to be executed by the spawning thread. Recursive task patterns are recommended for efficient scalable parallelism, because they allow the task scheduler to unfold potential parallelism to match available 306 315415-014US parallelism. A recursive task pattern begins by creating a root task t0 and running it as follows. T& t0 = *new(allocate_root()) T(...); task::spawn_root_and_wait(*t0); The root task’s method execute() recursively creates more tasks as described in subsequent subsections. 12.8.1 Blocking Style With k Children The following shows the recommended style for a recursive task of type T where each level spawns k children. task* T::execute() { if( not recursing any further ) { ... } else { set_ref_count(k+1); task& tk = *new(allocate_child()) T(...); spawn(tk); task& tk-1= *new(allocate_child()) T(...); spawn(tk-1); ... task& t1 = *new(allocate_child()) T(...); spawn_and_wait_for_all(t1); } return NULL; } Child construction and spawning may be reordered if convenient, as long as a task is constructed before it is spawned. The key points of the pattern are: • The call to set_ref_count uses k+1 as its argument. The extra 1 is critical. • Each task is allocated by allocate_child. • The call spawn_and_wait_for_all combines spawning and waiting. A more uniform but slightly less efficient alternative is to spawn all tasks with spawn and wait by calling wait_for_all. 12.8.2 Continuation-Passing Style With k Children There are two recommended styles. They differ in whether it is more convenient to recycle the parent as the continuation or as a child. The decision should be based upon whether the continuation or child acts more like the parent. Task Scheduler Reference Manual 307 Optionally, as shown in the following examples, the code can return a pointer to one of the children instead of spawning it. Doing so causes the child to execute immediately after the parent returns. This option often improves efficiency because it skips pointless overhead of putting the task into the task pool and taking it back out. 12.8.2.1 Recycling Parent as Continuation This style is useful when the continuation needs to inherit much of the state of the parent and the child does not need the state. The continuation must have the same type as the parent. task* T::execute() { if( not recursing any further ) { ... return NULL; } else { set_ref_count(k); recycle_as_continuation(); task& tk = *new(allocate_child()) T(...); spawn(tk); task& tk-1 = *new(allocate_child()) T(...); spawn(tk-1); ... // Return pointer to first child instead of spawning it, // to remove unnecessary overhead. task& t1 = *new(allocate_child()) T(...); return &t1; } } The key points of the pattern are: • The call to set_ref_count uses k as its argument. There is no extra +1 as there is in blocking style discussed in Section 12.8.1. 544H1103H • Each child task is allocated by allocate_child. • The continuation is recycled from the parent, and hence gets the parent's state without doing copy operations. 12.8.2.2 Recycling Parent as a Child This style is useful when the child inherits much of its state from a parent and the continuation does not need the state of the parent. The child must have the same type as the parent. In the example, C is the type of the continuation, and must derive from class task. If C does nothing except wait for all children to complete, then C can be the class empty_task (12.4). 545H1104H task* T::execute() { if( not recursing any further ) { 308 315415-014US ... return NULL; } else { // Construct continuation C& c = allocate_continuation(); c.set_ref_count(k); // Recycle self as first child task& tk = *new(c.allocate_child()) T(...); spawn(tk); task& tk-1 = *new(c.allocate_child()) T(...); spawn(tk-1); ... task& t2 = *new(c.allocate_child()) T(...); spawn(t2); // task t1 is our recycled self. recycle_as_child_of(c); update fields of *this to subproblem to be solved by t1 return this; } } The key points of the pattern are: • The call to set_ref_count uses k as its argument. There is no extra 1 as there is in blocking style discussed in Section 12.8.1. 546H1105H • Each child task except for t1 is allocated by c.allocate_child. It is critical to use c.allocate_child, and not (*this).allocate_child; otherwise the task graph will be wrong. • Task t1 is recycled from the parent, and hence gets the parent's state without performing copy operations. Do not forget to update the state to represent a child subproblem; otherwise infinite recursion will occur. 12.8.3 Letting Main Thread Work While Child Tasks Run Sometimes it is desirable to have the main thread continue execution while child tasks are running. The following pattern does this by using a dummy empty_task (12.4). 1106H task* dummy = new( task::allocate_root() ) empty_task; dummy->set_ref_count(k+1); task& tk = *new( dummy->allocate_child() ) T; dummy->spawn(tk); task& tk-1= *new( dummy->allocate_child() ) T; dummy->spawn(tk-1); ... task& t1 = *new( dummy->allocate_child() ) T; dummy->spawn(t1); ...do any other work... dummy->wait_for_all(); dummy->destroy(*dummy); The key points of the pattern are: Task Scheduler Reference Manual 309 • The dummy task is a placeholder and never runs. • The call to set_ref_count uses k+1 as its argument. • The dummy task must be explicitly destroyed. 310 315415-014US 13 Exceptions Intel® Threading Building Blocks (Intel® TBB) propagates exceptions along logical paths in a tree of tasks. Because these paths cross between thread stacks, support for moving an exception between stacks is necessary. When an exception is thrown out of a task, it is caught inside the Intel® TBB run-time and handled as follows: 1. If the cancellation group for the task has already been cancelled, the exception is ignored. 2. Otherwise the exception or an approximation of it is captured. 3. The captured exception is rethrown from the root of the cancellation group after all tasks in the group have completed or have been successfully cancelled. The exact exception is captured when both of the following conditions are true: • The task’s task_group_context was created in a translation unit compiled with TBB_USE_CAPTURED_EXCEPTION 43H =0. • The Intel® TBB library was built with a compiler that supports the std::exception_ptr feature of C++ 200x. Otherwise an appoximation of the original exception x is captured as follows: 1. If x is a tbb_exception, it is captured by x.move(). 2. If x is a std::exception, it is captured as a tbb::captured_exception(typeid(x).name(),x.what()). 3. Otherwise x is captured as a tbb::captured exception with implementationspecified value for name() and what(). 13.1 tbb_exception Summary Exception that can be moved to another thread. Syntax class tbb_exception; Exceptions Reference Manual 311 Header #include "tbb/tbb_exception.h" Description In a parallel environment, exceptions sometimes have to be propagated across threads. Class tbb_exception subclasses std::exception to add support for such propagation. Members namespace tbb { class tbb_exception: public std::exception { virtual tbb_exception* move() = 0; virtual void destroy() throw() = 0; virtual void throw_self() = 0; virtual const char* name() throw() = 0; virtual const char* what() throw() = 0; }; } Derived classes should define the abstract virtual methods as follows: • move() should create a pointer to a copy of the exception that can outlive the original. It may move the contents of the original. • destroy() should destroy a copy created by move(). • throw_self() should throw *this. • name() typically returns the RTTI name of the originally intercepted exception. • what() returns a null-terminated string describing the exception. 13.2 captured_exception Summary Class used by Intel® TBB to capture an approximation of an exception. Syntax class captured_exception; Header #include "tbb/tbb_exception.h" 312 315415-014US Description When a task throws an exception, sometimes Intel® TBB converts the exception to a captured_exception before propagating it. The conditions for conversion are described in Section 13. 1107H Members namespace tbb { class captured_exception: public tbb_exception { captured_exception(const captured_exception& src); captured_exception(const char* name, const char* info); ~captured_exception() throw(); captured_exception& operator=(const captured_exception&); captured_exception* move() throw(); void destroy() throw(); void throw_self(); const char* name() const throw(); const char* what() const throw(); }; } Only the additions that captured_exception makes to tbb_exception are described here. Section 13.1 describes the rest of the interface. 1108H 13.2.1 captured_exception( const char* name, const char* info ) Effects Constructs a captured_exception with the specified name and info. 13.3 movable_exception Summary Subclass of tbb_exception interface that supports propagating copy-constructible data. Syntax template class movable_exception; Exceptions Reference Manual 313 Header #include "tbb/tbb_exception.h" Description This template provides a convenient way to implement a subclass of tbb_exception that propagates arbitrary copy-constructible data. Members namespace tbb { template class movable_exception: public tbb_exception { public: movable_exception( const ExceptionData& src ); movable_exception( const movable_exception& src )throw(); ~movable_exception() throw(); movable_exception& operator=( const movable_exception& src ); ExceptionData& data() throw(); const ExceptionData& data() const throw(); movable_exception* move() throw(); void destroy() throw(); void throw_self(); const char* name() const throw(); const char* what() const throw(); }; } Only the additions that movable_exception makes to tbb_exception are described here. Section 13.1 describes the rest of the interface. 1109H 13.3.1 movable_exception( const ExceptionData& src ) Effects Construct movable_exception containing copy of src. 13.3.2 ExceptionData& data() throw() Returns Reference to contained data. 314 315415-014US 13.3.3 const ExceptionData& data() const throw() Returns Const reference to contained data. 13.4 Specific Exceptions Summary Exceptions thrown by other library components. Syntax class bad_last_alloc; class improper_lock; class invalid_multiple_scheduling; class missing_wait; Header #include "tbb/tbb_exception.h" Description Table 42 describes when the exceptions are thrown. 1110H Table 42: Classes for Specific Exceptions. Exception Thrown when... bad_last_alloc • A pop operation on a concurrent_queue or concurrent_bounded_queue corrersponds to a push that threw an exception. • An operation on a concurrent_vector cannot be performed because a prior operation threw an exception. improper_lock A thread attempts to lock a critical_section or reader_writer_lock that it it has already locked. invalid_multiple_scheduling A task_group or structured_task_group attempts to run a task_handle twice. Exceptions Reference Manual 315 missing_wait A task_group or structured_task_group is destroyed before method wait() is invoked. Members namespace tbb { class bad_last_alloc: public std::bad_alloc { public: const char* what() const throw(); }; class improper_lock: public std::exception { public: const char* what() const throw(); }; class invalid_multiple_scheduler: public std::exception { const char* what() const throw(); }; class missing_wait: public std::exception { public: const char* what() const throw(); }; } 316 315415-014US 14 Threads Intel® Threading Building Blocks (Intel® TBB) provides a wrapper around the platform’s native threads, based upon the N3000 44H working draft for C++ 200x. Using this wrapper has two benefits: • It makes threaded code portable across platforms. • It eases later migration to ISO C++ 200x threads. The library defines the wrapper in namespace std, not namespace tbb, as explained in Section 2.4.7. 45H1111H 27F 28 The significant departures from N3000 are shown in Table 43. 1112H Table 43: Differences Between N3000 and Intel® TBB Thread Class N3000 Intel® TBB template std::this_thread::sleep_for( const chrono::duration& rel_time) std::this_thread::sleep_for( tick_count::interval_t ) rvalue reference parameters Parameter changed to plain value, or function removed, as appropriate. constructor for std::thread takes arbitrary number of arguments. constructor for std::thread takes 0-3 arguments. The other changes are for compatibility with the current C++ standard or Intel® TBB. For example, constructors that have an arbitrary number of arguments require the variadic template features of C++ 200x. CAUTION: Threads are heavy weight entities on most systems, and running too many threads on a system can seriously degrade performance. Consider using a task based solution instead if practical. 28 In Intel® TBB 2.2, the class was tbb::tbb_thread. Appendix A.7 explains the changes. Threads Reference Manual 317 14.1 thread Class Summary Represents a thread of execution. Syntax class thread; Header #include "tbb/compat/thread" Description Class thread provides a platform independent interface to native threads. An instance represents a thread. A platform-specific thread handle can be obtained via method native_handle(). Members namespace std { class thread { public: #if _WIN32||_WIN64 typedef HANDLE native_handle_type; #else typedef pthread_t native_handle_type; #endif // _WIN32||_WIN64 class id; thread(); template explicit thread(F f); template thread(F f, X x); template thread (F f, X x, Y y); thread& operator=( thread& x); ~thread(); bool joinable() const; void join(); void detach(); id get_id() const; native_handle_type native_handle(); static unsigned hardware_concurrency(); 318 315415-014US }; } 14.1.1 thread() Effects Constructs a thread that does not represent a thread of execution, with get_id()==id(). 14.1.2 template thread(F f) Effects Construct a thread that evaluates f() 14.1.3 template thread(F f, X x) Effects Constructs a thread that evaluates f(x). 14.1.4 template thread(F f, X x, Y y) Effects Constructs thread that evaluates f(x,y). 14.1.5 thread& operator=(thread& x) Effects If joinable(), calls detach(). Then assigns the state of x to *this and sets x to default constructed state. CAUTION: Assignment moves the state instead of copying it. Threads Reference Manual 319 14.1.6 ~thread Effects if( joinable() ) detach(). 14.1.7 bool joinable() const Returns get_id()!=id() 14.1.8 void join() Requirements joinable()==true Effects Wait for thread to complete. Afterwards, joinable()==false. 14.1.9 void detach() Requirements joinable()==true Effects Sets *this to default constructed state and returns without blocking. The thread represented by *this continues execution. 14.1.10 id get_id() const Returns id of the thread, or a default-constructed id if *this does not represent a thread. 320 315415-014US 14.1.11 native_handle_type native_handle() Returns Native thread handle. The handle is a HANDLE on Windows* operating systems and a pthread_t on Linux* and Mac OS* X operating systems. For these systems, native_handle() returns 0 if joinable()==false. 14.1.12 static unsigned hardware_concurrency() Returns The number of hardware threads. For example, 4 on a system with a single Intel® Core™2 Quad processor. 14.2 thread::id Summary Unique identifier for a thread. Syntax class thread::id; Header #include "tbb/compat/thread" Description A thread::id is an identifier value for a thread that remains unique over the thread’s lifetime. A special value thread::id() represents no thread of execution. The instances are totally ordered. Members namespace tbb { class thread::id { public: id(); }; template std::basic_ostream& operator<< (std::basic_ostream &out, thread::id id) Threads Reference Manual 321 bool operator==(thread::id x, thread::id y); bool operator!=(thread::id x, thread::id y); bool operator<(thread::id x, thread::id y); bool operator<=(thread::id x, thread::id y); bool operator>(thread::id x, thread::id y); bool operator>=(thread::id x, thread::id y); } // namespace tbb 14.3 this_thread Namespace Description Namespace this_thread contains global functions related to threading. Members namepace tbb { namespace this_thread { thread::id get_id(); void yield(); void sleep( const tick_count::interval_t ); } } 14.3.1 thread::id get_id() Returns Id of the current thread. 14.3.2 void yield() Effects Offers to suspend current thread so that another thread may run. 14.3.3 void sleep_for( const tick_count::interval_t & i) Effects Current thread blocks for at least time interval i. 322 315415-014US Example using namespace tbb; void Foo() { // Sleep 30 seconds this_thread::sleep_for( tick_count::interval_t(30) ); } References Reference Manual 323 15 References Umut A. Acar, Guy E. Blelloch, Robert D. Blumofe, The Data Locality of Work Stealing. ACM Symposium on Parallel Algorithms and Architectures (2000):1-12. Robert D.Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An Efficient Multithreaded Runtime System. Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (July 1995):207–216. Working Draft, Standard for Programming Language C++. WG21 document N3000. Steve MacDonald, Duane Szafron, and Jonathan Schaeffer. Rethinking the Pipeline as Object-Oriented States with Transformations. 9th International Workshop on HighLevel Parallel Programming Models and Supportive Environments (April 2004):12-21. W.F. Burton and R.M. Sleep. Executing functional programs on a virtual tree of processors. Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (October 1981):187-194. ISO/IEC 14882, Programming Languages – C++ Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato, Lawrence Rauchwerger. STAPL: An Adaptive, Generic Parallel C++ Library. Workshop on Language and Compilers for Parallel Computing (LCPC 2001), Cumberland Falls, Kentucky Aug 2001. Lecture Notes in Computer Science 2624 (2003): 193-208. S. G. Akl and N. Santoro, Optimal Parallel Merging and Sorting Without Memory Conflicts, IEEE Transactions on Computers, Vol. C-36 No. 11, Nov. 1987. 324 315415-014US Appendix A Compatibility Features This appendix describes features of Intel Threading Building Blocks (Intel® TBB) that remain for compatibility with previous versions. These features are deprecated and may disappear in future versions of Intel® TBB. Some of these features are available only if the preprocessor symbol TBB_DEPRECATED is non-zero. A.1 parallel_while Template Class Summary Template class that processes work items. TIP: This class is deprecated. Use parallel_do (4.7) instead. 1113H Syntax template class parallel_while; Header #include "tbb/parallel_while.h" Description A parallel_while performs parallel iteration over items. The processing to be performed on each item is defined by a function object of type Body. The items are specified in two ways: • A stream of items. • Additional items that are added while the stream is being processed. Table 44 shows the requirements on the stream and body. 477H1114H Table 44: parallel_while Requirements for Stream S and Body B Pseudo-Signature Semantics bool S::pop_if_present( B::argument_type& item ) Get next stream item. parallel_while does not concurrently invoke the method on the same this. B::operator()( B::argument_type& item ) Process item. parallel_whileReferences Reference Manual 325 Pseudo-Signature Semantics const may concurrently invoke the operator for the same this but different item. B::argument_type() Default constructor. B::argument_type( const B::argument_type& ) Copy constructor. ~B::argument_type() Destructor. For example, a unary function object, as defined in Section 20.3 of the C++ standard, models the requirements for B. A concurrent_queue (5.5) models the requirements 1115H for S. TIP: To achieve speedup, the grainsize of B::operator() needs to be on the order of at least ~10,000 instructions. Otherwise, the internal overheads of parallel_while swamp the useful work. The parallelism in parallel_while is not scalable if all the items come from the input stream. To achieve scaling, design your algorithm such that method add often adds more than one piece of work. Members namespace tbb { template class parallel_while { public: parallel_while(); ~parallel_while(); typedef typename Body::argument_type value_type; template void run( Stream& stream, const Body& body ); void add( const value_type& item ); }; } A.1.1 parallel_while() Effects Constructs a parallel_while that is not yet running. 326 315415-014US A.1.2 ~parallel_while() Effects Destroys a parallel_while. A.1.3 Template void run( Stream& stream, const Body& body ) Effects Applies body to each item in stream and any other items that are added by method add. Terminates when both of the following conditions become true: • stream.pop_if_present returned false. • body(x) returned for all items x generated from the stream or method add. A.1.4 void add( const value_type& item ) Requirements Must be called from a call to body.operator() created by parallel_while. Otherwise, the termination semantics of method run are undefined. Effects Adds item to collection of items to be processed. A.2 Interface for constructing a pipeline filter The interface for constructing a filter evolved over several releases of Intel® TBB. The two following subsections describe obsolete aspects of the interface. A.2.1 filter::filter( bool is_serial ) Effects Constructs a serial in order filter if is_serial is true, or a parallel filter if is_serial is false. This deprecated constructor is superseded by the constructor filter( filter::mode ) described in Section 4.9.6.1. 1116HReferences Reference Manual 327 A.2.2 filter::serial The filter mode value filter::serial is now named filter::serial_in_order. The new name distinguishes it more clearly from the mode filter::serial_out_of_order. A.3 Debugging Macros The names of the debugging macros have changed as shown in Table 45. If you define 1117H the old macros, Intel® TBB sets each undefined new macro in a way that duplicates the behavior the old macro settings. The old TBB_DO_ASSERT enabled assertions, full support for Intel® Threading Tools, and performance warnings. These three distinct capabilities are now controlled by three separate macros as described in Section 3.2. 1118H TIP: To enable all three capabilities with a single macro, define TBB_USE_DEBUG to be 1. If you had code under “#if TBB_DO_ASSERT” that should be conditionally included only when assertions are enabled, use “#if TBB_USE_ASSERT” instead. Table 45: Deprecated Macros Deprecated Macro New Macro TBB_DO_ASSERT TBB_USE_DEBUG or TBB_USE_ASSERT, depending on context. TBB_DO_THREADING_TOOLS TBB_USE_THREADING_TOOLS A.4 tbb::deprecated::concurrent_queu e Template Class Summary Template class for queue with concurrent operations. This is the concurrent_queue supported in Intel® TBB 2.1 and prior. New code should use the Intel® TBB 2.2 unbounded concurrent_queue or concurrent_bounded_queue. Syntax template > class concurrent_queue; Header #include "tbb/concurrent_queue.h" 328 315415-014US Description A tbb::deprecated::concurrent_queue is a bounded first-in first-out data structure that permits multiple threads to concurrently push and pop items. The default bounds are large enough to make the queue practically unbounded, subject to memory limitations on the target machine. NOTE: Compile with TBB_DEPRECATED=1 to inject tbb::deprecated::concurrent_queue into namespace tbb. Consider eventually migrating to the new queue classes. • Use the new tbb::concurrent_queue if you need only the non-blocking operations (push and try_pop) for modifying the queue. • Otherwise use the new tbb::concurrent_bounded_queue. It supports both blocking operations (push and try_pop) and non-blocking operations. In both cases, use the new method names in Table 46. 1119H Table 46: Method Name Changes for Concurrent Queues Method in tbb::deprecated::concurrent_queue Equivalent method in tbb::concurrent_queue or tbb::concurrent_bounded_queue pop_if_present try_pop push_if_not_full try_push (not available in tbb::concurrent_queue) begin unsafe_begin end unsafe_end Members namespace tbb { namespace deprecated { template > class concurrent_queue { public: // types typedef T value_type; typedef T& reference; typedef const T& const_reference; typedef std::ptrdiff_t size_type; typedef std::ptrdiff_t difference_type; concurrent_queue(const Alloc& a = Alloc()); concurrent_queue(const concurrent_queue& src, const Alloc& a = Alloc()); template concurrent_queue(InputIterator first, InputIterator last, References Reference Manual 329 const Alloc& a = Alloc()); ~concurrent_queue(); void push(const T& source); bool push_if_not_full(const T& source); void pop(T& destination); bool pop_if_present(T& destination); void clear() ; size_type size() const; bool empty() const; size_t capacity() const; void set_capacity(size_type capacity); Alloc get_allocator() const; typedef implementation-defined iterator; typedef implementation-defined const_iterator; // iterators (these are slow and intended only for debugging) iterator begin(); iterator end(); const_iterator begin() const; const_iterator end() const; }; } #if TBB_DEPRECATED using deprecated::concurrent_queue; #else using strict_ppl::concurrent_queue; #endif } A.5 Interface for concurrent_vector The return type of methods grow_by and grow_to_at_least changed in Intel® TBB 2.2. Compile with the preprocessor symbol TBB_DEPRECATED set to nonzero to get the old methods. 330 315415-014US Table 47: Change in Return Types Method Deprecated Return Type New Return Type grow_by (5.8.3.1) 1120H size_type iterator grow_to_at_least (5.8.3.2) 1121H void iterator push_back (5.8.3.3) 1122H size_type iterator A.5.1 void compact() Effects Same as shrink_to_fit() (5.8.2.2). 1123H A.6 Interface for class task Some methods of class task are deprecated because they have obsolete or redundant functionality. Deprecated Members of class task namespace tbb { class task { public: ... void recycle_to_reexecute(); // task depth typedef implementation-defined-signed-integral-type depth_type; depth_type depth() const {return 0;} void set_depth( depth_type new_depth ) {} void add_to_depth( int delta ){} ... }; } A.6.1 void recycle _to_reexecute() Intel® TBB 3.0 deprecated method recycle_to_reexecute because it is redundant. Replace a call t->recycle_to_reexecute()with the following sequence: t->set_refcount(1); References Reference Manual 331 t->recycle_as_safe_continuation(); A.6.2 Depth interface for class task Intel® TBB 2.2 eliminated the notion of task depth that was present in prior versions of Intel® TBB. The members of class task that related to depth have been retained under TBB_DEPRECATED, but do nothing. A.7 tbb_thread Class Intel® TBB 3.0 introduces a header tbb/compat/thread that defines class std::thread. Prior versions had a header tbb/tbb_thread.h that defined class tbb_thread. The old header and names are still available, but deprecated in favor of the replacements shown inTable 48. 1124H Table 48: Replacements for Deprecated Names Entity Deprecated Replacement Header tbb/tbb_thread.h tbb/compat/thread tbb::tbb_thread std::thread Identifiers tbb::this_tbb_thread std::this_thread tbb::this_tbb_thread::sleep std::this_tbb_thread::sleep_for Most of the changes reflect a change in the way that the library implements C++ 200x features (2.4.7). The change from 46H1125H sleep to sleep_for reflects a change in the C++ 200x working draft. 332 315415-014US Appendix B PPL Compatibility Intel Threading Building Blocks (Intel® TBB) 2.2 introduces features based on joint discussions between the Microsoft Corporation and Intel Corporation. The features establish some degree of compatibility between Intel® TBB and Microsoft Parallel Patterns Library (PPL) development software. Table 49 lists the features. Each feature appears in namespace 1126H tbb. Each feature can be injected into namespace Concurrency by including the file "tbb/compat/ppl.h" Table 49: PPL Compatibility Features Section Feature 4.4 parallel_for( 1127H first,last, f) 4.4 parallel_for( 1128H first,last,step,f) 4.8 parallel_for_each 1129H 4.12 parallel_invoke 1130H 9.3.1 critical_section 1131H 9.3.2 reader_writer_lock 1132H 11.3 task_handle 1133H 11.2 task_group_status 1134H 11.1.1 task_group 1135H 11.4 make_task 1136H 11.5 structured_task_group 1137H 11.6 is_current_task_group_cancelling 1138H 13.4 improper_lock 1139H 13.4 invalid_multiple_scheduling 1140H 13.4 missing_wait 1141H For parallel_for, only the variants listed in the table are injected into namespace Concurrency. CAUTION: Because of different environments and evolving specifications, the behavior of the features can differ between the Intel® TBB and PPL implementations. References Reference Manual 333 Appendix C Known Issues This section explains known issues with using Intel® Threading Building Blocks (Intel® TBB). C.1 Windows* OS Some Intel® TBB header files necessarily include the header file , which by default defines the macros min and max, and consequently breaks the ISO C++ header files and . Defining the preprocessor symbol NOMINMAX causes to not define the offending macros. Thus programs using Intel® TBB and either of the aforementioned ISO C++ headers should be compiled with /DNOMINMAX as a compiler argument. 334 315415-014US Appendix D Community Preview Features This section provides documentation for Community Preview (CP) features. What is a Community Preview Feature? A Community Preview feature is a component of Intel® Threading Building Blocks (Intel® TBB) that is being introduced to gain early feedback from developers. Comments, questions and suggestions related to Community Preview features are encouraged and should be submitted to the forums at www.threadingbuildingblocks.org 47H . The key properties of a CP feature are: • It must be explicitly enabled. It is off by default. • It is intended to have a high quality implementation. • There is no guarantee of future existence or compatibility. • It may have limited or no support in tools such as correctness analyzers, profilers and debuggers. CAUTION: A CP feature is subject to change in the future. It may be removed or radically altered in future releases of the library. Changes to a CP feature do NOT require the usual deprecation and deletion process. Using a CP feature in a production code base is therefore strongly discouraged. Enabling a Community Preview Feature A Community Preview feature may be defined completely in header files or it may require some additional support defined in a library. For a CP feature that is contained completely in header files, a feature-specific macro must be defined before inclusion of the header files. Example #define TBB_PREVIEW_FOO 1 #include “tbb/foo.h” If a CP feature requires support from a library, then an additional library must be linked with the application. The use of separate headers, feature-specific macros and separate libraries mitigates the impact of Community Preview features on other product features. References Reference Manual 335 NOTE: Unless a CP feature is explicitly enabled using the above mechanisms, it will have no impact on the application. D.1 Flow Graph This section describes Flow Graph nodes that are available as Community Preview features. D.1.1 or_node Template Class Summary A node that broadcasts messages received at its input ports to all of its successors. Each input port pi is a receiver. The messages are broadcast individually as they are received at each port. The output message types is a struct that contains an index number that identifies the port on which the message arrived and a tuple of the input types where the value is stored. Syntax template class or_node; Header #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" Description An or_node is a graph_node and sender< or_node::output_type >. It contains a tuple of input ports, each of which is a receiver for each of the T0 .. TN in InputTuple. It supports multiple input receivers with distinct types and broadcasts each received message to all of its successors. Unlike a join_node, each message is broadcast individually to all successors of the or_node as it arrives at an input port. The incoming messages are wrapped in a struct that contains the index of the port number on which the message arrived and a tuple of the input types where the received value is stored. The function template input_port described in 6.19 simplifies the syntax for getting a 1142H reference to a specific input port. Rejection of messages by successors of the or_node is handled using the protocol in Figure 4. The input ports never reject incoming messages. 1143H InputTuple must be a std::tuple where each element is copyconstructible and assignable. 336 315415-014US Example #include #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" using namespace tbb::flow; int main() { graph g; function_node f1( g, unlimited, [](const int &i) { return 2*i; } ); function_node f2( g, unlimited, [](const float &f) { return f/2; } ); typedef or_node< std::tuple > my_or_type; my_or_type o; function_node< my_or_type::output_type > f3( g, unlimited, []( const my_or_type::output_type &v ) { if (v.indx == 0) { printf("Received an int %d\n", std::get<0>(v.result)); } else { printf("Received a float %f\n", std::get<1>(v.result)); } } ); make_edge( f1, input_port<0>(o) ); make_edge( f2, input_port<1>(o) ); make_edge( o, f3 ); f1.try_put( 3 ); f2.try_put( 3 ); g.wait_for_all(); return 0; } In the example above, three function_node objects are created: f1 multiplies an int i by 2, f2 divides a float f by 2, and f3 prints the values from f1 and f2 as they arrive. The or_node j wraps the output of f1 and f2 and forwards each result to f3. This example is purely a syntactic demonstration since there is very little work in the nodes. References Reference Manual 337 Members namespace tbb { namespace flow { template class or_node : public graph_node, public sender< impl-dependent-output-type > { public: typedef struct { size_t indx; InputTuple result; } output_type; typedef receiver successor_type; implementation-dependent-tuple input_ports_tuple_type; or_node(); or_node(const or_node &src); input_ports_tuple_type &inputs(); bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } D.1.1.1 or_node( ) Effect Constructs an or_node. D.1.1.2 or_node( const or_node &src ) Effect Constructs an or_node. The list of predecessors, messages in the input ports, and successors are NOT copied. 338 315415-014US D.1.1.3 input_ports_tuple_type& inputs() Returns A std::tuple of receivers. Each element inherits from tbb::receiver where T is the type of message expected at that input. Each tuple element can be used like any other flow::receiver. D.1.1.4 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. D.1.1.5 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. D.1.1.6 bool try_get( output_type &v ) Description An or_node contains no buffering and therefore does not support gets. Returns false. D.1.1.7 bool try_reserve( T & ) Description An or_node contains no buffering and therefore cannot be reserved. Returns false. References Reference Manual 339 D.1.1.8 bool try_release( ) Description An or_node contains no buffering and therefore cannot be reserved. Returns false. D.1.1.9 bool try_consume( ) Description An or_node contains no buffering and therefore cannot be reserved. Returns false. D.1.2 multioutput_function_node Template Class Summary A template class that is a receiver and has a tuple of sender outputs. This node may have concurrency limits as set by the user. When the concurrency limit allows, it executes the user-provided body on incoming messages. The body may create one or more output messages and broadcast them to successors.. Syntax template < typename InputType, typename OutputTuple > class multioutput_function_node; Header #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" Description This type is used for nodes that receive messages at a single input port and may generate one or more messages that are broadcast to successors. A multioutput_function_node maintains an internal constant threshold T and an internal counter C. At construction, C=0 and T is set the value passed in to the constructor. The behavior of a call to try_put is determined by the value of T and C as shown in Table 50. 1144H340 315415-014US Table 50: Behavior of a call to a multioutput_function_node’s try_put Value of threshold T Value of counter C bool try_put( input_type v ) T == graph::unlimited NA A task is enqueued that executes body(v). Returns true. T != flow::unlimited C < T Increments C. A task is enqueued that executes body(v) and then decrements C. Returns true. T != flow::unlimited C >= T Returns false. A multioutput_function_node has a user-settable concurrency limit. It can have flow::unlimited concurrency, which allows an unlimited number of copies of the node to execute concurrently. It can have flow::serial concurrency, which allows only a single copy of the node to execute concurrently. The user can also provide a value of type size_t to limit concurrency to a value between 1 and unlimited. The Body concept for multioutput_function_node is shown in Table 51. 1145H Table 51: multioutput_function_node Body Concept Pseudo-Signature Semantics B::B( const B& ) Copy constructor. B::~B() Destructor. void28F 29 operator=( const B& ) Assignment void B::operator()(const InputType &v, output_ports &p) Perform operation on v. May call try_put on zero or more output_ports. May call try_put on output_ports multiple times.. Example The example below shows a multioutput_function_node that separates a stream of integers into odd and even, placing each in the appropriate output queue. The Body method will receive as parameters a read-only reference to the input value and a reference to the tuple of output ports. The Body method may put items to one or more output ports. The output ports of the multioutput_function_node can be connected to other graph nodes using the make_edge method or by using register_successor: References Reference Manual 341 #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" using namespace tbb::flow; typedef multioutput_function_node > multi_node; struct MultiBody { void operator()(const int &i, multi_node::output_ports_type &op) { if(i % 2) std::get<1>(op).put(i); // put to odd queue else std::get<0>(op).put(i); // put to even queue } }; int main() { graph g; queue_node even_queue(g); queue_node odd_queue(g); multi_node node1(g,unlimited,MultiBody()); output_port<0>(node1).register_successor(even_queue); make_edge(output_port<1>(node1), odd_queue); for(int i = 0; i < 1000; ++i) { node1.try_put(i); } g.wait_for_all(); } Members namespace tbb { template< typename InputType, typename OutputTuple, graph_buffer_policy=queueing, A> class multioutput_function_node : public graph_node, public receiver, { public: typedef (input_queue) queue_type; template multioutput_function_node( graph &g, size_t concurrency, Body body, queue_type *q = NULL ); 342 315415-014US multioutput_function_node( const multioutput_function_node &other, queue_type *q = NULL); ~multioutput_function_node(); typedef InputType input_type; typedef sender predecessor_type; bool try_put( input_type v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); typedef OutputType tuple_port_types; typedef (tuple of sender) output_ports_type; template &output_port(MFN &node); } D.1.2.1 template< typename Body> multioutput_function_node(graph &g, size_t concurrency, Body body, queue_type *q = NULL) Description Constructs a multioutput_function_node that will invoke body. At most concurrency calls to the body may be made concurrently. D.1.2.2 template< typename Body> multioutput_function_node(multioutput_function_node const & other, queue_type *q = NULL) Effect Constructs a copy of a multioutput_function_node with an optional input queue. D.1.2.3 bool register_predecessor( predecessor_type & p ) Effect Adds p to the set of predecessors. Returns true. References Reference Manual 343 D.1.2.4 bool remove_predecessor( predecessor_type & p ) Effect Removes p from the set of predecessors. Returns true. D.1.2.5 bool try_put( input_type v ) Effect If fewer copies of the node exist than the allowed concurrency, a task is spawned to execute body on the v. The body may put results to one or more successors in the tuple of output ports. Returns true. D.1.2.6 (output port &) output_port(node) Returns A reference to port N of the multioutput_function_node node. D.1.3 split_node Template Class Summary A template class that is a receiver and has a tuple of sender outputs. A split_node is a multifunction_output_node with a body that sends each element of the incoming tuple to the output port that matches the element’s index in the incoming tuple. This node has unlimited concurrency. Syntax template < typename InputType > class split_node; Header #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" 344 315415-014US Description This type is used for nodes that receive tuples at a single input port and generate a message from each element of the tuple, passing each to its corresponding output port. A split_node has unlimited concurrency, no buffering, and behaves as a broadcast_node with multiple output ports. Example The example below shows a split_node that separates a stream of tuples of integers, placing each element of the tuple in the appropriate output queue. The output ports of the split_node can be connected to other graph nodes using the make_edge method or by using register_successor: #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" using namespace tbb::flow; typedef split_node< std::tuple > s_node; int main() { typedef std::tuple int_tuple_type; graph g; queue_node first_queue(g); queue_node second_queue(g); s_node node1(g); output_port<0>(node1).register_successor(first_queue); make_edge(output_port<1>(node1), second_queue); for(int i = 0; i < 1000; ++i) { node1.try_put(int_tuple_type(2*i,2*i+1)); } g.wait_for_all(); } Members namespace tbb { template< typename InputType, A > class split_node : References Reference Manual 345 public multioutput_function_node { public: split_node( graph &g); split_node( const split_node &other); ~split_node(); typedef InputType input_type; typedef sender predecessor_type; bool try_put( input_type v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); typedef OutputType tuple_port_types; typedef (tuple of sender) output_ports_type; template &output_port(MFN &node); } D.1.3.1 split_node(graph &g) Description Constructs a split_node. D.1.3.2 split_node(split_node const & other) Effect Constructs a copy of a split_node. D.1.3.3 bool register_predecessor( predecessor_type & p ) Effect Adds p to the set of predecessors. Returns true. 346 315415-014US D.1.3.4 bool remove_predecessor( predecessor_type & p ) Effect Removes p from the set of predecessors. Returns true. D.1.3.5 bool try_put( input_type v ) Effect Forwards each element of the input tuple v to the corresponding output port. Returns true. D.1.3.6 (output port &) output_port(node) Returns A reference to port N of the split_node. D.2 Run-time loader Summary The run-time loader is a mechanism that provides additional run-time control over the version of the Intel ® Threading Buidling Blocks (Intel® TBB) dynamic library used by an application, plug-in, or another library. Header #define TBB_PREVIEW_RUNTIME_LOADER 1 #include “tbb/runtime_loader.h” Library OS Release build Debug build Windows tbbproxy.lib tbbproxy_debug.lib References Reference Manual 347 Description The run-time loader consists of a class and a static library that can be linked with an application, library, or plug-in to provide better run-time control over the version of Intel® TBB used. The class allows loading a desired version of the dynamic library at run time with explicit list of directories for library search. The static library provides stubs for functions and methods to resolve link-time dependencies, which are then dynamically substituted with the proper functions and methods from a loaded Intel® TBB library. All instances of class runtime loader in the same module (i.e. exe or dll) share certain global state. The most noticeable piece of this state is the loaded Intel® TBB library. The implications of that are: Only one Intel® TBB library per module can be loaded. If one runtime_loader instance has already loaded a library, another one created by the same module will not load another one. If the loaded library is suitable for the second instance, both will use it cooperatively, otherwise an error will be reported (details below). If different versions of the library are requested by different modules, those can be loaded, but may result in processor oversubscription. runtime_loader objects are not thread-safe and may work incorrectly if used concurrently. NOTE: If an application or a library uses runtime_loader, it should be linked with one of the above specified libraries instead of a normal Intel® TBB library. Example #define TBB_PREVIEW_RUNTIME_LOADER 1 #include "tbb/runtime_loader.h" #include "tbb/parallel_for.h” #include char const * path[] = { "c:\\myapp\\lib\\ia32", NULL }; int main() { tbb::runtime_loader loader( path ); if( loader.status()!=tbb::runtime_loader::ec_ok ) return -1; // The loader does not impact how TBB is used tbb::parallel_for(0, 10, ParallelForBody()); return 0; 348 315415-014US } In this example, the Intel® Threading Building Blocks (Intel®) library will be loaded from the c:\myapp\lib\ia32 directory. No explicit requirements for a version are specified, so the minimal suitable version is the version used to compile the example, and any higher version is suitable as well. If the library is successfully loaded, it can be used in the normal way. D.2.1 runtime_loader Class Summary Class for run time control over the loading of an Intel® Threading Building Blocks dynamic library. Syntax class runtime_loader; Members namespace tbb { class runtime_loader { // Error codes. enum error_code { ec_ok, // No errors. ec_bad_call, // Invalid function call. ec_bad_arg, // Invalid argument passed. ec_bad_lib, // Invalid library found. ec_bad_ver, // The library found is not suitable. ec_no_lib // No library found. }; // Error mode constants. enum error_mode { em_status, // Save status of operation and continue. em_throw, // Throw an exception of error_code type. em_abort // Print message to stderr, and abort(). }; runtime_loader( error_mode mode = em_abort ); runtime_loader( char const *path[], // List of directories to search in. int min_ver = TBB_INTERFACE_VERSION, // Minimal suitable version int max_ver = INT_MAX, // Maximal suitable version References Reference Manual 349 error_mode mode = em_abort // Error mode for this instance. ); ~runtime_loader(); error_code load( char const * path[], int min_ver = TBB_INTERFACE_VERSION, int max_ver = INT_MAX ); error_code status(); }; } D.2.1.1 runtime_loader( error_mode mode = em_abort ) Effects Initialize runtime_loader but do not load a library. D.2.1.2 runtime_loader(char const * path[], int min_ver = TBB_INTERFACE_VERSION, int max_ver = INT_MAX, error_mode mode = em_abort ) Requirements The last element of path[] must be NULL. Effects Initialize runtime_loader and load Intel® TBB (see load() for details). If error mode equals to em_status, the method status() can be used to check whether the library was loaded or not. If error mode equals to em_throw, in case of a failure an exception of type error_code will be thrown. If error mode equals to em_abort, in case of a failure a message will be printed to stderr, and execution aborted. D.2.1.3 error_code load(char const * path[],int min_ver = TBB_INTERFACE_VERSION, int max_ver = INT_MAX) Requirements The last element of path[] must be NULL. Effects Load a suitable version of an Intel® TBB dynamic library from one of the specified directories. 350 315415-014US TIP: The method searches for a library in directories specified in the path[] array. When a library is found, it is loaded and its interface version (as returned by TBB_runtime_interface_version()) is checked. If the version does not meet the requirements specified by min_ver and max_ver, the library is unloaded. The search continues in the next specified path, until a suitable version of the Intel® TBB library is found or the array of paths ends with NULL. It is recommended to use default values for min_ver and max_ver. CAUTION: For security reasons, avoid using relative directory names such as current ("."), parent ("..") or any other relative directory (like "lib") when searching for a library. Use only absolute directory names (as shown in the example above); if necessary, construct absolute names at run time. Neglecting these rules may cause your program to execute 3-rd party malicious code. (See http://www.microsoft.com/techne 48H t/security/advisory/2269637.mspx for details.) Returns ec_ok – a suitable version was successfully loaded. ec_bad_call - this runtime_loader instance has already been used to load a library. ec_bad_lib - A library was found but it appears invalid. ec_bad_arg - min_ver and/or max_ver is negative or zero, or min_ver > max_ver. ec_bad_ver - unsuitable version has already been loaded by another instance. ec_no_lib - No suitable version was found. D.2.1.4 error_code status() Returns If error mode is em_status, the function returns status of the last operation. D.3 parallel_ deterministic _reduce Template Function Summary Computes reduction over a range, with deterministic split/join behavior. Syntax template References Reference Manual 351 Value parallel_deterministic_reduce( const Range& range, const Value& identity, const Func& func, const Reduction& reduction, [, task_group_context& group] ); template void parallel_deterministic_reduce( const Range& range, const Body& body [, task_group_context& group] ); Header #define TBB_PREVIEW_DETERMINISTIC_REDUCE 1 #include "tbb/parallel_reduce.h" Description The parallel_deterministic_reduce template is very similar to the parallel_reduce template. It also has the functional and imperative forms and has similar requirements for Func and Reduction (Table 12) and Body ( 1146H Table 13). 1147H Unlike parallel_reduce, parallel_deterministic_reduce has deterministic behavior with regard to splits of both Body and Range and joins of the bodies. For the functional form, it means Func is applied to a deterministic set of Ranges, and Reduction merges partial results in a deterministic order. To achieve that, parallel_deterministic_reduce always uses simple_partitioner 49H because other partitioners may react on random work stealing behaviour (see 4.3.1). So the template 1148H declaration does not have a partitioner argument. parallel_deterministic_reduce always invokes Body splitting constructor for each range splitting. b0 [0,20) b0 [0,10) b2 [10,20) b0 [0,5) b1 [5,10) b2 [10,15) b3 [15,20) Figure 18: Execution of parallel_deterministic_reduce over blocked_range(0,20,5) As a result, parallel_deterministic_reduce recursively splits a range until it is no longer divisible, and creates a new body (by calling Body splitting constructor) for each new subrange. Likewise parallel_reduce, for each body split the method join is invoked in order to merge the results from the bodies. Figure 18 shows the execution 1149H352 315415-014US of parallel_deterministic_reduce over a sample range, with the slash marks (/) denoting where new instances of the body were created. Therefore for given arguments parallel_ deterministic_reduce executes the same set of split and join operations no matter how many threads participate in execution and how tasks are mapped to the threads. If the user-provided functions are also deterministic (i.e. different runs with the same input result in the same output), then multiple calls to parallel_deterministic_reduce will produce the same result. Note however that the result might differ from that obtained with an equivalent sequential (linear) algorithm. CAUTION: Since simple_partitioner 50H is always used, be careful to specify an appropriate grainsize (see simple_partitioner 51H class). Complexity If the range and body take O(1) space, and the range splits into nearly equal pieces, then the space complexity is O(P log(N)), where N is the size of the range and P is the number of threads. Example The example from parallel_reduce 52H section can be easily modified to use parallel_deterministic_reduce. It is sufficient to define TBB_PREVIEW_DETERMINISTIC_REDUCE macro and rename parallel_reduce to parallel_deterministic_reduce; a partitioner, if any, should be removed to prevent compilation error. A grain size may need to be specified for blocked_range if performance suffered. #define TBB_PREVIEW_DETERMINISTIC_REDUCE 1 #include #include #include "tbb/parallel_reduce.h" #include "tbb/blocked_range.h" using namespace tbb; float ParallelSum( float array[], size_t n ) { size_t grain_size = 1000; return parallel_deterministic_reduce( blocked_range( array, array+n, grain_size ), 0.f, [](const blocked_range& r, float value)->float { return std::accumulate(r.begin(),r.end(),value); }, References Reference Manual 353 std::plus() ); } D.4 Scalable Memory Pools Memory pools allocate and free memory from a specified region or underlying allocator providing thread-safe, scalable operations. Table 52 summarizes the memory pool 1150H concept. Here, P represents an instance of the memory pool class. Table 52: Memory Pool Concept Pseudo-Signature Semantics ~P() throw(); Destructor. Frees all the memory of allocated objects. void P::recycle(); Frees all the memory of allocated objects. void* P::malloc(size_t n); Returns pointer to n bytes allocated from memory pool. void P::free(void* ptr); Frees memory object specified via ptr pointer. void* P::realloc(void* ptr, size_t n); Reallocates memory object pointed by ptr to n bytes. Model Types Template class memory_pool (D.4.1) and class 1151H fixed_pool (D.4.2) model the Memory 1152H Pool concept. D.4.1 memory_pool Template Class Summary Template class for scalable memory allocation from memory blocks provided by an underlying allocator. CAUTION: If the underlying allocator refers to another scalable memory pool, the inner pool (or pools) must be destroyed before the outer pool is destroyed or recycled. Syntax template class memory_pool; 354 315415-014US Header #define TBB_PREVIEW_MEMORY_POOL 1 #include “tbb/memory_pool.h” Description A memory_pool allocates and frees memory in a way that scales with the number of processors. The memory is obtained as big chunks from an underlying allocator specified by the template argument. The latter must satisfy the subset of requirements described in Table 29 with 1153H allocate, deallocate, and value_type valid for sizeof(value_type)>0. A memory_pool models the Memory Pool concept described in Table 52. 1154H Example #define TBB_PREVIEW_MEMORY_POOL 1 #include "tbb/memory_pool.h" ... tbb::memory_pool > my_pool(); void* my_ptr = my_pool.malloc(10); my_pool.free(my_ptr); The code above provides a simple example of allocation from an extensible memory pool. Members namespace tbb { template class memory_pool : no_copy { public: memory_pool(const Alloc &src = Alloc()) throw(std::bad_alloc); ~memory_pool(); void recycle(); void *malloc(size_t size); void free(void* ptr); void *realloc(void* ptr, size_t size); }; } D.4.1.1 memory_pool(const Alloc &src = Alloc()) Effects Constructs memory pool with an instance of underlying memory allocator of type Alloc copied from src. Throws bad_alloc exception if runtime fails to construct an instance of the class. References Reference Manual 355 D.4.2 fixed_pool Class Summary Template class for scalable memory allocation from a buffer of fixed size. Syntax class fixed_pool; Header #define TBB_PREVIEW_MEMORY_POOL 1 #include “tbb/memory_pool.h” Description A fixed_pool allocates and frees memory in a way that scales with the number of processors. All the memory available for the allocation is initially passed through arguments of the constructor. A fixed_pool models the Memory Pool concept described in Table 52. 1155H Example #define TBB_PREVIEW_MEMORY_POOL 1 #include "tbb/memory_pool.h" ... char buf[1024*1024]; tbb::fixed_pool my_pool(buf, 1024*1024); void* my_ptr = my_pool.malloc(10); my_pool.free(my_ptr);} The code above provides a simple example of allocation from a fixed pool. Members namespace tbb { class fixed_pool : no_copy { public: fixed_pool(void *buffer, size_t size) throw(std::bad_alloc); ~fixed_pool(); void recycle(); void *malloc(size_t size); void free(void* ptr); void *realloc(void* ptr, size_t size); }; } 356 315415-014US D.4.2.1 fixed_pool(void *buffer, size_t size) Effects Constructs memory pool to manage the memory pointed by buffer and of size. Throws bad_alloc exception if runtime fails to construct an instance of the class. D.4.3 memory_pool_allocator Template Class Summary Template class that provides the C++ allocator interface for memory pools. Syntax template class memory_pool_allocator; Header #define TBB_PREVIEW_MEMORY_POOL 1 #include “tbb/memory_pool.h” Description A memory_pool_allocator models the allocator requirements described in Table 29 1156H except for default constructor which is excluded from the class. Instead, it provides a constructor, which links with an instance of memory_pool or fixed_pool classes, that actually allocates and deallocates memory. The class is mainly intended to enable memory pools within STL containers. Example #define TBB_PREVIEW_MEMORY_POOL 1 #include "tbb/memory_pool.h" ... typedef tbb::memory_pool_allocator pool_allocator_t; std::list my_list(pool_allocator_t( my_pool )); The code above provides a simple example of cnostruction of a container that uses a memory pool. Members namespace tbb { template class memory_pool_allocator { public: References Reference Manual 357 typedef T value_type; typedef value_type* pointer; typedef const value_type* const_pointer; typedef value_type& reference; typedef const value_type& const_reference; typedef size_t size_type; typedef ptrdiff_t difference_type; template struct rebind { typedef memory_pool_allocator other; }; memory_pool_allocator(memory_pool &pool) throw(); memory_pool_allocator(fixed_pool &pool) throw(); memory_pool_allocator(const memory_pool_allocator& src) throw(); template memory_pool_allocator(const memory_pool_allocator& src) throw(); pointer address(reference x) const; const_pointer address(const_reference x) const; pointer allocate( size_type n, const void* hint=0); void deallocate( pointer p, size_type ); size_type max_size() const throw(); void construct( pointer p, const T& value ); void destroy( pointer p ); }; template<> class memory_pool_allocator { public: typedef void* pointer; typedef const void* const_pointer; typedef void value_type; template struct rebind { typedef memory_pool_allocator other; }; memory_pool_allocator(memory_pool &pool) throw(); memory_pool_allocator(fixed_pool &pool) throw(); memory_pool_allocator(const memory_pool_allocator& src) throw(); template memory_pool_allocator(const memory_pool_allocator& src) throw(); 358 315415-014US }; template inline bool operator==( const memory_pool_allocator& a, const memory_pool_allocator& b); template inline bool operator!=( const memory_pool_allocator& a, const memory_pool_allocator& b); } D.4.3.1 memory_pool_allocator(memory_pool &pool) Effects Constructs memory pool allocator serviced by memory_pool instance pool. D.4.3.2 memory_pool_allocator(fixed_pool &pool) Effects Constructs memory pool allocator serviced by fixed_pool instance pool. D.5 Serial subset Summary A subset of the parallel algorithms is provided for modeling serial execution. Currently only a serial version of tbb::parallel_for() is available. D.5.1 tbb::serial::parallel_for() Header #define TBB_PREVIEW_SERIAL_SUBSET 1 #include “tbb/ parallel_for.h” Motivation Sometimes it is useful, for example while debugging, to execute certain parallel_for() invocations serially while having other invocations of parallel_for()executed in parallel. Description The tbb::serial::parallel_for function implements the tbb::parallel_for API using a serial implementation underneath. Users who want sequential execution of a References Reference Manual 359 certain parallel_for() invocation will need to define the TBB_PREVIEW_SERIAL_SUBSET macro before parallel_for.h and prefix the selected parallel_for() with tbb::serial::. Internally, the serial implementation uses the same principle of recursive decomposition, but instead of spawning tasks, it does recursion “for real”, i.e. the body function calls itself twice with two halves of its original range. Example #define TBB_PREVIEW_SERIAL_SUBSET 1 #include #include Foo() { // . . . tbb::serial::parallel_for( . . . ); tbb::parallel_for( . . . ); // . . . } Intel® Threading Building Blocks Design Patterns Design Patterns Document Number 323512-005US World Wide Web: http://www.intel.com Intel® Threading Building Blocks Design Patterns ii 323512-005US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/#/en_US_01 0H . Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries.* Other names and brands may be claimed as the property of others. Copyright (C) 2010 - 2011, Intel Corporation. All rights reserved. Introduction Design Patterns iii Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Revision History Version Version Information Date 1.05 Updated the Optimization Notice. 2011-Oct-27 1.04 Added Optimization Notice. 2011-Aug-1 1.02 Correct lazy initialization examples. 2010-Sep-7 1.01 Change enqueue_self to enqeue. 2010-May-25 1.00 Initial version. 2010-Apr-4 Intel® Threading Building Blocks Design Patterns iv 323512-005US Contents 11H Introduction .....................................................................................................146H 22H Agglomeration..................................................................................................247H 33H Elementwise.....................................................................................................548H 44H Odd-Even Communication ..................................................................................749H 55H Wavefront........................................................................................................850H 66H Reduction ......................................................................................................1251H 77H Divide and Conquer.........................................................................................1652H 88H GUI Thread ....................................................................................................2053H 99H Non-Preemptive Priorities.................................................................................2454H 1010H Local Serializer ...............................................................................................2755H 1111H Fenced Data Transfer ......................................................................................3156H 1212H Lazy Initialization............................................................................................3457H 1313H Reference Counting.........................................................................................3758H 1414H Compare and Swap Loop..................................................................................3959H General Re 15H ferences.............................................................................................................4160HIntroduction Design Patterns 1 1 Introduction This document is a “cookbook” of some common parallel programming patterns and how to implement them in Intel® Threading Building Blocks (Intel® TBB). A cookbook will not make you a great chef, but provides a collection of recipes that others have found useful. Like most cookbooks, this document assumes that you know how to use basic tools. The Intel® Threading Building Blocks (Intel® TBB) Tutorial is a good place to learn the basic tools. This document is a guide to which tools to use when. A design pattern description is much more than a rote coding recipe. The description of each pattern has the following format: • Problem – describes the problem to be solved. • Context – describes contexts in which the problem arises. • Forces – considerations that drive use of the pattern. • Solution – describes how to implement the pattern. • Example – presents an example implementation. Variations and examples are sometimes discussed. The code examples are intended to emphasize key points and are not full-fledged code. Examples may omit obvious const overloads of non-const methods. Much of the nomenclature and examples are adapted from Web pages created by EunGyu and Marc Snir, and the Berkeley parallel patterns wiki. See links in the General References section For brevity, some of the code examples use C++0x lambda expressions. It is straightforward, albeit sometimes tedious, to translate such lambda expressions into equivalent C++98 code. See the Section "Lambda Expressions" in the Intel® TBB tutorial on how to enable lambda expressions in the Intel® Compiler or how do the translation by hand. Intel® Threading Building Blocks Design Patterns 2 323512-005US 2 Agglomeration Problem Parallelism is so fine grained that overhead of parallel scheduling or communication swamps the useful work. Context Many algorithms permit parallelism at a very fine grain, on the order of a few instructions per task. But synchronization between threads usually requires orders of magnitude more cycles. For example, elementwise addition of two arrays can be done fully in parallel, but if each scalar addition is scheduled as a separate task, most of the time will be spent doing synchronization instead of useful addition. Forces • Individual computations can be done in parallel, but are small. For practical use of Intel® Threading Building Blocks (Intel® TBB), "small" here means less than 10,000 clock cycles. • The parallelism is for sake of performance and not required for semantic reasons. Solution Group the computations into blocks. Evaluate computations within a block serially. The block size should be chosen to be large enough to amortize parallel overhead. Too large a block size may limit parallelism or load balancing because the number of blocks becomes too small to distribute work evenly across processors. The choice of block topology is typically driven by two concerns: • Minimizing synchronization between blocks. • Minimizing cache traffic between blocks. If the computations are completely independent, then the blocks will be independent too, and then only cache traffic issues must be considered. If the loop is “small”, on the order of less than 10,000 clock cycles, then it may be impractical to parallelize at all, because the optimal agglomeration might be a single block, Agglomeration Design Patterns 3 Examples Intel® TBB loop templates such as tbb::parallel_for that take a range argument support automatic agglomeration. When agglomerating, think about cache effects. Avoid having cache lines cross between groups if possible. There may be boundary to interior ratio effects. For example, if the computations form a 2D grid, and communicate only with nearest neighbors, then the computation per block grows quadratically (with the block’s area), but the cross-block communication grows with linearly (with the block’s perimeter). Figure 1 shows four different ways to 61H agglomerate an 8×8 grid. If doing such analysis, be careful to consider that information is transferred in cache line units. For a given area, the perimeter may be minimized when the block is square with respect to the underlying grid of cache lines, not square with respect to the logical grid. + Figure 1: Four different agglomerations of an 8×8 grid. Also consider vectorization. Blocks that contain long contiguous subsets of data may better enable vectorization. For recursive computations, most of the work is towards the leaves, so the solution is to treat subtrees as a groups as shown in Figure 2. 62HIntel® Threading Building Blocks Design Patterns 4 323512-005US Figure 2: Agglomeration of a recursive computation Often such an agglomeration is achieved by recursing serially once some threshold is reached. For example, a recursive sort might solve sub-problems in parallel only if they are above a certain threshold size. Reference Ian Foster introduced the term "agglomeration" in his book Designing and Building Parallel Programs . There agglomeration is part of a four step “PCAM” design method: 1. Partitioning - break the program into the smallest tasks possible. 2. Communication – figure out what communication is required between tasks. When using Intel® TBB, communication is usually cache line transfers. Though they are automatic, understanding which ones happen between tasks helps guide the agglomeration step. 3. Agglomeration – combine tasks into larger tasks. His book has an extensive list of considerations that is worth reading. 4. Mapping – map tasks onto processors. The Intel® TBB task scheduler does this step for you. Elementwise Design Patterns 5 3 Elementwise Problem Initiate similar independent computations across items in a data set, and wait until all complete. Context Many serial algorithms sweep over a set of items and do an independent computation on each item. However, if some kind of summary information is collected, use the Reduction pattern instead. Forces No information is carried or merged between the computations. Solution If the number of items is known in advance, use tbb::parallel_for. If not, consider using tbb::parallel_do. Use agglomeration 16H if the individual computations are small relative to scheduler overheads. If the pattern is followed by a reduction 17H on the same data, consider doing the elementwise operation as part of the reduction, so that the combination of the two patterns is accomplished in a single sweep instead of two sweeps. Doing so may improve performance by reducing traffic through the memory hierarchy. Example Convolution is often used in signal processing. The convolution of a filter c and signal x is computed as: = ? - j i j i j y c x Serial code for this computation might look like: // Assumes c[0..clen-1] and x[1-clen..xlen-1] are defined for( int i=0; i(0,xlen+clen-1,1000), [=]( tbb::blocked_range r ) { int end = r.end(); for( int i=r.begin(); i!=end; ++i ) { float tmp = 0; for( int j=0; j by Eun-Gyu Kim and Marc Snir describes the pattern. Intel® Threading Building Blocks Design Patterns 8 323512-005US 5 Wavefront Problem Perform computations on items in a data set, where the computation on an item uses results from computations on predecessor items. See reference 19H for a discussion. Context The dependences between computations form an acyclic graph. Forces • Dependence constraints between items form an acyclic graph. • The number of immediate predecessors in the graph is known in advance, or can be determined some time before the last predecessor completes. Solution The solution is a parallel variant of topological sorting, using tbb::parallel_do to process items. Associate an atomic counter with each item. Initialize each counter to the number of predecessors. Invoke tbb::parallel_do to process the items that have no predessors (have counts of zero). After an item is processed, decrement the counters of its successors. If a successor's counter reaches zero, add that successor to the tbb::parallel_do via a "feeder". If the number of predecessors for an item cannot be determined in advance, treat the information "know number of predecessors" as an additional predecessor. When the number of predecessors becomes known, treat this conceptual predecessor as completed. If the overhead of counting individual items is excessive, aggregate items into blocks, and do the wavefront over the blocks. Example Below is a serial kernel for the longest common subsequence algorithm. The parameters are strings x and y with respective lengths xlen and ylen. int F[MAX_LEN+1][MAX_LEN+1]; void SerialLCS( const char* x, size_t xlen, const char* y, size_t ylen ) Wavefront Design Patterns 9 { for( size_t i=1; i<=xlen; ++i ) for( size_t j=1; j<=ylen; ++j ) F[i][j] = x[i-1]==y[j-1] ? F[i-1][j-1]+1 : max(F[i][j-1],F[i-1][j]); } The kernel sets F[i][j] to the length of the longest common subsequence shared by x[0..i-1] and y[0..j-1]. It assumes that F[0][0..ylen] and F[0..xlen][0] have already been initialized to zero. Figure 3 shows the data dependences for calculating 63H F[i][j]. Fi-1,j-1 Fi-1,j Fi,j-1 Fi,j Figure 3: Data dependences for longest common substring calculation. As Figure 4 shows, the gray diagonal depend 64H ence is the transitive closure of other dependences. Thus for parallelization purposes it is a redundant dependence that can be ignored. Fi-1,j-1 Fi-1,j Fi,j-1 Fi,j Figure 4: Diagonal dependence is redundant. It is generally good to remove redundant dependences from consideration, because the atomic counting incurs a cost for each dependence considered. Another consideration is grain size. Scheduling each F[i][j] element calculation separately is prohibitively expensive. A good solution is to aggregate the elements into contiguous blocks, and process the contents of a block serially. The blocks have the same dependence pattern, but at a block scale. Hence scheduling overheads can be amortized over blocks. The parallel code follows. Each block consists of N×N elements. Each block has an associated atomic counter. Array Count organizes these counters for easy lookup. The Intel® Threading Building Blocks Design Patterns 10 323512-005US code initializes the counters and then rolls a wavefront using parallel_do, starting with the block at the origin since it has no predecessors. const int N = 64; tbb::atomic Count[MAX_LEN/N+1][MAX_LEN/N+1]; void ParallelLCS( const char* x, size_t xlen, const char* y, size_t ylen ) { // Initialize predecessor counts for blocks. size_t m = (xlen+N-1)/N; size_t n = (ylen+N-1)/N; for( int i=0; i0)+(j>0); // Roll the wavefront from the origin. typedef pair block; block origin(0,0); tbb::parallel_do( &origin, &origin+1, [=]( const block& b, tbb::parallel_do_feeder& feeder ) { // Extract bounds on block size_t bi = b.first; size_t bj = b.second; size_t xl = N*bi+1; size_t xu = min(xl+N,xlen+1); size_t yl = N*bj+1; size_t yu = min(yl+N,ylen+1); // Process the block for( size_t i=xl; i by Eun-Gyu Kim and Marc Snir. Intel® Threading Building Blocks Design Patterns 12 323512-005US 6 Reduction Problem Perform an associative reduction operation across a data set. Context Many serial algorithms sweep over a set of items to collect summary information. Forces The summary can be expressed as an associative operation over the data set, or at least is close enough to associative that reassociation does not matter. Solution Two solutions exist in Intel® Threading Building Blocks (Intel® TBB). The choice on which to use depends upon several considerations: • Is the operation commutative as well as associative? • Are instances of the reduction type expensive to construct and destroy? For example, a floating point number is inexpensive to construct. A sparse floatingpoint matrix might be very expensive to construct. Use tbb::parallel_reduce when the objects are inexpensive to construct. It works even if the reduction operation is not commutative. The Intel® TBB Tutorial describes how to use tbb::parallel_reduce for basic reductions. Use tbb::parallel_for and tbb::combinable if the reduction operation is commutative and instances of the type are expensive. If the operation is not precisely associative but a precisely deterministic result is required, use recursive reduction and parallelize it using tbb::parallel_invoke. Examples The examples presented here illustrate the various solutions and some tradeoffs. The first example uses t tbb::parallel_reduce to do a + reduction over sequence of type T. The sequence is defined by a half-open interval [first,last). T AssocReduce( const T* first, const T* last, T identity ) { Reduction Design Patterns 13 return tbb::parallel_reduce( // Index range for reduction tbb::blocked_range(first,last), // Identity element identity, // Reduce a subrange and partial sum [&]( tbb::blocked_range r, T partial_sum )->float { return std::accumulate( r.begin(), r.end(), partial_sum ); }, // Reduce two partial sums std::plus() ); } The third and fourth arguments to this form of parallel_reduce are a built in form of the agglomeration 21H pattern. If there is an elementwise 22H action to be performed before the reduction, incorporating it into the third argument (reduction of a subrange) may improve performance because of better locality of reference. The second example assumes the + is commutative on T. It is a good solution when T objects are expensive to construct. T CombineReduce( const T* first, const T* last, T identity ) { tbb::combinable sum(identity); tbb::parallel_for( tbb::blocked_range(first,last), [&]( tbb::blocked_range r ) { sum.local() += std::accumulate(r.begin(), r.end(), identity); } ); return sum.combine( []( const T& x, const T& y ) {return x+y;} ); } Sometimes it is desirable to destructively use the partial results to generate the final result. For example, if the partial results are lists, they can be spliced together to form the final result. In that case use class tbb::enumerable_thread_specific instead of combinable. The ParallelFindCollisions 23H example in Chapter 7 demonstrates the 65H technique. Floating-point addition and multiplication are almost associative. Reassociation can cause changes because of rounding effects. The techniques shown so far reassociate terms non-deterministically. Fully deterministic parallel reduction for a not quite associative operation requires using deterministic reassociation. The code below demonstrates this in the form of a template that does a + reduction over a sequence of values of type T. template T RepeatableReduce( const T* first, const T* last, T identity ) { if( last-first<=1000 ) { // Use serial reductionIntel® Threading Building Blocks Design Patterns 14 323512-005US return std::accumulate( first, last, identity ); } else { // Do parallel divide-and-conquer reduction const T* mid = first+(last-first)/2; T left, right; tbb::parallel_invoke( [&]{left=RepeatableReduce(first,mid,identity);}, [&]{right=RepeatableReduce(mid,last,identity);} ); return left+right; } } The outer if-else is an instance of the agglomeration 24H pattern for recursive computations. The reduction graph, though not a strict binary tree, is fully deterministic. Thus the result will always be the same for a given input sequence, assuming all threads do identical floating-point rounding. The final example shows how a problem that typically is not viewed as a reduction can be parallelized by viewing it as a reduction. The problem is retrieving floating-point exception flags for a computation across a data set. The serial code might look something like: feclearexcept(FE_ALL_EXCEPT); for( int i=0; i r ) { Reduction Design Patterns 15 int end=r.end(); for( int i=r.begin(); i!=end; ++i ) C[i] = A[i]/B[i]; // It is critical to do |= here, not =, because otherwise we // might lose earlier exceptions from the same thread. flags |= fetestexcept(FE_ALL_EXCEPT); } // Called by parallel_reduce when joining results from two subranges. void join( Body& other ) { flags |= other.flags; } }; Then invoke it as follows: // Construction of cc implicitly resets FP exception state. ComputeChunk cc; tbb::parallel_reduce( tbb::blocked_range(0,N), cc ); if (cc.flags & FE_DIVBYZERO) ...; if (cc.flags & FE_OVERFLOW) ...; ... Intel® Threading Building Blocks Design Patterns 16 323512-005US 7 Divide and Conquer Problem Parallelize a divide and conquer algorithm. Context Divide and conquer is widely used in serial algorithms. Common examples are quicksort and mergesort. Forces • Problem can be transformed into subproblems that can be solved independently. • Splitting problem or merging solutions is relatively cheap compared to cost of solving the subproblems. Solution There are several ways to implement divide and conquer in Intel®Threading Building Blocks (Intel® TBB). The best choice depends upon circumstances. • If division always yields the same number of subproblems, use recursion and tbb::parallel_invoke. • If the number of subproblems varies, use recursion and tbb::task_group. • If ultimate efficiency and scalability is important, use tbb::task and continuation passing style. Example Quicksort is a classic divide-and-conquer algorithm. It divides a sorting problem into two subsorts. A simple serial version looks like:0F 1 void SerialQuicksort( T* begin, T* end ) { 1 Production quality quicksort implementations typically use more sophisticated pivot selection, explicit stacks instead of recursion, and some other sorting algorithm for small subsorts. The simple algorithm is used here to focus on exposition of the parallel pattern. Divide and Conquer Design Patterns 17 if( end-begin>1 ) { using namespace std; T* mid = partition( begin+1, end, bind2nd(less(),*begin) ); swap( *begin, mid[-1] ); SerialQuicksort( begin, mid-1 ); SerialQuicksort( mid, end ); } } The number of subsorts is fixed at two, so tbb::parallel_invoke provides a simple way to parallelize it. The parallel code is shown below: void ParallelQuicksort( T* begin, T* end ) { if( end-begin>1 ) { using namespace std; T* mid = partition( begin+1, end, bind2nd(less(),*begin) ); swap( *begin, mid[-1] ); tbb::parallel_invoke( [=]{ParallelQuicksort( begin, mid-1 );}, [=]{ParallelQuicksort( mid, end );} ); } } Eventually the subsorts become small enough that serial execution is more efficient. The following variation, with changed parts in blue, does sorts of less than 500 elements using the earlier serial code. void ParallelQuicksort( T* begin, T* end ) { if( end-begin>=500 ) { using namespace std; T* mid = partition( begin+1, end, bind2nd(less(),*begin) ); swap( *begin, mid[-1] ); tbb::parallel_invoke( [=]{ParallelQuicksort( begin, mid-1 );}, [=]{ParallelQuicksort( mid, end );} ); } else { SerialQuicksort( begin, end ); } } The change is an instance of the Agglomeration 25H pattern. The next example considers a problem where there are a variable number of subproblems. The problem involves a tree-like description of a mechanical assembly. There are two kinds of nodes: • Leaf nodes represent individual parts. • Internal nodes represent groups of parts. The problem is to find all nodes that collide with a target node. The following code shows a serial solution that walks the tree. It records in Hits any nodes that collide with Target. Intel® Threading Building Blocks Design Patterns 18 323512-005US std::list Hits; Node* Target; void SerialFindCollisions( Node& x ) { if( x.is_leaf() ) { if( x.collides_with( *Target ) ) Hits.push_back(&x); } else { for( Node::const_iterator y=x.begin(); y!=x.end(); ++y ) SerialFindCollisions(*y); } } A parallel version is shown below. typedef tbb::enumerable_thread_specific > LocalList; LocalList LocalHits; Node* Target; // Target node void ParallelWalk( Node& x ) { if( x.is_leaf() ) { if( x.collides_with( *Target ) ) LocalHits.local().push_back(&x); } else { // Recurse on each child y of x in parallel tbb::task_group g; for( Node::const_iterator y=x.begin(); y!=x.end(); ++y ) g.run( [=]{ParallelWalk(*y);} ); // Wait for recursive calls to complete g.wait(); } } void ParallelFindCollisions( Node& x ) { ParallelWalk(x); for(LocalList::iterator i=LocalHits.begin(); i!=LocalHits.end(); ++i) Hits.splice( Hits.end(), *i ); } The recursive walk is parallelized using class task_group to do recursive calls in parallel. There is another significant change because of the parallelism that is introduced. Because it would be unsafe to update Hits concurrently, the parallel walk uses variable LocalHits to accumulate results. Because it is of type enumerable_thread_specific, each thread accumulates its own private result. The results are spliced together into Hits after the walk completes. The results will not be in the same order as the original serial code. Divide and Conquer Design Patterns 19 If parallel overhead is high, use the agglomeration 26H pattern. For example, use the serial walk for subtrees under a certain threshold. Intel® Threading Building Blocks Design Patterns 20 323512-005US 8 GUI Thread Problem A user interface thread must remain responsive to user requests, and must not get bogged down in long computations. Context Graphical user interfaces often have a dedicated thread (“GUI thread”) for servicing user interactions. The thread must remain responsive to user requests even while the application has long computations running. For example, the user might want to press a “cancel” button to stop the long running computation. If the GUI thread takes part in the long running computation, it will not be able to respond to user requests. Forces • The GUI thread services an event loop. • The GUI thread needs to offload work onto other threads without waiting for the work to complete. • The GUI thread must be responsive to the event loop and not become dedicated to doing the offloaded work. Related Non-Preemptive Priorities 27H Local Serializer 28H Solution The GUI thread offloads the work by firing off a task to do it using method task::enqueue. When finished, the task posts an event to the GUI thread to indicate that the work is done. The semantics of enqueue cause the task to eventually run on a worker thread distinct from the calling thread. The method is a new feature in Intel® Threading Building Blocks (Intel® TBB) 3.0. Figure 5 sketches the communication paths. 66H Items in black are executed by the GUI thread; items in blue are executed by another thread. GUI Thread Design Patterns 21 message loop task::enqueue post event task::execute Figure 5: GUI Thread pattern Example The example is for the Microsoft Windows* operating systems, though similar principles apply to any GUI using an event loop idiom. For each event, the GUI thread calls a user-defined function WndProc. to process an event. The key parts are in bold font. // Event posted from enqueued task when it finishes its work. const UINT WM_POP_FOO = WM_USER+0; // Queue for transmitting results from enqueued task to GUI thread. tbb::concurrent_queue ResultQueue; // GUI thread’s private copy of most recently computed result. Foo CurrentResult; LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam) { switch(msg) { case WM_COMMAND: switch (LOWORD(wParam)) { case IDM_LONGRUNNINGWORK: // User requested a long computation. Delegate it to another thread. LaunchLongRunningWork(hWnd); break; case IDM_EXIT: DestroyWindow(hWnd); break; default: return DefWindowProc(hWnd, msg, wParam, lParam); } break; case WM_POP_FOO: // There is another result in ResultQueue for me to grab. ResultQueue.try_pop(CurrentResult); // Update the window with the latest result. RedrawWindow( hWnd, NULL, NULL, RDW_ERASE|RDW_INVALIDATE ); break; Intel® Threading Building Blocks Design Patterns 22 323512-005US case WM_PAINT: Repaint the window using CurrentResult break; case WM_DESTROY: PostQuitMessage(0); break; default: return DefWindowProc( hWnd, msg, wParam, lParam ); } return 0; } The GUI thread processes long computations as follows: 1. The GUI thread calls LongRunningWork, which hands off the work to a worker thread and immediately returns. 2. The GUI thread continues servicing the event loop. If it has to repaint the window, it uses the value of CurrentResult, which is the most recent Foo that it has seen. When a worker finishes the long computation, it pushes the result into ResultQueue, and sends a message WM_POP_FOO to the GUI thread. 3. The GUI thread services a WM_POP_FOO message by popping an item from ResultQueue into CurrentResult. The try_pop always succeeds because there is exactly one WM_POP_FOO message for each item in ResultQueue. Routine LaunchLongRunningWork creates a root task and launches it using method task::enqeueue. The task is a root task because it has no successor task waiting on it. class LongTask: public tbb::task { HWND hWnd; tbb::task* execute() { Do long computation Foo x = result of long computation ResultQueue.push( x ); // Notify GUI thread that result is available. PostMessage(hWnd,WM_POP_FOO,0,0); return NULL; } public: LongTask( HWND hWnd_ ) : hWnd(hWnd_) {} }; void LaunchLongRunningWork( HWND hWnd ) { LongTask* t = new( tbb::task::allocate_root() ) LongTask(hWnd); tbb::task::enqueue(*t); } GUI Thread Design Patterns 23 It is essential to use method task::enqueue and not method task::spawn. The reason is that method enqueue ensures that the task eventually executes when resources permit, even if no thread explicitly waits on the task. In contrast, method spawn may postpone execution of the task until it is explicitly waited upon. The example uses a concurrent_queue for workers to communicate results back to the GUI thread. Since only the most recent result matters in the example, and alternative would be to use a shared variable protected by a mutex. However, doing so would block the worker while the GUI thread was holding a lock on the mutex, and vice versa. Using concurrent_queue provides a simple robust solution. If two long computations are in flight, there is a chance that the first computation completes after the second one. If displaying the result of the most recently requested computation is important, then associate a request serial number with the computation. The GUI thread can pop from ResultQueue into a temporary variable, check the serial number, and update CurrentResult only if doing so advances the serial number. See Non-Preemptive Priorities 29H for how to implement priorities. See Local Serializer 30H for how to force serial ordering of certain tasks. Intel® Threading Building Blocks Design Patterns 24 323512-005US 9 Non-Preemptive Priorities Problem Choose the next work item to do, based on priorities. Context The scheduler in Intel® Threading Building Blocks (Intel® TBB) chooses tasks using rules based on scalability concerns. The rules are based on the order in which tasks were spawned or enqueued, and are oblivious to the contents of tasks. However, sometimes it is best to choose work based on some kind of priority relationship. Forces • Given multiple work items, there is a rule for which item should be done next that is not the default Intel® TBB rule. • Preemptive priorities are not necessary. If a higher priority item appears, it is not necessary to immediately stop lower priority items in flight. If preemptive priorities are necessary, then non-preemptive tasking is inappropriate. Use threads instead. Solution Put the work in a shared work pile. Decouple tasks from specific work, so that task execution chooses the actual piece of work to be selected from the pile. Example The following example implements three priority levels. The user interface for it and top-level implementation follow: enum Priority { P_High, P_Medium, P_Low }; template void EnqueueWork( Priority p, Func f ) { WorkItem* item = new ConcreteWorkItem( p, f ); Non-Preemptive Priorities Design Patterns 25 ReadyPile.add(item); } The caller provides a priority p and a functor f to routine EnqueueWork. The functor may be the result of a lambda expression. EnqueueWork packages f as a WorkItem and adds it to global object ReadyPile. Class WorkItem provides a uniform interface for running functors of unknown type: // Abstract base class for a prioritized piece of work. class WorkItem { public: WorkItem( Priority p ) : priority(p) {} // Derived class defines the actual work. virtual void run() = 0; const Priority priority; }; template class ConcreteWorkItem: public WorkItem { Func f; /*override*/ void run() { f(); delete this; } public: ConcreteWorkItem( Priority p, const Func& f_ ) : WorkItem(p), f(f_) {} }; Class ReadyPile contains the core pattern. It maintains a collection of work and fires off tasks that choose work from the collection: class ReadyPileType { // One queue for each priority level tbb::concurrent_queue level[P_Low+1]; public: void add( WorkItem* item ) { level[item->priority].push(item); tbb::task::enqueue(*new(tbb::task::allocate_root()) RunWorkItem); } void runNextWorkItem() { // Scan queues in priority order for an item. WorkItem* item=NULL; for( int i=P_High; i<=P_Low; ++i ) if( level[i].try_pop(item) ) break; assert(item); item->run(); } Intel® Threading Building Blocks Design Patterns 26 323512-005US }; ReadyPileType ReadyPile; The task enqueued by add(item) does not necessarily execute that item. The task executes runNextWorkItem(), which may find a higher priority item. There is one task for each item, but the mapping resolves when the task actually executes, not when it is created. Here are the details of class RunWorkItem: class RunWorkItem: public tbb::task { /*override*/tbb::task* execute(); // Private override of virtual method }; ... tbb::task* RunWorkItem::execute() { ReadyPile.runNextWorkItem(); return NULL; }; RunWorkItem objects are fungible. They enable the Intel® TBB scheduler to choose when to do a work item, not which work item to do. The override of virtual method task::execute is private because all calls to it are dispatched via base class task. Other priority schemes can be implemented by changing the internals for ReadyPileType. A priority queue could be used to implement very fine grained priorities. The scalability of the pattern is limited by the scalability of ReadyPileType. Ideally scalable concurrent containers should be used for it. Local Serializer Design Patterns 27 10 Local Serializer Context Consider an interactive program. To maximize concurrency and responsiveness, operations requested by the user can be implemented as tasks. The order of operations can be important. For example, suppose the program presents editable text to the user. There might be operations to select text and delete selected text. Reversing the order of “select” and “delete” operations on the same buffer would be bad. However, commuting operations on different buffers might be okay. Hence the goal is to establish serial ordering of tasks associated with a given object, but not constrain ordering of tasks between different objects. Forces • Operations associated with a certain object must be performed in serial order. • Serializing with a lock would be wasteful because threads would be waiting at the lock when they could be doing useful work elsewhere. Solution Sequence the work items using a FIFO (first-in first-out structure). Always keep an item in flight if possible. If no item is in flight when a work item appears, put the item in flight. Otherwise, push the item onto the FIFO. When the current item in flight completes, pop another item from the FIFO and put it in flight. The logic can be implemented without mutexes, by using concurrent_queue for the FIFO and atomic to count the number of items waiting and in flight. The example explains the accounting in detail. Example The following example builds on the Non-Preemptive Priorities example 31H to implement local serialization in addition to priorities. It implements three priority levels and local serializers. The user interface for it follows: enum Priority { P_High, P_Medium, P_Low }; Intel® Threading Building Blocks Design Patterns 28 323512-005US template void EnqueueWork( Priority p, Func f, Serializer* s=NULL ); Template function EnqueueWork causes functor f to run when the three constraints in Table 1 are met. 67H Table 1: Implementation of Constraints Constraint Resolved by class... Any prior work for the Serializer has completed. Serializer A thread is available. RunWorkItem No higher priority work is ready to run. ReadyPileType Constraints on a given functor are resolved from top to bottom in the table. The first constraint does not exist when s is NULL. The implementation of EnqueueWork packages the functor in a SerializedWorkItem and routes it to the class that enforces the first relevant constraint between pieces of work. template void EnqueueWork( Priority p, Func f, Serializer* s=NULL ) { WorkItem* item = new SerializedWorkItem( p, f, s ); if( s ) s->add(item); else ReadyPile.add(item); } A SerializedWorkItem is derived from a WorkItem, which serves as a way to pass around a prioritized piece of work without knowing further details of the work. // Abstract base class for a prioritized piece of work. class WorkItem { public: WorkItem( Priority p ) : priority(p) {} // Derived class defines the actual work. virtual void run() = 0; const Priority priority; }; template class SerializedWorkItem: public WorkItem { Serializer* serializer; Func f; /*override*/ void run() { f(); Serializer* s = serializer; // Destroy f before running Serializer’s next functor. delete this; if( s ) Local Serializer Design Patterns 29 s->noteCompletion(); } public: SerializedWorkItem( Priority p, const Func& f_, Serializer* s ) : WorkItem(p), serializer(s), f(f_) {} }; Base class WorkItem is the same as class WorkItem 32H in the example 33H for Non-Preemptive Priorities. The notion of serial constraints is completely hidden from the base class, thus permitting the framework to extend other kinds of constraints or lack of constraints. Class SerializedWorkItem is essentially ConcreteWorkItem 34H from the other example, extended with a Serializer aspect. Virtual method run() is invoked when it becomes time to run the functor. It performs three steps: 1. Run the functor 2. Destroy the functor. 3. Notify the Serializer that the functor completed, and thus unconstraining the next waiting functor. Step 3 is the difference from the operation of ConcreteWorkItem::run 35H . Step 2 could be done after step 3 in some contexts to increase concurrency slightly. However, the presented order is recommended because if step 2 takes non-trivial time, it likely has side effects that should complete before the next functor runs. Class Serializer implements the core of the Local Serializer pattern: class Serializer { tbb::concurrent_queue queue; tbb::atomic count; // Count of queued items and in-flight item void moveOneItemToReadyPile() { // Transfer item from queue to ReadyPile WorkItem* item; queue.try_pop(item); ReadyPile.add(item); } public: void add( WorkItem* item ) { queue.push(item); if( ++count==1 ) moveOneItemToReadyPile(); } void noteCompletion() { // Called when WorkItem completes. if( --count!=0 ) moveOneItemToReadyPile(); } }; Intel® Threading Building Blocks Design Patterns 30 323512-005US The class maintains two members: • A queue of WorkItem waiting for prior work to complete. • A count of queued or in-flight work. Mutexes are avoided by using concurrent_queue and atomic along with careful ordering of operations. The transitions of count are the key understanding how class Serializer works. • If method add increments count from 0 to 1, this indicates that no other work is in flight and thus the work should be moved to the ReadyPile. • If method noteCompletion decrements count and it is not from 1 to 0, then the queue is non-empty and another item in the queue should be moved to ReadyPile. Class ReadyPile 36H is explained in the example 37H for Non-Preemptive Priorities. If priorities are not necessary, there are two variations on method moveOneItem, with different implications. • Method moveOneItem could directly invoke item->run(). This approach has relatively low overhead and high thread locality for a given Serializer. But it is unfair. If the Serializer has a continual stream of tasks, the thread operating on it will keep servicing those tasks to the exclusion of others. • Method moveOneItem could invoke task::enqueue to enqueue a task that invokes item->run(). Doing so introduces higher overhead and less locality than the first approach, but avoids starvation. The conflict between fairness and maximum locality is fundamental. The best resolution depends upon circumstance. The pattern generalizes to constraints on work items more general than those maintained by class Serializer. A generalized Serializer::add determines if a work item is unconstrained, and if so, runs it immediately. A generalized Serializer::noteCompletion runs all previously constrained items that have become unconstrained by the completion of the current work item. The term “run” means to run work immediately, or if there are more constraints, forwarding the work to the next constraint resolver. Fenced Data Transfer Design Patterns 31 11 Fenced Data Transfer Problem Write a message to memory and have another processor read it on hardware that does not have a sequentially consistent memory model. Context The problem normally arises only when unsynchronized threads concurrently act on a memory location, or are using reads and writes to create synchronization. High level synchronization constructs normally include mechanisms that prevent unwanted reordering. Modern hardware and compilers can reorder memory operations in a way that preserves the order of a thread's operation from its viewpoint, but not as observed by other threads. A serial common idiom is to write a message and mark it as ready to ready as shown in the following code: bool Ready; std::string Message; void Send( const std::string& src ) { // Executed by thread 1 Message=src; Ready = true; } bool Receive( std::string& dst ) { // Executed by thread 2 bool result = Ready; if( result ) dst=Message; return result; // Return true if message was received. } Two key assumptions of the code are: a. Ready does not become true until Message is written. b. Message is not read until Ready becomes true. These assumptions are trivially true on uniprocessor hardware. However, they may break on multiprocessor hardware. Reordering by the hardware or compiler can cause the sender's writes to appear out of order to the receiver (thus breaking condition a) or the receiver's reads to appear out of order (thus breaking condition b). Intel® Threading Building Blocks Design Patterns 32 323512-005US Forces • Creating synchronization via raw reads and writes. Related Lazy Initialization 38H Solution Change the flag from bool to tbb::atomic for the flag that indicates when the message is ready. Here is the previous example, with modifications colored blue. tbb::atomic Ready; std::string Message; void Send( const std::string& src ) { // Executed by thread 1 Message=src; Ready = true; } bool Receive( std::string& dst ) { // Executed by thread 2 bool result = Ready; if( result ) dst=Message; return result; // Return true if message was received. } A write to a tbb::atomic value has release semantics, which means that all of its prior writes will be seen before the releasing write. A read from tbb::atomic value has acquire semantics, which means that all of its subsequent reads will happen after the acquiring read. The implementation of tbb::atomic ensures that both the compiler and the hardware observe these ordering constraints. Variations Higher level synchronization constructs normally include the necessary acquire and release fences. For example, mutexes are normally implemented such that acquisition of a lock has acquire semantics and release of a lock has release semantics. Thus a thread that acquires a lock on a mutex always sees any memory writes done by another thread before it released a lock on that mutex. Non Solutions Mistaken solutions are so often proposed that it is worth understanding why they are wrong. Fenced Data Transfer Design Patterns 33 One common mistake is to assume that declaring the flag with the volatile keyword solves the problem. Though the volatile keyword forces a write to happen immediately, it generally has no effect on the visible ordering of that write with respect to other memory operations. An exception to this rule are processors from the Intel® Itanium® processor family, which by convention assign acquire semantics to volatile reads and release semantics to volatile writes. Another mistake is to assume that conditionally executed code cannot happen before the condition is tested. However, the compiler or hardware may speculatively hoist the conditional code above the condition. Similarly, it is a mistake to assume that a processor cannot read the target of a pointer before reading the pointer. A modern processor does not read individual values from main memory. It reads cache lines. The target of a pointer may be in a cache line that has already been read before the pointer was read, thus giving the appearance that the processor presciently read the pointer target. Intel® Threading Building Blocks Design Patterns 34 323512-005US 12 Lazy Initialization Problem Perform an initialization the first time it is needed. Context Initializing data structures lazily is a common technique. Not only does it avoid the cost of initializing unused data structures, it is often a more convenient way to structure a program. Forces • Threads share access to an object. • The object should not be created until the first access. The second force covers several possible motivations: • The object is expensive to create and creating it early would slow down program startup. • It is not used in every run of the program. • Early initialization would require adding code where it is undesirable for readability or structural reasons. Related Fenced Data Transfer 39H Solutions A parallel solution is substantially trickier, because it must deal with several concurrency issues. Races: If two threads attempt to simultaneously access to the object for the first time, and thus cause creation of the object, the race must be resolved in a way that both threads end up with a reference to the same object of type T. Memory leaks: In the event of a race, the implementation must ensure that any extra transient T objects are cleaned up. Lazy Initialization Design Patterns 35 Memory consistency: If thread X executes value=new T(), all other threads must see stores by new T() occur before the assignment value= . Deadlock: What if the constructor of T() requires acquiring a lock, but the current holder of that lock is also racing to access the object for the first time? There are two solutions. One is based on double-check locking. The other relies on compare-and-swap. Because the tradeoffs and issues are subtle, most of the discussion is in the following examples section. Examples An Intel® TBB implementation of the “double-check” pattern is shown below: template class lazy { tbb::atomic value; Mutex mut; public: lazy() : value() {} // Initializes value to NULL ~lazy() {delete value;} T& get() { if( !value ) { // Read of value has acquire semantics. Mutex::scoped_lock lock(mut); if( !value ) value = new T(); // Write of value has release semantics } return *value; } }; The name comes from the way that the pattern deals with races. There is one check done without locking and one check done after locking. The first check handles the presumably common case that the initialization has already been done, without any locking. The second check deals with cases where two threads both see an uninitialized value, and both try to acquire the lock. In that case, the second thread to acquire the lock will see that the initialization has already occurred. If T() throws an exception, the solution is correct because value will still be NULL and the mutex unlocked when object lock is destroyed. The solution correctly addresses memory consistency issues. A write to a tbb::atomic value has release semantics, which means that all of its prior writes will be seen before the releasing write. A read from tbb::atomic value has acquire semantics, which means that all of its subsequent reads will happen after the acquiring read. Both of these properties are critical to the solution. The releasing write ensures that the construction of T() is seen to occur before the assignment to value. The acquiring read ensures that when the caller reads from *value, the reads occur after the "if(!value)" check. The release/acquire is essentially the Fenced Data Transfer 40HIntel® Threading Building Blocks Design Patterns 36 323512-005US pattern, where the “message” is the fully constructed instance T(), and the “ready” flag is the pointer value. The solution described involves blocking threads while initialization occurs. Hence it can suffer the usual pathologies associated with blocking. For example, if the thread first acquires the lock is suspended by the OS, all other threads will have to wait until that thread resumes. A lock-free variation avoids this problem by making all contending threads attempt initialization, and atomically deciding which attempt succeeds. An Intel® TBB implementation of the non-blocking variant follows. It also uses doublecheck, but without a lock. template class lazy { tbb::atomic value; public: lazy() : value() {} // Initializes value to NULL ~lazy() {delete value;} T& get() { if( !value ) { T* tmp = new T(); if( value.compare_and_swap(tmp,NULL)!=NULL ) // Another thread installed the value, so throw away mine. delete tmp; } return *value; } }; The second check is performed by the expression value.compare_and_swap(tmp,NULL)!=NULL, which conditionally assigns value=tmp if value==NULL, and returns true if the old value was NULL. Thus if multiple threads attempt simultaneous initialization, the first thread to execute the compare_and_swap will set value to point to its T object. Other contenders that execute the compare_and_swap will get back a non-NULL pointer, and know that they should delete their transient T objects. As with the locking solution, memory consistency issues are addressed by the semantics of tbb::atomic. The first check has acquire semantics and the compare_and_swap has both acquire and release semantics. Reference A sophisticated way to avoid the acquire fence for a read is Mike Burrow's algorithm . Reference Counting Design Patterns 37 13 Reference Counting Problem Destroy an object when it will no longer be used. Context Often it is desirable to destroy an object when it is known that it will not be used in the future. Reference counting is a common serial solution that extends to parallel programming if done carefully. Forces • If there are cycles of references, basic reference counting is insufficient unless the cycle is explicitly broken. • Atomic counting is relatively expensive in hardware. Solution Thread-safe reference counting is like serial reference counting, except that the increment/decrement is done atomically, and the decrement and test "count is zero?" must act as a single atomic operation. The following example uses tbb::atomic to achieve this. template class counted { tbb::atomic my_count; T value; public: // Construct object with a single reference to it. counted() {my_count=1;} // Add reference void add_ref() {++my_count;} // Remove reference. Return true if it was the last reference. bool remove_ref() {return --my_count==0;} // Get reference to underlying object T& get() { assert(my_count>0); return my_value; } Intel® Threading Building Blocks Design Patterns 38 323512-005US }; It is incorrect to use a separate read for testing if the count is zero. The following code would be an incorrect implementation of method remove_ref() because two threads might both execute the decrement, and then both read my_count as zero. Hence two callers would both be told incorrectly that they had removed the last reference. --my_count; return my_count==0; // WRONG! The decrement may need to have a release fence so that any pending writes complete before the object is deleted. There is no simple way to atomically copy a pointer and increment its reference count, because there will be a timing hole between the copying and the increment where the reference count is too low, and thus another thread might decrement the count to zero and delete the object. Two way to address the problem are “hazard pointers” and “pass the buck”. See the references at the end of this chapter for details. Variations Atomic increment/decrement can more than an order of magnitude more expensive than ordinary increment/decrement. The serial optimization of eliminating redundant increment/decrement operations becomes more important with atomic reference counts. Weighted reference counting can be used to reduce costs if the pointers are unshared but the referent is shared. Associate a weight with each pointer. The reference count is the sum of the weights. A pointer x can be copied as a pointer x' without updating the reference count by splitting the original weight between x and x'. If the weight of x is too low to split, then first add a constant W to the reference count and weight of x. References D. Bacon and V.T. Rajan, “Concurrent Cycle Collection in Reference Counted Systems” in Proc. European Conf. on Object-Oriented Programming (June 2001). Describes a garbage collector based on reference counting that does collect cycles. M. Michael, “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects” in IEEE Transactions on Parallel and Distributed Systems (June 2004). Describes the “hazard pointer” technique. M. Herlihy, V. Luchangco, and M. Moir, “The Repeat Offender Problem: A Mechanism for Supporting Dynamic-Sized, Lock-Free Data Structures” in Proceedings of the 16th International Symposium on Distributed Computing (Oct. 2002). Describes the “pass the buck” technique. Compare and Swap Loop Design Patterns 39 14 Compare and Swap Loop Problem Atomically update a scalar value so that a predicate is satisfied. Context Often a shared variable must be updated atomically, by a transform that maps its old value to a new value. The transform might be a transition of a finite state machine, or recording global knowledge. For instance, the shared variable might be recording the maximum value that any thread has seen so far. Forces • The variable is read and updated by multiple threads. • The hardware implements “compare and swap” for a variable of that type. • Protecting the update with a mutex is to be avoided. Related Reduction 41H Reference counting 42H Solution The solution is to atomically snapshot the current value, and then use atomic::compare_and_swap to update it. Retry until the compare_and_swap succeeds. In some cases it may be possible to exit before the compare_and_swap succeeds because the current value meets some condition. The template below does the update x=F(x) atomically. // Atomically perform x=F(x). template void AtomicUpdate( atomic& x, F f ) { int o; do { // Take a snapshot int o = x; // Attempt to install new value computed from snapshotIntel® Threading Building Blocks Design Patterns 40 323512-005US } while( x.compare_and_swap(o,f(o))!=o ); } It is critical to take a snapshot and use it for intermediate calculations, because the value of X may be changed by other threads in the meantime. The following code shows how the template might be used to maintain a global maximum of any value seen by RecordMax. // Atomically perform UpperBound = max(UpperBound,y) void RecordMax( int y ) { extern atomic UpperBound; AtomicUpdate(UpperBound, [&](int value){return std::max(value,y);} ); } When y is not going to increase UpperBound, the call to AtomicUpdate will waste time doing the redundant operation compare_and_swap(o,o). In general, this kind of redundancy can be eliminated by making the loop in AtomicUpdate exit early if F(o)==o. In this particular case where F==std::max, that test can be further simplified. The following custom version of RecordMax has the simplified test. // Atomically perform UpperBound =max(UpperBound,y) void RecordMax( int y ) { . extern atomic UpperBound; do { // Take a snapshot int o = UpperBound; // Quit if snapshot meets condition. if( o>=y ) break; // Attempt to install new value. } while( UpperBound.compare_and_swap(y,o)!=o ); } Because all participating threads modify a common location, the performance of a compare and swap loop can be poor under high contention. Thus the applicability of more efficient patterns should be considered first. In particular: • If the overall purpose is a reduction, use the reduction 43H pattern instead. • If the update is addition or subtraction, use atomic::fetch_and_add. If the update is addition or subtraction by one, use atomic::operater++ or atomic::operator--. These methods typically employ direct hardware support that avoids a compare and swap loop. CAUTION: If use compare_and_swap to update links in a linked structure, be sure you understand if the “ABA problem” is an issue. See the Internet for discourses on the subject. Compare and Swap Loop Design Patterns 41 General References This section lists general references. References specific to a pattern are listed at the end of the chapter for the pattern. • E. Gamma, R. Helm, R. Johnson, J. Vlissides. Design Patterns (1995). • Berkeley Pattern Language for Parallel Programming, http://parlab.eecs.berkeley.edu/wiki/patterns 44H • T. Mattson, B. Sanders, B. Massingill. Patterns for Parallel Programming (2005). • ParaPLoP 2009, http://www.upcrc.illinois.edu/workshops/paraplop09/program.html 45H • ParaPLoP 2010, http://www.upcrc.illinois.edu/workshops/paraplop10/program.html • Eun-Gyu Kim and Marc Snir, “Parallel Programming Patterns”, http://www.cs.illinois.edu/homes/snir/PPP/index.html Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 1 Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes Document number: 321604-003US 24 July 2011 Table of Contents 1 Introduction......................................................................................................................... 2 1.1 Change History ............................................................................................................ 2 1.2 Product Contents ......................................................................................................... 2 1.3 What’s New.................................................................................................................. 2 1.4 System Requirements.................................................................................................. 3 1.5 Documentation............................................................................................................. 4 1.6 Samples....................................................................................................................... 5 1.7 Technical Support........................................................................................................ 5 2 Installation........................................................................................................................... 5 2.1 Pre-Installation Steps................................................................................................... 5 2.1.1 Configure Microsoft Visual Studio for 64-bit Applications ...................................... 5 2.1.2 Installation on Microsoft Windows Vista* or Windows 7*....................................... 6 2.2 Installation ................................................................................................................... 6 2.2.1 Activation of Purchase after Evaluation Using the Intel Activation Tool ................. 6 2.3 Installation Folders....................................................................................................... 7 2.4 Installation Known Issues............................................................................................. 7 2.4.1 Installation Path Too Long or Filename Too Long................................................. 7 2.4.2 Additional Steps to Install Documentation for Microsoft Visual Studio 2010 .......... 7 2.4.3 Error Message "HelpLibAgent.exe has stopped working" When Uninstalling Intel Parallel Studio 2011............................................................................................................ 8 2.4.4 Unicode Characters in License Path..................................................................... 8 2.4.5 Documentation Issue with Multiple Visual Studio Versions.................................... 8 3 Disclaimer and Legal Information........................................................................................ 9Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 2 1 Introduction This document describes system requirements and how to install Intel® Parallel Studio 2011 SP1. Additional release notes for each component, with details of changes and additional technical information, can be found after installation, in the respective components’ Documentation folder. First-time users should read “Intel® Parallel Studio Getting Started” by clicking on the “Getting Started” link at the lower left of the install window, or read The Intel® Parallel Studio Getting Started Tutorial that is available after installation at Start > All Programs > Intel Parallel Studio 2011 > Getting Started > Parallel Studio Getting Started Tutorial. 1.1 Change History This section highlights important changes in product updates. Update 2 ? Intel® Parallel Composer 2011 Update 3 ? Intel® Parallel Amplifier 2011 Update 2 ? Intel® Parallel Advisor 2011 Update 2 ? Intel® Parallel Inspector 2011 Update 2 ? Corrections to reported problems Update 1 ? Intel® Parallel Composer 2011 Update 1 ? Intel® Parallel Amplifier 2011 Update 1 ? Intel® Parallel Advisor 2011 Update 1 ? Intel® Parallel Inspector 2011 Update 1 ? Corrections to reported problems Product Release ? Initial product release 1.2 Product Contents Intel® Parallel Studio 2011 SP1 includes the following components: ? Intel® Parallel Composer 2011 Update 6 ? Intel® Parallel Inspector 2011 Update 3 ? Intel® Parallel Amplifier 2011 Update 3 ? Intel® Parallel Advisor 2011 Update 3 ? Integration into Microsoft* development environments ? Sample programs ? On-disk documentation 1.3 What’s New For details on what is new in the product components, please see the individual components’ release notes.Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 3 1.4 System Requirements For an explanation of architecture names, see http://software.intel.com/en-us/articles/intelarchitecture-platform-terminology/ ? A PC based on an IA-32 or Intel® 64 architecture processor supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions (Intel® Pentium 4 processor or later, or compatible non-Intel processor) o Incompatible or proprietary instructions in non-Intel processors may cause the analysis capabilities of this product to function incorrectly. Any attempt to analyze code not supported by Intel® processors may lead to failures in this product. o For the best experience, a multi-core or multi-processor system is recommended ? 2GB RAM ? 4GB free disk space for all product features and architectures ? Microsoft Windows XP*, Microsoft Windows Vista*, Microsoft Windows 7* - 32-bit or “x64” editions, or Microsoft Windows HPC Server 2008* “x64” edition only - embedded editions of any of these operating systems are not supported ? When installed on Microsoft Windows Server 2008, one of: o Microsoft Visual Studio 2010* with C++ and “x64 Compiler and Tools” components installed [1] o Microsoft Visual Studio 2008* Standard Edition (or higher edition) SP1 with C++ and “x64 Compiler and Tools” components installed [1] ? When installed on Microsoft Windows XP, Windows Vista or Windows Server 2003, one of: o Microsoft Visual Studio 2010* with C++ and “x64 Compiler and Tools” components installed [1] o Microsoft Visual Studio 2008* Standard Edition (or higher edition) with C++ and “x64 Compiler and Tools” components installed [1] o Microsoft Visual Studio 2005* Standard Edition (or higher edition) with C++ and “x64 Compiler and Tools” components installed [1] ? Application coding requirements: o Programming Language: C or C++ (native, not managed code) [4] o Threading methodologies supported by the analysis tools: ? Intel® Cilk™ Plus ? Intel® Threading Building Blocks ? Win32* Threads ? OpenMP* [4] ? To read the on-disk documentation, Adobe Reader* 7.0 or later Notes: 1. Microsoft Visual Studio 2005 and 2008 Standard Edition installs the “x64 Compiler and Tools” component by default – the Professional and higher editions require a “Custom” install to select this. Microsoft Visual Studio 2010 includes x64 support by default. 2. The default for the Intel® compilers is to build IA-32 architecture applications that require a processor supporting the Intel® SSE2 instructions - for example, the Intel® Pentium® Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 4 4 processor. A compiler option is available to generate code that will run on any IA-32 architecture processor. However, if your application uses Intel® Integrated Performance Primitives or Intel® Threading Building Blocks, executing the application will require a processor supporting the Intel® SSE2 instructions. 3. Applications built with Intel® Parallel Composer can be run on the same Windows versions as specified above for development. Applications may also run on nonembedded 32-bit versions of Microsoft Windows earlier than Windows XP, though Intel does not test these for compatibility. Your application may depend on a Win32 API routine not present in older versions of Windows. You are responsible for testing application compatibility. You may need to copy certain run-time DLLs onto the target system to run your application. 4. The analysis tools support analysis of applications built with Intel® Parallel Composer, Intel® C++ Compiler version 10.0 or higher, and/or Microsoft Visual C++ 2005, 2008 or 2010. Applications that use OpenMP and are built with the Microsoft compiler must link to the OpenMP “compatibility library” as supplied by an Intel compiler. 1.5 Documentation Product documentation for each component of Intel® Parallel Studio SP1 can be found in the component’s folder. In addition, “Getting Started” documentation can be found in the Documentation folder under Parallel Studio 2011. Optimization Notice Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 5 manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision #20110307 1.6 Samples A series of samples to help introduce you to Intel® Parallel Studio 2011 SP1 can be found in the Samples folder. The samples are provided as a ZIP archive which should be unpacked to a writable folder of your choice. Each component has additional samples under its respective folder. 1.7 Technical Support If you did not register your product during installation, please do so at the Intel® Software Development Products Registration Center. Registration entitles you to free technical support, product updates and upgrades for the duration of the support term. For information about how to find Technical Support, Product Updates, User Forums, FAQs, tips and tricks, and other support information, please visit http://www.intel.com/software/products/support Note: If your distributor provides technical support for this product, please contact them for support rather than Intel. 2 Installation 2.1 Pre-Installation Steps 2.1.1 Configure Microsoft Visual Studio for 64-bit Applications If you are using Microsoft Visual Studio 2005* or 2008 and will be developing 64-bit applications (for the Intel® 64 architecture) you may need to change the configuration of Visual Studio to add 64-bit support.Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 6 If you are using Visual Studio 2005/2008 Standard Edition, or Visual Studio 2010 Professional Edition or higher, no configuration is needed to build Intel® 64 architecture applications. For other editions: 1. From Control Panel > Add or Remove Programs, select “Microsoft Visual Studio 2005” (or 2008) > Change/Remove. The Visual Studio Maintenance Mode window will appear. Click Next. 2. Click Add or Remove Features 3. Under “Select features to install”, expand Language Tools > Visual C++ 4. If the box “X64 Compiler and Tools” is not checked, check it, then click Update. If the box is already checked, click Cancel. 2.1.2 Installation on Microsoft Windows Vista* or Windows 7* On Microsoft Windows Vista or Windows 7, Microsoft Visual Studio 2005 users should install Visual Studio 2005 Service Pack 1 (VS 2005 SP1) as well as the Visual Studio 2005 Service Pack 1 Update for Windows Vista, which is linked to from the VS 2005 SP1 page. After installing these updates, you must ensure that Visual Studio runs with Administrator permissions, otherwise you will be unable to use the Intel compiler. For more information, please see Microsoft's Visual Studio on Windows Vista page (http://msdn2.microsoft.com/enus/vstudio/aa948853.aspx) and related documents. 2.2 Installation The installation of the product requires a valid license file or serial number. If you are evaluating the product, you can also choose the “Evaluate this product (no serial number required)” option during installation. To begin installation, insert the first product DVD in your computer’s DVD-ROM drive; the installation should start automatically. If it does not, open the top-level folder of the DVD-ROM drive in Windows Explorer and double-click on setup.exe. If you received your product as a downloadable file, double-click on the executable file (.EXE) to begin installation. You do not need to uninstall previous versions or updates before installing a newer version – the new version will replace the older version. 2.2.1 Activation of Purchase after Evaluation Using the Intel Activation Tool Note for evaluation customers a new tool Intel Activation Tool “ActivationTool.exe” is included in this product release and installed at “[Common Files]\Intel\Parallel Studio 2011\Activation\”. If you installed the product using an Evaluation license or SN, or using the “Evaluate this product (no serial number required)” option during installation, and then purchased the product, you can activate your purchase using the Intel Activation Tool at Start > All Programs > Intel Parallel Studio 2011 > Product Activation. It will convert your evaluation software to a fully licensed product.Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 7 2.3 Installation Folders The product installs into a folder arrangement as shown below. Not all folders will be present in a given installation. If other Intel® Parallel Studio tools are installed, they will share the top-level installation folder. ? C:\Program Files\Intel\Parallel Studio 2011\ o Documentation o Samples o Advisor o Amplifier o Composer SP1 o Inspector If you are installing on a system with a non-English language version of Windows, the name of the Program Files folder may be different. On Intel® 64 architecture systems, the folder name is Program Files (X86) or the equivalent. 2.4 Installation Known Issues 2.4.1 Installation Path Too Long or Filename Too Long During installation, if the length of the full installation path of any installed file including the filename exceeds 256 characters, the installation will stop with an error message. One possible error message is: Error 1304. Error writing to file: d:\Program Files\Development Tools\Intel\Parallel Studio 2011\Composer\Documentation\en_US\ipp\ipp_manual\IPPI\ippi_ch16\functn _YCrCb411ToYCbCr422_EdgeDV_YCrCb411ToYCbCr422_ZoomOut2_EdgeDV_YCrCb411 ToYCbCr422_ZoomOut4_EdgeDV_YCrCb411ToYCbCr422_ZoomOut8_EdgeDV.htm This can occur because the user has specified a long custom installation root directory. Try shortening this path if you run into this error. Note that this may require reinstallation of other Parallel Studio 2011 SP1 products. 2.4.2 Additional Steps to Install Documentation for Microsoft Visual Studio 2010 When installing Intel Parallel Studio 2011 SP1 on a system with Microsoft Visual Studio 2010 for the first time, you will be asked to initialize the “Local Store” for documentation for Visual Studio 2010 if it was not done before. The "Help Library Manager" will register the Intel Parallel Studio 2011 SP1 help documentation within Visual Studio 2010. Please follow the instructions of the "Help Library Manager" installation wizard to install the Intel Parallel Studio 2011 SP1 help documentation for Visual Studio 2010. This step is only needed once. When you install Intel Parallel Studio updates in the future, you will not be required to re-register the documentation through the “Help Library Manager”.Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 8 2.4.3 Error Message "HelpLibAgent.exe has stopped working" When Uninstalling Intel Parallel Studio 2011 When installing or uninstalling Intel Parallel Studio 2011 SP1 on a system with Visual Studio 2010, you may see the error message “HelpLibAgent.exe has stopped working”. This error does not prevent the installation or uninstallation of Intel Parallel Studio. It is an issue from a 3rd party tool. When there is a fix, the Release Notes will be updated. Please visit http://software.intel.com/en-us/articles/installation-error-helplibagentexe-has-stopped-workingwhen-uninstalling-intel-parallel-studio-2011/ for the latest update on this issue. 2.4.4 Unicode Characters in License Path During installation, Intel® software cannot handle Unicode characters in license paths and the names of the licenses. Intel® software tries to find licenses in the standard location (%CommonProgramFiles%\Intel\Licenses, most commonly C:\Program Files\Common Files\Intel\Licenses on 32-bit systems and C:\Program Files (x86)\Common Files\Intel\Licenses on 64-bit systems). Do not place licenses in folders or paths containing localized characters. For example: C:\????\?????. Do not rename licenses obtained from Intel using localized characters. For example ???????.lic. Do not set the INTEL_LICENSE_FILE environment variable to contain directory paths and license names containing localized characters. Keep licenses either in the standard location (see above), or use ASCII characters in directory names and license names. For example: C:\Intel\Licenses and License.lic. 2.4.5 Documentation Issue with Multiple Visual Studio Versions If you have both Microsoft Visual Studio* 2005 and 2008 installed on your system and integrate Intel® Parallel Studio 2011 SP1 into both versions, removing the integration from one of the versions will remove the integrated Intel® Parallel Studio documentation from both. To re-install the documentation: For Intel® Parallel Composer 2011: 1. Use the Control Panel to select the product. ? For Windows XP* users: Select Control Panel > Add/Remove Programs. ? For Windows 7* users: Select Control Panel > Programs and Features. ? For Windows Vista* users: Select Control Panel > Programs. 2. With the product selected, click the Change/Remove button and choose Modify mode. 3. In the Select Components dialog box, unselect “Integrated Documentation;” this will remove the documentation. 4. Repeat steps 1 and 2. 5. In the Select Components dialog box, select “Integrated Documentation” to install documentation again For Intel® Parallel Advisor 2011, Intel® Parallel Amplifier 2011, Intel® Parallel Inspector 2011 : Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 9 First option: 1. Open the Intel® Parallel Studio command prompt (Start Menu\Programs\Intel Parallel Studio 2011\Command Prompt. You can choose any shortcut here, for example, “IA-32 Visual Studio 2005 mode”). 2. Remove the integration for the Visual Studio version that is missing integrated help. For example: ? “ampl-vsreg –d 2005” for removing the Amplifier integration with VS2005 ? “insp-vsreg –d 2008” for removing the Inspector integration with VS2008 ? “advi-vsreg –d 2005” for removing the Advisor integration with VS2005 3. Restore the integration. For example: ? “ampl-vsreg –i 2005” for adding the Amplifier integration with VS2005 ? “insp-vsreg –i 2008” for adding the Inspector integration with VS2008 ? “advi-vsreg –i 2005” for adding the Advisor integration with VS2005 Second option: 1. Uninstall the product. 2. Install it again with the desired Visual Studio integration selected. 3 Disclaimer and Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 10 Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Celeron, Centrino, Intel, Intel logo, Intel386, Intel486, Intel Atom, Intel Core, Itanium, MMX, Pentium, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Copyright © 2011 Intel Corporation. All Rights Reserved. Intel® Math Kernel Library Vector Statistical Library Notes Document Number: 310714-023US Copyright © 2003–2011, Intel Corporation. All Rights Reservedii Contents 1 Legal Information ................................................................................................................ 1 2 Revision History................................................................................................................... 2 3 About This Library................................................................................................................ 3 4 About This Document ........................................................................................................... 4 4.1 Conventions ............................................................................................................. 5 5 Introduction ........................................................................................................................ 6 6 Randomness and Scientific Experiment................................................................................... 7 7 Random Numbers ................................................................................................................ 8 8 Figures of Merit for Random Number Generators...................................................................... 9 8.1 Uniform Probability Distribution and Basic Pseudo- and Quasi-Random Number Generators ............................................................................................................... 9 8.2 Figures of Merit for General (Non-Uniform) Distribution Generators ............................... 10 9 VSL Structure.................................................................................................................... 12 9.1 Why Vector Type Generators?................................................................................... 12 9.2 Basic Generators..................................................................................................... 13 9.3 Random Streams and RNGs in Parallel Computation .................................................... 18 9.3.1 Initializing Basic Generator.......................................................................... 18 9.3.2 Creating and Initializing Random Streams..................................................... 19 9.3.3 Creating Random Stream Copy and Copying Stream State.............................. 20 9.3.4 Saving and Restoring Random Streams ........................................................ 20 9.3.5 Independent Streams. Leapfrogging and Block-Splitting ................................. 21 9.3.6 Abstract Basic Random Number Generators. Abstract Streams ........................ 23 9.4 Generating Methods for Random Numbers of Non-Uniform Distribution.......................... 29 9.4.1 Inverse Transformation .............................................................................. 29 9.4.2 Acceptance/Rejection ................................................................................. 30 9.4.3 Mixture of Distributions .............................................................................. 31 9.4.4 Special Properties ...................................................................................... 31 9.5 Accurate and Fast Modes of Random Number Generation ............................................. 32 9.6 Example of VSL Use ................................................................................................ 33 10 Testing of Basic Random Number Generators ........................................................................ 36 10.1 BRNG Implementations and Categories...................................................................... 36 10.1.1 First Category .......................................................................................... 36 10.1.2 Second Category ...................................................................................... 37 10.1.3 Third Category ......................................................................................... 37 10.2 Interpreting Test Results.......................................................................................... 37 10.2.1 One-Level (Threshold) Testing.................................................................... 37 10.2.2 Two-Level Testing..................................................................................... 38 10.3 BRNG Tests Description ........................................................................................... 38 10.3.1 3D Spheres Test....................................................................................... 38 10.3.2 Birthday Spacing Test ............................................................................... 39 10.3.3 Bitstream Test.......................................................................................... 41 10.3.4 Rank of 31x31 Binary Matrices Test ............................................................ 42 10.3.5 Rank of 32x32 Binary Matrices Test ............................................................ 44 10.3.6 Rank of 6x8 Binary Matrices Test................................................................ 45 10.3.7 Count-the-1's Test (Stream of Bits) ............................................................ 47 10.3.8 Count-the-1's Test (Stream of Specific Bytes) .............................................. 49 10.3.9 Craps Test............................................................................................... 50 10.3.10 Parking Lot Test ..................................................................................... 51 10.3.11 2D Self-Avoiding Random Walk Test.......................................................... 52 10.3.12 Template Test ........................................................................................ 53 10.4 BRNG Properties and Testing Results ......................................................................... 54 10.4.1 MCG31m1 ............................................................................................... 54Contents iii 10.4.2 R250....................................................................................................... 56 10.4.3 MRG32k3a............................................................................................... 58 10.4.4 MCG59.................................................................................................... 60 10.4.5 WH ......................................................................................................... 62 10.4.6 MT19937................................................................................................. 64 10.4.7 SFMT19937 ............................................................................................. 66 10.4.8 MT2203................................................................................................... 68 10.4.9 SOBOL .................................................................................................... 70 10.4.10 NIEDERREITER ....................................................................................... 74 11 Testing of Distribution Random Number Generators ............................................................... 78 11.1 Interpreting Test Results.......................................................................................... 78 11.2 Description of Distribution Generator Tests................................................................. 78 11.2.1 Confidence Test........................................................................................ 79 11.2.2 Distribution Moments Test ......................................................................... 79 11.2.3 Chi-Squared Goodness-of-Fit Test .............................................................. 80 11.2.4 Performance ............................................................................................ 80 11.3 Continuous Distribution Functions ............................................................................. 81 11.3.1 Uniform (VSL_RNG_METHOD_UNIFORM_STD/VSL_RNG_METHOD_UNIFORM_STD_ACCURATE) ..... 82 11.3.2 Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER)....................................... 82 11.3.3 Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2) ..................................... 82 11.3.4 Gaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF) ............................................... 83 11.3.5 GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER) ............................... 84 11.3.6 GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER2) ............................. 84 11.3.7 GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_ICDF) ....................................... 84 11.3.8 Exponential (VSL_RNG_METHOD_EXPONENTIAL_ICDF/VSL_RNG_METHOD_EXPONENTIAL_ICDF_ACCURAT E) ............................................................................................................ 85 11.3.9 Laplace (VSL_RNG_METHOD_LAPLACE_ICDF) ..................................................... 85 11.3.10 Weibull (VSL_RNG_METHOD_WEIBULL_ICDF/ VSL_RNG_METHOD_WEIBULL_ICDF_ACCURATE) .................................................. 85 11.3.11 Cauchy (VSL_RNG_METHOD_CAUCHY_ICDF) ................................................... 85 11.3.12 Rayleigh (VSL_RNG_METHOD_RAYLEIGH_ICDF/ VSL_RNG_METHOD_RAYLEIGH_ICDF_ACCURATE) ................................................ 86 11.3.13 Lognormal (VSL_RNG_METHOD_LOGNORMAL_ BOXMULLER2/VSL_RNG_METHOD_LOGNORMAL_BOXMULLER2_ACCURATE) .................. 86 11.3.14 Gumbel (VSL_RNG_METHOD_GUMBEL_ICDF)................................................... 86 11.3.15 Gamma (VSL_RNG_METHOD_GAMMA_GNORM/ VSL_RNG_METHOD_GAMMA_GNORM_ACCURATE) ................................................... 87 11.3.16 Beta (VSL_RNG_METHOD_BETA_CJA/ VSL_RNG_METHOD_BETA_CJA_ACCURATE) ... 87 11.4 Discrete Distribution Functions.................................................................................. 88 11.4.1 Uniform (VSL_RNG_METHOD_UNIFORM_STD) .................................................... 88 11.4.2 UniformBits (VSL_RNG_METHOD_UNIFORMBITS_STD) ........................................ 88 11.4.3 UniformBits32 (VSL_RNG_METHOD_UNIFORMBITS32_STD) ................................. 91 11.4.4 UniformBits64 (VSL_RNG_METHOD_UNIFORMBITS64_STD) ................................. 92 11.4.5 Bernoulli (VSL_RNG_METHOD_BERNOULLI_ICDF).............................................. 92 11.4.6 Geometric (VSL_RNG_METHOD_GEOMETRIC_ICDF)............................................ 93 11.4.7 Binomial (VSL_RNG_METHOD_BINOMIAL_BTPE)................................................ 93 11.4.8 Hypergeometric (VSL_RNG_METHOD_HYPERGEOMETRIC_H2PE) ........................... 93 11.4.9 Poisson (VSL_RNG_METHOD_POISSON_PTPE) ................................................... 93 11.4.10 Poisson (VSL_RNG_METHOD_POISSON_POISNORM) .......................................... 94 11.4.11 PoissonV (VSL_RNG_METHOD_POISSONV_POISNORM) ....................................... 94 11.4.12 NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) ................................... 94 Bibliography............................................................................................................................... 951 Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to http://www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, InTru, the InTru logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, skoool, the skoool logo, Sound Mark, The Journey Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Copyright © 2003-2011, Intel Corporation. All rights reserved.2 Revision History Revision Number Description Revision Date 1.0 Original version of the VSL Notes. Documents Intel® Math Kernel Library release 6.0 Gold. 02/03 2.0 Documents Intel® Math Kernel Library release 6.0 Gold + minor corrections 03/03 3.0 Documents Intel MKL release 6.1 Gold. 07/03 4.0 Documents Intel MKL release 7.0 Beta. 11/03 5.0 Documents Intel MKL release 7.0 Gold. 04/04 6.0 Documents Intel MKL release 7.0.1. 07/04 7.0 Documents Intel MKL release 8.0 Beta. 03/05 8.0 Documents Intel MKL release 8.0 Gold. 08/05 -009 Documents Intel MKL release 8.1 Gold. 03/06 -010 Documents Intel MKL release 9.0 Beta. 05/06 -011 Documents Intel MKL release 9.0 Gold. 09/06 -012 Documents Intel MKL release 9.1 Beta. 01/07 -013 Documents Intel MKL release 9.1 Gold. 03/07 -014 Documents Intel MKL release 10.0 Beta. 07/07 -015 Documents Intel MKL release 10.0 Gold. 09/07 -016 Documents Intel MKL release 10.1 Beta. 04/08 -017 Documents Intel MKL release 10.1 Gold. 08/08 -018 Documents Intel MKL release 10.2 Beta. 01/09 -019 Documents Intel MKL release 10.2. 04/09 -020 Documents Intel MKL release 10.3. 08/10 -021 Documents Intel MKL release 10.3.3. 02/11 -022 Documents Intel MKL release 10.3.5. 07/11 -023 Documents Intel MKL release 10.3.7. 10/113 1 About This Library Vector Statistical Library (VSL) is designed for the purpose of pseudorandom and quasi-random vector generation and for convolution and correlation mathematical operations. VSL is an integral part of Intel® Math Kernel Library (Intel® MKL). VSL provides a number of generator subroutines implementing commonly used continuous and discrete distributions, all of which are based on the highly optimized Basic Random Number Generators (BRNGs) and VML, the library of vector transcendental functions, to help improve their performance. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #201108044 2 About This Document Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 This document includes a brief conceptual overview of random numbers generation problems, the product and its capabilities, with focus on interpretation of results and the related generator figures of merit as well as task-oriented, procedural, and reference information. In contrast to the Intel MKL Reference Manual, VSL Notes substantially expand on the concept of random number generation and its application as well as on the related notions and issues. The document provides extensive comparative analysis of the library generators and describes the basic tests applied. Apart from the VSL distribution generators and service subroutines, dealt with in the Intel MKL Reference Manual, the VSL Notes also describe testing of distribution generators. Those interested in general issues related to random number generators, their quality and applications in computer simulation should refer to Randomness and Scientific Experiment, Random Numbers, and Figures of Merit for Random Number Generators sections, which briefly cover the relevant matters and provide references for further studies. VSL Structure section covers the concept underlying VSL, the library structure and potential for functionality enhancement. VSL is a library of high-performance random number generators. The section describes the factors that optimize the VSL generators for Intel® processors. Special attention is given to VSL ease of use and other advantages in parallel programming. The Testing of Basic Random Number Generators and Testing of Distribution Random Number Generators sections describe a number of tests for the VSL generators of various probability distributions. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for latest test results.5 2.1 Conventions The following mathematical notation is used throughout the document: Bitwise exclusive OR. & Bitwise AND. | Bitwise OR.6 3 Introduction This document does not purport to cover the fundamentals of the mathematical statistics and probability theory, nor those of the theory of numbers and statistical simulation. Books and articles listed in the Bibliography section mostly cover these issues. What you will find below is a brief overview of issues pertaining to random number generation, interpretation of the results and the related notion of quality random number generation. To some extent, it is an attempt to justify 'the fall' of many people engaged in solving problems of randomness simulation, that is, the fall John von Neumann meant, when he wrote: "Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin". Still more and more researchers in a variety of scientific fields are getting themselves involved into this kind of simulation depravity, as simulation is becoming more and more valuable in various scientific disciplines. Computer simulation has become a new and defacto commonly recognized approach to scientific research along with conventional experimentation. The latter harshly restricts a mathematical model that is supposed to be as sophisticated as the available conventional research methods permit. As for computer simulation, with ever-growing computing power the degree of mathematical model complexity has come to be more dependable exclusively on our own understanding of phenomena we try to model. This is arguably the key factor in ensuring the great success that computer simulation has achieved of recent.7 4 Randomness and Scientific Experiment A precise definition of what the word 'random' means can hardly be given, even considering the fact that everyday life provides a variety of examples of 'randomness'. Randomness is closely related to unpredictability of observation results and impossibility to predict them with sufficient accuracy. The nature of randomness is based on lack of exhaustive information about the phenomenon under observation. As soon as we learn the origin of that phenomenon, we no longer consider it accidental or random. On the other hand, a random phenomenon, whose origin has been revealed, loses nothing of its random character. We may characterize randomness as a type of relation stipulated by conditions that are inessential, superfluous, and extraneous to this particular phenomenon. Thus, knowledge is incomplete by definition as it is impossible to allow for all sorts of immaterial relations. Since our knowledge is incomplete (and it is something that can hardly be helped), the observation results may prove impossible to predict with great accuracy. For instance, the initial state of the objects under observation may change imperceptibly for our instruments, but these small changes may cause significant alterations in the final results. Sophisticated nature of the observed phenomenon may make accurate computation impossible in practice, if not in theory. Finally, even minor uncontrollable disturbing factors may cause serious deviations from hypothetically "true value". Nevertheless, with all likelihood of "irregularities" and "deviations", observational or experimental results still reveal a certain typical regularity, named statistical stability. Various forms of statistical stability are formulated as specific rules that mathematical statistics calls laws of large numbers. In fact, it is this stability that the mathematical theory underlying the mathematical model of random phenomena is based upon. This theory is well known as the theory of probability.8 5 Random Numbers A set of distinctive features characterizes experimental observations. Many of such features are of purely quantitative nature (results of measurements, calculations, and the like) but some of them are mainly qualitative (for example, color of the object, occurrence or non-occurrence, and so on). In the latter case results may also be presented as quantitative if some appropriate conventions have been developed and applied (this may prove to be a rather tricky task to accomplish though). Thus, even when the result is a particular quality feature it can be expressed by a certain number, which, if the result is a random phenomenon, is called a random number. Numerical methods consider random numbers not only as data from experimental observations. After emergence of computers an imitation of a huge amount of random numbers is of great interest in various computational areas as well [Knuth81]. For historical reasons, methods that utilize random numbers to perform a simulation of phenomena are called Monte Carlo methods. Monte Carlo became a tool to perform the most complex simulations in natural and social sciences, financial analysis, physics of turbulence, rarefied gas and fluid simulations, physics of high energies, chemical kinetics and combustion, radiation transport problems, and photorealistic rendering. Monte Carlo methods are intended for various numerical problems such as solving ordinary stochastic differential equations, ordinary differential equations with random entries, boundary value problems for partial differential equations, integral equations, and evaluation of high-dimensional integrals including path-dependent integrals. Monte Carlo methods include also random variables and order statistics simulation, stochastic processes as well as random samplings and permutations. Due to various reasons [Brat87], random number generation based on completely deterministic algorithms has become most common. It is obvious, however, that numbers obtained in a strictly deterministic way can not be considered truly random as they only imitate randomness and are, in fact, pseudo-random. Ideally, pseudo-random numbers imitate 'truly' random ones so well that without knowing the method of pseudo-random number generation and judging only by the output sequence, it is impossible to distinguish it within a reasonable time from a 'truly' random sequence with more than 50% probability [L’Ecu94]. The output sequence of most pseudorandom number generators is easily predictable. This is acceptable because a number of practical applications do not require strict unpredictability. However, there are certain applications for which most now existing pseudorandom generators are useless and at times simply dangerous. Among them, for example, are applications dealing with geometrical behavior of large random vectors. Most of presently existing generators should never be used for cryptographic purposes. Pseudorandom number generators imitate finite sequences of independent identically distributed (i.i.d.) random numbers. However, some numerical methods do not really require independence between random numbers in a sequence. For such methods (a numerical integration and optimization, for example) the most important is to fill some space with numbers as close to a given distribution as possible to the prejudice of independence. Such sequences do not look random at all. For historical reasons they are called quasi-random (or low discrepancy) sequences, respective generators are called quasi-random number generators, and Monte Carlo methods dealing with quasi-random numbers are called Quasi-Monte Carlo methods. Hereinafter, the term 'random number generator', or RNG, refers to both pseudo- and quasi-random number generators, unless we want to emphasize the fact that a generator produces precisely a pseudo- or quasi-random sequence.9 6 Figures of Merit for Random Number Generators Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 This section discusses figures of merit for Basic Pseudo- and Quasi-Random Number Generators as well as for General (Non-Uniform) Distribution Generators. 6.1 Uniform Probability Distribution and Basic Pseudo- and Quasi-Random Number Generators When considering a great variety of probability distributions, special emphasis should be laid upon a uniform distribution over a certain set U of large cardinality. Firstly, such a distribution is most convenient for analysis. And secondly, a random number generator of uniform distribution can always serve as a basis for an RNG of any other distribution type. That is why we use the term basic generators in reference to pseudorandom number generators of uniform distribution. So the observational output sequence of a basic generator should ideally possess the same properties as a sequence of independent variates evenly distributed over a set U, that is, it should be able to pass various statistical tests for uniformity and independence. A pseudorandom number generator, however, is unable to pass all sorts of statistical tests, as it is an a priori fact that the output sequence of such generator is anything but random. In other words, a fairly powerful statistical test can always be created for any individual basic RNG, which the said generator will definitely fail. The situation may not look so desperate, if we consider the time required to detect 'non-randomness' in the generator. It makes sense to consider only those statistical tests that work within a 'reasonable' period of time. What exactly time period is 'reasonable'? No direct answer is possible here, as it depends on the sphere of generator application. For example, 'reasonable' time in cryptography may be measured in years of testing conducted on a powerful cluster, while it may be significantly shorter for most of other applications. Note: As of present, VSL contains general-purpose random number generators that are not intended for cryptography applications. Cryptographic RNGs are too slow for other fields; most of applications there benefit from simpler (and faster) generators: linear congruential, multiple recursive, feedback-shift-register, add-with-carry, etc. To summarize, it should be noted that checking the quality of basic RNGs requires a 'reasonable' set, or battery, of statistical tests. Ideally, such tests depend for their choice on types of problems the generator is intended to solve. A suitable test battery for general-purpose RNGs libraries is fairly hard to choose, as the tests it should include are supposed to be versatile and sufficient for many simulation tasks. DIEHARD Battery of Tests by G. Marsaglia [Mars95] is an example of a good set of Intel(R) MKL Vector Statistical Library Notes 10 empirical tests for basic generators. Still a specific application type may require a more complete generator testing. While duly recognizing the importance and usefulness of empirical testing, we should emphasize the significance of theoretical methods for estimating the quality of basic generators. Theoretical research serves as the basis for better understanding of generator’s properties: its period length, lattice structure, discrepancy, equidistribution, etc. Theoretic evaluation is the first stage in rejecting admittedly bad generators. Empirical tests should be applied only to make sure the remaining generators are of acceptable quality. What makes the empirical testing just as important is the fact that most of results obtained with the help of theoretical testing refer to a basic generator used over the entire period, while in practice only a small fraction of the period is (and should be!) engaged. Good behavior of k-dimensional random number vectors over the entire period provides us with greater confidence (yet not with a proof) that similarly good statistical behavior will be observed over a smaller portion of the period [L’Ecu94]. Period of a basic generator is a most important feature that characterizes its quality. For example, one of the VSL BRNGs - multiplicative congruential generator MCG31m1 - has a period length of about 2 31 , while its efficiency amounts to about four processor cycles per one real number, using Intel® Itanium® 2 processor. Therefore, with the processor frequency of 1GHz, the entire period will be covered within slightly more than 2 seconds. Taking into consideration that good statistical behavior of the generator is observed only over a fraction of its period (B.D. Ripley [Ripley87] recommends to take no more than a square root of the period length) we may assert that such period length is unacceptable. Such generators, however, still may be useful in certain Monte Carlo applications (mostly due to the speed and small volume of memory engaged to keep the generator state as well as efficient methods available for generation of random subsequences), when a relatively little quantity of random numbers should be used. For example, while estimating a global solution to an integral equation through Monte Carlo method, the same random numbers should be used for different parameters [Mikh2000]. Somehow or other, modern computational capacities require BRNGs of at least 2 60 period length. All the other VSL BRNGs meet these requirements. Pseudorandom number generators are commonly recursive integer sequences in modular arithmetic, for example: Theoretical research aims at selection of such values for parameters k, ai, m that provide for good quality properties of the output sequence in terms of period length, lattice structure, discrepancy, equidistribution, etc. In particular, if m is a prime number, and with proper coefficients ai selected, a period length of order mk may be obtained. Nevertheless, m is often taken as 2p (p >1) due to efficient modulo m reduction. Some authors do not recommend using m in the form of a power of 2 (see, for example, D. Knuth [Knuth81], P. L’Ecuyer [L’Ecu94]) as the lower bits of the generated random numbers prove to be non-random on the whole. For most of Monte Carlo applications, however, this is immaterial. Moreover, even if m is a prime number, great care should also be taken when selecting random bits in the output sequence. For the same reasons quasi-random number generators filling some hypercube as evenly as possible are called in VSL as Basic Random Number Generators as well. Quasi-random sequences filling space according to a non-uniform distribution can be generated by transforming a sequence produced by a basic quasi-random number generator. It is obvious that in most cases tests designed for pseudorandom number generators cannot be used for quasi-random number generators. Special batteries of tests should be designed for basic quasi-random number generators. 6.2 Figures of Merit for General (Non-Uniform) Distribution Generators First and foremost, it should be noted that a general distribution generator greatly depends on the quality of the underlying BRNG. Several basic approaches may be singled out to test general distribution generators. Random number distributions can be described with a number of measures: probability moments, central and absolute moments, quantiles, mode, scattering, skewness, and excess (kurtosis) coefficients, etc. All the ordinary sample characteristics converge in probability to the corresponding Figures of Merit for Random Number Generators 11 measures of distribution when the sample size tends to infinity [Cram46]. Commonly, the characteristics based on the distribution moments are asymptotically normal with large sample sizes. Some classes of sample characteristics that are not based on sampling moments are also asymptotically normal, while others have quite different asymptotic behavior. Somehow or other, when limit probability distribution is known, it is possible to build a statistical test to check whether a particular sample characteristic agrees with a corresponding measure of the distribution. Of greatest practical value for simulation purposes are sample mean and variance that are main properties of the distribution bias and scattering. All the VSL random number generators undergo testing for agreement between distribution sampling moments (mean and variance) and theoretical values calculated for various sample sizes and distribution parameters. Another class of valuable tests aims to check how well the sample distribution function agrees with the theoretical one. The most important tests among them are chi-square Pearson goodness-of-fit test (for discrete and continuous distributions) and Kolmogorov-Smirnov goodness-of-fit test (for continuous distributions). Every VSL distribution is tested with chi-square Pearson test over various sample sizes and distribution parameters. It may be useful to transform the sequence that is being tested into one of the distributions, for example, into a uniform, normal, or multidimensional normal distribution. Then the transformed sequence is tested using a set of statistical tests that are specific for the distribution to which the sequence was transformed. Tests that are based on simulation are in fact real Monte Carlo applications. Their choice is quite optional and should be made in accordance with the generator’s field of application, the only requirement being an opportunity to verify the results obtained against the theoretical value. A good example of such test application, which is used in checking the VSL generators for quality, is the selfavoiding random walk [Ziff98].12 7 VSL Structure Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 The VSL library of the current Intel MKL version contains a set of generators to create general probability distributions, most commonly used in simulations, such as uniform, normal (Gaussian), exponential, Poisson, etc. Non-uniform distributions are generated using various transformation techniques applied to the output of a basic (either pseudo-random or quasi-random) RNG. To generate random numbers of a given probability distribution, you have an option of choosing one of the available VSL basic generators or of registering your own basic random number generator. To enhance their performance, all the VSL BRNGs are highly optimized for various architectures of Intel processors. Besides, VSL provides a number of different techniques for transforming uniformly distributed random numbers into a sequence of required distribution. All the random number generators that are implemented in VSL are of vector type. Unlike scalar type generators, for example, a standard rand() function, when the function output is a successive random number, vector generators produce a vector of n successive random numbers of a given distribution with given parameters. VSL is a thread-safe library convenient for parallel computing with a great variety of configurations of parallel systems. A random stream is a basic notion in the RNG subcomponent of VSL. Mechanism of streams provides simultaneous generation of several random number sequences produced by one or more basic generators, as well as splitting of the original sequence into several subsequences by the leapfrog and block-split methods. Several random streams are particularly useful not only in parallel applications but in sequential programs as well. 7.1 Why Vector Type Generators? Due to architectural features of modern computers vector type library subroutines often perform much more efficiently than scalar type routines. In other words, the overhead expenses are often comparable with the total time required for computations. Certainly, there are subroutines where overhead expenses are negligible in comparison with the total time required for computation. However, this is not usually the case with highly optimized RNGs. To reduce overhead expenses, all VSL random number generator subroutines are of vector type. User is free to call a vector random number generator subroutine to generate just one random number, however, such use is hardly efficient. On the one hand, vector type random number generators sometimes require more careful programming. A reward in this case is a substantial speedup in overall application performance. On the other hand, VSL provides a number of services to make vector programming as natural as possible. See Independent Streams. Leapfrogging and Block-Splitting section and Abstract Basic Random Number Generators. Abstract Streams section for further discussion. Example of VSL Use 13 Disregarding possible programming issues, the vector type interface is quite natural for Monte Carlo methods because Monte Carlo requires a lot of random numbers rather than just one. 7.2 Basic Generators As indicated above, the basic generators may serve to obtain random numbers of various statistical distributions. Non-uniform distribution generators strongly depend on the quality of the underlying basic generators. Besides, as we have already mentioned, at present there is no such basic generator that would be fully adequate for any application. Many of the current generators are useless and simply dangerous for a certain category of tasks. In a number of applications quality requirements for RNGs prevail over other requirements, such as speed, memory use, etc. In some other tasks quality requirements are not that stringent and speed criterion or efficiency in generating random number subsequences are of higher importance. Some applications use random numbers as real ones, while others treat random numbers as a bit stream. It should be noted that, even if a basic generator has trouble providing true randomness for lower bits, it is not necessarily inadequate for applications using variates as real numbers. All of the above arguments testify to the fact that a library of general-purpose RNGs should provide a set of several different basic generators, both pseudo- and quasi-random. Besides, such a library should provide an option of including new basic generators, which you may find preferable. VSL provides a variety of basic pseudo- and quasi-random number generators yet allowing the user to register user-defined basic generators and also utilize random numbers generated externally, for example, from physical source of random numbers [Jun99]. See Abstract Basic Random Number Generators. Abstract Streams section for details. One of the important issues for computational experimentation is verification of the results. Typically, a researcher is unable to verify the output since the solution is simply unknown. Without going into details of verification for sophisticated simulation systems, we would state that any verification process involves testing of each structural element of the system. A random number generator, being one of such structural elements, may bring about inadequate results. Therefore, to obtain more reliable results of the experiment, many authors recommend that several different basic generators should be used in a series of computational experiments. This is yet another argument favoring inclusion of several BRNGs of different types in a library. VSL provides the following basic pseudorandom number generators: • MCG31m1. A 31-bit multiplicative congruential generator. • R250. A generalized feedback shift register generator. • MRG32k3a. A combined multiple recursive generator with two components of order 3.Intel(R) MKL Vector Statistical Library Notes 14 • MCG59. A 59-bit multiplicative congruential generator. • WH. A set of 273 Wichmann-Hill combined multiplicative congruential generators. ( j = 1, 2, ... , 273 ) Note: The variables xn, yn, zn, wn in the above equations define a successive member of integer subsequence set by recursion. The variable un is the generator real output normalized to the interval (0, 1). • MT19937. Mersenne Twister pseudorandom number generator. , , , , , , , , , , , , , , , , , where matrix ( ) has the following format:Example of VSL Use 15 where 32-bit vector has the value . • SFMT19937. SIMD-oriented Fast Mersenne Twister pseudorandom number generator. where , , ... are 128-bit integers, and , , , are sparse 128 x 128 binary matrices for which , , , operations are defined as follows: , left shift of 128-bit integer by followed by exclusive-or operation , right shift of each 32-bit integer in quadruple by followed by andoperation with quadruple of 32-bit masks , , right shift of 128-bit integer by , left shift of each 32-bit integer in quadruple by . , , k-th 32-bit integer in quadruple . Parameters of the generator take the following values: , , , , , , , . MT2203. A set of 6024 Mersenne-Twister pseudorandom number generators ( ). , , , , , , , , , , , , , , , where matrix ( ) has the following format:Intel(R) MKL Vector Statistical Library Notes 16 , where 32-bit vector . In addition, two basic quasi-random number generators are available in VSL. • SOBOL (with Antonov-Saleev [Ant79] modification). A 32-bit Gray code-based generator producing low-discrepancy sequences for dimensions . Note 1: The value c is the rightmost zero bit in n-1; is an s-dimensional vector of 32-bit values. The s-dimensional vectors (calculated during random stream initialization) are called direction numbers. The vector is the generator output normalized to the unit hypercube . Note 2: Initialization parameters for SOBOL supported by VSL provide default dimensions . User also has an opportunity to pass user-defined initialization parameters into the generator and obtain quasi-random vectors of desirable dimension. • NIEDERREITER (with Antonov-Saleev [Ant79] modification). A 32-bit Gray code-based generator producing low-discrepancy sequences for dimensions . Note : Initialization parameters for NIEDERREITER supported by VSL provide default dimensions . User also has an opportunity to pass user-defined parameters into the generator and obtain quasi-random vectors of desirable dimension. • ABSTRACT. Abstract source of random numbers. See Abstract Basic Random Number Generators. Abstract Streams section for details. Below we discuss each basic generator in more detail and provide references for further reading. 7.2.1.1 MCG31m1 32-bit linear congruential generators, which also include MCG31m1 [L’Ecuyer99], are still used as default RNGs in various systems mostly due to simplicity of implementation, speed of operation, and compatibility with earlier versions of the systems. However, their period lengths do not meet the requirements for modern basic random number generators. Nevertheless, MCG31m1 possesses good statistical properties and may be used to advantage in generating random numbers of various distribution types for relatively small samplings. 7.2.1.2 R250 R250 is a generalized feedback shift register generator. Feedback shift register generators possess extensive theoretical footing and were first considered as RNGs for cryptographic and communications applications. Generator R250 proposed in [Kirk81] is fast and simple in implementation. It is common Example of VSL Use 17 in the field of physics. However, the generator fails a number of tests, a 2D self-avoiding random walk [Ziff98] being an example. 7.2.1.3 MRG32k3a A combined generator MRG32k3a [L’Ecu99] meets the requirements for modern RNGs: good multidimensional uniformity, fairly large period, etc. Besides, being optimized for various Intel® architectures, this generator rivals the other VSL BRNGs in speed. 7.2.1.4 MCG59 A multiplicative congruential generator MCG59 is one of the two basic generators implemented in NAG Numerical Libraries [NAG] (see www.nag.co.uk). Since the module of this generator is not prime, its period length is not 2 59 , but just 2 57 , if the seed is an odd number. A drawback of such generators is well-known (for example, see [Knuth81], [L’Ecu94]): the lower bits of the output sequence are not random, therefore breaking numbers down into their bit patterns and using individual bits may cause trouble. Besides, block-splitting of the sequence over the entire period into 2 d similar blocks results in full coincidence of such blocks in d lower bits (see, for instance, [Knuth81], [L’Ecu94]). 7.2.1.5 WH WH is a set of 273 different basic generators. It is the second basic generator in NAG libraries. The constants ai,j are in the range 112 to 127 and the constants mi,j are prime numbers in the range 16718909 to 16776971, which are close to 2 24 . These constants have been chosen so that they give good results with the spectral test, see [Knuth81] and [MacLaren89]. The period of each WichmannHill generator would be at least 2 92 , if it were not for common factors between (m1,j -1), (m2, j-1), (m3,j - 1), and (m4,j -1). However, each generator should still have a period of at least 2 80 . Further discussion of the properties of these generators is given in [MacLaren89], which shows that the generated pseudo-random sequences are essentially independent of one another according to the spectral test. 7.2.1.6 MT19937 The Mersenne Twister pseudorandom number generator [Matsum98] is a modification of a twisted generalized feedback shift register generator proposed in [Matsum92], [Matsum94]. Properties of the algorithm (the period length equal to 2 19937 -1 and 623-dimensional equidistribution up to 32-bit accuracy) make this generator applicable for simulations in various fields of science and engineering. Initialization procedure is essentially the same as described in [MT2002]. 7.2.1.7 MT2203 The set of 6024 MT2203 pseudorandom number generators is an addition to MT19937 generator intended for application in large scale Monte Carlo simulations performed on distributed multiprocessor systems. Parameters of the MT2203 generators are calculated using the methodology described in [Matsum2000] that provides mutual independence of the corresponding random number sequences. Every MT2203 generator has a period length equal to 2 2203 -1 and possesses 68- dimensional equidistribution up to 32-bit accuracy. Initialization procedure is essentially the same as described in [MT2002]. 7.2.1.8 SFMT19937 The SIMD-oriented Fast Mersenne Twister pseudorandom number generator [Saito08] is analogous to the MT19937 generator and makes use of Single Instruction Multiple Data (SIMD) and multi-stage pipelining CPU features. SFMT19937 generator has a period of a multiple of 2 19937 -1 and better equidistribution property than MT19937. 7.2.1.9 SOBOL Bratley and Fox [Brat88] provide an implementation of the Sobol quasi-random number generator. VSL implementation allows generating Sobol’s low-discrepancy sequences of length up to 2 32 . This implementation also allows for registration of user-defined parameters (direction numbers or initial direction numbers and primitive polynomials) during the initialization, which allows obtaining quasirandom vectors of any dimension. If user does not supply user-defined parameters, the default Intel(R) MKL Vector Statistical Library Notes 18 values are used for generation of quasi-random vectors. The default dimension of quasi-random vectors can vary from 1 to 40 inclusive. 7.2.1.10 NIEDERREITER According to the results of Bratley, Fox, and Niederreiter [Brat92] Niederreiter’s sequences have the best known theoretical asymptotic properties. VSL implementation allows generating Niederreiter’s low-discrepancy sequences of length up to 2 32 . This implementation also allows for registration of user-defined parameters (irreducible polynomials or direction numbers), which allows obtaining quasirandom vectors of any dimension. If user does not supply user-defined parameters, the default values are used for generation of quasi-random vectors. The default dimension of quasi-random vectors can vary from 1 to 318 inclusive. VSL provides an option of registering one or more new basic generators that you see as preferable or more reliable. Use them in the same way as the BRNGs available with VSL. The registration procedure makes it easy to include a variety of user-designed generators. 7.2.1.11 ABSTRACT Abstract basic generators are designed to allow VSL distribution generators to be used with underlying uniform random numbers that are already generated. There are several cases when this feature might be useful: • random numbers of the uniform distribution are generated externally [Mars95] (for example, in physical device [Jun99]); • you want to study the system using the same uniform random sequence but under different distribution parameters [Mikh2000]. It is unnecessary to generate uniform random numbers as many times as many different parameters you want to investigate. There might be other cases when abstract basic generators are useful. See Abstract Basic Random Number Generators. Abstract Streams section for further reading. Due to specificity of abstract basic generators, vslNewStream and vslNewStreamEx functions cannot be used to create abstract streams. Special vsliNewAbstractStream, vslsNewAbstractStream, and vsldNewAbstractStream functions are provided to initialize integer, single precision, and double precision abstract streams respectively. Each of the VSL basic generators consists of 4 subroutines: ? Stream Initialization Subroutine. See the section Random Streams and RNGs in Parallel Computation for details. ? Integer Output Generation Subroutine. Every generated integral value (within certain bounds) may be considered a random bit vector. For details on randomness of individual bits or bit groups, see Basic Random Generator Properties and Testing Results. ? Single Precision Floating-Point Random Number Vector Generation Subroutine. The subroutine generates a real arithmetic vector of uniform distribution over the interval [a, b]. ? Double Precision Floating-Point Random Number Vector Generation Subroutine. The subroutine generates a real arithmetic vector of uniform distribution over the interval [a, b]. 7.3 Random Streams and RNGs in Parallel Computation This section describes the usage model for random streams and RNGs, including their creation, initialization, copying, saving, and restoring. 7.3.1 Initializing Basic Generator To obtain a random number sequence from a given basic generator, you should assign initial, or seed values. The assigning procedure is called the generator initialization (the C language function analogous with the initialization function is srand(seed)) in stdlib.h). Different types of basic Example of VSL Use 19 generators require a different number of initial values. For example, the seed for MCG31m1 is an integral number within the range from 1 to 2 31 -2, the initial values for MRG32k3a are a set of two triples of 32-bit digits, and the seed for MCG59 is an integer within the range from 1 to 2 59 -1. In contrast to the pseudorandom number generators, quasi-random generators require the dimension parameter on input. Thus, each BRNG, including those registered by the user, requires an individual initialization function. However, requiring individual initialization functions within the library interface would limit the versatility of the routines. The basic concept of VSL is to provide an interface with universal mechanism for generator initialization, while encapsulating details of the initialization process from the user. (Nevertheless, the initialization process is clearly documented in VSL Notes for each library basic generator). In line with this concept, VSL offers two subroutines to initialize any basic generator (see the functions of random stream creation and initialization in Random Streams section). These initialization functions can also be used to initialize user-supplied functions. One of the subroutines initializes a given basic generator using one 32-bit initial value, which is called the seed by tradition. If the generator requires more than one 32-bit seed, VSL initializes the remaining initial values on the basis of the original seed. Thus, generator R250, which requires 250 initial 32-bit values, is initialized using one 32-bit seed by the method described in [Kirk81]. The second subroutine is a generalization of the first one. It initializes a basic generator by passing an array of n 32-bit initial values. If the number of the initial values n is insufficient to initialize a given basic generator, the missing initial values are initialized by default values. On the contrary, if the number of the initial values n is excessive, the redundant values are ignored. For details on initialization procedure see Basic Random Generator Properties and Testing Results. When calling initialization functions you may ignore acceptability of the passed initial values for a given basic generator. If the passed seeds are unacceptable, the initialization procedure replaces them with those acceptable for a given type of BRNG. See Basic Random Generator Properties and Testing Results for details on acceptable initial values. If you add a new basic generator to VSL, you should implement an appropriate initialization function, which supports the above mechanism of initial values passing, and, if required, apply the leapfrog and block-splitting techniques. 7.3.2 Creating and Initializing Random Streams VSL assumes that at any moment during the program operation you may simultaneously use several random number subsequences generated by one or more basic generators. Consider the following scenarios: ? The simulation system has several independent structural blocks of random number generation (for example, one block generates random numbers of normal distribution, another generates uniformly distributed numbers, etc.) Each of the blocks should generate an independent random number sequence, that is, each block is assigned an individual stream that generates random numbers of a given distribution. ? It is necessary to study correlation properties of the simulation output with different distribution parameters. In this case it looks natural to assign an individual random number stream (subsequence) to each set of the parameters. For example, see [Mikh2000]. ? Each parallel process (computational node) requires an independent random number subsequence of a given distribution, that is, a random number stream. A random stream means a certain abstract source of random numbers. By linking such a stream to a specific basic generator and assigning specific initial values we predetermine the random number sequence produced by this particular stream. In VSL a universal stream state descriptor identifies every random number stream (in C language this is just a pointer to the structure). The descriptor specifies the dynamically allocated memory space that contains information on the respective basic generator and its current state as well as some additional data necessary for the leapfrog and/or skipahead method. VSL has two stream creation and initialization functions: vslNewStream( stream, brng, seed ) vslNewStreamEx( stream, brng, n, params )Intel(R) MKL Vector Statistical Library Notes 20 Each of these subroutines allocates memory space to store information on the basic generator brng, its current state, etc., and then calls the initialization function of the basic generator brng that fills the fields of the generator current state with relevant initial values. The initial values are defined either by one 32-bit value seed (for vslNewStream) or an array of n 32-bit initial values params (for vslNewStreamEx). The output of vslNewStream and vslNewStreamEx is the pointer to stream, that is, the stream state descriptor. You can create any number of streams through multiple calls of vslNewStream or vslNewStreamEx functions. For example, you can generate several thread-safe streams that are linked to the same basic generator. The generated streams are further identified by their stream state descriptors. Although a random number stream is a source of random numbers produced by a basic generator, that is, a generator of uniform distribution, you can generate random numbers of non-uniform distribution using streams. To do this, the stream state descriptor is passed to the transformation function that generates random numbers of a given distribution. Each function uses the stream state descriptor to produce random numbers of a uniform distribution, which are further transformed into sequences of the required distribution. See the section Generating Methods for Random Numbers of Non-Uniform Distribution for details. When a given random number stream is no longer needed, delete it by calling vslDeleteStream function: vslDeleteStream( stream ) This function frees the memory space related to the stream state descriptor stream. After that, the descriptor can no longer be used. 7.3.3 Creating Random Stream Copy and Copying Stream State VSL provides an option of producing an exact copy of a generated stream by calling the vslCopyStream function: vslCopyStream( newstream, srcstream ) A new stream newstream is created with parameters (stream descriptive information) that are exactly the same as those of the source stream srcstream at the moment of calling vslCopyStream. The stream state of newstream will be exactly the same as that of srcstream, and both the streams will generate random numbers using the same basic generator. Another service function vslCopyStreamState copies the current state of the stream: vslCopyStreamState( deststream, srcstream ) The streams srcstream and deststream are assumed to have been created by one of the above methods, both of the streams being related to the same basic generator. The function vslCopyStreamState copies the information about the current stream state from srcstream into deststream. Other stream-related information remains unchanged. 7.3.4 Saving and Restoring Random Streams Typically, to get one more correct decimal digit in Monte Carlo, you need to increase the sample by a factor of 100. That makes Monte Carlo applications computationally expensive. Some of them take days or weeks while others may take several months of computations. For such applications, saving intermediate results to a file is essential so as be able to continue computation using that result in case the application is terminated intentionally or abnormally. In the case of basic generators, saving intermediate results means that BRNG state and other descriptive data, if any, should be saved to a binary file. Since BRNG state is not directly accessible for the user, who operates with the random stream descriptor only, VSL provides routines to save/restore random stream descriptive data to and from binary files:Example of VSL Use 21 errstatus = vslSaveStreamF( stream, fname, ) errstatus = vslLoadStreamF( &stream, fname ) The binary file name is specified by the fname parameter. In the vslSaveStreamF function a valid random stream to be written is specified by a stream input parameter. In vslLoadStreamF the stream is the output parameter that specifies a random stream that has been created on the basis of the binary file data. Each of these functions returns the error status of the operation. Non-negative value indicates an error. 7.3.5 Independent Streams. Leapfrogging and Block-Splitting One of the basic requirements for random number streams is their mutual independence and lack of intercorrelation. Even if you want random number samplings to be correlated, such correlation should be controllable. The independence of streams is provided through a number of methods. We discuss three of them, all supported by VSL, in greater detail. • For each of the streams you may use the same type of generators (for example, linear congruential generators), but choose their parameters in such a way as to produce independent output random number sequences. The Mersenne Twister generator is a good example here. It has 1024 parameter sets, which ensure that the resulting subsequences are independent (see [Matsum2000] for details). Another example is WH generator capable of creating up to 273 random number streams. The produced sequences are independent according to the spectral test (see [Knuth81] for the spectral test details). • Split the original sequence into k non-overlapping blocks, where k is the number of independent streams. Each of the streams generates random numbers only from the corresponding block. This method is known as block-splitting or skipping-ahead. • Split the original sequence into k disjoint subsequences, where k is the number of independent streams, in such a way that the first stream would generate the random numbers x1, xk+1, x2k+1, x3k+1, ..., the second stream would generate the random numbers x2, xk+2, x2k+2, x3k+2, ..., and, finally, the kth stream would generate the random numbers xk, x2k, x3k, ... This method is known as leapfrogging. Note, however, that multidimensional uniformity properties of each subsequence deteriorate seriously as k grows. The method may be recommended if k is fairly small. Karl Entacher presents data on inadequate subsequences produced by some commonly used linear congruential generators [Ent98]. VSL allows you to use any of the above methods, leapfrog and skip-ahead (block-split) methods deserving special attention. VSL implements block-splitting through the function vslSkipAheadStream: vslSkipAheadStream( stream, nskip ) The function changes current state of the stream stream so that with the further call of the generator the output subsequence would begin with the element xnskip rather than with the current element x0. Thus, if you wish to split the initial sequence into nstreams blocks of nskip size each, the following sequence of operations should be implemented: Option 1 VSLStreamStatePtr stream[nstreams]; int k; for ( k=0; ka /* Get successive non-uniform random number */ w := Nonuniform() // get successive uniform random number from BRNGExample of VSL Use 25 // and transform it to non-uniform random number /* Return i-th result */ r[i] := g(u,v,w) end do Minimization of control flow dependency is one of the valuable means to boost the performance on the modern processor architectures. In particular, this means that you should try to generate and process random numbers as vectors rather than as scalars: 1. Generate vector U of pairs (u, v) 2. Applying "good candidate" criterion f(u,v)>a, form a new vector V that consists of "good" candidates only. 3. Get vector W of non-uniform random numbers w. 4. Get vector R of results g(u,v,w). Note that steps 1- 4 do not preserve the original order of underlying uniform random numbers utilization. Consider an example below, if you need to keep the original order. Suppose that one underlying uniform random number is required per non-uniform. So underlying uniform random numbers are utilized as follows: To keep the original order of underlying uniform random number utilization, yet applying the vector random number generator effectively, pack "good" candidates into one buffer while packing random numbers to be used in non-uniform transformation into another buffer: To apply non-uniform distribution transformation, that is, to use a VSL distribution generator, for x7, x10, x17, x22, ... stored in a buffer W, you need to create an abstract stream that is associated with buffer W. Types of Abstract Basic Random Number Generators VSL provides three types of abstract basic random number generators intended for: • integer-valued buffers • single precision floating-point buffers • double precision floating-point buffers Corresponding abstract stream initialization subroutines are: vsliNewAbstractStream( &stream, n, ibuf, icallback );Intel(R) MKL Vector Statistical Library Notes 26 vslsNewAbstractStream( &stream, n, sbuf, a, b, scallback ); vsldNewAbstractStream( &stream, n, dbuf, a, b, dcallback ); Each of these routines creates a new abstract stream stream and associates it with a corresponding cyclic buffer [i,s,d]buf of length n. Data in floating-point buffers is supposed to have uniform distribution over (a,b) interval. An obligatory parameter is a user-provided callback function [i,s,d]callback to update the associated buffer when the quantity of random numbers required in the distribution generator becomes insufficient in that buffer. A user-provided callback function has the following format: int MyUpdateFunc( VSLStreamStatePtr stream, int* n, buf, int* nmin, int* nmax, int* idx ) { ... /* Update buf[] starting from index idx */ ... return nupdated; } For Fortran-interface compatibility, all parameters are passed by reference. The function renews the buffer buf of size n starting from position idx. Note that the buffer is considered as cyclic and index idx varies from 0 to n-1. Minimal number of buffer entries to be updated is nmin. Maximum number of buffer entries that can be updated is nmax. To minimize callback call overheads, update as many entries as possible (that is, nmax entries), if an algorithm specifics allows this. If you utilize multiple abstract streams, creation of multiple callback functions is not required. Instead, you may have one callback function and distinguish a particular abstract stream and a particular buffer using the stream and buf parameters respectively. The callback function should return the quantity of numbers that have been actually updated. Typically, the return value would be a number between nmin and nmax. If the callback function returns 0 or the number greater than nmax, the abstract basic generator reports an error. It is allowable however to update less than nmin numbers (but greater than 0). In this case, the corresponding abstract generator calls the callback function again until at least nmin numbers are updated. Of course, this is inefficient but still may be useful if there are no nmin numbers by the moment of the callback function call. The respective pointers to the callback functions are defined as follows: typedef int (*iUpdateFuncPtr)( VSLStreamStatePtr stream, int* n, unsigned int ibuf[], int* nmin, int* nmax, int* idx ); typedef int (*dUpdateFuncPtr)( VSLStreamStatePtr stream, int* n, double dbuf[], int* nmin, int* nmax, int* idx ); typedef int (*sUpdateFuncPtr)( VSLStreamStatePtr stream, int* n, float sbuf[], int* nmin, int* nmax, int* idx ); On the user level, an abstract stream looks like a usual random stream and can be used with any service and distribution generator routines. In many cases, more careful programming is required, however, while using abstract streams. For instance, checking the distribution generator status to determine whether the callback function has successfully updated the buffer or not is a good practice in working with abstract streams. Another important note is that a buffer associated with an abstract stream must not be updated manually, that is, outside of a callback function. In particular, this means that the buffer should not be filled with numbers by the moment of abstract stream initialization with vsl[i,s,d]NewAbstractStream function call. Type of the abstract stream to be created should be also chosen carefully. This type depends on a particular distribution generator routine. For instance, all single precision continuous distribution Example of VSL Use 27 generator routines utilize abstract streams associated with single precision buffers, while double precision distribution generators utilize abstract streams associated with double precision buffers. Most of discrete distribution generators utilize abstract streams that are associated with either single or double precision abstract streams. See the following table to choose the appropriate type of an abstract stream: Type of Discrete Distribution Type of Abstract Stream Uniform double precision UniformBits integer Bernoulli single precision Geometric single precision Binomial double precision Hypergeometric double precision Poisson (VSL_METHOD_IPOISSON_POISNORM) single precision Poisson (VSL_METHOD_IPOISSON_PTPE) single and double precision PoissonV single precision NegBinomial double precision The following example demonstrates generation of random numbers of the Poisson distribution with parameter using an abstract stream. Random numbers are assumed to be uniform integers from 0 to 231-1 and are stored in the ran_nums.txt file. In the callback function, the numbers are transformed to double precision format and normalized to (0,1) interval. #include #include "mkl_vsl.h" #define METHOD VSL_METHOD_IPOISSON_PTPE #define N 4500 #define DBUFN 1000 #define M 0x7FFFFFFF /* 2^31-1 */ static FILE* fp; int MydUpdateFunc(VSLStreamStatePtr stream, int* n, double dbuf[], int* nmin, int* nmax, int* idx) { int i; unsigned int num; double c; c = 1.0 / (double)M; for ( i = 0; i < *nmax; i++ ) {Intel(R) MKL Vector Statistical Library Notes 28 if ( fscanf(fp, "%u", &num) == EOF ) break; dbuf[(*idx+i) % (*n)] = num; } return i; } int main() { int errcode; double lambda, a, b; double dBuffer[DBUFN]; int r[N]; VSLStreamStatePtr stream; /* Boundaries of the distribution interval */ a = 0.0; b = 1.0; /* Parameter of the Poisson distribution */ lambda = 3.0; fp = fopen("ran_nums.txt", "r"); /***** Initialize stream *****/ vsldNewAbstractStream( &stream, DBUFN, dBuffer, a, b, MydUpdateFunc ); /***** Call RNG *****/ errcode = viRngPoisson(VSL_RNG_METHOD_POISSON_PTPE,stream,N,r,lambda); if (errcode == VSL_ERROR_OK) { /* Process vector of the Poisson distributed random numbers */ ... } else { /* Process error */ ... } ...Example of VSL Use 29 vslDeleteStream( &stream ); fclose(fp); return 0; } 7.4 Generating Methods for Random Numbers of NonUniform Distribution You can use a source of uniformly distributed random numbers to generate both discrete and continuous distributions, which is implemented through a number of methods briefly described below. 7.4.1 Inverse Transformation The probability distribution of a one-dimensional variate X may be most generally presented in terms of cumulative distribution function (CDF): . Any CDF is defined on the whole real axis and is monotonically increasing, where . In the case of continuous distribution, the cumulative distribution function F(x) is a continuous one. In what follows, we assume that F(x) is steadily increasing, though assuming a non-steadily increasing function with a limited number of intervals where it steadily increases leads to trivial complications and generalizations of what follows. Assuming the CDF steadily increases, the following single-valued inverse function should exist: . It is easy to prove that, if U is a variate with a uniform distribution on the interval (0, 1), then the variate X is of F(x) distribution. Thus, the inverse transformation method can be implemented as follows: 1. Generate a uniformly distributed random number meeting the requirements: 0 < u < 1. 2. Assume x = G(u) as a random number of the distribution F(x). The only drawback of this approach is that G(u) in closed form is often hard to find, while numerical solution to the equation to calculate x is, as a rule, excessively time consuming. For discrete distributions, the CDF is a step function, the inverse transformation method still being applicable. For simplicity, let us assume that the distribution has probability mass points k = 0, 1, 2, ... with pk probability. Then the distribution function is the sum Intel(R) MKL Vector Statistical Library Notes 30 , where is the maximum integer that does not exceed x. If a continuous function G(u) exists in closed form so that , and G(u) is monotone, then generation of random numbers of the distribution F(x) can be implemented as follows: 1. Generate a uniformly distributed random number meeting the requirements: 0 < u < 1. 2. Assume k = floor(G(u)) as a random number of the distribution F(x). For example, for the geometric distribution . Then G(u) does exist, as it easy to prove, . However, for most cases finding the closed form for G(u) function is too hard. An acceptable solution may be found using numerical search for k proceeding from . With tabulated values of F(k), the task is reduced to table lookup. As F(k) is a monotonically increasing function, you may use search algorithms that are considerably more efficient than exhaustive search. The efficiency is solely dependent on the size of the table. Inverse transformation method can be applied to the s-dimensional quasi-random vectors. The resulting quasi-random sequence has the required s-dimensional non-uniform distribution. 7.4.2 Acceptance/Rejection The cumulative distribution function, let alone the inverse one, is very often much more complex computationally than the probability density function (for continuous distributions) and the probability mass function (for discrete distributions). Therefore, methods based on the use of density (mass) functions are often more efficient than the inverse transformation method. We will consider a case of continuous probability distribution, although this technique is just as effective for discrete distributions. Suppose, we need to generate random numbers x with distribution density f(x). Apart from the variate X, let us consider the variate Y with the density g(x), which has a fast method of random number generation and the constant c such that .Example of VSL Use 31 Then, it is easy to conclude that the following algorithm provides generation of random numbers x with the distribution F(x): 1. Generate a random number y with the distribution density g(x). 2. Generate a random number u (independent of y) that is uniformly distributed over the interval (0, 1). 3. If , accept y as a random number x with the distribution F(x); else go back to Step 1. The efficiency of this method greatly depends on degree of complexity of random number generation with distribution density g(x), computational complexity for the functions f(x) and g(x), as well as on the constant c value. The closer c is to 1, the lower the necessity to reject the generated y. Note: Since quasi-random sequences are non-random, great care should be taken when using quasirandom basic generators with the acceptance/rejection methods. 7.4.3 Mixture of Distributions Sometimes it may be useful to split the initial distribution into several simpler distributions: , so that random numbers for each of the distributions Fi(x) are easy to generate. Then the appropriate algorithm may be as follows: 1. Generate a random number i with the probability pi . 2. Generate a random number y (independent of i) with the distribution Fi (x). 3. Accept y as a random number x with the distribution F(x). This technique is most common in the acceptance/rejection method, when for the whole range of acceptable x values a density g(x), which would approximate the function f(x) well enough, is hard to find. In this case, the range is divided into sections so that g(x) looks relatively simple in each of the sub-ranges. Note: Since quasi-random sequences are non-random, great care should be taken when using quasirandom basic generators with the mixture methods. 7.4.4 Special Properties The most efficient algorithms, though based on the general methods described in the previous sections, should, nevertheless, make use of special properties of distributions, if possible. For example, the inverse transformation method is inapplicable to normal distribution directly. However, use of polar coordinates for a pair of independent normal variates makes it possible to develop an efficient method of random number generation based on 2D inverse transformation, which is known as the Box-Muller method: Generating s-dimensional normally distributed quasi-random sequences with 2D inverse transformation (VSL name is the Box-Muller2 method), when s is odd, seems to be problematic because quasi-random numbers are generated in pairs. One of the options is to generate (s+1)- dimensional normally distributed quasi-random numbers from (s+1)-dimensional quasi-random numbers produced by a basic quasi-random generator and then ignore the last dimension.Intel(R) MKL Vector Statistical Library Notes 32 Another option is to use the method that produces one normally distributed number from two uniform ones (VSL name is the Box-Muller method). In this case to generate s-dimensional normally distributed quasi-random numbers, use 2s-dimensional quasi-random numbers produced by a basic quasi-random generator. For a binomial distribution with parameters m, p, the probability mass function is found as follows: . For p > 0.5, it is convenient to make use of the fact that . To summarize, we note that a uniform distribution can be converted to a general distribution by a number of methods. Also, two different transformation techniques implemented for one and the same uniform distribution produce two different sequences of a general distribution, though possessing the same statistical properties. Let us consider a simple example. If U1, U2 are two independent random values uniformly distributed over the interval (0, 1), that is, with the distribution function F(x) = x , 0 < x < 1, then the variate X = max(U1, U2) has the distribution F(x) ·F(x). Thus, on the one hand, the random number x1 with maximum distribution from two independent uniform distributions may be derived either from a pair of uniformly distributed random numbers u1, u2 as x1 = max(u1, u2) or from one uniform random number u1 as x1 = sqrt(u1) by applying the inverse transformation method. It is obvious that applying two different methods to one and the same sequence u1, u2, u3, ... will give two absolutely different sequences xi . Transformation into non-uniform distribution sequences may be accomplished in a variety of ways with no fastest or most accurate method existing, as a rule. The inverse transformation method may be preferable over the acceptance/rejection method for some applications and architectures, while reverse preference is common for others. Taking this into account, the VSL interface provides different options of random number generation for one and the same probability distribution. For example, a Poisson distribution may be transformed by two different methods: the first, known as PTPE [Schmeiser81], is based on acceptance/rejection and mixture of distributions techniques, while the second one is implemented through transformation of normally distributed random numbers. The method number calls a method for a specified generator, for example: viRngPoisson( VSL_METHOD_IPOISSON_PTPE, stream, n, r, lambda ) - calling PTPE method by passing the method number VSL_METHOD_IPOISSON_PTPE. viRngPoisson( VSL_METHOD_IPOISSON_POISNORM, stream, n, r, lambda ) - calling transformation from normally distributed random numbers by passing the method number VSL_METHOD_IPOISSON_POISNORM. For details on methods to be used for specific distributions see Continuous Distribution Functions and Discrete Distribution Functions sections. 7.5 Accurate and Fast Modes of Random Number Generation Using the distribution generators in the application the user can expect the obtained random numbers to belong to definitional domain of the corresponding distribution irrespective of its parameters. For example, uniformly distributed on random numbers obtained as output of the relevant generator are assumed to satisfy the following condition: for all indices and for all values of and . However, due to specificity of floating point calculations and rounding modes some continuous distribution generators may produce random numbers lying beyond definitional domain for Example of VSL Use 33 some particular values of distribution parameters. Such state of affairs cannot be acceptable in those applications for which accuracy of calculations is highly critical. To resolve this issue, VSL defines two modes of random number generation: accurate and fast. A generation mode is initialized during call of the distribution generator by specifying value of the method parameter. For example, accurate generation of single precision floating point numbers from distribution uniform on interval in C looks like this ... status=vsRngUniform(VSL_METHOD_SUNIFORM_STD_ACCURATE, stream, n, r, a, b); ... So, if a Monte Carlo application uses several distribution generators, each of them can be called in preferable mode. When used in accurate mode, the generators produce random numbers that belong to definitional domain for all parameter values of the distribution. See the table below for a list of generators supporting accurate mode of calculations. Type of Distribution Data Types Uniform s,d Exponential s,d Weibull s,d Raleigh s,d Lognormal s,d Gamma s,d Beta s,d The distribution generators used in the fast mode produce numbers beyond the definitional domain in relatively rare cases. The application should set accurate mode if all generated random numbers are expected to belong to the definitional domain irrespective of distribution parameter values. Use of the accurate mode makes slight performance degradation for random number generation possible. 7.6 Example of VSL Use A typical algorithm for VSL generators is as follows: 1. Create and initialize stream/streams. Functions vslNewStream, vslNewStreamEx, vslCopyStream, vslCopyStreamState, vslLeapfrogStream, vslSkipAheadStream. 2. Call one or more RNGs. 3. Process the output. 4. Delete the stream/streams. Function vslDeleteStream. Note: You may reiterate steps 2-3. Random number streams may be generated for different threads. The following example demonstrates generation of two random streams. The first of them is the output of the basic generator MCG31m1 and the second one is the output of the basic generator R250. The seeds are equal to 1 for each of the streams. The first stream is used to generate 1,000 normally distributed random numbers in blocks of 100 random numbers with parameters a = 5 and sigma = 2. The second stream is used to produce 1,000 exponentially distributed random numbers in blocks of 100 random numbers with parameters a = -3 and beta = 2. Delete the streams after completing the generation. The purpose is to calculate the sample mean for normal and exponential distributions with the given parameters.Intel(R) MKL Vector Statistical Library Notes 34 #include #include "mkl.h" float rn[100], re[100]; /* buffers for random numbers */ float sn, se; /* averages */ VSLStreamStatePtr streamn, streame; int i, j; /* Initializing */ sn = 0.0f; se = 0.0f; vslNewStream( &streamn, VSL_BRNG_MCG31, 1 ); vslNewStream( &streame, VSL_BRNG_R250, 1 ); /* Generating */ for ( i=0; i<10; i++ ) { vsRngGaussian( VSL_METHOD_SGAUSSIAN_BOXMULLER2, streamn, 100, rn, 5.0f, 2.0f ); vsRngExponential(VSL_RNG_METHOD_EXPONENTIAL_ICDF, streame, 100, re, -3.0f, 4.0f ); for ( j=0; j<100; j++ ) { sn += rn[j]; se += re[j]; } } sn /= 1000.0f; se /= 1000.0f; /* Deleting the streams */ vslDeleteStream( &streamn ); vslDeleteStream( &streame ); /* Printing results */ printf( "Sample mean of normal distribution = %f\n", sn ); printf( "Sample mean of exponential distribution = %f\n", se ); When you call a generator of random numbers of normal (Gaussian) distribution, use the named constant VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2 to invoke the Box-Muller2 generation method. In the case of a generator of exponential distribution, assign the method by the named constant VSL_RNG_METHOD_EXPONENTIAL_ICDF. The following example generates 100 three-dimensional quasi-random vectors in the hypercube using SOBOL BRNG. #include #include "mkl.h" float r[100][3]; /* buffer for quasi-random numbers */ VSLStreamStatePtr stream;Example of VSL Use 35 /* Initializing */ vslNewStream( &stream, VSL_BRNG_SOBOL, 3 ); /* Generating */ vsRngUniform( VSL_RNG_METHOD_UNIFORM_STD, stream, 100*3, (float*)r, 2.0f, 3.0f ); /* Deleting the streams */ vslDeleteStream( &stream );36 8 Testing of Basic Random Number Generators This section provides information on testing the Basic Random Number Generators (BRNG), including some details on BRNG properties and categories, as well as on interpretation of test results. 8.1 BRNG Implementations and Categories Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Three implementations are available for every basic generator: • integer implementation (output is a 32-bit integer sequence) • real (single precision) • real (double precision) You can use the basic generator integer output to obtain random bits or groups of bits. However, when you interpret the output of a generator, you should take into consideration the characteristics of each basic generator in general and its bit precision in particular. For detailed information on implementations of each basic generator, see Basic Random Generator Properties and Testing Results. All VSL basic generators are tested by a number of specially designed empirical tests. These tests are applied either for floating-point sequences or for integer-valued sequences. The set of tests for basic generators can be divided into three categories: • tests to analyze the randomness of bits/groups of bits • tests to analyze the randomness of real random numbers normalized to the interval (0, 1) • tests to analyze conformance to the template 8.1.1 First Category You can only use the first category tests to evaluate the basic generator integer implementation. The function viRngUniformBits corresponds to the integer implementation on the interface level. The testing in this category of tests is made with regard to characteristics of each basic generator and its bit precision in particular. You can subsequently use the results of the tests to decide if you can apply this particular basic generator to obtain random bits or groups of bits. A failed test does not mean that the generator is bad but rather that the interpretation of the integer output as the stream of random bits may result in an inadequate simulation outcome. Also, this category includes a set of tests to determine the degree of randomness of upper, medium and lower bits. For example, upper bits may NIEDERREITER 37 prove to be much more random than lower. Thus some tests may indicate which bits or groups of bits are better for use as random ones. 8.1.2 Second Category The second category contains different tests for basic generator normalized output. You can apply all these tests for real implementation of both single and double precision. Moreover, in most cases, the testing results are identical for both implementations, which proves that non-randomness of lower bits in the original integer sequence does not have practical influence on the randomness of the real basic generator output normalized to the (0, 1) interval. The functions vsRngUniform and vdRngUniform, for single and double precision respectively, correspond to real implementations on the interface level. 8.1.3 Third Category The third category contains tests to check how a basic generator output conforms to the template. Template tests variations check if the leapfrog and skip-ahead methods generate subsequences of random numbers correctly. These tests are particularly important because, if any current member of the integer sequence differs from the template in a single bit only, the resulting sequence will be totally different from the template sequence. Also, the statistical properties of such sequence are worse than those of the template sequence. This assumption is based on the fact that in a variety of sequences there are a very small number of "sufficiently random" sequences. As Knuth suggests, "random numbers should not be generated with a method chosen at random" [Knuth81]. However, situations are possible, where the random choice of the method of generation is not a result of personal preference but rather the curse of a bug. 8.2 Interpreting Test Results Testing of a generator for all possible seeds and sampling sizes is hardly practicable. Therefore we actually test only a few subsequences of various lengths. Testing a random number sequence u1, u2, ..., un gives a p-value that falls within the range from 0 to 1. Being a function of a random sampling, this p-value is a random number itself. For the sequence u1, u2, ..., un of truly random numbers, the resulting p-value is supposed to be uniformly distributed over the interval (0, 1). Significant p-value deviation from the theoretical uniform distribution may indicate a defect in the tested sequence. For example, we may consider the sequence u1, u2, ..., un suspicious, if the resulting p-value falls outside the interval (0.01, 0.99). The chance to reject a 'good' sequence in this case is 2%. Multiple testing of different subsequences of the sequence makes the statistical conclusion about the sequence randomness more substantiated with several options to arrive at such a conclusion. 8.2.1 One-Level (Threshold) Testing When we test K subsequences u1, u2, ..., un; un+1, un+2, ..., u2n; ...; u(K-1)n+1, u(K-1)n+2, ..., uKn of the original sequence, we compute p-values p1, p2, ..., pK. For a subsequence u(j-1)n+1, u(j -1)n+2, ..., ujn the test j is failed, if the value pj falls outside the interval (pl , ph) ? (0, 1). We consider the sequence u1, u2, ..., uKn suspicious when r or more test iterations failed. We have conducted threshold testing for the VSL generators with 10 iterations (K=10), the interval (pl , ph) equal to (0.05, 0.95), and r = 5. The chance to reject a 'good' sequence in this case is 0.16349374% ? 0.2%.Intel(R) MKL Vector Statistical Library Notes 38 8.2.2 Two-Level Testing When we test K subsequences u1, u2, ..., un; un+1, un+2, ..., u2n; ...; u(K-1)n+1, u(K-1)n+2, ..., uKn of the original sequence, we compute p-values p1, p2, ..., pK. Since the resulting p-values for the sequence u1, u2, ..., uKn of truly random numbers are supposed to be uniformly distributed over the interval (0, 1), we may subject those p-values to any uniformity test, thus obtaining p-value q1 of the second level. After going through this procedure L times we obtain L p-values of the second level q1, q2, ... , qL that we subject to threshold testing. We have conducted threshold second level testing for the VSL generators with 10 iterations (L=10) and applied the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to evaluate p1, p2, ..., pK uniformity. 8.3 BRNG Tests Description Most of empirical tests that are used for testing the VSL BRNGs are well documented (for example, see [Mars95], [Ziff98]). Nevertheless, we find it useful to describe them and the testing procedure in greater detail here since tests may vary as to their applicability and implementation for a particular basic generator. We also provide figures of merit that are used to decide on passing vs. failure in oneor two level testing. For ideas underlying such criteria, see Interpreting Test Results section. 8.3.1 3D Spheres Test 8.3.1.1 Test Purpose The test uses simulation to evaluate the randomness of the triplets of sequential random numbers of uniform distribution. The stable response is the volume of the sphere. The radius of the sphere is equal to the minimal distance between the generated 3D points. 8.3.1.2 First Level Test The test generates the vector ui of 12,000 random numbers (i = 0, 1, ..., 11999), which are uniformly distributed in the (0, 1000) interval. The test forms 4,000 triplets of random numbers xk = (u3k, u3k+1, u3k+2) (k = 0, 1, ..., 3999) situated in the cube R = (0, 1000)?(0, 1000)?(0, 1000). Further, the test calculates dmi n= d(xk, xl ) (l ? k), where d(x, y) is the Euclidean distance between x and y. In this case, the volume of the sphere with the dmin radius should have the distribution close to the exponential one with a = 0, ß = 40p parameters. Thus, the distribution of the p = 1 - exp(-(dmin)3/30) value should be close to the uniform distribution. The p-value is the result of the first level test. 8.3.1.3 Second Level Test The second level test performs the first level test ten times. The p-value pj , j = 1, 2, ..., 10 is the result of each first level test. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistic to the obtained set of pj (j = 1, 2, ..., 10). If the resulting p-value is p<0.05 or p>0.95, the test fails. 8.3.1.4 Final Result Interpretation The final result is the FAIL percentage for the failed first level tests. The test performs the second level test ten times. The acceptable result is the value of FAIL < 50%. 8.3.1.5 Tested Generators Function Name Application vsRngUniform applicableNIEDERREITER 39 vdRngUniform applicable viRngUniform not applicable viRngUniformBits applicable Note: The test transforms the integer output into the real output within the interval (0, 1) for the function viRngUniformBits. For detailed information about the normalization of the integer output see the description of the given basic generator. 8.3.2 Birthday Spacing Test 8.3.2.1 Test Purpose The test uses simulation to evaluate the randomness of groups of 24 sequential bits of the integer output of basic generator. The test analyzes all possible groups of the kind, that is, for example, from 0 to 23 bit, from 1 to 24 bit, etc. 8.3.2.2 First Level Test The first level test selects at random m = 210 ”birthdays” from a ”year” of n = 224 days. Then the test computes the spacing between the birthdays for each pair of sequential birthdays. The test then uses the spacings to determine the K value, that is, the number of pairs of sequential birthdays with the spacing of more than one day. In this case K should have the distribution close to the Poisson distribution with the ? = 16 parameter. The first level test determines 200 values of Kj (j = 1, 2, ..., 200). To obtain the p-value p, the test applies the chi-square goodness-of-fit test to the determined values. The integer output lists different interpretations for each basic generator. BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits:Intel(R) MKL Vector Statistical Library Notes 40 0-32. NB=32, WS=32. The test generates the dates of the birthdays in the following way: • Selects the bs, bs+1, ..., bs+23 bits from the next WS-bit integer of the integer output of viRngUniformBits. • Treats the selected bits as a 24-bit integer, that is, the number of the date on which the next birthday takes place and thus generates a birthday. • The test performs the steps 1 and 2 m times to generate m birthdays taken that the year consists of n days. The legitimate values s are different for each base generator (see the table above): 0 = s = NB - 24. 8.3.2.3 Second Level Test The second level test performs the first level test ten times for the fixed s. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained set of pj (j = 1, 2 , ..., 10). If the resulting p-value is p<0.05 or p>0.95, the test fails for the given s. 8.3.2.4 Final Result Interpretation The second level test performs ten times for each 0 = s = NB - 24. The test computes the FAILs percentage for the failed second level tests. The final result is the minimal percentage of the failed tests FAIL = min(FAIL0, FAIL1, ..., FAILNB-24) for 0 = s = NB - 24. The applicable result is the value of FAIL<50%. Thus, the test determines if it is possible to select 24 random bits from every element of the integer output of the generator. • The integer output for the WH generator is the quadruples of the 32-bits values (xi , yi , zi , wi ). In each 32-bit value only the lower 24 bits are significant. • The second level test performs ten times for the xi element. Then the test computes the FAILx percentage the failed second level tests. • The second level test performs ten times for the yi . Then the test computes the FAILy percentage for the failed second level tests. • The test performs the same procedure to compute the FAILz and FAILw values. The final result is the minimal percentage of the failed tests FAIL = min(FAILx , FAILy, FAILz, FAILw). The acceptable result is the value of FAIL < 50%. The test determines if it is possible to select 24 random bits from the fixed element x, y, z or w for each element of the integer output of the generator. 8.3.2.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicableNIEDERREITER 41 8.3.3 Bitstream Test 8.3.3.1 Test Purpose The test uses simulation to check if it is possible to interpret the integer output of the basic generator as a sequence of random bits. Note: The bit precision of a basic generator defines the sequence of random bits formation. For example, only 59 lower bits take part in the bit stream formation for the MCG59 generator, and only 31 lower bits for the MCG31 generator. 8.3.3.2 First Level Test The first level test initially forms the sequence of bits b0, b1, b2, ... from the integer output of the basic generator and then forms 20-bit overlapping words w0 = b0 b1...b19 , w1 = b1 b2...b20 , ... from the sequence. From the total number of 2021 formed words the test computes the quantity K of the missed 20-bit words. For the truly random sequence the K statistic distribution should be very close to normal with mean a = 141,909 and standard deviation s = 428. The test denotes the cumulative function of the normal distribution with these parameters as F(x). The result is that the distribution of the p-value p = F(K) should be uniform within the interval of (0, 1). BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. The test selects only NB of lower bits from each WS-bit integer to form a bit sequence. The test selects only NB of lower bits from each of four WS-bit elements for WH generator.Intel(R) MKL Vector Statistical Library Notes 42 8.3.3.3 Second Level Test The second level test performs the first level test 20 times. The result of each first level test is the pvalue pj , j = 1, 2, ..., 20. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained set of pj (j = 1, 2, ..., 20). If the resulting p-value is p<0.05 or p>0.95, the test fails. 8.3.3.4 Final Result Interpretation The final result of the test is the FAIL percentage of the failed second level tests. The second level test performs ten times. The acceptable result is the value of FAIL < 50%. 8.3.3.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The lower bits are not random for multiplicative congruential generators where the module is the power of two (for example, MCG59), thus, the Bitstream Test fails for such generators. 8.3.4 Rank of 31x31 Binary Matrices Test 8.3.4.1 Test Purpose The test evaluates the randomness of 31-bit groups of 31 sequential random numbers of the integer output. The stable response is the rank of the binary matrix composed of the random numbers. The test performs iterations for all possible 31-bit groups of bits (0-30, 1-31, ...) for the generators with more than 31 bit precision. 8.3.4.2 First Level Test The first level test selects, with s fixed, groups of bits bs, bs+1, ..., bs+30 from each element of the integer output and forms a binary matrix 31x31 in size from these 31 groups. The first level test composes 40000 of such matrices out of sequential elements of the integer output of the generator. Then the test computes the number of matrices with the rank of 31, the number of matrices with the rank of 30, the number of matrices with the rank of 29, and the number of matrices with the rank less than 29. For the truly random sequence, the probability of composing a 31 rank matrix is 0.289, a 30 rank matrix is 0.578, a 29 rank matrix is 0.128, and a less than 29 rank matrix is 0.005. Therefore, the test divides all possible matrix ranks into four groups. The test makes a V statistic with a chisquare distribution with three degrees of freedom for these four groups. Then the first level test applies the chi-square goodness-of-fit test to the groups. The testing result is the p-value. Note: The acceptable values of are specific for each basic generator. The test is not applicable for the basic generator WH. BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the NIEDERREITER 43 following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. The test selects only NB of lower bits from each WS-bit integer to form a bit sequence. 8.3.4.3 Second Level Test The second level test performs the first level test ten times for the fixed s. The result is the set of pvalues pj , j = 1, 2, ..., 10 .The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained set of pj j = 1, 2, ..., 10. If the resulting p-value is p<0.05 or p>0.95, the test fails for the s. 8.3.4.4 Final Result Interpretation The second level test performs ten times for each . The test computes the FAIL percentage of the failed second level tests. The final result is the minimal percentage of the failed tests FAIL = min(FAIL0, FAIL1, ..., FAILNB-31) for . The acceptable result is the value of FAIL < 50%. Therefore the test indicates whether it is possible to single out at least 31 random bits out of each element of generator integer output such that 31 random numbers of 31 bits each have a random enough behavior under this particular test. 8.3.4.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The Rank of 31x31 Binary Matrices Test cannot be applied to the generator WH as each element of this generator is only 24-bit.Intel(R) MKL Vector Statistical Library Notes 44 8.3.5 Rank of 32x32 Binary Matrices Test 8.3.5.1 Test Purpose The test evaluates the randomness of 32-bit groups of 32 sequential random numbers of the integer output. The stable response is the rank of the binary matrix composed of the random numbers. The test performs iterations for all possible 32-bit groups of bits (0-31, 1-32,...) for the generators with the bit precision of more than 32 bits. 8.3.5.2 First Level Test The first level test selects, with s fixed, groups of bits bs, bs+1, ..., bs+31 from each element of the integer output. Then it forms a binary matrix 32x32 in size from these 32 groups. The first level test composes 40000 of such matrices out of sequential elements of the integer output of the generator. Then the test computes the number of matrices with the rank of 32, the number of matrices with the rank of 31, the number of matrices with the rank of 30, and the number of matrices with the rank less than 30. For the truly random sequence the probability of composing a 32 rank matrix is 0.289, a 31 rank matrix is 0.578, a 30 rank matrix is 0.128, and a less than 30 rank matrix is 0.005. Therefore, the test divides all possible matrix ranks into four groups. The test makes a V statistics with a chisquare distribution with three degrees of freedom for these three groups. Then the first level test applies the chi-square goodness-of-fit test to the groups. The testing result is the p-value. Note: The acceptable values of are specific for each basic generator. The test is not applicable for basic generators MCG31 and WH. BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32.NIEDERREITER 45 The test selects only NB of lower bits from each WS-bit integer to form a bit sequence. 8.3.5.3 Second Level Test The second level test performs the first level test ten times for the fixed s. The result is the set of pvalues pj , j = 1, 2, ..., 10 .The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained set of pj , j = 1, 2, ..., 10. If the resulting p-value is p<0.05 or p>0.95, the test fails for the s. 8.3.5.4 Final Result Interpretation The second level test performs ten times for each . The test computes the FAIL percentage of the failed second level tests. The final result is the minimal percentage of the failed tests FAIL = min(FAIL0, FAIL1, ..., FAILNB-32) for . The acceptable result is the value of FAIL < 50%. Therefore the test indicates whether it is possible to single out at least 32 random bits out of each element of generator integer output such that 32 random numbers of 32 bits each have a random enough behavior under this particular test. 8.3.5.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The Rank of 32x32 Binary Matrices Test cannot be applied to the WH generator as each element of this generator is only 24-bit. The Rank of 32x32 Binary Matrices Test cannot be applied to the MCG31 generator as each element of this generator is only 31-bit. 8.3.6 Rank of 6x8 Binary Matrices Test 8.3.6.1 Test Purpose The test evaluates the randomness of the 8-bit groups of 6 sequential random numbers of the integer output. The stable response is the rank of the binary matrix composed of the random numbers. The test checks all possible 8-bit groups: 0-7, 1-8, ... 8.3.6.2 First Level Test The first level test selects, with s fixed, groups of bits bs, bs+1, ..., bs+7 from each element of the integer output and forms a binary matrix 6x8 in size from these 6 groups. The first level test composes 100000 of such matrices out of sequential elements of the integer output of the generator. Then the test computes the number of matrices with the rank of 6, the number of matrices with the rank of 5, and the number of matrices with the rank less than 5. For the truly random sequence the probability of composing a 6 rank matrix is 0.773, a 5 rank matrix is 0.217, and a less than 5 rank matrix is 0.010. Therefore, the test divides all possible matrix ranks into three groups. The test makes a V statistic with a chi-square distribution with two degrees of freedom for these three groups. Then the first level test applies the chi-square goodness-of-fit test to the groups. The testing result is the pvalue. Note: The acceptable values of are specific for each basic generator. The test checks each of the four elements of the integer output for the WH and SFMT19937 basic generators.Intel(R) MKL Vector Statistical Library Notes 46 BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. The test selects only NB of lower bits from each WS-bit integer to form a bit sequence. 8.3.6.3 Second Level Test The second level test performs the first level test ten times for the fixed s. The result is a set of pvalues pj , j = 1, 2, ..., 10. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained set of pj , j = 1, 2, ..., 10. If the resulting p-value is p<0.05 or p>0.95, the test fails for the s. 8.3.6.4 Final Result Interpretation The second level test performs ten times for each . The test computes the FAIL percentage of the failed second level tests. The final result is the minimal percentage of the failed tests FAIL = min(FAIL0, FAIL1, ..., FAILNB-8) for . The acceptable result is the value of FAIL < 50%. Therefore the test indicates whether it is possible to single out at least 8 random bits out of each element of generator integer output such that six random numbers of eight bits each have a random enough behavior under this particular test. 8.3.6.5 Tested Generators Function Name Application vsRngUniform not applicableNIEDERREITER 47 vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The Rank of 6x8 Binary Matrices Test checks each element of the WH generator separately as different multiplicative generators produce its elements. 8.3.7 Count-the-1's Test (Stream of Bits) 8.3.7.1 Test Purpose The test evaluates the randomness of the overlapping random five-letter words sequence. The fiveletter words have the specified distribution of the probabilities of obtaining the specified letter. The test forms the random letters from the integer output of the basic generator. The test regards the integer output as a sequence of bits. 8.3.7.2 First Level Test The first level test assumes that the integer output is a sequence of random bits. The test interprets this bit sequence as a sequence of bytes, that is, a sequence of 8-bit integer numbers. The number of 1’s in every random byte should have a binominal distribution with m = 8, p = 1/2 parameters. Therefore, the probability of getting k 1’s in a byte is equal to . The first level test regards a random variable c that takes five possible values: c = 0, if the number of 1’s in a random byte is less than three, c = 1, if the number of 1’s in a random byte is three, c = 2, if the number of 1’s in a random byte is four, c = 3, if the number of 1’s in a random byte is five, c = 4, if the number of 1’s in a random byte is more than five. The probability distribution of c is the following: The test interprets c as a selection of a random letter from the alphabet {a, b, c, d, e} with the probabilities respectively. Thus, the sequence of random bytes b0, b1, b2, ... corresponds with the defined sequence of random letters l0, l1, l2, ... . The test forms overlapping words of length four: v1 = l1 l2 l3 l4, v2 = l2 l3 l4 l5, ... and length five: w1 = l1 l2 l3 l4 l5, w2 = l2 l3 l4 l5 l6, ... from this sequence. The test computes the frequencies of getting each of 625 of possible four-letter words and of 3,125 of possible five-letter words for 2,560,000 of the obtained words. According to these frequencies, the test makes the chi-square statistics V1 and V2 for the four- and five-letter words respectively. The test takes into account the covariance of the frequencies of the fallouts of four-letter and five-letter words and performs the chi-square test for the V2 -V1 statistic. The V2 -V1 statistic is asymptotically normal with a mean a = 2500 and standard deviation s = 70.71. The result of the first level test is the p-value. BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits:Intel(R) MKL Vector Statistical Library Notes 48 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. The test selects only NB of lower bits from each WS-bit integer to form a bit sequence. 8.3.7.3 Second Level Test The second level test performs the first level test ten times. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of pj , j = 1, 2, ..., 10. If the resulting p-value is p < 0.05 or p > 0.95, the test fails. 8.3.7.4 Final Result Interpretation The second level test performs ten times. The test computes the FAIL percentage of the failed second level tests. The acceptable result is the value of FAIL < 50%. 8.3.7.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The WH and SFMT19937 generators use all the four elements to form a bit sequence.NIEDERREITER 49 8.3.8 Count-the-1's Test (Stream of Specific Bytes) 8.3.8.1 Test Purpose The test evaluates the randomness of the overlapping random five-letter words sequence. The fiveletter words have the specified distribution of the probabilities of obtaining the specified letter. The test forms the random letters from the integer output of the basic generator. The test selects only 8 sequential bits from each element, starting with a certain fixed bit s. 8.3.8.2 First Level Test The test selects the ds, ds+1, ..., ds+7 bits determining the next random byte from each element of the integer output, where (see the table below). The number of 1’s in every random byte should have a binominal distribution with m = 8, p = 1/2 parameters. Therefore, the probability of getting k 1’s in a byte is equal to . The first level test regards a random number that takes five possible values: c = 0, if the number of 1’s in a random byte is less than three, c = 1, if the number of 1’s in a random byte is three, c = 2, if the number of 1’s in a random byte is four, c = 3, if the number of 1’s in a random byte is five, c = 4, if the number of 1’s in a random byte is more than five. The probability distribution of c is the following: . The test interprets c as a selection of a random letter from the alphabet {a, b, c, d, e} with the respective probabilities . Thus, the sequence of random bytes b0, b1, b2, ... corresponds with the defined sequence of random letters l0, l1, l2, ... . The test forms overlapping words of length four: v1 = l1 l2 l3 l4, v2 = l2 l3 l4 l5, ... and length five: w1 = l1 l2 l3 l4 l5, w2 = l2 l3 l4 l5 l6, ... from this sequence. The test computes the frequencies of getting each of 625 of possible four-letter words and of 3,125 of possible five-letter words for 256,000 of the obtained words. According to these frequencies, the test makes the chi-square statistics V1 and V2 for the four- and five-letter words respectively. The test takes into account the covariance of the frequencies of the fallouts of four-letter and five-letter words and performs the chi-square test for the V2 -V1 statistic. The V2 -V1 statistic is asymptotically normal with a mean a = 2500 and standard distribution s = 70.71. The result of the first level test is the p-value. BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the Intel(R) MKL Vector Statistical Library Notes 50 following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. 8.3.8.3 Second Level Test The second level test performs the first level test ten times for the fixed . The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of pj , j = 1, 2, ..., 10. If the resulting p-value is p < 0.05 or p > 0.95, the test fails for s. 8.3.8.4 Final Result Interpretation The second level test performs ten times for each of 0 £ s £ NB-8. The test computes the FAIL percentage of the failed second level tests. The final result is the minimal for percentage of the failed tests FAIL = min(FAIL0, FAIL1, ..., FAILNB-8). The acceptable result is the value of FAIL < 50%. Therefore, the test determines whether it is possible to select at least 8 random bits from each element of the integer output of the generator. 8.3.8.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The test checks each of the four elements separately for the WH and SFMT19937 generators. 8.3.9 Craps Test 8.3.9.1 Test Purpose The test evaluates the randomness of the output sequence of random numbers of the uniform distribution that imitates the process of dice tossing when gambling Craps. The stable response is the number of tosses of the pair of dice necessary to complete the game and the frequency of wins in the game.NIEDERREITER 51 8.3.9.2 First Level Test The test forms a sequence of random numbers equiprobably taking the values from 1 to 6 from the output sequence of random numbers. The test treats every number as a number of spots on the face of a die. Thus the test regards a pair of numbers as the result of a toss of two dice. If on the first throw of dice the sum of the spots on the faces of dice equals to 7 or 11, it is a win; if the sum equals 2, 3 or 12, it is a loss. In other cases it is necessary to make additional throws to define the result of the game. The test performs additional throws until the sum of the spots equals to 7 or coincides with the sum thrown on the first throw. If the sum equals to 7, it is a loss, otherwise, it is a win. The theoretical probability of the win is 244/495, that is, a little less than 0.5. Further, the frequency of wins with the K-multiple repeats of the game, when K = 200,000, has a very close to normal distribution with mean a = K*244/495 and standard deviation s = a*251/495. The number of throws necessary to complete the game can take the 1,2, ... values. On K-multiple iterations of the game, the test computes the frequencies of getting c = 1, c = 2, ..., c = 20, c > 20. Based on these frequencies, the test makes the chi-square statistics V with the chi-square distribution with 20 degrees of freedom. The result of the first level test is the pair of p-values p and q for the number of tosses and the frequency of wins respectively. 8.3.9.3 Second Level Test The test performs the first level test ten times. The result of each iteration of the first level test is the pair of p-values pj and qj , j = 1, 2, ..., 10. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of pj , j = 1, 2, ..., 10. If the resulting pvalue is p < 0.05 or p > 0.95, the test fails. Similarly, the test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of qj , j = 1, 2, ..., 10. If the resulting p-value is q < 0.05 or q > 0.95, the s test fails. The test passes in all other cases. 8.3.9.4 Final Result Interpretation The final result of the test is the percentage FAIL of the failed second level tests. The test performs the second level test ten times. The acceptable result is the value of FAIL < 50%. 8.3.9.5 Tested Generators Function Name Application vsRngUniform applicable vdRngUniform applicable viRngUniform applicable viRngUniformBits applicable 8.3.10 Parking Lot Test 8.3.10.1 Test Purpose The test evaluates the randomness of two-dimensional random points uniformly distributed in the square with a side of length 100. The stable response is the number of successfully ”parked” points from the 12,000 random two-dimensional points.Intel(R) MKL Vector Statistical Library Notes 52 8.3.10.2 First Level Test The test assumes a next random point (x, y) successfully ”parked”, if it is far enough from every previous successfully ”parked” point. The sufficient distance between the points (x1, y1) and (x2, y2) is . Numerous experiments prove that out of 12,000 of truly random points only 3,523 points park successfully in average. Moreover, the K value of points successfully parked after 12,000 attempts haves close to normal distribution with mean a = 3,523 and standard deviation s = 21.9. Consequently, (K-a)/s should have a close to standard normal distribution with the cumulative distribution function. The result of the test is the p-value . 8.3.10.3 Second Level Test The test performs the first level test ten times. The result of each iteration of the first level test is the p-value pj , j = 1, 2, ..., 10. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of pj , j = 1, 2, ..., 10. If the resulting p-value is p < 0.05 or p > 0.95, the test fails. 8.3.10.4 Final Result Interpretation The final result of the test is the percentage FAIL of the failed second level tests. The test performs the second level test ten times. The acceptable result is the value of FAIL < 50%. 8.3.10.5 Tested Generators Function Name Application vsRngUniform applicable vdRngUniform applicable viRngUniform not applicable viRngUniformBits applicable 8.3.11 2D Self-Avoiding Random Walk Test 8.3.11.1 Test Purpose The test evaluates the randomness of the output vector of the generator. The stable response is the frequency of achieving the upper side of the lattice by the point walking randomly along the sites. 8.3.11.2 First Level Test A random particle walks along the sites of a square lattice. With each new step, the particle moves in one of possible directions one step forward corner-wise. A square lattice has two types of sides: the lower and left-hand sides are totally reflecting, while the upper and right-hand sides are totally adsorbing. Reaching the lower and left-hand sides, the vector of the movement direction makes a 90- degree bend. The upper and right-hand sides adsorb the particle when it reaches them and the walking process completes. The particle starts its movement from the lower left-hand site of the lattice in the northeast direction. If the particle encounters an unvisited site, it changes the direction vector with a ½ probability clockwise or counter-clockwise by 90 degrees and continues the walking process. If the particle encounters an already visited site of the lattice, it defines the movement direction according to the conditions of inadmissibility of re-tracing at least a part ?f the passed path. Due to the symmetry of the task, either upper or the right-hand side should equiprobably adsorb the particle. The test determines the frequency of the achievement of the upper side of the lattice by the result of 500 iterations of the walking process. If M is the number of attempts when the particle NIEDERREITER 53 reaches the upper side, then has the close to standard normal distribution . The result of the first level test is the p-value . 8.3.11.3 Second Level Test The test performs the first level test ten times. The result of each iteration of the first level test is the p-value pj , j = 1, 2, ..., 10. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of pj , j = 1, 2, ..., 10. If the resulting p-value is p < 0.05 or p > 0.95, the test fails. 8.3.11.4 Final Result Interpretation The final result of the test is the percentage FAIL of the failed second level tests. The test performs the second level test ten times. The acceptable result is the value of FAIL < 50%. 8.3.11.5 Tested Generators Function Name Application vsRngUniform applicable vdRngUniform applicable viRngUniform not applicable viRngUniformBits applicable 8.3.12 Template Test 8.3.12.1 Test Purpose The test evaluates the conformity of the generator output with the template sequence of random numbers. The test forms the specified output integer sequence from the recurrence specifying initial conditions. The parameters of the recurrences are selected such that the output sequences possess "good" properties (good multidimensional uniformity, large period, etc.). If the test computes any member of sequence incorrectly, that results in incorrect computing of the other members of the sequence. Moreover, if differs from the correct (template) sequence in one bit, the subsequent members of sequence may differ significantly from the template sequence. In this connection the quality of the obtained sequence is highly probable to be much worse than the quality of the template sequence. That is why all the basic generators of the VSL undergo thorough tests for template sequences conformity. The test also checks the basic generators with the random output numbers , uniformly distributed over the interval for the template output conformity. Obviously, the output sequences are different for real arithmetic of single and double precision. Other from the integer output where every member should coincide bitwisely with the template member, it is not necessary for the real output members. The lower bits of mantissa of the real output do not influence randomness, these are the upper bits that determine the quality of the output sequence. For example, the coincidence of the upper binary digits of mantissa is sufficient enough for most applications. (See the chapter Spectral Test in [Knuth81]). This test is also used to validate VSL basic quasi-random number generatorsIntel(R) MKL Vector Statistical Library Notes 54 8.3.12.2 Final Result Interpretation The final result is the number of the sequence members that do not coincide with the template members. The value should be equal to 0. For real sequences the test assumes that the sequence member coincides with the template member, if at least 8 upper binary digits of mantissa coincide. 8.3.12.3 Tested Generators Function Name Application vsRngUniform applicable vdRngUniform applicable viRngUniform not applicable viRngUniformBits applicable 8.4 BRNG Properties and Testing Results This section contains the empirical testing results for the VSL basic generators described in the BRNG Tests Description section and other information on the properties of basic generators and the rules of the output vector interpretation. 8.4.1 MCG31m1 This is a 31-bit multiplicative congruential generator: MCG31m1 belongs to linear congruential generators with the period length of approximately 2 32 . Such generators are still used as default random number generators in various software systems, mainly due to the simplicity of the portable versions implementation, speed and compatibility with the earlier systems versions. However, their period length does not meet the requirements for modern basic generators. Still, the MCG31m1 generator possesses good statistic properties and you may successfully use it to generate random numbers of different distributions for small samplings. 8.4.1.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.1.2 Integer Implementation The output vector of 32-bit integers 8.4.1.3 Stream Initialization by Function vslNewStream MCG31m1 generates the stream and initializes it specifying the input 32-bit parameter seed :NIEDERREITER 55 • Assume x0 = seed mod 0x7FFFFFFF • If x0 = 0, assume x0 = 1. 8.4.1.4 Stream Initialization by Function vslNewStreamEx MCG31m1 generates the stream and initializes it specifying the array n of 32-bit integers params[]: • If n = 0, assume x0 = 1 • Otherwise, assume x0 = params[0] mod 0x7FFFFFFF If x0 = 0, assume x0 = 1. 8.4.1.5 Subsequences Selection Methods vslSkipAheadStream supported vslLeapfrogStream supported 8.4.1.6 Generator Period . 8.4.1.7 Lattice Structure M8 = 0.72771, M16 = 0.61996, M32 = 0.61996 (for more details see [L’Ecu94]). 8.4.1.8 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (10% errors) OK (10% errors) N/A OK (10% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (10% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (10% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A N/A Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the- 1’s Test (stream of bits) N/A N/A N/A OK (20% errors) Counts-the- 1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors)Intel(R) MKL Vector Statistical Library Notes 56 Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (20% errors) Parking Lot Test OK (10% errors) OK (10% errors) N/A OK (10% errors) 2D SelfAvoiding Random Walk Test OK (20% errors) OK (20% errors) N/A OK (20% errors) Note: • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.2 R250 This is a generalized feedback shift register generator: Feedback shift register generators possess ample theoretical foundation and first were intended for cryptographic and communication applications. The physicists widely use R250 generator, as it is simple and fast in implementation. However, this generator fails in some types of tests, one of which is the 2D Self-Avoiding Random Walk Test. 8.4.2.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.2.2 Integer Implementation The output vector of 32-bit integers 8.4.2.3 Stream Initialization by Function vslNewStream R250 generates the stream and initializes it specifying the input 32-bit integer parameter seed. The stream state is the array of 250 32-bit integers , initialized in the following way: • If seed = 0, assume seed = 1. Assume x-250 = seed. • Initialize according to recurrent correlation . • Interpret the values as a binary matrix of size 32x32 and perform the following: set the diagonal bits to 1, and the under-diagonal bits to 0. 8.4.2.4 Stream Initialization by Function vslNewStreamEx R250 generates the stream and initializes it specifying the array n of 32-bit integer params[]: NIEDERREITER 57 • If n = 0, assume xk-250 = params[k], k=0,1,...,249. If n = 0, assume seed = 1, and perform the initialization as described in the above section on stream initialization by the function vslNewStream. 8.4.2.5 Subsequences Selection Methods vslSkipAheadStream not supported vslLeapfrogStream not supported 8.4.2.6 Generator Period . 8.4.2.7 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (0% errors) OK (0% errors) N/A OK (0% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (25% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (10% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (30% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (20% errors) Parking Lot Test OK (0% errors) OK (0% errors) N/A OK (0% errors) 2D Self-Avoiding Random Walk Test FAIL (70% errors) FAIL (80% errors) N/A FAIL (80% errors) Note: • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777.Intel(R) MKL Vector Statistical Library Notes 58 8.4.3 MRG32k3a This is a 32-bit combined multiple recursive generator with 2 components of order 3: MRG32k3a combined generator meets the requirements for modern RNGs, such as good multidimensional uniformity, long period, etc. Optimization for various Intel® architectures makes it competitive with the other VSL basic generators in terms of speed. 8.4.3.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.3.2 Integer Implementation The output vector of 32-bit integers 8.4.3.3 Stream Initialization by Function vslNewStream MRG32k3a generates the stream and initializes it specifying the 32-bit input integer parameter seed. The stream state is the two triplets of 32-bit integers ( and ), initialized in the following way: • Assume x-3 = seed. • Assume the other values equal to 1, that is, . 8.4.3.4 Stream Initialization of Function vslNewStreamEx MRG32k3a generates the stream and initializes it specifying the array n of 32-bit integer params[]: • If n = 0, assume . • If n = 1, assume x-3 = params[0] mod m1, . • If n = 2, assume x-3 = params[0] mod m1, x-2 = params[1] mod m1, . • If n = 3, assume x-3 = params[0] mod m1, x-2 = params[1] mod m1, x-1 = params[2] mod m1, . If the values prove to be x-3 = x-2 = x-1 = 0, assume x-3 = 1. • If n = 4, assume x-3 = params[0] mod m1, x-2 = params[1] mod m1, x-1 = params[2] mod m1, y-3 = params[3] mod m2, . If the values prove to be x-3 = x-2 = x-1 = 0, assume x-3 = 1.NIEDERREITER 59 • If n = 5, assume x-3 = params[0] mod m1, x-2 = params[1] mod m1, x-1 = params[2] mod m1, y-3 = params[3] mod m2, y-2 = params[4] mod m2, . If the values prove to be x-3 = x-2 = x-1 = 0, assume x-3 = 1. • If n = 6, assume x-3 = params[0] mod m1, x-2 = params[1] mod m1, x-1 = params[2] mod m1, y-3 = params[3] mod m2, y-2 = params[4] mod m2, y-1 = params[5] mod m2. If the values prove to be x-3 = x-2 = x-1 = 0, assume x-3 = 1. If the values prove to be y-3 = y-2 = y-1 = 0, assume y-3 = 1. 8.4.3.5 Subsequences Selection Methods vslSkipAheadStream supported vslLeapfrogStream not supported 8.4.3.6 Generator Period . 8.4.3.7 Lattice Structure M8 = 0.68561, M16 = 0.63940, M32 = 0.63359. 8.4.3.8 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (10% errors) OK (10% errors) N/A OK (10% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (20% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (20% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (10% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (20% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (20% errors) Parking Lot Test OK (10% errors) OK (10% errors) N/A OK (10% errors) 2D Self-Avoiding Random Walk Test OK (20% errors) OK (20% errors) N/A OK (20% errors) Note: • N/A means that the test is not applicable to this function. Intel(R) MKL Vector Statistical Library Notes 60 • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.4 MCG59 This is a 59-bit multiplicative congruential generator: Multiplicative congruential generator MCG59 is one of the two basic generators implemented in the NAG Numerical Libraries. As the module of the generator is not prime, the length of its period is not 2 59 but only 2 57 , if the initial value (seed) is not an even number. The drawback of these generators is well known, (see, for example, [Cram46], [Ent98]): the lower bits of the generated sequence of pseudo-random numbers are not random and thus breaking numbers down into their bit patterns and using individual bits may cause trouble. Besides, block-splitting an entire period sequence into 2d identical blocks leads to their full identity in d lower bits. 8.4.4.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.4.2 Integer Implementation The output vector of the 32-bit integers is Thus, the output vector stores practically every 59-bit member of the integer output as two 32-bit integers. For example, to get a vector from n 59-bit integers the size of the output array should be large enough to store 2n 32-bit numbers. 8.4.4.3 Stream Initialization by Function vslNewStream MCG59 generates the stream and initializes it specifying the 32-bit input integer parameter seed. • Assume x0 = seed mod 2 59 . • If x0 = 0, assume x0 = 1. 8.4.4.4 Stream Initialization of Function vslNewStreamEx MCG59 generates the stream and initializes it specifying the array n of 32-bit integer params[]: • If n = 0, assume x0 = 1. • If n = 1, assume seed = params[0], follow the instructions described in the above section on stream initialization by the function vslNewStream. • Otherwise assume seed = params[0]+2 32 *params[1], follow the instructions described in the above section on stream initialization by the function vslNewStream. 8.4.4.5 Subsequences Selection Methods vslSkipAheadStream supportedNIEDERREITER 61 vslLeapfrogStream supported 8.4.4.6 Generator Period . 8.4.4.7 Lattice Structure S2 = 0.84; S3 = 0.73; S4 = 0.74; S5 = 0.58; S6 = 0.63; S7 = 0.52; S8 = 0.55; S9 = 0.56. 8.4.4.8 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (10% errors) OK (10% errors) N/A OK (10% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (45% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A FAIL (100% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (10% errors) OK (10% errors) OK (10% errors) OK (10% errors) Parking Lot Test OK (20% errors) OK (20% errors) N/A OK (20% errors) 2D Self-Avoiding Random Walk Test OK (20% errors) OK (10% errors) N/A OK (10% errors) Note: • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. [1] The generator fails the test for bit groups 0-23, 1-24, 2-25, 3-26, 5-28. [2] The generator fails the test for bit groups 0-30, 1-31. [3] The generator fails the test for bit groups 0-31, 1-32.Intel(R) MKL Vector Statistical Library Notes 62 [4] The generator fails the test for bit groups 0-7, ..., 9-16, 11-18, 32-39, ..., 37-44, 39-46, ..., 41- 48. [5] The generator fails the test for bit groups 0-7, …, 11-18, 13-20, …, 15-22. 8.4.5 WH This is a set of 273 Wichmann-Hill’s combined multiplicative congruential generators (j = 1, 2, ..., 273): WH is a set of 273 different basic generators. This generator is the second basic generator in the NAG libraries. The constants ai,j range from 112 to 127, the constants mi,j are prime numbers ranging from 16,718,909 to 16,776,971, close to 2 24 . These constant should show good results in the spectral test (see Knuth [Knuth81] and MacLaren [MacLaren89]). The period of each Wichmann-Hill generator may be equal to 2 92 if not for common factors between (m1,j -1), (m2,j -1), (m3,j -1) and (m4,j -1). However, each generator should still have a period of at least 2 80 . The generated pseudo-random sequences are essentially independent of one another according to the spectral test (for detailed information about properties of these generators see [MacLaren89]). 8.4.5.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.5.2 Integer Implementation The output vector of 32-bit integers Thus, the output vector stores practically every quadruple (x, y, z, w) of members of the integer output as four 32-bit integers. For example, to get a vector from n quadruples (x, y, z, w), the size of the output array should be large enough to for storage of 4n 32-bit numbers. 8.4.5.3 Stream Initialization by Function vslNewStream WH generates the stream and initializes it specifying the 32-bit input integer parameter seed : • Assume x0 = seed mod m1. If x0 = 0, assume x0 = 1. • Assume y0 = 1, z0 = 1, w0 = 1. WH generator is a set of 273 basic generators. The test selects a WH generator adding an offset to the named constant VSL_BRNG_WH: VSL_BRNG_WH+0, VSL_BRNG_WH+1, ... , VSL_BRNG_WH+272. The following example illustrates the initialization of the seventh (of 273) WH generator: vslNewStream (&stream, VSL_BRNG_WH+6, seed); 8.4.5.4 Stream Initialization of Function vslNewStreamEx WH generates the stream and initializes it specifying the array n of 32-bit integer params[]: • If n = 0, assume x0 = 1, y0 = 1, z0 = 1, w0 = 1.NIEDERREITER 63 • If n = 1, assume x0 = params[0] mod m1, y0 = 1, z0 = 1, w0 = 1. If x0 = 0, assume x0 =1. • If n = 2, assume x0 = params[0] mod m1, y0 = params[1] mod m2, z0 = 1, w0 = 1. If x0 = 0, assume x0 = 1. If y0 = 0, assume y0 = 1. • If n = 3, assume x0 = params[0] mod m1, y0 = params[1] mod m2, z0 = params[2] mod m3, w0 = 1. If x0 = 0, assume x0 = 1. If y0 = 0, assume y0 = 1. If z0 = 0, assume z0 = 1. • If n = 4, assume x0 = params[0] mod m1, y0 = params[1] mod m2, z0= params[2] mod m3, w0 = params[3] mod m4. If x0 = 0, assume x0 = 1. If y0 = 0, assume y0 = 1. If z0 = 0, assume z0 = 1. If w0 = 0, assume w0 = 1. 8.4.5.5 Subsequences Selection Methods vslSkipAheadStream supported vslLeapfrogStream supported 8.4.5.6 Generator Period . 8.4.5.7 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (0% errors) OK (0% errors) N/A OK (0% errors) Birthday Spacing Test N/A N/A N/A FAIL (60% errors) Bitstream Test N/A N/A N/A OK (10% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A N/A Rank of 32x32 Binary Matrices Test N/A N/A N/A N/A Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (10% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (10% errors) Parking Lot Test OK (10% errors) OK (10% errors) N/A OK (10% errors) 2D Self-Avoiding Random Walk Test OK (10% errors) OK (0% errors) N/A OK (20% errors) Note: • N/A means that the test is not applicable to this function. Intel(R) MKL Vector Statistical Library Notes 64 • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.6 MT19937 This is a Mersenne Twister pseudorandom number generator: , , , , , , . Matrix A (32x32) has the following format: , Where the 32-bit vector has the value . Mersenne Twister pseudorandom number generator MT19937 is a modification of twisted generalized feedback shift register generator [Matsum92], [Matsum94]. MT19937 has the period length of 2 19937 -1 and is 623-dimensionally equidistributed up to 32-bit accuracy. These properties make the generator applicable for simulations in various fields of science and engineering. The initialization procedure is essentially the same as described in [MT2002]. The state of the generator is represented by 624 32- bit unsigned integer numbers. 8.4.6.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.6.2 Integer Implementation The output vector of 32-bit integersNIEDERREITER 65 8.4.6.3 Stream Initialization by Function vslNewStream MT19937 generates the stream and initializes it specifying the input 32-bit unsigned integer parameter seed. The stream state, that is, the array of 624 32-bit integers , is initialized by the procedure described in [MT2002] and based on the seed value. 8.4.6.4 Stream Initialization of Function vslNewStreamEx MT19937 generates the stream and initializes it specifying the array n of 32-bit unsigned integer params[]: • If n = 1, perform initialization as described in [MT2002] using array params[]on input. • If n = 0, assume params[0] = 1, n = 1 and perform initialization as described in the previous item. 8.4.6.5 Subsequences Selection Methods vslSkipAheadStream not supported vslLeapfrogStream not supported 8.4.6.6 Generator Period . 8.4.6.7 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (0% errors) OK (0% errors) N/A OK (0% errors) Birthday Spacing Test N/A N/A N/A OK (10% errors) Bitstream Test N/A N/A N/A OK (10% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (10% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (20% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (30% errors) OK (30% errors) OK (30% errors) OK (30% errors) Parking Lot Test OK (0% errors) OK (0% errors) N/A OK (0% errors) 2D Self-Avoiding Random Walk Test OK (0% errors) OK (10% errors) N/A OK (10% errors) Note:Intel(R) MKL Vector Statistical Library Notes 66 • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.7 SFMT19937 This is a SIMD-oriented Fast Mersenne Twister pseudorandom number generator: where , , ... are 128-bit integers, and the , , , operations are defined as follows: , left shift of 128-bit integer by followed by exclusive-or operation , right shift of each 32-bit integer in quadruple followed by and-operation with quadruple of 32-bit masks , mask=(0xBFFFFFF6 0xBFFAFFFF 0xDDFECB7F 0xDFFFFFFEF) , right shift of 128-bit integer , left shift of each 32-bit integer in quadruple , k-th 32-bit integer in quadruple , . 8.4.7.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.7.2 Integer Implementation The output vector of 32-bit integers , kth 32-bit integer member of quadruple . Thus, the output vector stores practically every quadruple, 128-bit integer of members of the integer output as four 32-bit integers. For example, to get a vector from n quadruples , the size of the output array should be large enough to store 4n 32-bit numbers. 8.4.7.3 Stream Initialization by Function vslNewStream SFMT19937 generates the stream and initializes it specifying the input 32-bit unsigned integer parameter seed. The stream state, that is, the array of 156 128-bit integers (624 32-bit integers ), is initialized by the procedure described in [Saito08] and based on the seed value. NIEDERREITER 67 8.4.7.4 Stream Initialization of Function vslNewStreamEx SFMT19937 generates the stream and initializes it specifying the array n of 32-bit unsigned integer params[]: • If n = 1, perform initialization as described [Saito08] using array params[]on input. • If n = 0, assume params[0] = 1, n = 1 and perform initialization as described in the previous item. 8.4.7.5 Subsequences Selection Methods vslSkipAheadStream not supported vslLeapfrogStream not supported 8.4.7.6 Generator Period . 8.4.7.7 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (30% errors) OK (30% errors) N/A OK (40% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (10% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (10% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (10% errors) Parking Lot Test OK (30% errors) OK (30% errors) N/A OK (0% errors) 2D Self-Avoiding Random Walk Test OK (0% errors) OK (20% errors) N/A OK (10% errors) Note: • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95].Intel(R) MKL Vector Statistical Library Notes 68 • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.8 MT2203 This is a set of 6024 Mersenne Twister pseudorandom number generators (j = 1, ..., 6024): , , , , , , . Matrix (32x32) has the following format: , with the 32-bit vector . The set of 6024 basic pseudorandom number generators MT2203 is a natural addition to MT19937 generator. MT2203 generators are intended for use in large scale Monte Carlo simulations performed on multi-processor computer systems. These generators possess a smaller period length but the number of 2 2203 -1 is big enough to meet the requirements of modern Monte Carlo problems. MT2203 produces up to 6024 independent random number sequences. The parameters have been carefully chosen according to the method described in [Matsum2000]. 8.4.8.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.8.2 Integer Implementation The output vector of 32-bit integersNIEDERREITER 69 8.4.8.3 Stream Initialization by Function vslNewStream MT2203 generates the stream and initializes it specifying the input 32-bit unsigned integer parameter seed. The stream state, that is, the array of 69 32-bit integers , is initialized by the procedure described in [MT2002] and based on the seed value. MT2203 generator is a set of 6024 basic generators. To select an MT2203 generator, add an offset to the named constant VSL_BRNG_MT2203, for example, VSL_BRNG_MT2203+0, VSL_BRNG_ MT2203+1, ... . The following example illustrates the initialization of the 10th (of 6024) MT2203 generator: vslNewStream (&stream, VSL_BRNG_MT2203+9, seed); 8.4.8.4 Stream Initialization of Function vslNewStreamEx MT2203 generates the stream and initializes it specifying the array n of 32-bit unsigned integer params[]: • If n = 1, perform initialization as described in [MT2002] using array params[]on input. • If n = 0, assume params[0] = 1, n = 1 and perform initialization as described in the previous item. 8.4.8.5 Subsequences Selection Methods vslSkipAheadStream not supported vslLeapfrogStream not supported 8.4.8.6 Generator Period . 8.4.8.7 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (20% errors) OK (20% errors) N/A OK (20% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (15% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (10% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (20% errors)Intel(R) MKL Vector Statistical Library Notes 70 Parking Lot Test OK (0% errors) OK (0% errors) N/A OK (0% errors) 2D Self-Avoiding Random Walk Test OK (10% errors) OK (0% errors) N/A OK (0% errors) Note: • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.9 SOBOL This is a 32-bit Gray code-based quasi-random number generator Note: The value c is the rightmost zero bit in n-1; is s-dimensional vector of 32-bit values. The sdimensional vectors (calculated during random stream initialization) are called direction numbers. The vector is the generator output normalized to the unit hypercube . Bratley and Fox [Brat87] provide an implementation of the SOBOL quasi-random number generator. VSL implementation allows generating SOBOL’s low-discrepancy sequences of length up to 232. This implementation also admits registration of user-defined parameters (direction numbers and primitive polynomials) during the initialization, which allows obtaining quasi-random vectors of any dimension. If user does not supply user-defined parameters, the default values are used for generation of quasirandom vectors. The default dimension of quasi-random vectors can vary from 1 to 40 inclusive. 8.4.9.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values , where elements correspond to the , correspond to the , and so on. 8.4.9.2 Integer Implementation The output vector of 32-bit integers , where elements correspond to the , correspond to the , and so on. 8.4.9.3 Stream Initialization by Function vslNewStream SOBOL generates the stream and initializes it specifying the input 32-bit parameter seed (dimension dimen of a quasi-random vector): • Assume dimen = seed • If dimen < 1 or dimen > 40, assume dimen = 1.NIEDERREITER 71 8.4.9.4 Stream Initialization by Function vslNewStreamEx SOBOL generates the stream and initializes it specifying the array params[] of n 32-bit integers to set the dimension dimen of a quasi-random vector as well as pass other generator related parameters, for example, initial direction numbers and primitive polynomials. Direction numbers can also be passed using the array. General interface for passing stream initialization parameters of SOBOL via the params[]array has the following format: Position in params[] 0 1 2 3...2+dimen 3+dimen 4+dimen...dimen* (maxdeg+1)+3 dimen Parameter Class Indicators Initial Values Subclass Indicators Primitive polynomials Maximum degree of primitive polynomial, maxdeg Initial direction numbers The dimension parameter params[0] is obligatory, and can be initialized as follows: params[0] = dimen; The other elements of params intended for passing additional user-supplied data are optional. For example, if they are not presented, then default tables of direction numbers are used for generation of quasi-random vectors. VSL default tables of direction numbers allow generating quasi-random sequences for dimensions up to 40. If you want to generate quasi-random vectors of greater dimension or obtain another sequence you may register a set of your own primitive polynomials and/or a table of initial direction numbers. In order to do this, you need to set the Parameter Class Indicators field (params[1]) to VSL_USER_QRNG_INITIAL_VALUES: params[1] = VSL_USER_QRNG_INITIAL_VALUES; Further, you should specify in Initial Values Subclass Indicators field (params[2]) whether you want to supply primitive polynomials, initial direction numbers, or both, by setting corresponding indicators. In the example below both direction numbers and primitive polynomials indicators are set: params[2] = VSL_USER_INIT_DIRECTION_NUMBERS | VSL_USER_PRIMITIVE_POLYMS; If you want to provide just initial direction numbers, do it as follows: params[2] = VSL_USER_INIT_DIRECTION_NUMBERS; Similarly you can indicate that only primitive polynomials are passed to the library: params[2] = VSL_USER_PRIMITIVE_POLYMS; Note: For dimensions greater than 40, both the primitive polynomials and the table of initial direction numbers must be provided. Remainder of the params array is used to pass primitive polynomials and/or initial direction numbers. Primitive polynomials are packed as unsigned integers, initial direction numbers for SOBOL are assumed to be two-dimensional table. In the matrix i-th row corresponds to i-th dimension, and number of columns equals the maximum degree of primitive polynomials maxdeg. The number of polynomials (and the number of rows in the table) depends on the initialization mode for the first dimension. In the default initialization mode (see [Brat88] for details) it is enough to pass into the library dimen -1 primitive polynomials (correspondingly, the number of rows in the table of initial direction numbers also equals dimen -1). To override default initialization for the first dimension, set VSL_QRNG_OVERRIDE_1ST_DIM_INIT indicator in params[2]: params[2] = params[2] | VSL_QRNG_OVERRIDE_1ST_DIM_INIT; and pass a complete set of polynomials and/or initial direction numbers (dimen primitive polynomials and the table of initial direction numbers with dimen rows). If you pass just primitive polynomials or Intel(R) MKL Vector Statistical Library Notes 72 initial direction numbers for dimensions , the default initialization for the first dimension is always assumed (the number of polynomials and the number of rows in the table of initial direction numbers equals s-1). If both arrays are passed to the generator you should organize data in correct order: first - polynomials, second - maximum degree of primitive polynomials and, finally, initial direction numbers as it is done in the example below: unsigned int uSobolIrredPoly[dimen] = {...}; unsigned int uSobolMInit[dimen][maxdeg] = {...}; ... params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; params[2] = VSL_USER_INIT_DIRECTION_NUMBERS|VSL_USER_PRIMITIVE_POLYMS; params[2] = params[2] | VSL_QRNG_OVERRIDE_1ST_DIM_INIT; for ( i = 0; i < dimen; i++ ) params[i+3] = uSobolIrredPoly[i]; params[3+dimen] = maxdeg; k = 4+dimen; for ( i = 0; i < dimen; i++ ) { for ( j = 0; j < maxdeg; j++ ) { params[k++] = uSobolMInit[i][j]; } } Replacement of default initial values for SOBOL with user-provided values can be done as shown in the example below: ... // dimen = 10 unsigned int uSobolMInit[dimen-1][maxdeg] = {...}; params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; params[2] = VSL_USER_INIT_DIRECTION_NUMBERS; params[3] = maxdeg; k = 4; for ( i = 0; i < dimen-1; i++ ) { for ( j = 0; j < maxdeg; j++ ) { params[k++] = uSobolMInit[i][j]; } } You can also calculate a table of direction numbers using your own initial direction numbers and primitive polynomials and pass this array to the generator. The interface for registration of the direction numbers is as follows: NIEDERREITER 73 Position in params[] 0 1 2 3...dimen*32+2 dimen Parameter Class Indicators Initial Values Subclass Indicators Direction numbers As earlier, the dimension parameter params[0] and Parameter Class Indicators field (params[1]) can be initialized as follows: params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; Further, you need to initialize Initial Values Subclass Indicators field (params[2]): params[2] = VSL_USER_DIRECTION_NUMBERS; Direction numbers are assumed to be dimen x 32 table of unsigned integers and can be passed to the generator in the following way: unsigned int uSobolV[dimen][32] = {...}; params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; params[2] = VSL_USER_DIRECTION_NUMBERS; k = 3; for ( i = 0; i < dimen; i++ ) { for ( j = 0; j < 32; j++ ) { params[k++] = uSobolV[i][j]; } } In short, the SOBOL stream initialization is as follows: If n = 0, assume dimen = 1 If n = 1, dimen = params[0] • If dimen < 1 or dimen > 40, assume dimen = 1. If n > 1, initialize SOBOL quasi-random stream by means of user-defined primitive polynomials and initial direction numbers or direction numbers. • If externally defined parameters of the generator are packed incorrectly, initialize stream using default tables of direction numbers. 8.4.9.5 Subsequences Selection Methods vslSkipAheadStream supported vslLeapfrogStream supported Note:Intel(R) MKL Vector Statistical Library Notes 74 • The skip-ahead method skips individual components of quasi-random vectors rather than whole s-dimensional vectors. Hence, to skip N s-dimensional quasi-random vectors, call vslSkipAheadStream subroutine with parameter nskip equal to the N×s. • The leapfrog method works with individual components of quasi-random vectors rather than with s-dimensional vectors. In addition, its functionality allows picking out a fixed quasirandom component only. In other words, nstreams parameter should be equal to the predefined constant VSL_QRNG_LEAPFROG_COMPONENTS, and k parameter should indicate the index of a component of s-dimensional quasi-random vectors to be picked out (0 = k < s). 8.4.9.6 Generator Period . 8.4.9.7 Dimensions is a default set of dimensions; user-defined dimensions are available. 8.4.10 NIEDERREITER This is a 32-bit Gray code-based quasi-random number generator Note: The value c is the rightmost zero bit in n-1; is s-dimensional vector of 32-bit values. The sdimensional vectors (calculated during random stream initialization) are called direction numbers. The vector is the generator output normalized to the unit hypercube . According to the results of Bratley, Fox, and Niederreiter [Brat92] Niederreiter sequences have the best known theoretical asymptotic properties. VSL implementation allows generating Niederreiter lowdiscrepancy sequences of length up to 2 32 . This implementation also allows for registration of userdefined parameters (irreducible polynomials or direction numbers), which allows obtaining quasirandom vectors of any dimension. If user does not supply user-defined parameters, the default values are used for generation of quasi-random vectors. The default dimension of quasi-random vectors can vary from 1 to 318 inclusive. 8.4.10.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values , where elements correspond to the , correspond to the , and so on. 8.4.10.2 Integer Implementation The output vector of 32-bit integers , where elements correspond to the , correspond to the , and so on. 8.4.10.3 Stream Initialization by Function vslNewStream NIEDERREITER generates the stream and initializes it specifying the input 32-bit parameter seed (dimension dimen of a quasi-random vector):NIEDERREITER 75 • Assume dimen = seed • If dimen < 1 or dimen > 318, assume dimen = 1. 8.4.10.4 Stream Initialization by Function vslNewStreamEx NIEDERREITER generates the stream and initializes it specifying the array params[] of n 32-bit integers to set the dimension dimen of a quasi-random vector as well as pass other generator related parameters, for example, irreducible polynomials or direction numbers (matrix of the generator). General interface for passing stream the polynomials via the params[] array has the following format: Position in params[] 0 1 2 3...2+dimen dimen Parameter Class Indicators Initial Values Subclass Indicators Irreducible polynomials The dimension parameter params[0] is obligatory, and can be initialized as follows: params[0] = dimen; The other elements of params intended for passing additional user-supplied data are optional. For example, if they are not presented, then the default table of irreducible polynomials is used for generation of quasi-random vectors. VSL default tables of the polynomials allow generating quasirandom sequences for dimensions up to 318. If you want to generate quasi-random vectors of greater dimension or obtain another sequence you may register a set of your own irreducible polynomials. In order to do this, you need to set the Parameter Class Indicators field (params[1]) to VSL_USER_QRNG_INITIAL_VALUES: params[1] = VSL_USER_QRNG_INITIAL_VALUES; Further, you should indicate in Initial Values Subclass Indicators field (params[2]) that you want to supply irreducible polynomials: params[2] = VSL_USER_IRRED_POLYMS; Remainder of the params array is used to pass irreducible polynomials. They are packed as unsigned integers and serially set into corresponding positions of the params array as it is shown in the example below (number of the polynomials equals the dimension dimen): unsigned int uNiederrIrredPoly[dimen] = {...}; ... params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; params[2] = VSL_USER_IRRED_POLYMS; for ( i = 0; i < dimen; i++ ) params[i+3] = uNiederrIrredPoly[i]; You can also calculate direction numbers (matrix of the generator) using your own irreducible polynomials and pass this table to the generator. The interface for registration of the direction numbers is as follows: Position in params[] 0 1 2 3...dimen*32+2Intel(R) MKL Vector Statistical Library Notes 76 dimen Parameter Class Indicators Initial Values Subclass Indicators Direction numbers As earlier, the dimension parameter params[0] and Parameter Class Indicators field (params[1]) can be initialized as follows: params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; Further, you need to initialize Initial Values Subclass Indicators field (params[2]): params[2] = VSL_USER_DIRECTION_NUMBERS; Direction numbers are assumed to be dimen x 32 table of unsigned integers and can be passed to the generator in the following way: unsigned int uNiederrCJ[dimen][32] = {...}; params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; params[2] = VSL_USER_DIRECTION_NUMBERS; k = 3; for ( i = 0; i < dimen; i++ ) { for ( j = 0; j < 32; j++ ) { params[k++] = uNiederrCJ[i][j]; } } In short, NIEDERREITER stream initialization is as follows: • If n = 0, assume dimen = 1 • If n = 1, dimen = params[0] If dimen < 1 or dimen > 318, assume dimen = 1. • If n > 1, initialize NIEDERREITER quasi-random stream by means of user-defined polynomials If externally defined parameters of the generator are packed incorrectly, initialize stream by setting dimension to 1 and using default tables of irreducible polynomials. 8.4.10.5 Subsequences Selection Methods vslSkipAheadStream supported vslLeapfrogStream supported Note: • The skip-ahead method skips individual components of quasi-random vectors rather than whole s-dimensional vectors. Hence, to skip N s-dimensional quasi-random vectors, call vslSkipAheadStream subroutine with parameter nskip equal to the N×s. • The leapfrog method works with individual components of quasi-random vectors rather than with s-dimensional vectors. In addition, its functionality allows picking out a fixed quasirandom component only. In other words, nstreams parameter should be equal to the NIEDERREITER 77 predefined constant VSL_QRNG_LEAPFROG_COMPONENTS, and k parameter should indicate the index of a component of s-dimensional quasi-random vectors to be picked out (0 = k < s). 8.4.10.6 Generator Period . 8.4.10.7 Dimensions is a default set of dimensions; user-defined dimensions are available.78 9 Testing of Distribution Random Number Generators Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 VSL generators are tested with a testing suite comprising a set of tests to control the quality of random number sequences of general discrete and continuous distributions. Random numbers of discrete and continuous distributions are generated by transforming random numbers of uniform distribution. A source of uniformly distributed random numbers is a random stream produced by a basic generator. Quality of the random number sequences with non-uniform distribution greatly depends on the quality of the respective basic generator. Therefore, generators of discrete and continuous distributions are tested for each individual basic generator. VSL can provide several methods of random number generation for any probability distribution. For example, two methods are implemented for Poisson distribution: PTPE acceptance/rejection algorithm and PoisNorm inverse transformation algorithm, based on transformation of normal distribution. The generator is tested for each of the implemented methods. VSL offers two different implementations for each of continuous distributions: • single-precision real arithmetic • double-precision real arithmetic. Single-precision generator implementation is, as a rule, faster than that for double-precision implementation. Moreover, single-precision implementation is quite sufficient for most applications. VSL offers only one implementation for discrete distributions. Apart from the above-mentioned factors, RNGs are dependent for their quality on distribution parameters. For example, different transformation techniques may be used for different parameters. Therefore, generators are also tested for different parameter sets. 9.1 Interpreting Test Results Test results for general distribution generators are interpreted almost in the same way as for basic generators. For reliable results, either one-level (threshold) or two-level testing is performed. 9.2 Description of Distribution Generator Tests This section describes the available Distribution Generator Tests:NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 79 • Confidence Test • Distribution Moments Test • Chi-Squared Goodness-of-Fit Test 9.2.1 Confidence Test 9.2.1.1 Test Purpose The test checks how well each output member corresponds to the valid range of possible values. For example, for an exponential distribution with parameters a and ß all the output members xi should lie within the range . A value is impossible, that is, the fact that the variate X of exponential distribution with parameters a and ß acquires a value less than a is an impossible event (not to be confused with a null event). Any output member lying outside the valid range constitutes the case of an error. Such a test is necessary because statistical tests (for example, distribution moments test or chisquare test) are unable to detect a small number (if compared with the total sample size) of xi values falling outside the valid range. 9.2.1.2 Interpreting Final Results The test gives a certain quantity K of random numbers that lie outside the valid range of values. The test is considered passed, if K = 0, and failed otherwise. 9.2.2 Distribution Moments Test 9.2.2.1 Test Purpose The test verifies that sample moments of a given distribution agree with theoretical moments. Sample mean (first order moment) and sample variance (central moment of the second order) are considered as stable response. 9.2.2.2 First Level Test The generated random number sequence is used to compute the sample mean M and the sample variance D that are of an asymptomatically normal distribution. Proceeding from this asymptotic, pvalues and are found using the values of M and D. 9.2.2.3 Second Level Test The first level test is run 10 times, each run producing a pair of p-values and , j = 1, 2, ... , 10. The Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling’s statistics is applied to the obtained p-values , j = 1, 2, ... , 10. If the resulting p-value p M < 0.05 or p M > 0.95, the test is considered failed for the sample mean. The same procedure is performed for p-values , j = 1, 2, ... , 10, and if p-value p D < 0.05 or p D > 0.95, the test is considered failed for the sample variance. 9.2.2.4 Interpreting Final Results 10 runs of the second level test provide the percentage FAILM of failed tests for the sample mean and the percentage FAILD of failed tests for the sample variance. The final result of the test is the percentage FAIL = max(FAILM, FAILD ). The value of FAIL < 50% is considered acceptable.Intel(R) MKL Vector Statistical Library Notes 80 9.2.3 Chi-Squared Goodness-of-Fit Test 9.2.3.1 Test Purpose The test verifies that the sample distribution function agrees with the hypothesized distribution. A chisquared V statistic with the number of degrees of freedom that is minus one from the number of the intervals of partition is considered a stable response. 9.2.3.2 First Level Test For a given parameter set and a given sample size the test computes the partition of the distribution domain into disjoint intervals so that the a priori quantity of random numbers from each interval is of order 100. The test computes the actual number of random values within each interval of the generated sample and then calculates chi-square of the statistic V. Since V is asymptotically of chi-squared distribution Fk-1(x) with k - 1 degrees of freedom, where k is the number of the intervals, p-value, which is equal to Fk-1(V), should be of a distribution that is close to uniform. 9.2.3.3 Second Level Test The first level test is run 10 times, each run producing a p-value , j = 1, 2, ... , 10. The Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling’s statistics is applied to the obtained p-values , j = 1, 2, ... , 10. If the resulting p-value p M < 0.05 or p M > 0.95, the test is considered failed. 9.2.3.4 Interpreting Final Results The final result of the test is the percentage FAIL of failed second level tests. The second level test is run 10 times. The value of FAIL < 50% is considered acceptable. 9.2.4 Performance The following factors influence the performance of an RNG of a given distribution: • architecture and configuration of the hardware and software • performance of the underlying BRNG • method of transformation • number of random numbers to be generated (size of the output vector) • parameters of a given probability distribution. VSL random number generators are optimized for Intel(R) Xeon(R) Processor X7560 and Intel(R) Xeon(R) Processor X5670. For more detals on performance, see Vector Statistical Library (VSL) Performance Data document available at http://software.intel.com/en-us/articles/intel-mathkernel-library-documentation/. For earlier Intel processors VSL generators are fully functional, yet not specifically optimized. The value of CPE (Clocks Per Element), which is independent from the processor clock rate, is selected as a unit of measurement. For example, if the generator performance is equal to 10 CPE and the processor rate is 1 GHz, then the generator will produce 108 random numbers per second. The VSL BRNGs differ from each other in speed, therefore data on performance of general (discrete and continuous) distribution generators is given separately for each BRNG used as an underlying generator to produce uniformly distributed random numbers. NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 81 Performance of a general distribution generator also depends on a method chosen for transforming a uniform distribution to a given non-uniform one. This requires specifying the applied transformation method as well. The length of a generated vector is another factor influencing the performance of the VSL vector type generators. Calling generators on short vector lengths may prove highly ineffective. See the figure for the typical interdependence between the generator performance and the vector length. Finally, the generator performance may vary according to probability distribution parameters. The tables provide performance data only for fixed parameter values (or fixed intervals of parameter variations). Table footnotes contain parameters with which a given performance is obtained. For some transformation methods the performance is approximately the same on a wide range of parameters, such methods being called uniformly fast, while for others the performance may vary considerably with variation in the distribution parameters, for example, in PTPE method for an RNG of Poisson distribution. When the latter is the case, graphs of interdependence between the performance and the distribution parameters are provided. 9.3 Continuous Distribution Functions This section describes VSL Continuous Distribution Functions: • Uniform (VSL_RNG_METHOD_UNIFORM_STD/VSL_RNG_METHOD_UNIFORM_STD_ACCURATE) • Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER) • Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2) • Gaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF) • GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER) • GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER2) • GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_ICDF) • Exponential (VSL_RNG_METHOD_EXPONENTIAL_ICDF/VSL_RNG_METHOD_EXPONENTIAL_ICDF_ACCURAT E) • Laplace (VSL_RNG_METHOD_LAPLACE_ICDF) • Weibull (VSL_RNG_METHOD_WEIBULL_ICDF/VSL_RNG_METHOD_WEIBULL_ICDF_ACCURATE) • Cauchy (VSL_RNG_METHOD_CAUCHY_ICDF) • Rayleigh (VSL_RNG_METHOD_RAYLEIGH_ICDF/VSL_RNG_METHOD_RAYLEIGH_ICDF_ACCURATE) • Lognormal (VSL_RNG_METHOD_LOGNORMAL_ BOXMULLER2/SL_RNG_METHOD_LOGNORMAL_ BOXMULLER2_ACCURATE) • Gumbel (VSL_RNG_METHOD_GUMBEL_ICDF) • Gamma (VSL_RNG_METHOD_GAMMA_GNORM/VSL_RNG_METHOD_GAMMA_GNORM_ACCURATE) • Beta (VSL_RNG_METHOD_BETA_CJA/VSL_RNG_METHOD_BETA_CJA_ACCURATE)Intel(R) MKL Vector Statistical Library Notes 82 9.3.1 Uniform (VSL_RNG_METHOD_UNIFORM_STD/VSL_RNG_METHOD_UNIFOR M_STD_ACCURATE) Random number generator of uniform distribution over the real interval [a,b]. You may identify the underlying BRNG by passing the random stream descriptor stream as a parameter. Then Uniform function calls real implementation (of single precision for vsRngUniform and of double precision for vdRngUniform) of this basic generator. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.2 Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER) Random number generator of normal (Gaussian) distribution with the parameters a and s. You may obtain any successive random number x of the standard normal distribution according to the formula (for details, see [Box58]) , where u1, u2 are a pair of successive random numbers uniformly distributed over the interval (0, 1). The normal distribution with the parameters a and s is transformed to the random number y by scaling and the shift y = sx+a. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.3 Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2) Random number generator of normal (Gaussian) distribution with the parameters a and s. You may produce a successive pair of the random numbers x1, x2 of the standard normal distribution according to the formula (for details, see [Box58]) where u1, u2 are a pair of successive random numbers uniformly distributed over the interval (0, 1). The normal distribution with the parameters a and s is transformed to the random number y by scaling and the shift y = sx+a. In VSL you can safely call this method even when the random numbers are generated in blocks with the size aliquant to 2. Consider the following example. Suppose, you use the method VSL_METHOD_DGAUSSIAN_BOXMULLER2 to generate a pair of random numbers of the standard normal distribution. Option 1. Single call of method VSL_METHOD_DGAUSSIAN_BOXMULLER2 with the vector length equal to 2: ... double x[2]; ...NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 83 vdRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2, stream, 2, x, 0.0, 1.0); ... In this case, you generate the random numbers x[0], x[1] by the formula Option 2. Double call of the method VSL_METHOD_DGAUSSIAN_BOXMULLER2 with the vector length equal to 1: ... double x[2]; ... vdRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2, stream, 1, &x[0], 0.0, 1.0); vdRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2, stream, 1, &x[1], 0.0, 1.0); ... At the first call of vdRngGaussian you produce the random number x[0] by the formula At the second call of vdRngGaussian the vector length, over which you initially called the function to generate the random stream, is recognized as odd (equal to 1 in this case). Then the random number x[1] is generated by the formula and not by the formula , as it might be supposed. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.4 Gaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF) Random number generator of normal (Gaussian) distribution with the parameters a and s. You may obtain any successive random number x of the standard normal distribution by the inverse transformation method from the following formula: , where u is a random number uniformly distributed over the interval (-1, 1), and is inverse to the error function .Intel(R) MKL Vector Statistical Library Notes 84 The normal distribution with the parameters a and s is transformed to the random number y by scaling and the shift y = sx+a. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.5 GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER) Random number generator of d-variate (correlated) normal distribution with the parameters a and T. You may obtain any successive random vector according to the formula , where is a d-dimensional vector of random numbers from standard normal distribution, is a lower triangular d×d matrix - Cholesky factor of variance-covariance matrix. Random numbers from standard normal distribution are generated by the method VSL_RNG_METHOD_GAUSSIAN_BOXMULLER. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.3.6GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER2) Random number generator of d-variate (correlated) normal distribution with the parameters a and T. You may obtain any successive random vector according to the formula , where is a d-dimensional vector of random numbers from standard normal distribution, is a lower triangular d×d matrix - Cholesky factor of variance-covariance matrix. Random numbers from standard normal distribution are generated by the method VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.3.7 GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_ICDF) Random number generator of d-variate (correlated) normal distribution with the parameters a and T. You may obtain any successive random vector according to the formula , where is a d-dimensional vector of random numbers from standard normal distribution, is a lower triangular d×d matrix - Cholesky factor of variance-covariance matrix. Random numbers from standard normal distribution are generated by the method VSL_RNG_METHOD_GAUSSIAN_ICDF.NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 85 See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.3.8 Exponential (VSL_RNG_METHOD_EXPONENTIAL_ICDF/VSL_RNG_METHOD_E XPONENTIAL_ICDF_ACCURATE) Random number generator of the exponential distribution with the parameters a and . You may generate any successive random number x of the exponential distribution by the inverse transformation method from the formula: , where u is a successive random number of a uniform distribution over the interval (0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.9 Laplace (VSL_RNG_METHOD_LAPLACE_ICDF) Random number generator of the Laplace distribution with the parameters a and . You may generate any successive random number x of the Laplace distribution by the inverse transformation method from the formula: , where u1, u2 is a pair of successive random numbers of a uniform distribution over the interval (0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.10 Weibull (VSL_RNG_METHOD_WEIBULL_ICDF/ VSL_RNG_METHOD_WEIBULL_ICDF_ACCURATE) Random number generator of the Weibull distribution with the parameters , a and . You may generate any successive random number x of the Weibull distribution by the inverse transformation method from the formula , where u is a successive random number of a uniform distribution over the interval (0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.11 Cauchy (VSL_RNG_METHOD_CAUCHY_ICDF) Random number generator of the Cauchy distribution with the parameters a and . You may generate any successive random number x of the Cauchy distribution by the inverse transformation method from the formula ,Intel(R) MKL Vector Statistical Library Notes 86 where u is a successive random number of a uniform distribution over the interval (-p/2, p/2). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.12 Rayleigh (VSL_RNG_METHOD_RAYLEIGH_ICDF/ VSL_RNG_METHOD_RAYLEIGH_ICDF_ACCURATE) Random number generator of the Rayleigh distribution with the parameters a and . You may generate any successive random number x of the Rayleigh distribution by the inverse transformation method from the formula , where u is a successive random number of a uniform distribution over the interval (0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.13 Lognormal (VSL_RNG_METHOD_LOGNORMAL_ BOXMULLER2/VSL_RNG_METHOD_LOGNORMAL_BOXMULL ER2_ACCURATE) Random number generator of the lognormal distribution with the parameters a, , b and . You may generate any successive random number x of the lognormal distribution by the inverse transformation method from the formula , where y is a successive random number of a normal (Gaussian) distribution with the parameters a and . The random numbers of the normal distribution are generated using the method VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.14 Gumbel (VSL_RNG_METHOD_GUMBEL_ICDF) Random number generator of the Gumbel distribution with the parameters a and . You may generate any successive random number x of the Gumbel distribution by the inverse transformation method from the formula , where y is a successive random number of an exponential distribution with the parameters a=0 and . The random numbers of the exponential distribution are generated using the method VSL_RNG_METHOD_EXPONENTIAL_ICDF. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary.NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 87 9.3.15 Gamma (VSL_RNG_METHOD_GAMMA_GNORM/ VSL_RNG_METHOD_GAMMA_GNORM_ACCURATE) Random number generator of the gamma distribution with the parameters shape , offset a, and scalefactor . You may generate any successive random number of the standard gamma distribution (a=0, =1) as follows: • if > 1, a gamma distributed random number can be generated as a cube of properly scaled normal random number [Mars2000]. The algorithm is based on the acceptance/rejection method using squeeze technique. • If < 1, a gamma distributed random number is generated using two acceptance/rejection based algorithms: ? if < 0.6, a gamma distributed random number is obtained by transformation of exponential power distributed random number [Dev86], ? otherwise, rejection method from Weibull distribution is used [Vad77], [Dev86]. Note that when =1 gamma distribution is reduced to exponential distribution with parameters a, . The random numbers of the exponential distribution are generated using the method VSL_RNG_METHOD_EXPONENTIAL_ICDF. The gamma distributed random number with the parameters , a, and is transformed from using scale and shift . See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.16 Beta (VSL_RNG_METHOD_BETA_CJA/ VSL_RNG_METHOD_BETA_CJA_ACCURATE) Random number generator of the beta distribution with two shape parameters p and q, offset a, and scalefactor . You may generate any successive random number of the standard gamma distribution (a=0, =1) as follows: • if >1, Cheng algorithm is used (for details, see [Cheng78]) • if <1, composition of two algorithms is applied: if , where K = 0.852..., C = - 0.956..., Jöhnk algorithm is used (for details, see [Jöhnk64]); otherwise Atkinson switching algorithm is used (for details, see [Atkin79]) • if <1 and >1, the random numbers are generated using the switching algorithm of Atkinson (for details, see [Atkin79]) • if =1 or =1, the inverse transformation method is used • if =1 and =1, standard beta distribution is reduced to the uniform distribution over the interval (0,1). The random numbers of the uniform distribution are generated using the VSL_RNG_METHOD_UNIFORM_STD method. The algorithms of Cheng and Atkinson use acceptance/rejection technique. The beta distributed random number with the parameters , , a, and is transformed from as follows: . See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary.Intel(R) MKL Vector Statistical Library Notes 88 9.4 Discrete Distribution Functions This section describes VSL Discrete Distribution Functions: • Uniform (VSL_RNG_METHOD_UNIFORM_STD) • UniformBits (VSL_RNG_METHOD_UNIFORMBITS_STD) • UniformBits32 (VSL_RNG_METHOD_UNIFORMBITS32_STD) • UniformBits64 (VSL_RNG_METHOD_UNIFORMBITS64_STD) • Bernoulli (VSL_RNG_METHOD_BERNOULLI_ICDF) • Geometric (VSL_RNG_METHOD_GEOMETRIC_ICDF) • Binomial (VSL_RNG_METHOD_BINOMIAL_BTPE) • Hypergeometric (VSL_RNG_METHOD_HYPERGEOMETRIC_H2PE) • Poisson (VSL_RNG_METHOD_POISSON_PTPE) • Poisson (VSL_RNG_METHOD_POISSON_POISNORM) • PoissonV (VSL_RNG_METHOD_POISSONV_POISNORM) • NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 9.4.1 Uniform (VSL_RNG_METHOD_UNIFORM_STD) Uniform discrete distribution over the integer interval . You may generate any successive random number k of the uniform distribution by the formula: , where u is a successive random number of a uniform (continuous) distribution over the interval and stands for the operation floor(x) that produces the maximum integer, which does not exceed x. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.4.2 UniformBits (VSL_RNG_METHOD_UNIFORMBITS_STD) A random number generator of uniform distribution that produces an integer (non-normalized to the interval (0, 1)) sequence. You may identify the underlying BRNG by passing the random stream descriptor stream as a parameter. Then UniformBits function calls integer implementation of this basic generator. Basic generators differ in bit capacity and structure of the integer output, therefore you should interpret the output integer array of the function viRngUniformBits correctly. The following table provides rules for interpreting 32-bit integer output r[i] for each VSL basic generator. BRNG Integer Recurrence Interpretation of 32-bit integer output array r[i] after calling viRngUniformBiNegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 89 ts MCG31m1 R250 MRG32k3 a MCG59 WH MT19937 , , , , whereIntel(R) MKL Vector Statistical Library Notes 90 , with . MT2203 , where , with , . SFMT1993 7 SOBOLNegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 91 , where , and s is the dimension of quasi-random vector. NIEDERR , where , and s is the dimension of quasi-random vector. Notes: • means obtaining lower 32 bits of the 64-bit unsigned integer x, that is, . • means obtaining upper 32 bits of the 64-bit unsigned integer x, that is, . So, when you generate an integer sequence of n elements, the output array r[i] of the function viRngUniformBits comprises: • n elements for the basic generators MCG31m1, R250, MRG32k3a, MT19937, MT2203, SOBOL, and NIEDERR • 2n elements for the basic generator MCG59 • 4n elements for the basic generators WH and SFMT19937. You may use the integer output, in particular, for fast generation of bit vectors. However, in this case some bits (or groups of them) may happen to be non-random. For example, lower bits produced by linear congruential generators are less random than their higher bits. Note that quasi-random numbers are not random at all. Thoroughly check the integer output bits and bit groups for randomness before forming bit vectors from r[i] array. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.4.3 UniformBits32 (VSL_RNG_METHOD_UNIFORMBITS32_STD) A random number generator that produces uniformly distributed bits in 32-bit chunks. Some basic random number generators produce integers in which not all of the bits are uniformly distributed, for example • The least significant bits in the integers produced by MCG59 BRNG are less random, e.g. the lower four bits form a congruential sequence of period at most 16; and the least significant bit is either constant or strictly alternating (see, for example, [Knuth81]).Intel(R) MKL Vector Statistical Library Notes 92 • By design, BRNGs do not produce the most significant bits setting them to zero, e.g. MCG31m1 is a 31-bit generator, and MCG59 is a 59-bit generator. The UniformBits32 function transforms the underlying BRNG integer recurrence so that all bits in 32- bit chunks are uniformly distributed. This function does not support the following VSL BRNGs: • VSL_BRNG_MCG31 • VSL_BRNG_R250 • VSL_BRNG_MRG32K3A • VSL_BRNG_WH • VSL_BRNG_SOBOL • VSL_BRNG_NIEDERR • VSL_BRNG_IABSTRACT • VSL_BRNG_DABSTRACT • VSL_BRNG_SABSTRACT 9.4.4 UniformBits64 (VSL_RNG_METHOD_UNIFORMBITS64_STD) A random number generator that produces uniformly distributed bits in 64-bit chunks. The generator addresses the same BRNG issues as its 32-bit counterpart, UniformBits32 does. The UniformBits64 function transforms the underlying BRNG integer recurrence so that all bits in 64- bit chunks are uniformly distributed. This function does not support the following VSL BRNGs: • VSL_BRNG_MCG31 • VSL_BRNG_R250 • VSL_BRNG_MRG32K3A • VSL_BRNG_WH • VSL_BRNG_SOBOL • VSL_BRNG_NIEDERR • VSL_BRNG_IABSTRACT • VSL_BRNG_DABSTRACT • VSL_BRNG_SABSTRACT 9.4.5 Bernoulli (VSL_RNG_METHOD_BERNOULLI_ICDF) Bernoulli distribution with the parameter p. You may generate any successive random number k of the Bernoulli distribution by the formula: ,NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 93 where u is a successive random number of a uniform distribution over the interval [0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.4.6 Geometric (VSL_RNG_METHOD_GEOMETRIC_ICDF) Geometrical distribution with the parameter p. You may generate any successive random number k of the geometrical distribution by the formula: , where u is a successive random number of a uniform distribution over the interval [0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.4.7 Binomial (VSL_RNG_METHOD_BINOMIAL_BTPE) Binomial distribution with the parameters ntrial and p. If , random numbers of the binomial distribution are generated by BTPE method (see [Kach88] for details), otherwise combination of inverse transformation and table lookup methods is used. BTPE method is a variation of the acceptance/rejection method that uses linear (on the fractions close to the distribution mode) and exponential (at the distribution tails) functions as majorizing functions. To avoid time consuming acceptance/rejection checks, areas with zero probability of rejection are introduced and squeezing technique is applied. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.4.8 Hypergeometric (VSL_RNG_METHOD_HYPERGEOMETRIC_H2PE) Hypergeometric distribution with the parameters l, s, and m. If and , where , , , the random numbers are generated by H2PE method (see [Kach85] for details), otherwise by the inverse transformation method in combination with the table lookup method. H2PE method is a variation of the acceptance/rejection method that uses constant (on the fraction close to the distribution mode) and exponential (at the distribution tails) functions as majorizing functions. To avoid time consuming acceptance/rejection checks, squeezing technique is applied. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.4.9 Poisson (VSL_RNG_METHOD_POISSON_PTPE) Poisson distribution with the parameter . If , random numbers are generated by PTPE method (see [Schmeiser81] for details), otherwise combination of inverse transformation and table lookup methods is used. PTPE method is a variation of the acceptance/rejection method that uses linear (on the fraction close to the distribution mode) and exponential (at the distribution tails) functions as majorizing functions. To avoid time consuming acceptance/rejection checks, areas with zero probability of rejection are introduced and squeezing technique is applied.Intel(R) MKL Vector Statistical Library Notes 94 See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.4.10 Poisson (VSL_RNG_METHOD_POISSON_POISNORM) Poisson distribution with the parameter . If , the random numbers are generated by combination of inverse transformation and table lookup methods. Otherwise they are produced through transformation of the normally distributed random numbers. The VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2 method is used to generate random numbers of normal distribution. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.4.11 PoissonV (VSL_RNG_METHOD_POISSONV_POISNORM) Poisson distribution with the parameter . If , the random numbers are generated by inverse transformation method. Otherwise they are produced through transformation of normally distributed random numbers. The VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2 method is used to generate random numbers of normal distribution. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.4.12 NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) Negative binomial distribution with the parameters a and p. If , the random numbers are generated by NBAR method, otherwise by combination of inverse transformation and table lookup methods. NBAR method is a variation of the acceptance/rejection method that uses constant and linear functions (on the fraction close to the distribution mode) and exponential functions (at the distribution tails) as majorizing functions. To ensure that the majorizing functions are close to the normalized probability mass function, five 2D figures are formed from the majorizing and minorizing functions as well as from other auxiliary curves. To avoid time-consuming acceptance/rejection checks, areas with zero probability of rejection are introduced. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs.95 Bibliography [Ant79] Antonov, I.A., and Saleev, V.M. An economic method of computing LPt-sequences. USSR Comput. Math. Math. Phys., 19, 252-256, 1979. [Atkin79] Atkinson A.C. A family of switching algorithms for the computer generation of beta random variables, Biometrika, 66, 1, 141-145, 1979. [Box58] Box, G. E. P. and Muller, M. E. A Note on the Generation of Random Normal Deviates. Ann. Math. Stat. 28, 610-611, 1958. [Brat87] Bratley, P., Fox, B.L., and Schrage, L.E.. A Guide to Simulation, 2 nd Edition, Springer-Verlag, New York, 1987. [Brat88] Bratley, P. and Fox, B.L. ALGORITHM 659: Implementing Sobol’s Quasirandom Sequence Generator. ACM Transactions on Modeling and Computer Simulation, Vol. 14, No. 1, 88-100, March 1988. [Brat92] Bratley, P., Fox, B.L., and Niederreiter, H. Implementation and Tests of Low-Discrepancy Sequences. ACM Transactions on Modeling and Computer Simulation, Vol. 2, No. 3, 195-213, July 1992. [Cheng78] Cheng, R. C. H., Generating Beta variates with Nonintegral Shape Parameters, Communications of the ACM, 21, 4, 317-322, 1978. [Cram46] Cramer, H. Mathematical Methods of Statistics. Cambridge, 1946. [Dev86] Devroye, L. Non-Uniform Random Variate Generation, Springer-Verlag, New York, 1986. [Ent98] Entacher, Karl. Bad Subsequences of Well-Known Linear Congruential Pseudorandom Number Generators. ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, 61-70, January 1998. [Jöhnk64] Jöhnk, M.D. Erzeugung von Betaverteilten und Gammaverteilten Zufallszahlen, Metrika, 8, 5-15, 1964. [Jun99] Jun, B., and Kocher, P. The Intel Random Number Generator. White paper prepared for Intel Corp., Cryptography Research, Inc., April 1999. [Kach88] Kachitvichyanukul, V. and Schmeiser, B.W. Binomial random variate generation. Communications of the ACM, Volume 31, Issue 2, February 1988. [Kach85] Kachitvichyanukul, V. and Schmeiser, B.W. Computer generation of hypergeometric random variates. J. Stat. Comput. Simul. 22, 1, 127-145, 1985. [Kirk81] Kirkpatrick, S., and E. Stoll. A Very Fast Shift-Register Sequence Random Number Generator. Journal of Computational Physics, V. 40, 517-526, 1981. [Knuth81] Knuth, Donald E. The Art of Computer Programming, Volume 2, Seminumerical Algorithms, 2 nd edition, Addison-Wesley Publishing Company, Reading, Massachusetts, 1981. [L’Ecu94] L’Ecuyer, Pierre. Uniform Random Number Generators, Annals of Operations Research, 53, 77- 120, 1994. [L’Ecu99] L'Ecuyer, P. Good Parameter Sets for Combined Multiple Recursive Random Number Generators. Operations Research, 47, 1, 159-164, 1999. [L’Ecuyer99] L'Ecuyer, Pierre. Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure. Mathematics of Computation, 68, 249-260, 1999. [MacLaren89] MacLaren, N.M. The Generation of Multiple Independent Sequences of Pseudorandom Numbers. Applied Statistics, 38, 351-359, 1989.Intel(R) MKL Vector Statistical Library Notes 96 [Mars95] Marsaglia, G. The Marsaglia Random Number CDROM, including the DIEHARD Battery of Tests of Randomness, Department of Statistics, Florida State University, Tallahassee, Florida, 1995. [Mars2000] Marsaglia, G., and Tsang, W. W. A simple method for generating gamma variables, ACM Transactions on Mathematical Software, Vol. 26, No. 3, Pages 363-372, September 2000. [Matsum92] Matsumoto, M., and Kurita, Y. Twisted GFSR generators, ACM Transactions on Modeling and Computer Simulation, Vol. 2, No. 3, Pages 179-194, July 1992. [Matsum94] Matsumoto, M., and Kurita, Y. Twisted GFSR generators II, ACM Transactions on Modeling and Computer Simulation, Vol. 4, No. 3, Pages 254-266, July 1994. [Matsum98] Matsumoto, M., and Nishumira T. Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator, ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, Pages 3-30, January 1998. [Matsum2000] Matsumoto, M., and Nishimura T. Dynamic Creation of Pseudorandom Number Generators, 56- 69, in: Monte Carlo and Quasi-Monte Carlo Methods 1998, Ed. Niederreiter, H. and Spanier, J., Springer 2000, http://www.math.sci.hiroshima-u.ac.jp/%7Em-mat/MT/DC/dc.html. [Mikh2000] Mikhailov, G.A. Weight Monte Carlo Methods, Novosibirsk: SB RAS Publ., 2000 (In Russian). [MT2002] http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html [NAG] Numerical Algorithms Group, www.nag.co.uk. [Ripley87] Ripley, B.D. Stochastic Simulation, Wiley, New York, 1987. [Saito08] Saito, M., and Matsumoto, M. SIMD-oriented Fast Mersenne Twister: a 128-bit Pseudorandom Number Generator,Monte Carlo and Quasi-Monte Carlo Methods 2006, Springer, pp. 607-622, 2008, http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/earticles.html [Schmeiser81] Schmeiser, Bruce, and Kachitvichyanukul, Voratas. Poisson Random Variate Generation. Research Memorandum 81-4, School of Industrial Engineering, Purdue University, 1981. [Vad77] Vaduva, I. On computer generation of gamma random variables by rejection and composition procedures. Mathematische Operationsforschung und Statistik, Series Statistics, vol. 8, 545-576, 1977. [Ziff98] Ziff, Robert M. Four-tap shift-register-sequence random-number generators. Computers in Physics, Vol. 12, No. 4, Jul/Aug 1998. Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323648-001US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information..................................................................................5 Introducing the Intel® C++ Composer XE 2011 .....................................7 Prerequisites.........................................................................................9 Chapter 1: Navigation Quick Start Starting the Intel ® C++ Compiler from the Eclipse* IDE..................................11 Starting the Intel ® C++ Compiler from the Command Line..............................11 Starting the Intel ® Debugger.......................................................................12 Chapter 2: Tutorial: Intel® C++ Compiler Using Auto Vectorization.............................................................................13 Introduction to Auto-vectorization.......................................................13 Establishing a Performance Baseline.....................................................14 Generating a Vectorization Report........................................................14 Improving Performance by Pointer Disambiguation.................................15 Improving Performance by Aligning Data..............................................15 Improving Performance with Interprocedural Optimization......................16 Additional Exercises...........................................................................17 Using Guided Auto-parallelization.................................................................17 Introduction to Guided Auto-parallelization...........................................17 Preparing the Project for Guided Auto-parallelization..............................17 Running Guided Auto-parallelization.....................................................18 Analyzing Guided Auto-parallelization Reports.......................................18 Implementing Guided Auto-parallelization Recommendations..................19 iiiiv Intel® C++ Composer XE 2011 Getting Started TutorialsLegal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Centrino, Cilk, Intel, Intel Atom, Intel Core, Intel NetBurst, Itanium, MMX, Pentium, Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Portions Copyright © 2001, Hewlett-Packard Development Company, L.P. Copyright ©2011, Intel Corporation. All rights reserved. 56 Intel® C++ Composer XE 2011 Getting Started TutorialsIntroducing the Intel® C++ Composer XE 2011 This guide shows you how to start the Intel® C++ Composer XE 2011 and begin debugging code using the Intel® Debugger. The Intel(R) C++ Composer XE 2011 is a comprehensive set of software development tools that includes the following components: • Intel® C++ Compiler • Intel® Integrated Performance Primitives • Intel® Threading Building Blocks • Intel® Math Kernel Library • Intel® Debugger 78 Intel® C++ Composer XE 2011 Getting Started TutorialsPrerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. Required Skills and Knowledge These tutorials are designed for developers with a basic understanding of the Linux* operating system, including how to: • install the Intel ® C++ Composer XE 2011 on a supported Linux distribution. See the Release Notes. • open a Linux shell and execute fundamental commands including make. • compile and link C/C++ source files. 910 Intel® C++ Composer XE 2011 Getting Started TutorialsNavigation Quick Start 1 Starting the Intel® C++ Compiler from the Eclipse* IDE The Intel ® C++ Compiler XE 12.1 for Linux* OS compiles C and C++ source files on Linux* operating systems. The compiler is supported on IA-32 and Intel ® 64 architectures. You must first install and configure Eclipse on your system, then you can configure Eclipse to use the Intel ® C++ Compiler XE 12.1. See the Getting Started section in the compiler documentation for current information about compiling applications with Eclipse*. The Using Eclipse* section provides detailed information about configuring and using Eclipse with the Intel ® C/C++ Compilers. Starting the Intel® C++ Compiler from the Command Line The Intel ® C++ Compiler XE 12.1 for Linux* OS compiles C and C++ source files on Linux* operating systems. The compiler is supported on IA-32 and Intel ® 64 architectures. Start using the compiler by performing the following steps: 1. Open a terminal session. 2. Set the environment variables for the compiler. 3. Invoke the compiler. One way to set the environment variables prior to invoking the compiler is to "source" the compiler environment script, compilervars.sh (or compilervars.csh): source /bin/compilervars.sh where is the directory structure containing the compiler /bin directory, and is the architecture argument listed below. The environment script takes an argument based on architecture. Valid arguments are as follows: • ia32: Compilers and libraries for IA-32 architectures only • intel64: Compilers and libraries for Intel ® 64 architectures only 11To compile C source files, use a command similar to the following: icc my_source_file.c To compile C++ source files, use a command similar to the following: icpc my_source_file.cpp Following successful compilation, an executable is created in the current directory. Starting the Intel® Debugger The Intel® Debugger (IDB) is a full-featured symbolic source code application debugger that helps programmers: • Debug programs • Disassemble and examine machine code and examine machine register values • Debug programs with shared libraries • Debug multithreaded applications The debugger features include: • C/C++ language support • Assembler language support • Access to the registers your application accesses • Bitfield editor to modify registers • MMU support Starting the Intel® Debugger On Linux*, you can use the Intel Debugger from a Java* GUI application or the command-line. • To start the GUI for the Intel Debugger, execute the idb command from a Linux shell. • To start the command-line invocation of the Intel Debugger, execute the idbc command from a Linux shell. 12 1 Intel® C++ Composer XE 2011 Getting Started TutorialsTutorial: Intel® C++ Compiler 2 Using Auto Vectorization Introduction to Auto-vectorization For the Intel® C++ Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization to emphasize that the compiler automatically identifies and optimizes suitable loops on its own. Using the -vec (Linux* OS) or the /Qvec (Windows* OS) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx or -m or -x (Linux and Mac OS X). Vectorization is enabled with the Intel C++ Compiler at optimization levels of -O2 and higher. Many loops are vectorized automatically, but in cases where this doesn't happen, you may be able to vectorize loops by making simple code modifications. In this tutorial, you will: • establish a performance baseline • generate a vectorization report • improve performance by pointer disambiguation • improve performance by aligning data • improve performance using Interprocedural Optimization Locating the Samples To begin this tutorial, locate the source files in the product's Samples directory: /Samples//C++/vec_samples/ Use these files for this tutorial: • Driver.c • Multiply.c • Multiply.h 13Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, compile your sources with these compiler options: icc -O1 -std=c99 -DNOFUNCCALL Multiply.c Driver.c -o MatVector Execute MatVector and record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured. This example uses a variable length array (VLA), and therefore, must be compiled with the -std=c99 option. Generating a Vectorization Report A vectorization report tells you whether the loops in your code were vectorized, and if not, explains why not. Because vectorization is off at -O1, the compiler does not generate a vectorization report, so recompile at -O2 (default optimization): icc -std=c99 -DNOFUNCCALL -vec-report1 Multiply.c Driver.c -o MatVector Record the new execution time. The reduction in time is mostly due to auto-vectorization of the inner loop at line 150 noted in the vectorization report: Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. The -vec-report2 option returns a list that also includes loops that were not vectorized, along with the reason why the compiler did not vectorize them. icc -std=c99 -DNOFUNCCALL -vec-report2 Multiply.c Driver.c -o MatVector The vectorization report indicates that the loop at line 45 in Multiply.c did not vectorize because it is not the innermost loop of the loop nest. Two versions of the innermost loop at line 55 were generated, but neither version was vectorized. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: loop was not vectorized: existence of vector dependence. Multiply.c(55) (col. 3): remark: loop skipped: multiversioned. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(148) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. NOTE. For more information on the -vec-report compiler option, see the Compiler Options section in the Compiler User and Reference Guide. 14 2 Intel® C++ Composer XE 2011 Getting Started TutorialsImproving Performance by Pointer Disambiguation Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path. In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x. NOTE. The restrict qualifier requires the use of either the -restrict compiler option for .c or .cpp files, or the -std=c99 compiler option for .c files. Replace the NOFUNCCALL macro with NOALIAS. icc -std=c99 -vec-report2 -DNOALIAS Multiply.c Driver.c -o MatVector This conditional compilation replaces the loop in the main program with a function call. Execute MatVector and record the execution time reported in the output. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Now that the compiler has been told that the arrays do not overlap, it knows that it is safe to vectorize the loop. Improving Performance by Aligning Data The vectorizer can generate faster code when operating on aligned data. In this activity you will improve performance by aligning the arrays a, b, and x in Driver.c on a 16-byte boundary so that the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will modify the declarations of a, b, and x in Driver.c using the __attribute keyword, which has the following syntax: float array[30] __attribute((aligned(base, [offset]))); This instructs the compiler to create an array that it is aligned on a "base"-byte boundary with an "offset" (Default=0) in bytes from that boundary. Example: FTYPE a[ROW][COLWIDTH] __attribute((aligned(16))); 15 Tutorial: Intel® C++ Compiler 2In addition, the row length of the matrix, a, needs to be padded out to be a multiple of 16 bytes, so that each individual row of a is 16-byte aligned. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in Multiply.c are aligned by using #pragma vector aligned. NOTE. If you use #pragma vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if #pragma vector aligned is not used. See the code under the ALIGNED macro in Multiply.c If your compilation targets the Intel ® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, #pragma vector aligned advises the compiler that the data is 32-byte aligned. Recompile the program after adding the ALIGNED macro to ensure consistently aligned data. icc -std=c99 -vec-report2 -DNOALIAS -DALIGNED Multiply.c Driver.c -o MatVector Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(140) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(72) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(61) (col. 4): remark: LOOP WAS VECTORIZED. Now, run the executable and record the execution time. Improving Performance with Interprocedural Optimization The compiler may be able to perform additional optimizations if it is able to optimize across source line boundaries. These may include, but are not limited to, function inlining. This is enabled with the -ipo option. Recompile the program using the -ipo option to enable interprocedural optimization. icc -std=c99 -vec-report2 -DNOALIAS -DALIGNED -ipo Multiply.c Driver.c -o MatVector Note that the vectorization messages now appear at the point of inlining in Driver.c (line 155). Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Now, run the executable and record the execution time. 16 2 Intel® C++ Composer XE 2011 Getting Started TutorialsAdditional Exercises The previous examples made use of double precision arrays. They may be built instead with single precision arrays by adding the macro, FTYPE=float. The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements. NOTE. In the example with data alignment, you will need to set COLBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, #pragma vector aligned will cause the program to fail. This completes the tutorial for auto-vectorization, where you have seen how the compiler can optimize performance with various vectorization techniques. Using Guided Auto-parallelization Introduction to Guided Auto-parallelization Guided Auto-parallelization (GAP) is a feature of the Intel® C++ Compiler that offers selective advice and, when correctly applied, results in auto-vectorization or auto-parallelization for serially-coded applications. Using the -guide option with your normal compiler options at -O2 or higher is sufficient to enable the GAP technology to generate the advice for auto-vectorization. Using -guide in conjunction with -parallel will enable the compiler to generate advice for auto-parallelization. In this tutorial, you will: 1. prepare the project for Guided Auto-parallelization. 2. run Guided Auto-parallelization. 3. analyze Guided Auto-parallelization reports. 4. implement Guided Auto-parallelization recommendations. Preparing the Project for Guided Auto-parallelization To begin this tutorial, open the source file archive located at: /Samples//C++/guided_auto_parallel.tar.gz The following files are included: • Makefile • main.cpp • main.h • scalar_dep.cpp 17 Tutorial: Intel® C++ Compiler 2• scalar_dep.h Copy these files to a directory on your system where you have write and execute permissions. Running Guided Auto-parallelization You can use the -guide option to generate GAP advice. From a directory where you can compile the sample program, execute make gap_vec_report from the command-line, or execute: icpc -c -guide scalar_dep.cpp The GAP Report appears in the compiler output. GAP reports are encapsulated with GAP REPORT LOG OPENED and END OF GAP REPORT LOG. GAP REPORT LOG OPENED remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.cpp(51): remark #30515: (VECT) Loop at line 51 cannot be vectorized due to conditional assignment(s) into the following variable(s): b. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG Analyzing Guided Auto-parallelization Reports Analyze the output generated by GAP analysis and determine whether or not the specific suggestions are appropriate for the specified source code. For this sample tutorial, GAP generates output for the loop in scalar_dep.cpp: for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } In this example, the GAP Report generates a recommendation (remark #30761) to add the -parallel option to improve auto-parallelization. Remark #30515 indicates if variable b can be unconditionally assigned, the compiler will be able to vectorize the loop. GAP REPORT LOG OPENED remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.cpp(51): remark #30515: (VECT) Loop at line 51 cannot be vectorized due to conditional assignment(s) into the following variable(s): b. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG 18 2 Intel® C++ Composer XE 2011 Getting Started TutorialsImplementing Guided Auto-parallelization Recommendations The GAP Report in this example recommends using the -parallel option to enable parallelization. From the command-line, execute make gap_par_report, or run the following: icpc -c -guide -parallel scalar_dep.cpp The compiler emits the following: GAP REPORT LOG OPENED ON Wed Jul 28 14:33:09 2010 scalar_dep.cpp(51): remark #30523: (PAR) Loop at line 51 cannot be parallelized due to conditional assignment(s) into the following variable(s): b. This loop will be parallelized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. [ALTERNATIVE] Another way is to use "#pragma parallel private(b)" to parallelize the loop. [VERIFY] The same conditions described previously must hold. scalar_dep.cpp(51): remark #30525: (PAR) If the trip count of the loop at line 51 is greater than 188, then use "#pragma loop count min(188)" to parallelize this loop. [VERIFY] Make sure that the loop has a minimum of 188 iterations. Number of advice-messages emitted for this compilation session: 2. END OF GAP REPORT LOG In the GAP Report, remark #30523 indicates that loop at line 51 cannot parallelize because the variable b is conditionally assigned. Remark #30525 indicates that the loop trip count must be greater than 188 for the compiler to parallelize the loop. Apply the necessary changes after verifying that the GAP recommendations are appropriate and do not change the semantics of the program. For this loop, the conditional compilation enables parallelization and vectorization of the loop as recommended by GAP: #ifdef TEST_GAP #pragma loop count min (188) for (i=0; i 0) {A[i] = 1 / A[i];} if (A[i] > 1) {A[i] += b;} } #else for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } #endif } To verify that the loop is parallelized and vectorized: • Add the compiler options -vec-report1 -par-report1. • Add the conditional definition TEST_GAP to compile the appropriate code path. From the command-line, execute make final, or run the following: icpc -c -parallel -DTEST_GAP -vec-report1 -par-report1 scalar_dep.cpp 19 Tutorial: Intel® C++ Compiler 2The compiler's -vec-report and -par-report options emit the following output, confirming that the program is vectorized and parallelized: scalar_dep.cpp(43) (col. 3): remark: LOOP WAS AUTO-PARALLELIZED. scalar_dep.cpp(43) (col. 3): remark: LOOP WAS VECTORIZED. scalar_dep.cpp(43) (col. 3): remark: LOOP WAS VECTORIZED. For more information on using the -guide, -vec-report, and -par-report compiler options, see the Compiler Options section in the Compiler User Guide and Reference. This completes the tutorial for Guided Auto-parallelization, where you have seen how the compiler can guide you to an optimized solution through auto-parallelization. 20 2 Intel® C++ Composer XE 2011 Getting Started Tutorials Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323649-001US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information..................................................................................5 Introducing the Intel® C++ Composer XE 2011 .....................................7 Prerequisites.........................................................................................9 Chapter 1: Navigation Quick Start Getting Started with the Intel ® C++ Composer XE 2011..................................11 Starting the Intel ® Debugger.......................................................................12 Chapter 2: Tutorial: Intel® C++ Compiler Using Auto Vectorization.............................................................................13 Introduction to Auto-vectorization.......................................................13 Establishing a Performance Baseline.....................................................14 Generating a Vectorization Report........................................................14 Improving Performance by Pointer Disambiguation.................................15 Improving Performance by Aligning Data..............................................15 Improving Performance with Interprocedural Optimization......................16 Additional Exercises...........................................................................17 Using Guided Auto-parallelization.................................................................17 Introduction to Guided Auto-parallelization...........................................17 Preparing the Project for Guided Auto-parallelization..............................17 Running Guided Auto-parallelization.....................................................18 Analyzing Guided Auto-parallelization Reports.......................................18 Implementing Guided Auto-parallelization Recommendations..................19 iiiiv Intel® C++ Composer XE 2011 Getting Started TutorialsLegal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Centrino, Cilk, Intel, Intel Atom, Intel Core, Intel NetBurst, Itanium, MMX, Pentium, Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P. Copyright © 2011, Intel Corporation. All rights reserved. 56 Intel® C++ Composer XE 2011 Getting Started TutorialsIntroducing the Intel® C++ Composer XE 2011 This guide shows you how to start the Intel® C++ Composer XE 2011 and begin debugging code using the Intel® Debugger. The Intel(R) C++ Composer XE 2011 is a comprehensive set of software development tools that includes the following components: • Intel® C++ Compiler • Intel® Integrated Performance Primitives • Intel® Threading Building Blocks • Intel® Math Kernel Library • Intel® Debugger 78 Intel® C++ Composer XE 2011 Getting Started TutorialsPrerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. Required Skills and Knowledge These tutorials are designed for developers with a basic understanding of the Mac OS* X, including how to: • install the Intel ® C++ Composer XE 2011 on a supported Mac OS* X version. See the Release Notes. • open a Mac OS* X command-line shell and execute fundamental commands including make. • compile and link C/C++ source files. 910 Intel® C++ Composer XE 2011 Getting Started TutorialsNavigation Quick Start 1 Getting Started with the Intel® C++ Composer XE 2011 The Intel ® C++ Compiler XE 12.1 for Mac OS* X compiles C and C++ source files on Mac OS* X operating systems. The compiler is supported on IA-32 and Intel ® 64 architectures. You can use the Intel C++ Compiler XE 12.1 in the Xcode* integrated development environment or from the command line. This tutorial assumes you are using Xcode*, but supplies general instructions for starting the compiler from a command line. Using the Compiler in Xcode* You must first create or choose an existing C or C++ Xcode* project. These instructions assume you are creating a new project. 1. Launch Xcode. 2. Choose New Project from the File menu. When the New Project Assistant window appears, select a project template under Application; for example, select Command Line Tool. Click Choose. 3. Click Next, then name your project (hello_world, for example) and specify a save location. Click Save. 4. From within the project, highlight the target you want to change in the Groups & Files list under the Target group. 5. Double-click the target you want to change in the Groups & Files list under the Target group. 6. In the Target Info window, click Rules. 7. To add a new rule, click the + button at the bottom, left-hand corner of the Target Info window. From the new Rule section: • under Process, choose C++ source files • under Using, choose Intel® C++ Compiler XE 12.1 8. Choose Build from the Build menu or click the Build and Go button in the toolbar. To view the results of your build, choose Build Results from the Build menu in the Xcode toolbar. See the Building Applications with Xcode* section in the compiler documentation for more information about using the compiler with the Xcode integrated development environment. Using the Compiler from the Command Line 11Start the compiler from a command line by performing the following steps: 1. Open a terminal session. 2. Set the environment variables for the compiler. 3. Invoke the compiler. One way to set the environment variables prior to invoking the compiler is to "source" the compiler environment script, compilervars.sh (or compilervars.csh): source /bin/compilervars.sh where is the directory structure containing the compiler /bin directory, and is the architecture argument listed below. The environment script takes an argument based on architecture. Valid arguments are as follows: • ia32: Compilers and libraries for IA-32 architectures only • intel64: Compilers and libraries for Intel ® 64 architectures only To compile C source files, use a command similar to the following: icc my_source_file.c To compile C++ source files, use a command similar to the following: icpc my_source_file.cpp Following successful compilation, an executable is created in the current directory. Starting the Intel® Debugger The Intel® Debugger (IDB) is a full-featured symbolic source code application debugger that helps programmers: • Debug programs • Disassemble and examine machine code and examine machine register values • Debug programs with shared libraries • Debug multithreaded applications The debugger features include: • C/C++ language support • Assembler language support • Access to the registers your application accesses • Bitfield editor to modify registers • MMU support Starting the Intel® Debugger On Mac OS* X, you can use the Intel Debugger only from the command-line. To start the command-line invocation of the Intel Debugger, execute the idb command. 12 1 Intel® C++ Composer XE 2011 Getting Started TutorialsTutorial: Intel® C++ Compiler 2 Using Auto Vectorization Introduction to Auto-vectorization For the Intel® C++ Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization to emphasize that the compiler automatically identifies and optimizes suitable loops on its own. Using the -vec (Linux* OS) or the /Qvec (Windows* OS) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx or -m or -x (Linux and Mac OS X). Vectorization is enabled with the Intel C++ Compiler at optimization levels of -O2 and higher. Many loops are vectorized automatically, but in cases where this doesn't happen, you may be able to vectorize loops by making simple code modifications. In this tutorial, you will: • establish a performance baseline • generate a vectorization report • improve performance by pointer disambiguation • improve performance by aligning data • improve performance using Interprocedural Optimization Locating the Samples To begin this tutorial, locate the source files in the product's Samples directory: /Samples//C++/vec_samples/ Use these files for this tutorial: • Driver.c • Multiply.c • Multiply.h 13Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, compile your sources with these compiler options: icc -O1 -std=c99 -DNOFUNCCALL Multiply.c Driver.c -o MatVector Execute MatVector and record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured. This example uses a variable length array (VLA), and therefore, must be compiled with the -std=c99 option. Generating a Vectorization Report A vectorization report tells you whether the loops in your code were vectorized, and if not, explains why not. Because vectorization is off at -O1, the compiler does not generate a vectorization report, so recompile at -O2 (default optimization): icc -std=c99 -DNOFUNCCALL -vec-report1 Multiply.c Driver.c -o MatVector Record the new execution time. The reduction in time is mostly due to auto-vectorization of the inner loop at line 150 noted in the vectorization report: Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. The -vec-report2 option returns a list that also includes loops that were not vectorized, along with the reason why the compiler did not vectorize them. icc -std=c99 -DNOFUNCCALL -vec-report2 Multiply.c Driver.c -o MatVector The vectorization report indicates that the loop at line 45 in Multiply.c did not vectorize because it is not the innermost loop of the loop nest. Two versions of the innermost loop at line 55 were generated, but neither version was vectorized. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: loop was not vectorized: existence of vector dependence. Multiply.c(55) (col. 3): remark: loop skipped: multiversioned. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(148) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. NOTE. For more information on the -vec-report compiler option, see the Compiler Options section in the Compiler User and Reference Guide. 14 2 Intel® C++ Composer XE 2011 Getting Started TutorialsImproving Performance by Pointer Disambiguation Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path. In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x. NOTE. The restrict qualifier requires the use of either the -restrict compiler option for .c or .cpp files, or the -std=c99 compiler option for .c files. Replace the NOFUNCCALL macro with NOALIAS. icc -std=c99 -vec-report2 -DNOALIAS Multiply.c Driver.c -o MatVector This conditional compilation replaces the loop in the main program with a function call. Execute MatVector and record the execution time reported in the output. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Now that the compiler has been told that the arrays do not overlap, it knows that it is safe to vectorize the loop. Improving Performance by Aligning Data The vectorizer can generate faster code when operating on aligned data. In this activity you will improve performance by aligning the arrays a, b, and x in Driver.c on a 16-byte boundary so that the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will modify the declarations of a, b, and x in Driver.c using the __attribute keyword, which has the following syntax: float array[30] __attribute((aligned(base, [offset]))); This instructs the compiler to create an array that it is aligned on a "base"-byte boundary with an "offset" (Default=0) in bytes from that boundary. Example: FTYPE a[ROW][COLWIDTH] __attribute((aligned(16))); 15 Tutorial: Intel® C++ Compiler 2In addition, the row length of the matrix, a, needs to be padded out to be a multiple of 16 bytes, so that each individual row of a is 16-byte aligned. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in Multiply.c are aligned by using #pragma vector aligned. NOTE. If you use #pragma vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if #pragma vector aligned is not used. See the code under the ALIGNED macro in Multiply.c If your compilation targets the Intel ® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, #pragma vector aligned advises the compiler that the data is 32-byte aligned. Recompile the program after adding the ALIGNED macro to ensure consistently aligned data. icc -std=c99 -vec-report2 -DNOALIAS -DALIGNED Multiply.c Driver.c -o MatVector Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(140) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(72) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(61) (col. 4): remark: LOOP WAS VECTORIZED. Now, run the executable and record the execution time. Improving Performance with Interprocedural Optimization The compiler may be able to perform additional optimizations if it is able to optimize across source line boundaries. These may include, but are not limited to, function inlining. This is enabled with the -ipo option. Recompile the program using the -ipo option to enable interprocedural optimization. icc -std=c99 -vec-report2 -DNOALIAS -DALIGNED -ipo Multiply.c Driver.c -o MatVector Note that the vectorization messages now appear at the point of inlining in Driver.c (line 155). Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Now, run the executable and record the execution time. 16 2 Intel® C++ Composer XE 2011 Getting Started TutorialsAdditional Exercises The previous examples made use of double precision arrays. They may be built instead with single precision arrays by adding the macro, FTYPE=float. The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements. NOTE. In the example with data alignment, you will need to set COLBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, #pragma vector aligned will cause the program to fail. This completes the tutorial for auto-vectorization, where you have seen how the compiler can optimize performance with various vectorization techniques. Using Guided Auto-parallelization Introduction to Guided Auto-parallelization Guided Auto-parallelization (GAP) is a feature of the Intel® C++ Compiler that offers selective advice and, when correctly applied, results in auto-vectorization or auto-parallelization for serially-coded applications. Using the -guide option with your normal compiler options at -O2 or higher is sufficient to enable the GAP technology to generate the advice for auto-vectorization. Using -guide in conjunction with -parallel will enable the compiler to generate advice for auto-parallelization. In this tutorial, you will: 1. prepare the project for Guided Auto-parallelization. 2. run Guided Auto-parallelization. 3. analyze Guided Auto-parallelization reports. 4. implement Guided Auto-parallelization recommendations. Preparing the Project for Guided Auto-parallelization To begin this tutorial, open the source file archive located at: /Samples//C++/guided_auto_parallel.tar.gz The following files are included: • Makefile • main.cpp • main.h • scalar_dep.cpp 17 Tutorial: Intel® C++ Compiler 2• scalar_dep.h Copy these files to a directory on your system where you have write and execute permissions. Running Guided Auto-parallelization You can use the -guide option to generate GAP advice. From a directory where you can compile the sample program, execute make gap_vec_report from the command-line, or execute: icpc -c -guide scalar_dep.cpp The GAP Report appears in the compiler output. GAP reports are encapsulated with GAP REPORT LOG OPENED and END OF GAP REPORT LOG. GAP REPORT LOG OPENED remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.cpp(51): remark #30515: (VECT) Loop at line 51 cannot be vectorized due to conditional assignment(s) into the following variable(s): b. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG Analyzing Guided Auto-parallelization Reports Analyze the output generated by GAP analysis and determine whether or not the specific suggestions are appropriate for the specified source code. For this sample tutorial, GAP generates output for the loop in scalar_dep.cpp: for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } In this example, the GAP Report generates a recommendation (remark #30761) to add the -parallel option to improve auto-parallelization. Remark #30515 indicates if variable b can be unconditionally assigned, the compiler will be able to vectorize the loop. GAP REPORT LOG OPENED remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.cpp(51): remark #30515: (VECT) Loop at line 51 cannot be vectorized due to conditional assignment(s) into the following variable(s): b. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG 18 2 Intel® C++ Composer XE 2011 Getting Started TutorialsImplementing Guided Auto-parallelization Recommendations The GAP Report in this example recommends using the -parallel option to enable parallelization. From the command-line, execute make gap_par_report, or run the following: icpc -c -guide -parallel scalar_dep.cpp The compiler emits the following: GAP REPORT LOG OPENED ON Wed Jul 28 14:33:09 2010 scalar_dep.cpp(51): remark #30523: (PAR) Loop at line 51 cannot be parallelized due to conditional assignment(s) into the following variable(s): b. This loop will be parallelized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. [ALTERNATIVE] Another way is to use "#pragma parallel private(b)" to parallelize the loop. [VERIFY] The same conditions described previously must hold. scalar_dep.cpp(51): remark #30525: (PAR) If the trip count of the loop at line 51 is greater than 188, then use "#pragma loop count min(188)" to parallelize this loop. [VERIFY] Make sure that the loop has a minimum of 188 iterations. Number of advice-messages emitted for this compilation session: 2. END OF GAP REPORT LOG In the GAP Report, remark #30523 indicates that loop at line 51 cannot parallelize because the variable b is conditionally assigned. Remark #30525 indicates that the loop trip count must be greater than 188 for the compiler to parallelize the loop. Apply the necessary changes after verifying that the GAP recommendations are appropriate and do not change the semantics of the program. For this loop, the conditional compilation enables parallelization and vectorization of the loop as recommended by GAP: #ifdef TEST_GAP #pragma loop count min (188) for (i=0; i 0) {A[i] = 1 / A[i];} if (A[i] > 1) {A[i] += b;} } #else for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } #endif } To verify that the loop is parallelized and vectorized: • Add the compiler options -vec-report1 -par-report1. • Add the conditional definition TEST_GAP to compile the appropriate code path. From the command-line, execute make final, or run the following: icpc -c -parallel -DTEST_GAP -vec-report1 -par-report1 scalar_dep.cpp 19 Tutorial: Intel® C++ Compiler 2The compiler's -vec-report and -par-report options emit the following output, confirming that the program is vectorized and parallelized: scalar_dep.cpp(43) (col. 3): remark: LOOP WAS AUTO-PARALLELIZED. scalar_dep.cpp(43) (col. 3): remark: LOOP WAS VECTORIZED. scalar_dep.cpp(43) (col. 3): remark: LOOP WAS VECTORIZED. For more information on using the -guide, -vec-report, and -par-report compiler options, see the Compiler Options section in the Compiler User Guide and Reference. This completes the tutorial for Guided Auto-parallelization, where you have seen how the compiler can guide you to an optimized solution through auto-parallelization. 20 2 Intel® C++ Composer XE 2011 Getting Started Tutorials Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323647-001US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information..................................................................................5 Introducing the Intel® C++ Composer XE 2011 .....................................7 Prerequisites.........................................................................................9 Chapter 1: Navigation Quick Start Starting the Intel ® C++ Compiler from the Microsoft Visual Studio* IDE............11 Switching between the Installed Compilers....................................................12 Starting the Intel ® C++ Compiler from the Command Line..............................12 Starting the Intel ® Parallel Debugger Extension..............................................13 Chapter 2: Tutorial: Intel® C++ Compiler Using Auto Vectorization.............................................................................15 Introduction to Auto-vectorization.......................................................15 Establishing a Performance Baseline.....................................................16 Generating a Vectorization Report........................................................18 Improving Performance by Pointer Disambiguation.................................19 Improving Performance by Aligning Data..............................................20 Improving Performance with Interprocedural Optimization......................21 Additional Exercises...........................................................................22 Using Guided Auto-parallelization.................................................................22 Introduction to Guided Auto-parallelization...........................................22 Preparing the Project for Guided Auto-parallelization..............................22 Running Guided Auto-parallelization.....................................................23 Analyzing Guided Auto-parallelization Reports.......................................26 Implementing Guided Auto-parallelization Recommendations..................26 Threading Your Applications........................................................................30 Learning Objectives...........................................................................30 Threading Your Application.................................................................30 iiiiv Intel® C++ Composer XE 2011 Getting Started TutorialsLegal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Centrino, Cilk, Intel, Intel Atom, Intel Core, Intel NetBurst, Itanium, MMX, Pentium, Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. Copyright © 2011, Intel Corporation. All rights reserved. 56 Intel® C++ Composer XE 2011 Getting Started TutorialsIntroducing the Intel® C++ Composer XE 2011 This guide shows you how to start the Intel® C++ Composer XE 2011 and begin debugging code using the Intel® Parallel Debugger Extension. The Intel(R) C++ Composer XE 2011 is a comprehensive set of software development tools that includes the following components: • Intel® C++ Compiler • Intel® Integrated Performance Primitives • Intel® Threading Building Blocks • Intel® Math Kernel Library • Intel® Parallel Debugger Extension Check http://software.intel.com/en-us/articles/intel-software-product-tutorials/ for the following: • ShowMe video for using Intel® C++ Composer XE with Microsoft Visual Studio* 78 Intel® C++ Composer XE 2011 Getting Started TutorialsPrerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. NOTE. Although the instructions and screen captures in these tutorials refer to the Visual Studio* 2005 integrated development environment (IDE), you can use these tutorials with later versions of Visual Studio. Required Tools You need the following tools to use these tutorials: • Microsoft Visual Studio 2005 or later. • Intel ® C++ Composer XE 2011. • Sample code included with the Intel ® C++ Composer XE 2011. NOTE. • Samples are non-deterministic. Your results may vary from the examples shown throughout these tutorials. • Samples are designed only to illustrate features and do not represent best practices for creating multithreaded code. Required Skills and Knowledge These tutorials are designed for developers with a basic understanding of Microsoft Visual Studio, including how to: • open a project/solution. • access the Document Explorer. (valid in Microsoft Visual Studio 2005 /2008 ) • display the Solution Explorer. • compile and link a project. • ensure a project compiled successfully. 910 Intel® C++ Composer XE 2011 Getting Started TutorialsNavigation Quick Start 1 Starting the Intel® C++ Compiler from the Microsoft Visual Studio* IDE The Intel ® C++ Composer XE 2011 integrates into the following versions of the Microsoft Visual Studio* Integrated Development Environment (IDE): • Microsoft Visual Studio 2010* • Microsoft Visual Studio 2008* • Microsoft Visual Studio 2005* Using the Intel ® C++ Composer XE 2011 from Microsoft Visual Studio* IDE To use the Intel ® C++ Compiler do the following: 1. Launch Microsoft Visual Studio*. 2. Open or create a Visual Studio solution in the Solution Explorer pane. 3. From the Project menu, select Intel C++ Compiler XE > Use Intel C++. 4. Click OK in the Confirmation dialog box. This configures the solution to use the Intel ® C++ Compiler. ( Visual Studio 2008 or Visual Studio 2005: you can configure the solution to use the Intel ® C++ Compiler by clicking on the toolbar icon . Visual Studio 2010: you can use Project > Properties General > Platform Toolset to select the Intel C++ Compiler. This method is equivalent to using the Use Intel C++ menu item except you can make the selection in individual build configurations.) 5. Select Rebuild Solution from the Visual Studio Build menu. The results of the compilation display in the Output window. Setting Intel ® C++ Compiler Options 1. Select Project > Properties. The Property Pages for your solution display. 2. Locate C/C++ in the list and expand the heading. 3. Step through the available properties to select your configuration. Compatibility 11The Intel ® C++ Compiler processes C and C++ language source files. The Intel ® C++ Compiler is fully sourceand binary-compatible (native code only) with the Microsoft Visual Studio* C++ compiler. The Intel C++ Compiler only supports native C++ project types provided by Visual Studio development environment. The project types with .NET attributes such as the ones below, cannot be converted to an Intel C++ project: • Empty Project (.NET) • Class Library (.NET) • Console Application (.NET) • Windows Control Library (.NET) • Windows Forms Application (.NET) • Windows Service (.NET) Refer to the User and Reference Guides for the full list of unsupported features. Switching between the Installed Compilers Switching to the Intel ® C++ Composer XE 2011 To switch to the Intel ® C++ Compiler do the following: 1. Launch Microsoft Visual Studio*. 2. Open the solution. 3. From the Project menu, select Intel C++ Compiler XE > Use Intel C++. 4. Click OK in the Confirmation dialog box. This configures the solution to use the Intel ® C++ Compiler. ( Visual Studio 2008 or Visual Studio 2005: you can configure the solution to use the Intel ® C++ Compiler by clicking on the toolbar icon . Visual Studio 2010: you can use Project > Properties General > Platform Toolset to select the Intel C++ Compiler. This method is equivalent to using the Use Intel C++ menu item except you can make the selection in individual build configurations.) Switching to the Microsoft Visual Studio* C++ Compiler If you are using the Intel® C++ Compiler, you can switch to the Visual C++ Compiler at any time. Switch compilers by doing the following: 1. Launch Microsoft Visual Studio*. 2. Open the solution. 3. From the Project drop-down menu, select Intel C++ Compiler XE > Use Visual C++. This action updates the solution file to use the Microsoft Visual Studio C++ compiler. All configurations of affected projects are automatically cleaned unless you select Do not clean project(s). If you choose not to clean projects, you will need to rebuild updated projects to ensure all source files are compiled with the new compiler. Starting the Intel® C++ Compiler from the Command Line Follow these steps to invoke the Intel ® C++ Compiler from the command line: 12 1 Intel® C++ Composer XE 2011 Getting Started Tutorials1. Open a command prompt from the Start>All Programs menu: Intel Parallel Studio XE 2011 >Command Prompt Intel Parallel Studio 2011 >Command Prompt. 2. Invoke the compiler as follows: icl [options... ] inputfile(s) [/link link_options] Use the command icl /help to display all available compiler options. Starting the Intel® Parallel Debugger Extension The Intel® Parallel Debugger Extension for Microsoft Visual Studio* is a debugging add-on for the Intel® Compiler's parallel code development features. It facilitates developing parallelism into applications based on the Intel® OpenMP* runtime environment. The Intel® Parallel Debugger Extension provides: • A new Microsoft Visual Studio* toolbar • An extension to the Microsoft Visual Studio* Debug menu • A set of new views and dialogs that are invoked from the toolbar or the menu tree The debugger features include: • C/C++ language support • Assembler language support • Access to the registers your application accesses • Bitfield editor to modify registers • MMU support Preparing Applications for Parallel Debugging You must enable the parallel debug instrumentation with the compiler to enable parallel debugging, such as analyzing shared data or breaking at re-entrant function calls. To enable the parallel debug instrumentation: 1. Open your application project in Microsoft Visual Studio*. 2. Select Project > Properties... from the menu. The Projectname Property Pages dialog box opens. 3. Enable Parallel debug checking. 1. Select Configuration Properties > C/C++ > Debug in the left pane. 2. Under Enable Parallel Debug Checks, select Yes (/debug:parallel). 4. Click OK. 5. Rebuild your application. Your application is now instrumented for parallel debugging using the features of the Intel ® Parallel Debugger Extension. 13 Navigation Quick Start 114 1 Intel® C++ Composer XE 2011 Getting Started TutorialsTutorial: Intel® C++ Compiler 2 Using Auto Vectorization Introduction to Auto-vectorization For the Intel® C++ Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization to emphasize that the compiler automatically identifies and optimizes suitable loops on its own. Using the -vec (Linux* OS) or the /Qvec (Windows* OS) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx or -m or -x (Linux and Mac OS X). Vectorization is enabled with the Intel C++ Compiler at optimization levels of /O2 and higher. Many loops are vectorized automatically, but in cases where this doesn't happen, you may be able to vectorize loops by making simple code modifications. In this tutorial, you will: • establish a performance baseline • generate a vectorization report • improve performance by pointer disambiguation • improve performance by aligning data • improve performance using Interprocedural Optimization Locating the Samples To begin this tutorial, open the vec_samples.zip archive in the product's Samples directory: \Samples\\C++\vec_samples.zip Use these files for this tutorial: • matrix_vector_multiplication_c.sln • matrix_vector_multiplication_c.vcproj • Driver.c • Multiply.c • Multiply.h 15Open the Microsoft Visual Studio solution file, matrix_vector_multiplication_c.sln, and follow the steps below to prepare the project for the vectorization exercises in this tutorial: 1. Convert to an Intel project by right-clicking on the matrix_vector_multiplication_c project and selecting Intel C++ Composer XE > Use Intel C++. Click OK in the Confirmation dialog. 2. Change the Active solution configuration to Release using Build > Configuration Manager. 3. Clean the solution by selecting Build > Clean Solution. Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, build your project with these settings: 1. Select Project > Properties > C/C++ > Optimization > General > Optimization > Minimize Size (/O1). 2. Select Project > Properties > C/C++ > Optimization > Intel Specific > Interprocedural Optimization > No. 16 2 Intel® C++ Composer XE 2011 Getting Started Tutorials3. Add the preprocessor definition, NOFUNCCALL, by selecting Project > Properties > C/C++ > Preprocessor > Preprocessor Definitions, then adding NOFUNCCALL to the existing list of preprocessor definitions. 4. Select Project > Properties > C/C++ > Langauage > Intel Specific > Enable C99 Support > Yes. 17 Tutorial: Intel® C++ Compiler 2This example uses a variable length array (VLA), and therefore, must be compiled with the /Qstd=c99 option. 5. Rebuild the project, then run the executable (Debug > Start Without Debugging) and record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured. Generating a Vectorization Report A vectorization report tells you whether the loops in your code were vectorized, and if not, explains why not. Add the /Qvec-report1 option to the command line by selecting Project > Properties > C/C++ > Command Line > Additional Options, then adding /Qvec-report1. Because vectorization is off at /O1, the compiler does not generate a vectorization report, so recompile at /O2 (default optimization): Record the new execution time. The reduction in time is mostly due to auto-vectorization of the inner loop at line 150 noted in the vectorization report: Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. The /Qvec-report2 option returns a list that also includes loops that were not vectorized, along with the reason why the compiler did not vectorize them. For C/C++ > Command Line > Additional Options, change /Qvec-report1 to /Qvec-report2. Also, for Linker > Command Line > Additional Options, add /Qvec-report2: 18 2 Intel® C++ Composer XE 2011 Getting Started TutorialsRebuild your project. The vectorization report indicates that the loop at line 45 in Multiply.c did not vectorize because it is not the innermost loop of the loop nest. Two versions of the innermost loop at line 55 were generated, but neither version was vectorized. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: loop was not vectorized: existence of vector dependence. Multiply.c(55) (col. 3): remark: loop skipped: multiversioned. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(148) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. NOTE. For more information on the /Qvec-report compiler option, see the Compiler Options section in the Compiler User and Reference Guide. Improving Performance by Pointer Disambiguation Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path. In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x. NOTE. The restrict qualifier requires the use of either the /Qrestrict compiler option for .c or .cpp files, or the /Qstd=c99 compiler option for .c files. Replace the NOFUNCCALL preprocessor definition with NOALIAS. 19 Tutorial: Intel® C++ Compiler 2This conditional compilation replaces the loop in the main program with a function call. Rebuild your project, run the executable, and record the execution time reported in the output. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Now that the compiler has been told that the arrays do not overlap, it knows that it is safe to vectorize the loop. Improving Performance by Aligning Data The vectorizer can generate faster code when operating on aligned data. In this activity you will improve performance by aligning the arrays a, b, and x in Driver.c on a 16-byte boundary so that the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will modify the declarations of a, b, and x in Driver.c using the __attribute keyword, which has the following syntax: float array[30] __attribute((aligned(base, [offset]))); This instructs the compiler to create an array that it is aligned on a "base"-byte boundary with an "offset" (Default=0) in bytes from that boundary. Example: FTYPE a[ROW][COLWIDTH] __attribute((aligned(16))); In addition, the row length of the matrix, a, needs to be padded out to be a multiple of 16 bytes, so that each individual row of a is 16-byte aligned. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in Multiply.c are aligned by using #pragma vector aligned. NOTE. If you use #pragma vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if #pragma vector aligned is not used. See the code under the ALIGNED macro in Multiply.c If your compilation targets the Intel ® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, #pragma vector aligned advises the compiler that the data is 32-byte aligned. Rebuild the program after adding the ALIGNED preprocessor definition to ensure consistently aligned data. 20 2 Intel® C++ Composer XE 2011 Getting Started TutorialsMultiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(140) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(72) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(61) (col. 4): remark: LOOP WAS VECTORIZED. Now, run the executable and record the execution time. Improving Performance with Interprocedural Optimization The compiler may be able to perform additional optimizations if it is able to optimize across source line boundaries. These may include, but are not limited to, function inlining. This is enabled with the /Qipo option. Rebuild the program using the /Qipo option to enable interprocedural optimization. Select Optimization > Interprocedural Optimization > Multi-file(/Qipo) Note that the vectorization messages now appear at the point of inlining in Driver.c (line 155). Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Now, run the executable and record the execution time. 21 Tutorial: Intel® C++ Compiler 2Additional Exercises The previous examples made use of double precision arrays. They may be built instead with single precision arrays by adding the preprocessor definition, FTYPE=float. The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements. NOTE. In the example with data alignment, you will need to set COLBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, #pragma vector aligned will cause the program to fail. This completes the tutorial for auto-vectorization, where you have seen how the compiler can optimize performance with various vectorization techniques. Using Guided Auto-parallelization Introduction to Guided Auto-parallelization Guided Auto-parallelization (GAP) is a feature of the Intel® C++ Compiler that offers selective advice and, when correctly applied, results in auto-vectorization or auto-parallelization for serially-coded applications. Using the /Qguide option with your normal compiler options at /O2 or higher is sufficient to enable the GAP technology to generate the advice for auto-vectorization. Using /Qguide in conjunction with /Qparallel will enable the compiler to generate advice for auto-parallelization. In this tutorial, you will: 1. prepare the project for Guided Auto-parallelization. 2. run Guided Auto-parallelization. 3. analyze Guided Auto-parallelization reports. 4. implement Guided Auto-parallelization recommendations. Preparing the Project for Guided Auto-parallelization To begin this tutorial, open the GuidedAutoParallel.zip archive located in the product's Samples directory located at: \Samples\\C++\ The following Visual Studio* 2005 project files and source files are included: • GAP-c.sln • GAP-c.vcproj • main.cpp • main.h 22 2 Intel® C++ Composer XE 2011 Getting Started Tutorials• scalar_dep.cpp • scalar_dep.h Open the Microsoft Visual Studio Solution file, GAP-c.sln, and follow the steps below to prepare the project for Guided Auto-parallelization (GAP). 1. Convert to an Intel project by right-clicking on the GAP-c project and selecting Intel C++ Composer XE > Use Intel C++. Click OK in the Confirmation dialog. 2. Clean the Solution by selecting Build > Clean Solution. 3. Since GAP is enabled only with option /O2 or higher, you will need to change the build configuration to Release using Build > Configuration Manager. Running Guided Auto-parallelization There are several ways to run GAP analysis in Visual Studio, depending on whether you want analysis for the whole solution, the project, a single file, a function, or a range of lines in your source code. In this tutorial, we will use single-file analysis. Follow the steps below to run a single-file analysis on scalar_dep.cpp in the GAP-c project: 1. In the GAP-c project, right-click on scalar_dep.cpp. 2. Select Intel C++ Composer XE > Guided Auto Parallelism > Run Analysis on file "scalar_dep.cpp" 3. If the /Qipo option is enabled, the Analysis with Multi-file optimization dialog appears. Click Run Analysis. 4. On the Configure Analysis dialog, click Run Analysis using the choices shown here: 23 Tutorial: Intel® C++ Compiler 2NOTE. If you select Send remarks to a file, GAP messages will not be available in the Output window or Error List window. See the GAP Report in the Output window. GAP reports in the standard Output window are encapsulated with GAP REPORT LOG OPENED and END OF GAP REPORT LOG. 24 2 Intel® C++ Composer XE 2011 Getting Started TutorialsAlso, see the GAP Messages in the Error List window: 25 Tutorial: Intel® C++ Compiler 2Analyzing Guided Auto-parallelization Reports Analyze the output generated by GAP analysis and determine whether or not the specific suggestions are appropriate for the specified source code. For this sample tutorial, GAP generates output for the loop in scalar_dep.cpp: for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } In this example, the GAP Report generates a recommendation (remark #30761) to add the /Qparallel option to improve auto-parallelization. Remark #30515 indicates if variable b can be unconditionally assigned, the compiler will be able to vectorize the loop. Implementing Guided Auto-parallelization Recommendations The GAP Report in this example recommends using the /Qparallel option to enable parallelization. Follow these steps to enable this option: 1. Right-click on the GAP-c project and select Properties 26 2 Intel® C++ Composer XE 2011 Getting Started Tutorials2. On the Property Pages dialog, expand the C/C++ heading and select Optimization. 3. In the right-hand pane under Intel Specific, select Parallelization, then choose Enable Parallelization (/Qparallel) and click OK. Now, run the GAP Analysis again and review the GAP Report: 27 Tutorial: Intel® C++ Compiler 2The remark #30521 indicates that loop at line 50 cannot parallelize because the variable b is conditionally assigned. Remark #30525 indicates that the loop trip count must be greater than 188 for the compiler to parallelize the loop. Apply the necessary changes after verifying that the GAP recommendations are appropriate and do not change the semantics of the program. For this loop, the conditional compilation enables parallelization and vectorization of the loop as recommended by GAP: #ifdef TEST_GAP #pragma loop count min (188) for (i=0; i 0) {A[i] = 1 / A[i];} if (A[i] > 1) {A[i] += b;} } #else for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } 28 2 Intel® C++ Composer XE 2011 Getting Started Tutorials#endif } To verify that the loop is parallelized and vectorized: 1. Add the options /Qvec-report1 /Qpar-report1 to the Linker > Command Line > Additional Options dialog. 2. Add the preprocessor definition TEST_GAP to compile the appropriate code path. 3. Rebuild the GAP-c project and note the reports in the output window: 29 Tutorial: Intel® C++ Compiler 2For more information on using the -guide, -vec-report, and -par-report compiler options, see the Compiler Options section in the Compiler User Guide and Reference. This completes the tutorial for Guided Auto-parallelization, where you have seen how the compiler can guide you to an optimized solution through auto-parallelization. Threading Your Applications Learning Objectives In this tutorial, we will be building different parallel implementations of the same function with both the Microsoft Visual C++* Compiler and Intel ® C++ Composer XE 2011. When executed, the application will display the execution time required to render the object in the window title. This time is an indication of the speedup obtained with parallel implementations compared to a baseline established with a serial implementation in the first step. Threading Your Application Tachyon is a ray-tracer application, rendering objects described in data files. The Tachyon program is located in the product Samples directory: \Samples\\C++\Tachyon.zip. 30 2 Intel® C++ Composer XE 2011 Getting Started TutorialsExpand the archive to \Tachyon By default we use balls.dat as the input file. Data files are stored in the directory \Tachyon\dat\. Originally, Tachyon was an application with parallelism implemented in function pthread_create()(source file \Tachyon\src\Windows\pthread.cpp) with explicit threads: one for the rendering, and the other for calculations. In this tutorial we implement parallelization on the calculation thread with OpenMP*, Intel ® TBB, and Intel ® Cilk™ Plus. Parallelization is implemented only for one function draw_task(), which you can find in the source file build_serial.cpp, in project build_serial. Open the Microsoft Visual Studio* Solution \Tachyon\vc8\tachyon_compiler.sln. It includes these projects: • build_serial • build_with_cilk • build_with_openmp • build_with_tbb • tachyon.common NOTE. Projects build_with_openmp, build_with_tbb and build_with_tbbc use OpenMP, Intel ® TBB and Intel ® Cilk™ Plus, respectively. In addition to these implementations, there is also an option for users to implement with lambda functionality based on Intel TBB Follow the steps below to build the serial and Intel ® Cilk™ Plus approaches to Tachyon. Workflow Steps In the following, we will be building different parallel implementations of the same function with both the Microsoft Visual C++ Compiler and the Intel ® C++ Compiler. When executed, the application will display the execution time required to render the object in the window title. This time is an indication of the speedup obtained with parallel implementations compared to a baseline established with a serial implementation in the first step. 31 Tutorial: Intel® C++ Compiler 2Building the Serial Project 1. Set the build_serial project as the StartUp project (Project > Set as StartUp Project). 2. Set the configuration to Release mode: Build > Configuration Manager > Active solution configuration: > Release, then build the build_serial project. 3. Execute the application tachyon_compiler.exe with Debug > Start without Debugging. Take a note of the time in seconds displayed in the window title. This time to render the image is the baseline for parallelization with the Microsoft Visual C++ Compiler. 4. For projects build_serial and "tachyon.common" change compiler to Intel(R) Parallel Composer (Project > Intel C++ Composer XE 2011 > Use Intel C++ ...). 5. Rebuild build_serial in Release mode (now with Intel Compiler). 6. Execute the application. Note the time to render the image as the baseline for parallelization with the Intel C++ Compiler. Building with OpenMP* 1. Set the build_with_openmp project as StartUp project. 2. For project build_with_openmp, change the compiler to Intel C++ Composer XE (Project > Intel C++ Composer XE > Use Intel C++...). 3. For the project build_with_openmp, make sure the /Qopenmp compiler option is set (Project > Properties > Configuration Properties > C/C++ > Language > OpenMP Support = Generate Parallel Code (/Qopenmp)). 4. Open source file build_with_openmp.cpp in the project build_with_openmp. 5. Uncomment OpenMP* pragmas in the routine draw_task which create parallel regions and distribute loop iteration within the team of threads. 6. Comment out return inside parallel region in the routine draw_task. 7. Uncomment zero assignment to variable ison (ison = 0;) inside parallel region in the routine draw_task. 8. Uncomment return at the end of the routine draw_task. 9. Build build_with_openmp in Release configuration. 10. Execute the application. 11. Measure performance compared with the serial version. Options that use OpenMP are available for both Intel ® and non-Intel microprocessors, but these options may perform additional optimizations on Intel ® microprocessors than they perform on non-Intel microprocessors. The list of major, user-visible OpenMP constructs and features that may perform differently on Intel ® vs. non-Intel microprocessors includes: locks (internal and user visible), the SINGLE construct, barriers (explicit and implicit), parallel loop scheduling, reductions, memory allocation, and thread affinity and binding. Building with Intel® TBB 1. Set build_with_tbb project as StartUp project. 32 2 Intel® C++ Composer XE 2011 Getting Started Tutorials2. For project build_with_tbb, change the compiler to Intel C++ Composer XE (Project > Intel C++ Composer XE > Use Intel C++...). 3. For the project build_with_tbb make sure the Intel ® TBB environment is set (Project > Intel C++ Composer XE > Select Build Components > Use TBB). See Note below. 4. Open source file build_with_tbb.cpp in the project build_with_tbb. 5. Uncomment TBB header files. 6. Uncomment class draw_task. 7. Comment out routine draw_task. 8. Uncomment lines regarding TBB schedule and number of threads in routine thread_trace. 9. Uncomment lines regarding grain size in routine thread_trace. 10. Uncomment TBB parallel_for routine in routine thread_trace. 11. Comment out call of routine draw_task in routine thread_trace. 12. Build build_with_tbb in Release configuration. 13. Execute the application. 14. Measure performance compared with the serial version. NOTE. Double check the following project properties are set: • Configuration Properties > C/C++ > General > Additional Include Directories: contains $(INTEL_DEF_IA32_INSTALL_DIR)TBB\Include • Configuration Properties > Linker > General > Additional Library Directories: contains "$(INTEL_DEF_IA32_INSTALL_DIR)TBB\Lib\ia32\vc8" for Visual Studio 2005; "$(INTEL_DEF_IA32_INSTALL_DIR)TBB\Lib\ia32\vc9" for Visual Studio 2008; "$(INTEL_DEF_IA32_INSTALL_DIR)TBB\Lib\ia32\vc10" for Visual Studio 2010; • For platform x64, the $(INTEL_DEF_X64_INSTALL_DIR) is used instead of $(INTEL_DEF_IA32_INSTALL_DIR) and the library directory becomes $(INTEL_DEF_X64_INSTALL_DIR)TBB\Lib\intel64\vc8 for Visual Studio 2005. Building with Intel® Cilk™ Plus 1. Set the build_with_cilk project as the StartUp project. 2. For project build_with_cilk change compiler to the Intel C++ Compiler (Project > Intel C++ Composer XE 2011 > Use Intel C++ ...). 3. For the project build_with_cilk make sure Intel ® Cilk™ Plus for Intel ® C++ Compiler additional include directory is set (Project > Properties > Configuration Properties > C/C++ > General > Additional Include Directories = C:\Program Files\Intel\ComposerXE-2011\compiler\include\cilk\). 4. Open source file build_with_cilk.cpp in the project build_with_cilk. 5. Uncomment Intel ® Cilk™ Plus header files. 6. Uncomment routine draw_task related to Intel ® Cilk™ Plus implementation. 7. Comment out the serial draw_task() function 33 Tutorial: Intel® C++ Compiler 28. Build build_with_cilk in Release mode. 9. Execute the application. 10. Measure performance compared with the serial version for Intel(R) Parallel Composer. Platform and Other Details The solution for this example was created in Microsoft Visual Studio 2005. If you open the tachyon_compiler.sln solution in Microsoft Visual Studio 2008, then it will be converted to a Microsoft Visual Studio 2008 solution. For Platform Win32 • The executable file for all implementations is tachyon_compiler.exe in the \Tachyon\vc8\Release\ directory. • Object files are stored in \Tachyon\vc8\tachyon_compiler\Release\ directory. For Platform x64 • The executable file for all implementations is tachyon_compiler.exe in the \Tachyon\vc8\x64\Release\ directory. • Object files are stored in \Tachyon\vc8\x64\tachyon_compiler\Release\ directory. 34 2 Intel® C++ Composer XE 2011 Getting Started Tutorials Intel ® Fortran Composer XE 2011 Getting Started Tutorials Document Number: 323651-001US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information..................................................................................5 Introducing the Intel® Fortran Composer XE 2011 ................................7 Prerequisites.........................................................................................9 Chapter 1: Navigation Quick Start Getting Started with the Intel ® Fortran Composer XE 2011..............................11 Starting the Intel ® Debugger.......................................................................11 Chapter 2: Tutorial: Intel® Fortran Compiler Using Auto Vectorization.............................................................................13 Introduction to Auto-vectorization.......................................................13 Establishing a Performance Baseline.....................................................14 Generating a Vectorization Report........................................................14 Improving Performance by Aligning Data..............................................15 Improving Performance with Interprocedural Optimization......................16 Additional Exercises...........................................................................16 Using Guided Auto-parallelization.................................................................17 Introduction to Guided Auto-parallelization...........................................17 Preparing the Project for Guided Auto-parallelization..............................17 Running Guided Auto-parallelization.....................................................17 Analyzing Guided Auto-parallelization Reports.......................................18 Implementing Guided Auto-parallelization Recommendations..................18 Using Coarry Fortran..................................................................................19 Introduction to Coarray Fortran...........................................................19 Compiling the Sample Program...........................................................20 Controlling the Number of Images.......................................................21 iiiiv Intel® Fortran Composer XE 2011 Getting Started TutorialsLegal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Centrino, Cilk, Intel, Intel Atom, Intel Core, Intel NetBurst, Itanium, MMX, Pentium, Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P. Copyright © 2011, Intel Corporation. All rights reserved. 56 Intel® Fortran Composer XE 2011 Getting Started TutorialsIntroducing the Intel® Fortran Composer XE 2011 This guide shows you how to start the Intel® Fortran Composer XE 2011 and begin debugging code using the Intel® Debugger. The Intel(R) Fortran Composer XE 2011 is a comprehensive set of software development tools that includes the following components: • Intel® Fortran Compiler • Intel® Math Kernel Library • Intel® Debugger 78 Intel® Fortran Composer XE 2011 Getting Started TutorialsPrerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. Required Skills and Knowledge These tutorials are designed for developers with a basic understanding of the Linux* operating system, including how to: • install the Intel ® Fortran Composer XE 2011 on a supported Linux distribution. See the Release Notes. • open a Linux shell and execute fundamental commands including make. • compile and link Fortran source files. 910 Intel® Fortran Composer XE 2011 Getting Started TutorialsNavigation Quick Start 1 Getting Started with the Intel® Fortran Composer XE 2011 The Intel ® Fortran Compiler XE 12.1 compiles Fortran source files on Linux* operating systems. The compiler is supported on IA-32 and Intel ® 64 architectures. The compiler operates only from a command line on Linux* operating systems. 1. Open a terminal session. 2. Set the environment variables for the compiler. 3. Invoke the compiler. One way to set the environment variables prior to invoking the compiler is to "source" the compiler environment script, compilervars.sh (or compilervars.csh): source /bin/compilervars.sh where is the directory structure containing the compiler /bin directory, and is the architecture argument listed below. The environment script takes an argument based on architecture. Valid arguments are as follows: • ia32: Compiler and libraries for IA-32 architectures only • intel64: Compiler and libraries for Intel ® 64 architectures only To compile Fortran source files, use a command similar to the following: ifort my_source_file.f90 Following successful compilation, an executable is created in the current directory. Starting the Intel® Debugger The Intel® Debugger (IDB) is a full-featured symbolic source code application debugger that helps programmers: • Debug programs • Disassemble and examine machine code and examine machine register values • Debug programs with shared libraries • Debug multithreaded applications 11The debugger features include: • Fortran language support including Fortran 95/90 • Assembler language support • Access to the registers your application accesses • Bitfield editor to modify registers • MMU support Starting the Intel® Debugger On Linux*, you can use the Intel Debugger from a Java* GUI application or the command-line. • To start the GUI for the Intel Debugger, execute the idb command from a Linux shell. • To start the command-line invocation of the Intel Debugger, execute the idbc command from a Linux shell. 12 1 Intel® Fortran Composer XE 2011 Getting Started TutorialsTutorial: Intel® Fortran Compiler 2 Using Auto Vectorization Introduction to Auto-vectorization For the Intel® Fortran Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization to emphasize that the compiler automatically identifies and optimizes suitable loops on its own. Using the -vec (Linux* OS) or the /Qvec (Windows* OS) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx or -m or -x (Linux and Mac OS X). Vectorization is enabled with the Intel Fortran Compiler at optimization levels of -O2 and higher. Many loops are vectorized automatically, but in cases where this doesn't happen, you may be able to vectorize loops by making simple code modifications. In this tutorial, you will: • establish a performance baseline • generate a vectorization report • improve performance by aligning data • improve performance using Interprocedural Optimization Locating the Samples To begin this tutorial, locate the source files in the product's Samples directory: /Samples//Fortran/vec_samples/ Use these files for this tutorial: • driver.f90 • matvec.f90 13Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, compile your sources with these compiler options: ifort -real-size 64 -O1 -vec-report1 matvec.f90 driver.f90 -o MatVector Execute MatVector and record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured. Generating a Vectorization Report A vectorization report tells you whether the loops in your code were vectorized, and if not, explains why not. Because vectorization is off at -O1, the compiler does not generate a vectorization report, so recompile at -O2 (default optimization): ifort -real-size 64 -vec-report1 matvec.f90 driver.f90 -o MatVector Record the new execution time. The reduction in time is mostly due to auto-vectorization of the inner loop at line 32 noted in the vectorization report: matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. The -vec-report2 option returns a list that also includes loops that were not vectorized, along with the reason why the compiler did not vectorize them. ifort -real-size 64 -vec-report2 matvec.f90 driver.f90 -o MatVector The vectorization report indicates that the loop at line 33 in matvec.f90 did not vectorize because it is not the innermost loop of the loop nest. matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(33) (col. 3): remark: loop was not vectorized: not inner loop. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. driver.f90(74) (col. 7): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. NOTE. For more information on the -vec-report compiler option, see the Compiler Options section in the Compiler User and Reference Guide. 14 2 Intel® Fortran Composer XE 2011 Getting Started TutorialsImproving Performance by Aligning Data The vectorizer can generate faster code when operating on aligned data. In this activity you will improve the vectorizer performance by aligning the arrays a, b, and c in driver.f90 on a 16-byte boundary so the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will insert an alignment directive for a, b, and c in driver.f90 with the following syntax: !dir$attributes align : 16 :: a,b,c This instructs the compiler to create arrays that it are aligned on a 16-byte boundary, which should facilitate the use of SSE aligned load instructions. In addition, the column height of the matrix a needs to be padded out to be a multiple of 16 bytes, so that each individual column of a maintains the same 16-byte alignment. In practice, maintaining a constant alignment between columns is much more important than aligning the start of the arrays. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in matvec.f90 are aligned by using the directive !dir$ vector aligned NOTE. If you use !dir$ vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if !dir$ vector aligned is not used. See the code under the ALIGNED macro in matvec.f90 If your compilation targets the Intel ® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, !dir$ vector aligned advises the compiler that the data is 32-byte aligned. Recompile the program after adding the ALIGNED macro to ensure consistently aligned data: ifort -real-size 64 -vec-report2 -DALIGNED matvec.f90 driver.f90 -o MatVector matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(33) (col. 3): remark: loop was not vectorized: not inner loop. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(63) (col. 21): remark: loop was not vectorized: not inner loop. driver.f90(63) (col. 21): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. driver.f90(74) (col. 7): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. 15 Tutorial: Intel® Fortran Compiler 2Improving Performance with Interprocedural Optimization The compiler may be able to perform additional optimizations if it is able to optimize across source line boundaries. These may include, but are not limited to, function inlining. This is enabled with the -ipo option. Recompile the program using the -ipo option to enable interprocedural optimization. ifort -real-size 64 -vec-report2 -DALIGNED -ipo matvec.f90 driver.f90 -o MatVector Note that the vectorization messages now appear at the point of inlining in driver.f90 (line 70). driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(63) (col. 21): remark: loop was not vectorized: not inner loop. driver.f90(63) (col. 21): remark: LOOP WAS VECTORIZED. driver.f90(73) (col. 16): remark: loop was not vectorized: not inner loop. driver.f90(70) (col. 14): remark: LOOP WAS VECTORIZED. driver.f90(70) (col. 14): remark: loop was not vectorized: not inner loop. driver.f90(70) (col. 14): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. Now, run the executable and record the execution time. Additional Exercises The previous examples made use of double precision arrays. They may be built instead with single precision arrays by changing the command-line option -real-size 64 to -real-size 32 The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements. NOTE. In the example with data alignment, you will need to set ROWBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, the directive !dir$ vector aligned will cause the program to fail. This completes the tutorial for auto-vectorization, where you have seen how the compiler can optimize performance with various vectorization techniques. 16 2 Intel® Fortran Composer XE 2011 Getting Started TutorialsUsing Guided Auto-parallelization Introduction to Guided Auto-parallelization Guided Auto-parallelization (GAP) is a feature of the Intel® Fortran Compiler that offers selective advice and, when correctly applied, results in auto-vectorization or auto-parallelization for serially-coded applications. Using the -guide option with your normal compiler options at -O2 or higher is sufficient to enable the GAP technology to generate the advice for auto-vectorization. Using -guide in conjunction with -parallel will enable the compiler to generate advice for auto-parallelization. In this tutorial, you will: 1. prepare the project for Guided Auto-parallelization. 2. run Guided Auto-parallelization. 3. analyze Guided Auto-parallelization reports. 4. implement Guided Auto-parallelization recommendations. Preparing the Project for Guided Auto-parallelization To begin this tutorial, open the source file archive located at: /Samples//Fortran/guided_auto_parallel.tar.gz The following files are included: • Makefile • main.f90 • scalar_dep.f90 Copy these files to a directory on your system where you have write and execute permissions. Running Guided Auto-parallelization You can use the -guide option to generate GAP advice. From a directory where you can compile the sample program, execute make vec from the command-line, or execute: ifort -c -guide scalar_dep.f90 The GAP Report appears in the compiler output. GAP reports are encapsulated with GAP REPORT LOG OPENED and END OF GAP REPORT LOG. GAP REPORT LOG OPENED ON Mon Aug 2 14:04:34 2010 remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.f90(44): remark #30515: (VECT) Loop at line 44 cannot be vectorized due to conditional assignment(s) into the following variable(s): t. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG 17 Tutorial: Intel® Fortran Compiler 2Analyzing Guided Auto-parallelization Reports Analyze the output generated by GAP analysis and determine whether or not the specific suggestions are appropriate for the specified source code. For this sample tutorial, GAP generates output for the loop in scalar_dep.f90: do i = 1, n if (a(i) >= 0) then t = i end if if (a(i) > 0) then a(i) = t * (1 / (a(i) * a(i))) end if end do In this example, the GAP Report generates a recommendation (remark #30761) to add the -parallel option to improve auto-parallelization. Remark #30515 indicates if variable t can be unconditionally assigned, the compiler will be able to vectorize the loop. GAP REPORT LOG OPENED remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.f90(44): remark #30515: (VECT) Loop at line 44 cannot be vectorized due to conditional assignment(s) into the following variable(s): t. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG Implementing Guided Auto-parallelization Recommendations The GAP Report in this example recommends using the -parallel option to enable parallelization. From the command-line, execute make gap_par_report, or run the following: ifort -c -parallel -guide scalar_dep.f90 The compiler emits the following: GAP REPORT LOG OPENED ON Mon Aug 2 14:04:44 2010 scalar_dep.f90(44): remark #30523: (PAR) Loop at line 44 cannot be parallelized due to conditional assignment(s) into the following variable(s): t. This loop will be parallelized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. [ALTERNATIVE] Another way is to use "!dir$ parallel private(t)" to parallelize the loop. [VERIFY] The same conditions described previously must hold. scalar_dep.f90(44): remark #30525: (PAR) If the trip count of the loop at line 44 is greater than 36, then use "!dir$ loop count min(36)" to parallelize this loop. [VERIFY] Make sure that the loop has a minimum of 36 iterations. Number of advice-messages emitted for this compilation session: 2. END OF GAP REPORT LOG 18 2 Intel® Fortran Composer XE 2011 Getting Started TutorialsIn the GAP Report, remark #30523 indicates that loop at line 44 cannot parallelize because the variable t is conditionally assigned. Remark #30525 indicates that the loop trip count must be greater than 36 for the compiler to parallelize the loop. Apply the necessary changes after verifying that the GAP recommendations are appropriate and do not change the semantics of the program. For this loop, the conditional compilation enables parallelization and vectorization of the loop as recommended by GAP: do i = 1, n !dir$ if defined(test_gap) t = i !dir$else if (a(i) >= 0) then t = i end if !dir$ endif if (a(i) > 0) then a(i) = t * (1 / (a(i) * a(i))) end if end do To verify that the loop is parallelized and vectorized: • Add the compiler options -vec-report1 -par-report1. • Add the conditional definition test_gap to compile the appropriate code path. From the command-line, execute make w_changes, or run the following: ifort -c -parallel -Dtest_gap -vec-report1 -par-report1 scalar_dep.f90 The compiler's -vec-report and -par-report options emit the following output, confirming that the program is vectorized and parallelized: scalar_dep.f90(44) (col. 9): remark: LOOP WAS AUTO-PARALLELIZED. scalar_dep.f90(44) (col. 9): remark: LOOP WAS VECTORIZED. scalar_dep.f90(44) (col. 9): remark: LOOP WAS VECTORIZED. For more information on using the -guide, -vec-report, and -par-report compiler options, see the Compiler Options section in the Compiler User Guide and Reference. This completes the tutorial for Guided Auto-parallelization, where you have seen how the compiler can guide you to an optimized solution through auto-parallelization. Using Coarry Fortran Introduction to Coarray Fortran The Intel® Fortran Compiler XE supports parallel programming using coarrays as defined in the Fortran 2008 standard. As an extension to the Fortran language, coarrays offer one method to use Fortran as a robust and efficient parallel programming language. Coarray Fortran uses a single-program, multi-data programming model (SPMD). Coarrays are supported in the Intel® Fortran Compiler XE for Linux* and Intel® Visual Fortran Compiler XE for Windows*. 19 Tutorial: Intel® Fortran Compiler 2This tutorial demonstrates how to compile a simple coarray Fortran application using the Intel Fortran Compiler XE, and how to control the number of images (processes) for the application. Locating the Sample To begin this tutorial, locate the source file in the product's Samples directory: /Samples//Fortran/coarray_samples/hello_image.f90 Copy hello_image.f90 to a working directory, then continue with this tutorial. NOTE. The Intel Fortran Compiler implementation of coarrays follows the standard provided in a draft version of the Fortran 2008 Standard. Not all features present in the Fortran 2008 Standard may be implemented by Intel. Consult the Release Notes for a list of supported features. Compiling the Sample Program The hello_image.f90 sample is a hello world program. Unlike the usual hello world, this coarray Fortran program will spawn multiple images, or processes, that will run concurrently on the host computer. Examining the source code for this application shows a simple Fortran program: program hello_image write(*,*) "Hello from image ", this_image(), & "out of ", num_images()," total images" end program hello_image Note the function calls to this_image() and num_images(). These are new Fortran 2008 intrinsic functions. The num_images() function returns the total number of images or processes spawned for this program. The this_image() function returns a unique identifier for each image in the range 1 to N, where N is the total number of images created for this program. To compile the sample program containing the Coarray Fortran features, use the -coarray compiler option: ifort -coarray hello_image.f90 -o hello_image If you run the hello_image executable, the output will vary depending on the number of processor cores on your system: ./hello_image Hello from image 1 out of 8 total images Hello from image 6 out of 8 total images Hello from image 7 out of 8 total images Hello from image 2 out of 8 total images Hello from image 5 out of 8 total images Hello from image 8 out of 8 total images Hello from image 3 out of 8 total images Hello from image 4 out of 8 total images By default, when a Coarray Fortran application is compiled with the Intel Fortran Compiler, the invocation creates as many images as there are processor cores on the host platform. The example shown above was run on a dual quad-core host system with eight total cores. As shown, each image is a separately spawned process on the system and executes asynchronously. 20 2 Intel® Fortran Composer XE 2011 Getting Started TutorialsNOTE. The -coarray option cannot be used in conjunction with -openmp options. One cannot mix Coarray Fortran language extensions with OpenMP extensions. Controlling the Number of Images There are two methods to control the number of images created for a Coarray Fortran application. First, you can use the -coarray-num-images=N compiler option to compile the application, where N is the number of images. This option sets the number of images created for the application during run time. For example, use the -coarraynum-images=2 option to the limit the number of images of the hello_image.f90 program to exactly two: ifort -coarray -coarray-num-images=2 hello_image.f90 -o hello_image Hello from image 2 out of 2 total images Hello from image 1 out of 2 total images The second way to control the number of images is to use the environment variable FOR_COARRAY_NUM_IMAGES, setting this to the number of images you want to spawn. As an example, recompile hello_image.f90 without the -coarray-num-images option. Instead, before we run the executable hello_image, set the environment variable FOR_COARRAY_NUM_IMAGES to the number of images you want created during the program run. For bash shell users, set the environment variable with this command: export FOR_COARRAY_NUM_IMAGES=4 For csh/tcsh shell users, set the environment variable with this command: setenv FOR_COARRAY_NUM_IMAGES 4 For example, assuming bash shell: ifort -coarray hello_image.f90 -o hello_image export FOR_COARRAY_NUM_IMAGES=4 Hello from image 1 out of 4 total images Hello from image 3 out of 4 total images Hello from image 2 out of 4 total images Hello from image 4 out of 4 total images export FOR_COARRAY_NUM_IMAGES=3 Hello from image 3 out of 3 total images Hello from image 2 out of 3 total images Hello from image 1 out of 3 total images NOTE. Setting FOR_COARRAY_NUM_IMAGES=N overrides the -coarray_num_images compiler option. 21 Tutorial: Intel® Fortran Compiler 222 2 Intel® Fortran Composer XE 2011 Getting Started Tutorials 1 Intel® Parallel Inspector 2011 Release Notes Intel® Parallel Inspector 2011 Release Notes Installation Guide and Release Notes Document number: 320754-002US 7 August 2011 Contents Introduction What’s New System Requirements Installation Notes Issues and Limitations Attributions Disclaimer and Legal Information 1 Introduction Intel® Parallel Inspector 2011 is a serial and multithreading error checking analysis tool for Microsoft Visual Studio* C/C++ developers. Inspector detects memory leaks and errors as well as threading data races and deadlock errors. This comprehensive developer productivity tool pinpoints errors and provides guidance to help ensure application reliability and quality. This document provides system requirements, installation instructions, issues and limitations, and legal information. To learn more about this product, see the Inspector Documentation at: ? Start > All Programs > Intel Parallel Studio 2011 > Parallel Studio Documentation > Inspector Documentation. ? Or \documentation\\ documentation_inspector.htm. For example, if you install the product in the default installation path, you can find the documentation at: C:\Program Files\Intel\Parallel Studio 2011\Inspector\documentation\en\documentation_inspector.htm For Technical support, including answers to questions not addressed in the installed tool, visit the technical support forum at: http://software.intel.com/sites/support/2 Intel® Parallel Inspector 2011 Release Notes Please remember to register your tool at https://registrationcenter.intel.com/ by providing your email address. This helps Intel recognize you as a valued customer in the support forum. 2 What’s New Intel® Parallel Inspector 2011 Update 6: ? Update numbers now aligned with Intel® Inspector XE 2011. As a result, you will see the Update number skip from Update 2 in the previous release to Update 6 in this release. ? New Memory growth reporting - Use new Set Transaction Start and Set Transaction End buttons during analysis to detect if a block of memory is allocated but not deallocated within a specific time segment during application execution ? Analysis support for C# .NET applications ? New C# .NET sample code ? Added stability improvements Intel® Parallel Inspector 2011 Update 2: ? Improved GUI: ? Simpler, more intuitive real-time analysis views, main result data view, and import view ? Enhanced state management and problem filtering ? New memory overhead gauge to help choose the optimal preset analysis configuration ? Updates for Operating System and IDE support ? Added Microsoft Windows 7* SP1 ? Added Microsoft Visual Studio* 2010 SP1 ? Added stability improvements Intel® Parallel Inspector 2011 Update 1: ? Improved analysis configuration (The Collection dialog now contains three levels of analysis. Level of analysis formerly known as mi4/ti4 is now available as an additional option when you select mi3 or ti3 levels of analysis, respectively) ? New Managing Suppressions tutorial ? Bug fixes Intel® Parallel Inspector 2011: ? Microsoft Visual Studio* 2010 support ? Resource leak detection ? Intel® Cilk™ Plus support3 Intel® Parallel Inspector 2011 Release Notes ? Activation tool See http://software.intel.com/en-us/intel-parallel-inspector/ or the What’s New section in the help. 3 System Requirements For an explanation of architecture names, see http://software.intel.com/enus/articles/intel-architecture-platform-terminology/ ? A system with an IA-32 or Intel® 64 architecture processor supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions (Intel® Pentium® 4 processor or later, or compatible non-Intel processor) ? Incompatible or proprietary instructions in non-Intel processors may cause the analysis capabilities of this tool to function incorrectly. Any attempt to analyze code not supported by Intel® processors may lead to failures in this tool. ? For the best experience, a multi-core or multi-processor system is recommended. ? 2GB RAM ? 4GB free disk space for all tool features and architectures ? Software requirements ? Operating system: Microsoft Windows 7* SP1, Microsoft Windows XP* SP3, Microsoft Windows Vista* SP2, Microsoft Windows Server* 2008 SP2, 32-bit or x64 editions – embedded editions not supported. NOTE: In a future major release of this product, support for installation and use on Microsoft Windows Vista* will be removed. ? Microsoft Visual Studio* 2005 SP1, 2008 SP1 or 2010 SP1 software with C++ component installed [0] – Microsoft Visual Studio* Express Edition not supported. NOTE: In a future major release of this product, support for installation and use with Microsoft Visual Studio* 2005 will be removed. Intel recommends that customers migrate to Microsoft Visual Studio* 2010 at their earliest convenience. ? Application coding requirements ? Programming Language: C or C++ (native, not managed code) ? Threading methodologies supported by the analysis tool: ? Intel® Threading Building Blocks (Intel® TBB) ? Win32* Threads on Windows* ? OpenMP* [1] ? Intel's C/C++ Parallel Language Extensions ? Intel® Cilk™ Plus ? To view PDF documents, use a PDF reader, such as Adobe Reader*.4 Intel® Parallel Inspector 2011 Release Notes Notes: [0] Inspector supports analysis of applications built with the Intel® Parallel Composer, Intel® C++ Compiler Professional Edition version 10.0 or higher, and/or Microsoft Visual C++* 2005 SP1, 2008 SP1 or 2010 SP1 software. [1] Applications that use OpenMP* technology and are built with the Microsoft* compiler must link to the OpenMP* compatibility library as supplied by an Intel® compiler. 4 Installation Notes If you are installing the Inspector for the first time, please be sure to have the product serial number available so you can type it in during installation. Inspector updates uninstall your currently installed Inspector version, and use the existing valid Inspector license on the system. Default Installation Folders The default top-level installation folder for the Inspector is: C:\Program Files\Intel\Parallel Studio 2011\Inspector If you are installing on a system with a non-English language version of the Windows* operating system, the name of the Program Files folder may be different. On Intel® 64 architecture systems, the folder name is Program Files (x86) or the equivalent. Changing, Updating and Removing the Tool To remove, modify, or repair the Inspector: 1. Open the Control Panel. 2. Select the Add or Remove Programs applet. 3. Select Intel Parallel Inspector 2011. 4. Click the Change button. Converting Evaluation-licensed Products to Fully Licensed Products To convert your evaluation software to a fully licensed product: 1. From the start menu, click Start > All Programs > Intel Parallel Studio 2011 > Product Activation 2. Supply a valid product serial number 3. Click Activate5 Intel® Parallel Inspector 2011 Release Notes Inspector Documentation Inspector documentation is automatically integrated into supported versions of Microsoft Visual Studio*. If documentation integration does not work or disappears, follow these steps to restore documentation integration: 1. Click Start > All Programs > Intel Parallel Studio 2011 > Command Prompt and choose any shortcut (such as IA-32 Visual Studio 2005 mode). 2. Remove integration: ? “insp-vsreg –d 2005” to remove the Inspector integration with VS2005 ? “insp-vsreg –d 2008” to remove the Inspector integration with VS2008 ? “insp-vsreg –d 2010” to remove the Inspector integration with VS2010 3. Restore integration: ? “insp-vsreg –i 2005” to restore the Inspector integration with VS2005 ? “insp-vsreg –i 2008” to restore the Inspector integration with VS2008 ? “insp-vsreg –i 2010” to restore the Inspector integration with VS2010 If you still cannot access integrated Inspector documentation from the Microsoft Visual Studio* Help menu, try accessing Inspector documentation from the Start menu (Start > Intel Parallel Studio 2011 > Parallel Studio Documentation > Inspector Documentation) or directly from the Inspector Documentation Index at \documentation\\documentation_inspector.htm. Also, the Inspector Help may be unavailable in Microsoft Visual Studio* software if the language for non-Unicode programs does not match the operating system language: for example, the Japanese Windows* operating system with English language set for nonUnicode programs. Workaround: Configure the language for non-Unicode programs to match the operating system language (go to Control Panel > Regional and Language Options > tab: Advanced). 5 Issues and Limitations Installation ? Inspector may not install correctly if an installation of other software is in progress. ? If you have both Microsoft Visual Studio* 2005 and 2008 integrated development environments (IDEs) installed on your system and integrate the Intel® Parallel Studio 2011 into both IDEs, removing the integration from one IDE can remove the integrated Intel® Parallel Studio documentation from both IDEs. To work around this problem, follow the instructions provided in Installation Notes/Inspector Documentation subsection. Follow only the steps for VS2005 and VS2008.6 Intel® Parallel Inspector 2011 Release Notes General Issues ? Inspector does not guarantee this software tool will detect or report every memory and threading error in an application. ? Not all logic errors are detectable. ? Heuristics used to eliminate false positives may hide real issues. ? Highly correlated events will be grouped into a single problem. ? You can use the Inspector to analyze applications in debug and release modes. To learn more about options necessary to produce the most accurate, complete results, please refer to the following two resources: ? Memory error analysis: http://software.intel.com/en-us/articles/compiler-settingsfor-memory-error-analysis-in-intel-parallel-inspector/ ? Threading error analysis: http://software.intel.com/en-us/articles/compilersettings-for-threading-error-analysis-in-intel-parrallel-inspector/ ? If no symbols are found for a module in which a problem is detected, the Inspector displays the call stack and observation source code of the first location where it can find symbols. If it cannot find any location in the call stack with symbols, it displays the module name and relative virtual address (RVA) for the location. ? Inspector analyzes only one process in an application: the initial process created by the execution of the targeted application. This means an application launched by a script results in analysis of the script, not the process the script starts. ? Applications that crash when run outside the Inspector may crash or hang the Inspector runtime analysis engine. For example, a corrupt return address on an application call stack crashes the runtime analysis engine. If a crash occurs, problems detected prior to that time can be viewed, but memory leaks are not reported. ? Inspector uses a socket to communicate between the graphical user interface and the runtime analysis engine. Preventing an application from opening a socket prevents the Inspector from analyzing the application. ? Inspector may report an incorrect call stack following an interruption of normal call flow, such as when an exception is thrown and caught. While the Inspector recognizes and attempts to correct result data when this situation occurs, it is possible for a threading or memory problem to be reported before the call stack is fully corrected. ? You cannot obtain meaningful results if the application under analysis launches a debugger.7 Intel® Parallel Inspector 2011 Release Notes ? Synchronization, function calls and memory loads/stores that occur before the Inspector takes control of the program are not visible to the Inspector. Missing these events may cause the tool to report false positives. This situation can occur if these constructs occur in DllMain. ? When using the Help Viewer in Visual Studio 2010 SP1, if the user clicks the Where am I in the Workflow? icon in the upper-right of some Inspector help topics, to resume reading the original topic: ? Click the original tab (where the user clicked the Where am I in the Workflow? icon). ? Click its Back button. Threading Error Analysis ? Inspector may report false positives and false negatives when analyzing applications that call Microsoft Windows* ThreadpoolWait, ThreadpoolTimer, and ThreadpoolIo APIs (first introduced in the Microsoft Windows Vista* operating system) or UserMode scheduling (UMS) APIs (first introduced in the Microsoft Windows 7* operating system). ? If you use Intel® Threading Building Blocks (Intel® TBB), set the macro TBB_USE_THREADING_TOOLS at compilation time to enable correct analysis of Intel® TBB applications. Otherwise the Inspector may generate false positives during threading error analysis. If you use Intel® TBB debug libraries, do one of the following to set the macro TBB_USE_THREADING_TOOLS: ? Use the /MDd switch to set the _DEBUG preprocessor symbol (recommended). ? Set the macro TBB_USE_DEBUG. If you use Intel® TBB release libraries, set TBB_USE_THREADING_TOOLS macro. See Intel® TBB documentation for more information. ? Inspector does not detect deadlocks or potential deadlocks created with: ? Some types of locks via Intel’s C/C++ parallel extension (__critical) provided by the Intel® Parallel Composer ? Some types of locks in Intel® TBB (spin_mutex, spin_rw_mutex) ? Non-exclusive ownership synchronization objects involved, for example, condition variables, semaphores and events etc. ? Inspector may not detect threading issues on data accessed in the C runtime library (like memmove and memcpy).8 Intel® Parallel Inspector 2011 Release Notes ? Inspector does not detect inter-processes data races or deadlock/potential deadlocks. ? Inspector does not capture the main thread creation site if the .pdb symbol file is not in the location specified within the .exe or .dll executable file, or in the location containing the .exe or .dll executable file. ? Inspector may report false positives for analyzed applications using customized synchronization primitives. Memory Error Analysis ? On the 64-bit version of the Windows 7* operating system, the Inspector may show incorrect call stacks associated with memory leaks detected by the narrow (mi1) analysis setting. Any stack frames corresponding to functions in libraries/executables that call LoadLibrary() will be missing in call stacks associated with memory leaks. Workaround: Analyze your application using a wider memory analysis setting (mi2 and mi3). ? Inspector does not report memory leaks when using the narrow (mi1) analysis setting if the application under analysis circumvents the normal termination flow and does not call ExitProcess() (which is a call normally made by the runtime library when the application’s main function ends). Workaround: Analyze your application using a wider memory analysis setting (mi2 and mi3). ? Inspector does not report memory as leaked if a pointer to the memory is available in the application memory space at the time the application exits, because the application has the ability to free this memory. For example, if an application allocates a block of memory and stores a pointer to the memory in a global variable, this memory is not included in a list of reported memory leaks. Only memory that has no pointer to it is considered as a leak. ? Inspector may report false positives when the analyzed application uses custom memory allocators. ? In some circumstances, the Inspector does not record the deallocation of memory freed during application shutdown. For example, the Inspector may not record the event if memory is freed from the destructor of an object that is located in global memory, and that destructor does not execute until late in the shutdown process. Such memory may be reported as a memory leak. ? If the semantics of standard C runtime allocators are changed (the application uses non-standard versions) such that the memory returned by the allocator is initialized, the behavior of the Inspector is unknown and could lead to abnormal analysis termination.9 Intel® Parallel Inspector 2011 Release Notes ? Inspector may report mismatched allocation/deallocation for an array that appears correct with an allocation of new type[] and a matching delete[] if the code uses #include . This occurs because the underlying implementation brought in by this include file may not actually use a matched deallocation to support backward compatibility. Applications that use #include are non-conforming C++ applications. Workaround: Make the code conform by using #include (which eliminates this problem), or suppress the code. ? Narrow memory error analysis setting (mi1) may not report leaks for the memory allocated with the operator new from mfc90ud.dll (mfc90u.dll). Workaround: Copy the corresponding pdb-file (mfc90ud.i386.pdb or mfc90ud.AMD64.pdb) from the C:\WINDOWS\symbols\dll directory to the directory where mfc90ud.dll is located. ? The behavior of Memory Leak Analysis level 1 (mi1) is undefined and could lead to abnormal analysis termination if the analyzed application links with the release version of tbbmalloc.dll. Workaround: Use the debug version of tbbmalloc.dll. ? When doing Memory Error Analysis on applications that use fibers or user-level threads, the Inspector may not work properly and/or results may be incorrect in some cases. For such an application, if the “analyze stack accesses” feature is turned on, the application will not work properly and/or data collection will fail. If the “analyze stack accesses” feature is not turned on, then in some cases, incorrect call stacks may be reported. Intel® Cilk™ Plus uses fibers or user-level threads, and as such, this caveat applies to any software that uses Intel® Cilk™ Plus. Command-line Interface ? Options put in a file and passed to the insp-cl command with the -option-file option cannot use the same syntax alternatives used when entering these options on the command line. The restrictions are as follows: ? Put a newline character after the final line in the file, otherwise the final character is duplicated. ? Use ’ =’ between the option name and its value(s) For more information, please refer to Technical Support. 6 Attributions wxWindows Library This tool includes wxWindows software which can be downloaded from http://www.wxwidgets.org/downloads. wxWindows Library Licence, Version 3.1 ======================================10 Intel® Parallel Inspector 2011 Release Notes Copyright (C) 1998-2005 Julian Smart, Robert Roebling et al Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. WXWINDOWS LIBRARY LICENCE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION This library is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public Licence as published by the Free Software Foundation; either version 2 of the Licence, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public Licence for more details. You should have received a copy of the GNU Library General Public Licence along with this software, usually in a file named COPYING.LIB. If not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. EXCEPTION NOTICE 1. As a special exception, the copyright holders of this library give permission for additional uses of the text contained in this release of the library as licenced under the wxWindows Library Licence, applying either version 3.1 of the Licence, or (at your option) any later version of the Licence as published by the copyright holders of version 3.1 of the Licence document. 2. The exception is that you may use, copy, link, modify and distribute under your own terms, binary object code versions of works based on the Library. 3. If you copy code from files distributed under the terms of the GNU General Public Licence or the GNU Library General Public Licence into a copy of this library, as this licence permits, the exception does not apply to the code that you add in this way. To avoid misleading anyone as to the status of such modified files, you must delete this exception notice from such code and/or adjust the licensing conditions notice accordingly. 4. If you write modifications of your own for this library, it is your choice whether to permit this exception to apply to your modifications. If you do not wish that, you must delete the exception notice from such code and/or adjust the licensing conditions notice accordingly11 Intel® Parallel Inspector 2011 Release Notes Libxml2 Except where otherwise noted in the source code (e.g. the files hash.c,list.c and the trio files, which are covered by a similar license but with different Copyright notices) all the files are: Copyright (C) 1998-2003 Daniel Veillard. All Rights Reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE DANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHERIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except as contained in this notice, the name of Daniel Veillard shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization from him. Boost Boost Software License - Version 1.0 - August 17th, 2003 Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following: The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS ORIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 12 Intel® Parallel Inspector 2011 Release Notes MERCHANTABILITY,FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NONINFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Apache Apache License - Version 2.0 – January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, 13 Intel® Parallel Inspector 2011 Release Notes elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and14 Intel® Parallel Inspector 2011 Release Notes (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NONINFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.15 Intel® Parallel Inspector 2011 Release Notes 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS 7 Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.16 Intel® Parallel Inspector 2011 Release Notes The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Java and all Java based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P.17 Intel® Parallel Inspector 2011 Release Notes Copyright © 2009-2011, Intel Corporation. All rights reserved. Intel ® Visual Fortran Composer XE 2011 Getting Started Tutorials Document Number: 323650-001US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information..................................................................................5 Introducing the Intel® Visual Fortran Composer XE 2011 ......................7 Prerequisites.........................................................................................9 Chapter 1: Navigation Quick Start Getting Started with the Intel ® Visual Fortran Composer XE 2011.....................11 Starting the Intel ® Parallel Debugger Extension..............................................11 Chapter 2: Tutorial: Intel® Fortran Compiler Using Auto Vectorization.............................................................................13 Introduction to Auto-vectorization.......................................................13 Establishing a Performance Baseline.....................................................14 Generating a Vectorization Report........................................................16 Improving Performance by Aligning Data..............................................17 Improving Performance with Interprocedural Optimization......................19 Additional Exercises...........................................................................19 Using Guided Auto-parallelization.................................................................20 Introduction to Guided Auto-parallelization...........................................20 Preparing the Project for Guided Auto-parallelization..............................20 Running Guided Auto-parallelization.....................................................21 Analyzing Guided Auto-parallelization Reports.......................................24 Implementing Guided Auto-parallelization Recommendations..................25 Using Coarry Fortran..................................................................................28 Introduction to Coarray Fortran...........................................................28 Compiling the Sample Program...........................................................28 Controlling the Number of Images.......................................................31 iiiiv Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsLegal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Centrino, Cilk, Intel, Intel Atom, Intel Core, Intel NetBurst, Itanium, MMX, Pentium, Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. Copyright © 2011, Intel Corporation. All rights reserved. 56 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsIntroducing the Intel® Visual Fortran Composer XE 2011 This guide shows you how to start the Intel® Visual Fortran Composer XE 2011 and begin debugging code using the Intel® Parallel Debugger Extension. The Intel(R) Visual Fortran Composer XE 2011 is a comprehensive set of software development tools that includes the following components: • Intel® Fortran Compiler • Intel® Math Kernel Library • Intel® Parallel Debugger Extension Check http://software.intel.com/en-us/articles/intel-software-product-tutorials/ for the following: • ShowMe video for using Intel® Visual Fortran Composer XE with Microsoft Visual Studio* 78 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsPrerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. NOTE. Although the instructions and screen captures in these tutorials refer to the Visual Studio* 2005 integrated development environment (IDE), you can use these tutorials with later versions of Visual Studio. Required Tools You need the following tools to use these tutorials: • Microsoft Visual Studio 2005 or later. • Intel ® Visual Fortran Composer XE 2011. • Sample code included with the Intel ® Visual Fortran Composer XE 2011. NOTE. • Samples are non-deterministic. Your results may vary from the examples shown throughout these tutorials. • Samples are designed only to illustrate features and do not represent best practices for creating multithreaded code. Required Skills and Knowledge These tutorials are designed for developers with a basic understanding of Microsoft Visual Studio, including how to: • open a project/solution. • access the Document Explorer. (valid in Microsoft Visual Studio 2005 /2008 ) • display the Solution Explorer. • compile and link a project. • ensure a project compiled successfully. 910 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsNavigation Quick Start 1 Getting Started with the Intel® Visual Fortran Composer XE 2011 The Intel ® Visual Fortran Composer XE 2011 integrates into the following versions of the Microsoft Visual Studio* Integrated Development Environment (IDE): • Microsoft Visual Studio 2010* • Microsoft Visual Studio 2008* • Microsoft Visual Studio 2005* If you do not have one of these Microsoft products on your system, the Intel ® Visual Fortran Composer XE 2011 installation can install Microsoft Visual Studio 2008 Shell and Libraries*. To start the Intel ® Visual Fortran Compiler XE 12.1 from Microsoft Visual Studio* IDE, perform the following steps: 1. Launch Microsoft Visual Studio*. 2. Select File > New > Project. 3. In the New Project window select a project type under Intel® Visual Fortran. 4. Select the desired template. 5. Click OK. Setting Compiler Options 1. Select Project > Properties. The Property Pages for your solution display. 2. Locate Fortran in the list and expand the heading. 3. Step through the available properties to select your configuration. The results of the compilation display in the Output window. Starting the Intel® Parallel Debugger Extension The Intel® Parallel Debugger Extension for Microsoft Visual Studio* is a debugging add-on for the Intel® Compiler's parallel code development features. It facilitates developing parallelism into applications based on the Intel® OpenMP* runtime environment. 11The Intel® Parallel Debugger Extension provides: • A new Microsoft Visual Studio* toolbar • An extension to the Microsoft Visual Studio* Debug menu • A set of new views and dialogs that are invoked from the toolbar or the menu tree The debugger features include: • Fortran language support including Fortran 95/90 • Assembler language support • Access to the registers your application accesses • Bitfield editor to modify registers • MMU support Preparing Applications for Parallel Debugging You must enable the parallel debug instrumentation with the compiler to enable parallel debugging, such as analyzing shared data or breaking at re-entrant function calls. To enable the parallel debug instrumentation: 1. Open your application project in Microsoft Visual Studio*. 2. Select Project > Properties... from the menu. The Projectname Property Pages dialog box opens. 3. Enable Parallel debug checking. 1. Select Configuration Properties > Fortran > Debugging in the left pane. 2. Under Enable Parallel Debug Checks, select Yes (/debug:parallel). 4. Click OK. 5. Rebuild your application. Your application is now instrumented for parallel debugging using the features of the Intel ® Parallel Debugger Extension. 12 1 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsTutorial: Intel® Fortran Compiler 2 Using Auto Vectorization Introduction to Auto-vectorization For the Intel® Fortran Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization to emphasize that the compiler automatically identifies and optimizes suitable loops on its own. Using the -vec (Linux* OS) or the /Qvec (Windows* OS) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx or -m or -x (Linux and Mac OS X). Vectorization is enabled with the Intel Fortran Compiler at optimization levels of /O2 and higher. Many loops are vectorized automatically, but in cases where this doesn't happen, you may be able to vectorize loops by making simple code modifications. In this tutorial, you will: • establish a performance baseline • generate a vectorization report • improve performance by aligning data • improve performance using Interprocedural Optimization Locating the Samples To begin this tutorial, open the vec_samples.zip archive in the product's Samples directory: \Samples\\Fortran\vec_samples.zip Use these files for this tutorial: • matrix_vector_multiplication_f.sln • matrix_vector_multiplication_f.vcproj • driver.f90 • matvec.f90 Open the Microsoft Visual Studio solution file, matrix_vector_multiplication_f.sln, 13and follow the steps below to prepare the project for the vectorization exercises in this tutorial: 1. Change the Active solution configuration to Release using Build > Configuration Manager. 2. Clean the solution by selecting Build > Clean Solution. Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, build your project with these settings: 1. Select Project > Properties > Fortran > Optimization > Optimization > Minimum Size(/O1) 14 2 Intel® Visual Fortran Composer XE 2011 Getting Started Tutorials2. Select Project > Properties > Fortran > Data > Default Real KIND > 8(/real_size:64) 3. Rebuild the project, then run the executable (Debug > Start Without Debugging) and record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured. 15 Tutorial: Intel® Fortran Compiler 2Generating a Vectorization Report A vectorization report tells you whether the loops in your code were vectorized, and if not, explains why not. Add the /Qvec-report1 option by selecting Project > Properties > Fortran > Diagnostics > Vectorizer Diagnostic Level > Loops Successefully Vectorized(1)(/Qvec-report1). Because vectorization is off at /O1, the compiler does not generate a vectorization report, so recompile at /O2 (default optimization): Select Fortran > Optimization > Optimization > Maximize Speed Record the new execution time. The reduction in time is mostly due to auto-vectorization of the inner loop at line 32 noted in the vectorization report: matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. The /Qvec-report2 option returns a list that also includes loops that were not vectorized, along with the reason why the compiler did not vectorize them. Change /Qvec-report1 to /Qvec-report2. 16 2 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsAlso, for Linker > Command Line > Additional Options, add /Qvec-report2: Rebuild your project. The vectorization report indicates that the loop at line 33 in matvec.f90 did not vectorize because it is not the innermost loop of the loop nest. matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(33) (col. 3): remark: loop was not vectorized: not inner loop. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. driver.f90(74) (col. 7): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. NOTE. For more information on the /Qvec-report compiler option, see the Compiler Options section in the Compiler User and Reference Guide. Improving Performance by Aligning Data The vectorizer can generate faster code when operating on aligned data. In this activity you will improve the vectorizer performance by aligning the arrays a, b, and c in driver.f90 on a 16-byte boundary so the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will insert an alignment directive for a, b, and c in driver.f90 with the following syntax: !dir$attributes align : 16 :: a,b,c This instructs the compiler to create arrays that it are aligned on a 16-byte boundary, which should facilitate the use of SSE aligned load instructions. 17 Tutorial: Intel® Fortran Compiler 2In addition, the column height of the matrix a needs to be padded out to be a multiple of 16 bytes, so that each individual column of a maintains the same 16-byte alignment. In practice, maintaining a constant alignment between columns is much more important than aligning the start of the arrays. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in matvec.f90 are aligned by using the directive !dir$ vector aligned NOTE. If you use !dir$ vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if !dir$ vector aligned is not used. See the code under the ALIGNED macro in matvec.f90 If your compilation targets the Intel ® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, !dir$ vector aligned advises the compiler that the data is 32-byte aligned. Rebuild the program after adding the ALIGNED Preprocessor Definition to ensure consistently aligned data: Fortran > Preprocessor > Preprocessor Definitions Rebuild your project. matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(33) (col. 3): remark: loop was not vectorized: not inner loop. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(63) (col. 21): remark: loop was not vectorized: not inner loop. driver.f90(63) (col. 21): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. driver.f90(74) (col. 7): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. 18 2 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsImproving Performance with Interprocedural Optimization The compiler may be able to perform additional optimizations if it is able to optimize across source line boundaries. These may include, but are not limited to, function inlining. This is enabled with the /Qipo option. Rebuild the program using the /Qipo option to enable interprocedural optimization. Select Optimization > Interprocedural Optimization > Multi-file(/Qipo) Note that the vectorization messages now appear at the point of inlining in driver.f90 (line 70). driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(63) (col. 21): remark: loop was not vectorized: not inner loop. driver.f90(63) (col. 21): remark: LOOP WAS VECTORIZED. driver.f90(73) (col. 16): remark: loop was not vectorized: not inner loop. driver.f90(70) (col. 14): remark: LOOP WAS VECTORIZED. driver.f90(70) (col. 14): remark: loop was not vectorized: not inner loop. driver.f90(70) (col. 14): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. Now, run the executable and record the execution time. Additional Exercises The previous examples made use of double precision arrays. They may be built instead with single precision arrays by changing the command-line option /real-size:64 to /real-size:32 The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements. 19 Tutorial: Intel® Fortran Compiler 2NOTE. In the example with data alignment, you will need to set ROWBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, the directive !dir$ vector aligned will cause the program to fail. This completes the tutorial for auto-vectorization, where you have seen how the compiler can optimize performance with various vectorization techniques. Using Guided Auto-parallelization Introduction to Guided Auto-parallelization Guided Auto-parallelization (GAP) is a feature of the Intel® Fortran Compiler that offers selective advice and, when correctly applied, results in auto-vectorization or auto-parallelization for serially-coded applications. Using the /Qguide option with your normal compiler options at /O2 or higher is sufficient to enable the GAP technology to generate the advice for auto-vectorization. Using /Qguide in conjunction with /Qparallel will enable the compiler to generate advice for auto-parallelization. In this tutorial, you will: 1. prepare the project for Guided Auto-parallelization. 2. run Guided Auto-parallelization. 3. analyze Guided Auto-parallelization reports. 4. implement Guided Auto-parallelization recommendations. Preparing the Project for Guided Auto-parallelization To begin this tutorial, open the GuidedAutoParallel.zip archive located in the product's Samples directory located at: \Samples\\Fortran\ The following Visual Studio* 2005 project files and source files are included: • GAP-f.sln • GAP-f.vfproj • main.f90 • scalar_dep.f90 Open the Microsoft Visual Studio Solution file, GAP-f.sln, 20 2 Intel® Visual Fortran Composer XE 2011 Getting Started Tutorialsand follow the steps below to prepare the project for Guided Auto-parallelization (GAP). 1. Clean the Solution by selecting Build > Clean Solution. 2. Since GAP is enabled only with option /O2 or higher, you will need to change the build configuration to Release using Build > Configuration Manager. Running Guided Auto-parallelization There are several ways to run GAP analysis in Visual Studio, depending on whether you want analysis for the whole solution, the project, a single file, a function, or a range of lines in your source code. In this tutorial, we will use single-file analysis. Follow the steps below to run a single-file analysis on scalar_dep.f90 in the GAP-f project: 1. In the GAP-f project, right-click on scalar_dep.f90. 2. Select Intel Visual Fortran Composer XE > Guided Auto Parallelism > Run Analysis on file "scalar_dep.f90" 3. If the /Qipo option is enabled, the Analysis with Multi-file optimization dialog appears. Click Run Analysis. 4. On the Configure Analysis dialog, click Run Analysis using the choices shown here: 21 Tutorial: Intel® Fortran Compiler 2NOTE. If you select Send remarks to a file, GAP messages will not be available in the Output window or Error List window. See the GAP Report in the Output window. GAP reports in the standard Output window are encapsulated with GAP REPORT LOG OPENED and END OF GAP REPORT LOG. 22 2 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsAlso, see the GAP Messages in the Error List window: 23 Tutorial: Intel® Fortran Compiler 2Analyzing Guided Auto-parallelization Reports Analyze the output generated by GAP analysis and determine whether or not the specific suggestions are appropriate for the specified source code. For this sample tutorial, GAP generates output for the loop in scalar_dep.f90: do i = 1, n if (a(i) >= 0) then t = i end if if (a(i) > 0) then a(i) = t * (1 / (a(i) * a(i))) end if end do In this example, the GAP Report generates a recommendation (remark #30761) to add the /Qparallel option to improve auto-parallelization. Remark #30515 indicates if variable t can be unconditionally assigned, the compiler will be able to vectorize the loop. 24 2 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsImplementing Guided Auto-parallelization Recommendations The GAP Report in this example recommends using the /Qparallel option to enable parallelization. Follow these steps to enable this option: 1. Right-click on the GAP-f project and select Properties 2. On the Property Pages dialog, expand the Fortran heading and select Optimization. 3. In the right-hand pane under, select Parallelization, then choose Yes (/Qparallel) and click OK. Now, run the GAP Analysis again and review the GAP Report: 25 Tutorial: Intel® Fortran Compiler 2Apply the necessary changes after verifying that the GAP recommendations are appropriate and do not change the semantics of the program. For this loop, the conditional compilation enables parallelization and vectorization of the loop as recommended by GAP: do i = 1, n !dir$ if defined(test_gap) t = i !dir$else if (a(i) >= 0) then t = i end if !dir$ endif if (a(i) > 0) then a(i) = t * (1 / (a(i) * a(i))) end if end do To verify that the loop is parallelized and vectorized: 1. Add the options /Qdiag-enable:par /Qdiag-enable:vec to the Command Line > Additional Options dialog. 26 2 Intel® Visual Fortran Composer XE 2011 Getting Started Tutorials2. Add the preprocessor definition test_gap to compile the appropriate code path. 3. Rebuild the GAP-f project and note the reports in the output window: 27 Tutorial: Intel® Fortran Compiler 2For more information on using the -guide, -vec-report, and -par-report compiler options, see the Compiler Options section in the Compiler User Guide and Reference. This completes the tutorial for Guided Auto-parallelization, where you have seen how the compiler can guide you to an optimized solution through auto-parallelization. Using Coarry Fortran Introduction to Coarray Fortran The Intel® Fortran Compiler XE supports parallel programming using coarrays as defined in the Fortran 2008 standard. As an extension to the Fortran language, coarrays offer one method to use Fortran as a robust and efficient parallel programming language. Coarray Fortran uses a single-program, multi-data programming model (SPMD). Coarrays are supported in the Intel® Fortran Compiler XE for Linux* and Intel® Visual Fortran Compiler XE for Windows*. This tutorial demonstrates how to compile a simple coarray Fortran application using the Intel Fortran Compiler XE, and how to control the number of images (processes) for the application. Locating the Sample To begin this tutorial, locate the source file in the product's Samples directory: \Samples\\Fortran\coarray_samples.zip Extract the Visual Studio project files from the .zip archive to a working directory: • coarray_samples.sln • coarray_samples.vfproj • hello_image.f90 NOTE. The Intel Fortran Compiler implementation of coarrays follows the standard provided in a draft version of the Fortran 2008 Standard. Not all features present in the Fortran 2008 Standard may be implemented by Intel. Consult the Release Notes for a list of supported features. Compiling the Sample Program The hello_image.f90 sample is a hello world program. Unlike the usual hello world, this coarray Fortran program will spawn multiple images, or processes, that will run concurrently on the host computer. Examining the source code for this application shows a simple Fortran program: program hello_image write(*,*) "Hello from image ", this_image(), & 28 2 Intel® Visual Fortran Composer XE 2011 Getting Started Tutorials "out of ", num_images()," total images" end program hello_image Note the function calls to this_image() and num_images(). These are new Fortran 2008 intrinsic functions. The num_images() function returns the total number of images or processes spawned for this program. The this_image() function returns a unique identifier for each image in the range 1 to N, where N is the total number of images created for this program. After installing the Intel ® Visual Fortran Composer XE 2011, start Microsoft Visual Studio* and open the coarray_samples.sln file. To build the project using coarrays, select: Project > Properties > Fortran > Command Line > /Qcoarray 29 Tutorial: Intel® Fortran Compiler 2Now, build the solution (Build > Build Solution), then run the executable (Debug > Start Without Debugging). Your output should be similar to this: Hello from image 1 out of 8 total images Hello from image 6 out of 8 total images Hello from image 7 out of 8 total images Hello from image 2 out of 8 total images Hello from image 5 out of 8 total images Hello from image 8 out of 8 total images Hello from image 3 out of 8 total images Hello from image 4 out of 8 total images By default, when a Coarray Fortran application is compiled with the Intel Fortran Compiler, the invocation creates as many images as there are processor cores on the host platform. The example shown above was run on a dual quad-core host system with eight total cores. As shown, each image is a separately spawned process on the system and executes asynchronously. NOTE. The /Qcoarray option cannot be used in conjunction with /Qopenmp options. One cannot mix Coarray Fortran language extensions with OpenMP extensions. 30 2 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsControlling the Number of Images There are two methods to control the number of images created for a Coarray Fortran application. First, you can use the /Qcoarray-num-images=N compiler option to compile the application, where N is the number of images. This option sets the number of images created for the application during run time. For example, use the /Qcoarray-num-images=2 option to the limit the number of images of the hello_image.f90 program to exactly two: To use the /Qcoarray-num-images=N option, select: Project > Properties > Fortran > Command Line > /Qcoarray-num-images=N In this example, we use /Qcoarray-num-images=2 to generate the following output: Hello from image 2 out of 2 total images Hello from image 1 out of 2 total images The second way to control the number of images is to use the environment variable FOR_COARRAY_NUM_IMAGES, setting this to the number of images you want to spawn. As an example, recompile hello_image.f90 without the /Qcoarray-num-images option. Before running the executable, set the environment variable FOR_COARRAY_NUM_IMAGES to the number of images you want created during the program run. 31 Tutorial: Intel® Fortran Compiler 2To set an environment variable in Visual Studio, select Project Properties > Configuration Properties > Debugging > Environment. Then set FOR_COARRAY_NUM_IMAGES=N where N is the number of images you want to create at runtime. Hello from image 3 out of 3 total images Hello from image 2 out of 3 total images Hello from image 1 out of 3 total images NOTE. Setting FOR_COARRAY_NUM_IMAGES=N overrides the /Qcoarray_num_images compiler option. 32 2 Intel® Visual Fortran Composer XE 2011 Getting Started Tutorials 1 Document Number: XXXXXX Intel® Rapid Storage Technology User Guide August 2011 Revision 1.0 2 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL?S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The Intel® Rapid Storage Technology may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Intel, Intel® Rapid Storage Technology, Intel® Matrix Storage Technology, Intel® Rapid Recover Technology, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. Copyright © 2011, Intel Corporation. All rights reserved.3 Contents 1 Introduction............................................................................................................................................ 5 1.1 Terminology.................................................................................................................................... 5 2 Intel® Rapid Storage Technology Features............................................................................................. 8 2.1 Feature Overview........................................................................................................................ 8 2.2 RAID 0 (Striping).......................................................................................................................... 8 2.3 RAID 1 (Mirroring) ...................................................................................................................... 9 2.4 RAID 5 (Striping with Parity)..................................................................................................... 9 2.5 RAID 10........................................................................................................................................ 10 2.6 Matrix RAID ................................................................................................................................ 10 2.7 RAID Migration .......................................................................................................................... 11 2.8 RAID Level Migration ............................................................................................................... 11 2.9 Intel® Rapid Recover Technology ............................................................................................ 12 2.10 Advanced Host ControllerInterface...................................................................................... 12 2.10.1 Native Command Queuing ................................................................................................ 12 2.10.2 Hot Plug .............................................................................................................................. 13 3 RAID BIOSConfiguration .................................................................................................................. 14 3.1 Overview..................................................................................................................................... 14 3.2 Enabling RAID in BIOS............................................................................................................... 14 4 Intel®Rapid Storage Technology Option ROM..................................................................................... 15 4.1 Overview..................................................................................................................................... 15 4.2 User Interface............................................................................................................................ 15 4.3 Version Identification............................................................................................................... 16 4.4 RAID Volume Creation.............................................................................................................. 16 5 Loading Driver during Operating System Installation........................................................................... 21 5.1 Overview..................................................................................................................................... 21 5.2 F6 Installation Method................................................................................................................. 21 5.2.1 Automatic F6 Diskette Creation............................................................................................ 21 5.2.2 Manual F6 Diskette Creation ............................................................................................ 21 5.2.3 F6 Installation Steps ......................................................................................................... 224 6 Intel®Rapid Storage Technology Installation......................................................................................... 24 6.1 Overview..................................................................................................................................... 24 6.2 Where to Obtain the Software ................................................................................................... 24 6.3 Installation Steps....................................................................................................................... 25 6.4 Confirming Software Installation ........................................................................................... 27 6.5 Version Identification............................................................................................................... 28 7 RAID-Ready Setup.............................................................................................................................. 29 7.1 Overview..................................................................................................................................... 29 7.2 System Requirements................................................................................................................ 29 7.3 RAID-Ready System Setup Steps................................................................................................... 29 8 Converting RAID-Ready to Full RAID................................................................................................ 30 8.1 Overview..................................................................................................................................... 30 8.2 RAID-Ready to 2-drive RAID 1....................................................................................................... 30 9 Verify and Repair ....................................................................................................................................... 34 9.1 Overview............................................................................................................................................. 34 9.2 Actions during Verify and Repair........................................................................................................ 34 Appendix A: Error Messages....................................................................................................................... 35 A.1 Incompatible Hardware................................................................................................................ 35 A.2 Operating System Not Supported................................................................................................. 35 A.3 Source Hard Drive Cannot Be Larger ........................................................................................... 35 A.4 Hard Drive Has System Files ......................................................................................................... 35 A.5 Source Hard Drive is Dynamic Disk............................................................................................... 365 1 Introduction The purpose of this document is to enable a user to properly set up and configure a system using Intel® Rapid Storage Technology. It provides steps for set up and configuration, as well as a brief overview on Intel® Rapid Storage Technology features. The information in this document is relevant only on systems with a supported Intel chipset and a supported operating system. Supported Intel chipset and operating system information is available at the Intel® Rapid Storage Technology support web page. Note: The majority of the information in this document is related to either software configuration or hardware integration. Intel is not responsible for the software written by third party vendors or the implementation of Intel components in the products of third party manufacturers. Customers should always contact the place of purchase or system/software manufacturer with support questions about their specific hardware or software configuration. 1.1 Terminology Term Description AHCI Advanced Host Controller Interface: an interface specification that allows the storage driver to enable advanced Serial ATA features such as Native Command Queuing, native hot plug, and power management. Continuous Update Policy When a recovery volume is using this policy, data on the master drive is copied to the recovery drive automatically as long as both drives are connected to the system. Intel® Rapid Storage Technology Option ROM A code module built into the system BIOS that provides boot support for RAID volumes as well as a user interface for configuring and managing RAID volumes. Master Drive The hard drive that is the designated source drive in a recovery volume. Matrix RAID Two independent RAID volumes within a single RAID array. Member A hard drive used within a RAID array.6 Term Description Migration The process of converting a system's data storage configuration from a non-RAID configuration (pass-thru) to a RAID configuration. Hot Plug The unannounced removal and insertion of a Serial ATA hard drive while the system is powered on. NCQ Native Command Queuing: a command protocol in Serial ATA that allows multiple commands to be outstanding within a hard drive at the same time. The commands are dynamically reordered to increase hard drive performance. On Request Update Policy When a recovery volume is using this policy, data on the master drive is copied to the recovery drive when you request it. Only changes since the last update process are copied. OS Operating System Port0 A serial ATA port (connector) on a motherboard identified as Port0. Port1 A serial ATA port (connector) on a motherboard identified as Port1. Port2 A serial ATA port (connector) on a motherboard identified as Port2. Port3 A serial ATA port (connector) on a motherboard identified as Port3. POST Power-On Self Test RAID Redundant Array of Independent Drives: allows data to be distributed across multiple hard drives to provide data redundancy or to enhance data storage performance. RAID 0 (striping) The data in the RAID volume is striped across the array's members. Striping divides data into units and distributes those units across the members without creating data redundancy, but improving read/write performance. RAID 1 (mirroring) The data in the RAID volume is mirrored across the RAID array's members. Mirroring is the term used to describe the key feature of RAID 1, which writes duplicate data to each member; therefore, creating data redundancy and increasing fault tolerance. RAID 5 (striping with parity) The data in the RAID volume and parity are striped across the array's members. Parity information is written with the data in a rotating sequence across the members of the array. This RAID level is a preferred configuration for efficiency, fault-tolerance, and performance. RAID 10 (striping and mirroring) The RAID level where information is striped across a two disk array for system performance. Each of the drives in the array has a mirror for fault tolerance. RAID 10 provides the performance benefits of RAID 0 and the redundancy of RAID 1. However, it requires four hard drives.7 Term Description RAID Array A logical grouping of physical hard drives. RAID Level Migration The process of converting a system's data storage configuration from one RAID level to another. RAID Volume A fixed amount of space across a RAID array that appears as a single physical hard drive to the operating system. Each RAID volume is created with a specific RAID level to provide data redundancy or to enhance data storage performance. Recovery Drive The hard drive that is the designated target drive in a recovery volume. Recovery Volume A volume utilizing Intel(R) Rapid Recover Technology.2 Intel® Rapid Storage Technology Features 2.1 Feature Overview The Intel® Rapid Storage Technology software package provides high-performance Serial ATA (SATA) and SATA RAID capabilities for supported operating systems. The key features of the Intel® Rapid Storage Technology are as follows: ? RAID 0 ? RAID 1 ? RAID 5 ? RAID 10 ? Matrix RAID ? RAID migration and RAID level migration ? Intel® Rapid Recover Technology ? Advanced Host Controller Interface (AHCI) support 2.2 RAID 0 (Striping) RAID 0 uses the read/write capabilities of two or more hard drives working in unison to maximize the storage performance of a computer system. The following table provides an overview of the advantages, the level of fault-tolerance provided and the typical usage of RAID 0. RAID 0 Overview Hard Drives Required: 2-6 Advantage: Highest transfer rates Fault- tolerance: None – if one disk fails all data will be lost Application: Typically used in desktops and workstations for maximum performance for temporary data and high I/O rate. 2-drive RAID 0 available in specific mobile configurations.2.3 RAID 1 (Mirroring) A RAID 1 array contains two hard drives where the data between the two is mirrored in real time to provide good data reliability in the case of a single disk failure; when one disk drive fails, all data is immediately available on the other without any impact to the integrity of the data. The following table provides an overview of the advantages, the level of fault-tolerance provided and the typical usage of RAID 1. RAID 1 Overview Hard Drives Required: 2 Advantage: 100% redundancy of data. One disk may fail, but data will continue to be accessible. A rebuild to a new disk is recommended to maintain data redundancy. Fault- tolerance: Excellent – disk mirroring means that all data on one disk is duplicated on another disk. Application: Typically used for smaller systems where capacity of one disk is sufficient and for any application(s) requiring very high availability. Available in specific mobile configurations. 2.4 RAID 5 (Striping with Parity) A RAID 5 array contains three or more hard drives where the data and parity are striped across all the hard drives in the array. Parity is a mathematical method for recreating data that was lost from a single drive, which increases fault-tolerance. The following table provides an overview of the advantages, the level of fault-tolerance provided and the typical usage of RAID 5. RAID 5 Overview Hard Drives Required: 3-6 Advantage: Higher percentage of usable capacity and high read performance as well as fault-tolerance. Fault- tolerance: Excellent - parity information allows data to be rebuilt after replacing a failed hard drive with a new drive. Application: Storage of large amounts of critical data. Not available in mobile configurations.2.5 RAID 10 A RAID 10 array uses four hard drives to create a combination of RAID levels 0 and 1. It is a striped set whose members are each a mirrored set. The following table provides an overview of the advantages, the level of fault-tolerance provided and the typical usage of RAID 10. RAID 10 Overview Hard Drives Required: 4 Advantage: Combines the read performance of RAID 0 with the fault-tolerance of RAID 1. Fault- tolerance: Excellent – disk mirroring means that all data on one disk is duplicated on another disk. Application: High-performance applications requiring data protection, such as video editing. Not available in mobile configurations. 2.6 Matrix RAID Matrix RAID allows you to create two RAID volumes on a single RAID array. As an example, on a system with an Intel® 82801GR I/O controller hub (ICH7R), Intel® Rapid Storage Technology allows you to create both a RAID 0 volume as well as a RAID 5 volume across four Serial ATA hard drives. Example of Matrix RAID: 2.7 RAID Migration The RAID migration feature enables a properly configured PC, known as a RAID-Ready system, to be converted into a high-performance RAID 0, RAID 1, RAID 5, or RAID 10 configuration by adding one or more Serial ATA hard drives to the system and invoking the RAID migration process from within Windows. The following RAID migrations are supported: • RAID-Ready to 2,3,4,5 or 6-drive RAID 0 • RAID-Ready to 2-drive RAID 1 • RAID-Ready to 3,4,5 or 6-drive RAID 5 • RAID-Ready to 4-drive RAID 10 Note: All migrations may not be available as each migration is supported on specific platform configurations. The migrations do not require re-installation of the operating system. All applications and data remain intact. Refer to Supported RAID Migrations for more information on migrations and the platforms on which each migration is supported. 2.8 RAID Level Migration The RAID level migration feature enables a user to migrate data from a RAID 0, RAID 1, or RAID 10 volume to RAID 5 by adding any additional Serial ATA hard drives necessary and invoking the modify volume process from within Windows. The following RAID level migrations are supported: • 2-drive RAID 0 to 3,4,5 or 6-drive RAID 5 • 3-drive RAID 0 to 4,5 or 6-drive RAID 5 • 4-drive RAID 0 to 5 or 6-drive RAID 5 • 2-drive RAID 1 to 3,4,5 or 6-drive RAID 5 • 4-drive RAID 10 to 4,5 or 6-drive RAID 5 Note: All migrations may not be available as each migration is supported on specific platform configurations. RAID level migrations do not require re-installation of the operating system. All applications and data remain intact. Refer to Supported RAID Migrations for more information on migrations and the platforms on which each migration is supported.2.9 Intel® Rapid Recover Technology Intel® Rapid Recover Technology utilizes RAID 1 (mirroring) functionality to copy data from a designated master drive to a designated recovery drive. The master drive data can be copied to the recovery drive either continuously or on request. When using the continuous update policy, changes made to the data on the master drive while the system is not docked are automatically copied to the recovery drive when the system is re-docked. When using the on request update policy, the master drive data can be restored to a previous state by copying the data on the recovery drive back to the master drive. The following table provides an overview of the advantages, the disadvantages and the typical usage of Intel® Rapid Recover Technology. Recovery Volume Overview: Hard Drives Required: 2 Advantage: More control over how data is copied between master and recovery drives; fast volume updates (only changes to the master drive since the last update are copied to the recovery drive); member hard drive data can be viewed in Microsoft Windows Explorer*. Disadvantage: No increase in volume capacity. Application: Critical data protection for mobile systems; fast restoration of the master drive to a previous or default state. 2.10 Advanced Host Controller Interface Advanced Host Controller Interface (AHCI) is an interface specification that allows the storage driver to enable advanced SATA features such as Native Command Queuing and Native Hot Plug. Refer to Supported Chipsets for AHCI for more information. 2.10.1 Native Command Queuing Native Command Queuing (NCQ) is a feature supported by AHCI that allows SATA hard drives to accept more than one command at a time. NCQ, when used in conjunction with one or more hard drives that support NCQ, increases storage performance on random workloads by allowing the drive to internally optimize the order of commands. Note: To take advantage of NCQ, you need the following: • Chipset that supports AHCI • Intel® Rapid Storage Technology • One or more SATA hard drives that support NCQ2.10.2 Hot Plug Hot plug, also referred to as hot swap, is a feature supported by AHCI that allows SATA hard drives to be removed or inserted while the system is powered on and running. As an example, hot plug may be used to replace a failed hard drive that is in an externally-accessible drive enclosure. Note: To take advantage of hot plug, you need the following: • Chipset that supports AHCI • Intel® Rapid Storage Technology • Hot plug capability correctly enabled in the system BIOS by the motherboard manufacturer3 RAID BIOSConfiguration 3.1 Overview To install the Intel® Rapid Storage Technology, the system BIOS must include the SATA RAID option ROM and you must enable RAID in the BIOS. 3.2 Enabling RAID in BIOS Note: The instructions to enable RAID in the BIOS are specific to motherboards manufactured by Intel with a supported Intel chipset. The specific BIOS settings on non-Intel motherboards may differ. Refer to the motherboard documentation or contact the motherboard manufacturer or your place of purchase for specific instructions. Always follow the instructions that are provided with your motherboard. Depending on your Intel motherboard model, enable RAID by following either of the steps below. 1. Press the F2 key after the Power-On-Self-Test(POST) memory test begins. 2. Select the Configuration menu, then the SATA Drives menu. 3. Set the Chipset SATA Mode to RAID. 4. Press the F10 key to save the BIOS settings and exit the BIOS Setup program. OR 1. Press the F2 key after the Power-On-Self-Test(POST) memory test begins. 2. Select the Advanced menu, then the Drive Configuration menu. 3. Set the Drive Mode option to Enhanced. 4. Enable Intel® RAID Technology. 5. Press the F10 key to save the BIOS settings and exit the BIOS Setup program.4 Intel®Rapid Storage Technology Option ROM 4.1 Overview The Intel® Rapid Storage Technology option ROM provides the following: ? Pre-operating system user interface for RAID volume management ? Ability to create, delete and reset RAID volumes ? RAID recovery 4.2 User Interface To enter the Intel® Rapid Storage Technology option ROM user interface, press Ctrl-I when prompted during the Power-On Self Test (POST). Option ROM prompt: In the user interface, the hard drive(s) and hard drive information listed for your system will differ from the example in Figure 3. Option ROM user interface: 4.3 Version Identification To identify the version of the Intel® Rapid Storage Technology option ROM in the system BIOS, enter the option ROM user interface. The version number is located in the upper right corner. 4.4 RAID Volume Creation Use the following steps to create a RAID volume using the Intel® Rapid Storage Technology user interface: Note: The following procedure should only be used with a newly-built system or if you are reinstalling your operating system. The following procedure should not be used to migrate an existing system to RAID 0. If you wish to create matrix RAID volumes after the operating system software is loaded, they should be created using the Intel® Rapid Storage Technology software in Windows. 1. Press Ctrl-I when the following window appears during POST:2. Select option for Create RAID Volume and press Enter. 3. Type in a volume name and press Enter or press Enter to accept the default volume name.4. Select the RAID level by using the up and down arrow keys to scroll through the available values, then press Enter. 5. Press Enter to select the physical disks. A dialog similar to the following will appear:6. Select the appropriate number of hard drives by using the up and down arrow keys to scroll through the list of available hard drives. Press the Space bar to select a drive. When you have finished selecting hard drives, press Enter. 7. Unless you have selected RAID 1, select the strip size by using the up and down arrow keys to scroll through the available values and then press Enter.8. Select the volume capacity and press Enter. Note: The default value indicates the maximum volume capacity using the selected disks. If less than the maximum volume capacity is chosen, creation of a second volume is needed to utilize the remaining space (i.e. a matrix RAID configuration). 9. At the Create Volume prompt, press Enter to create the volume. The following prompt will appear: 10. Press the key to confirm volume creation. 11. Exit the option ROM user interface by selecting the Exit option. 12. Press the key again to confirm exit. Note: To change any of the information before the volume creation has been confirmed, you must exit the Create Volume process and restart it. Press the key to exit the Create Volume process.5 Loading Driver during Operating System Installation 5.1 Overview The chart below shows the circumstances in which the F6 installation method must be used during an operating system installation. Operating system Total drive volume F6 installation method Windows 7* Less than 2 Terabytes Recommended but not required 1 More than 2 Terabytes 2 Required Windows Vista* Less than 2 Terabytes Recommended but not required 1 More than 2 Terabytes 2 Required Windows XP* Less than 2 Terabytes Required More than 2 Terabytes 2 Required 1 Windows 7 and Windows Vista both include drivers for RAID/AHCI during installation. 2 For Intel® Desktop Boards, you must first enable UEFI in the BIOS when using total drive volume greater than two Terabytes. For non-Intel motherboards, refer to the motherboard documentation to see if this is a requirement. 5.2 F6 Installation Method The F6 installation method requires a 3.5” diskette with the driver files. 5.2.1 Automatic F6 Diskette Creation To automatically create a diskette that contains the files needed during the F6 installation process, follow these steps: 1. Download the latest F6 Driver Diskette utility from Download Center: http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2101 2. Run the .EXE file. 3. Follow all on-screen prompts. Note: Choose either the 32-bit or the 64-bit version, depending on your operating system. 5.2.2 Manual F6 Diskette Creation To manually create a diskette that contains the files needed during the F6 installation process, follow these steps: 1. Download the Intel® Rapid Storage Technology and save it to your local drive (or use the CD shipped with your motherboard which contains the Intel® Rapid Storage Technology). Note: The Intel® Rapid Storage Technology can be downloaded from Download Center at http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2101 2. Extract the driver files at the command prompt by running the following command: {filename} –A -P {path} Example: IATA_CD_10.6.0.1022.EXE –A –P C:\TEMP 3. The following directory structure will be created: \Drivers \x32 \x64 4. Copy the IAAHCI.CAT, IAACHI.INF, IASTOR.CAT, IASTOR.INF, IASTOR.SYS, and TXTSETUP.OEM files to the root directory of a diskette. Note: If the system has a 32-bit processor, copy the files found in the \x32 folder; if the system has a 64-bit processor, copy the files found in the \x64 folder. 5.2.3 F6 Installation Steps To install the Intel® Rapid Storage Technology driver using the F6 installation method, complete the following steps: 1. Press F6 at the beginning of Windows setup when prompted in the status line with the „Press F6 if you need to install a third party SCSI or RAID driver? message. 2. After pressing F6, nothing will happen immediately; setup will temporarily continue loading drivers and then you will be prompted with a screen to load support for mass storage device(s). Press S to „Specify Additional Device?.3. Enter> key. Refer to the Automatic F6 Diskette Creation section above for instructions. 4. Select the RAID or AHCI controller entry that corresponds to your BIOS setup and press Enter. Note: Not all available selections may appear in the list; use the up and down arrow keys to see additional options. 5. Press Enter to confirm. Windows setup will now continue. Leave the diskette in the diskette drive until the system reboots itself because Windows setup will need to copy the files again from the diskette. After Windows setup has copied these files again, remove the diskette so that Windows setup can reboot as needed.6 Intel®Rapid Storage Technology Installation 6.1 Overview After installing an operating system onto a RAID volume or on a SATA hard drive when in RAID or AHCI mode, the Intel® Rapid Storage Technology can be loaded from within Windows. This installs the following components: ? User interface (i.e. Intel® Rapid Storage Technology software) ? Tray icon service ? Monitor service, allowing you to monitor the health of your RAID volume and/or hard drives. Warning:The Intel® Rapid Storage Technology driver may be used to operate the hard drive from which the system is booting or a hard drive that contains important data. For this reason, you cannot remove or un-install this driver from the system; however, you will have the ability to uninstall all other non-driver components. The following non-driver components can be un-installed: • Intel® Rapid Storage Technology software • Help documentation • Start Menu shortcuts • System tray icon service • RAID monitor service 6.2 Where to Obtain the Software If a CD or DVD was included with your motherboard or system, it should include the Intel® Rapid Storage Technology software. The latest version of Intel® Rapid Storage Technology can also be downloaded from Download Center at: http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=21016.3 Installation Steps Note: The instructions below assume that the BIOS has been configured correctly and the RAID driver has been installed using the F6 installation method (if applicable). 1. Run the Intel® Rapid Storage Technology installation file. 2. On the welcome screen, click Next to continue. 3. Review the Warning screen and click Next to continue.4. Review the License Agreement and click Yes to accept the license agreement terms. 5. Review the Readme File Information and click Next to continue.6. Click Finish to complete the installation and restart the system. 6.4 Confirming Software Installation Refer to the image below to confirm that Intel® Rapid Storage Technology has been installed. If installation was done using F6 or an unattended installation method, you can confirm that the Intel® Rapid Storage Technology was loaded by following these steps: Note: The following instructions assume Classic mode in Windows* XP. 1. Click the Start button and then Control Panel. 2. Double-click the System icon. 3. Select the Hardware tab. 4. Click the Device Manager button. 5. Expand the SCSI and RAID Controllers entry. 6. Right-click the SATA RAID Controller entry. 7. Select the Driver tab. 8. Click the Driver Details button. The iastor.sys file should be listed. Example: Refer to Figure 5. Driver details example: NOTE: The controller shown here may differ from the controller displayed for your system. 6.5 Version Identification 1. Open the Intel® Rapid Storage Technology software. 2. Click the Help button and then the About button. NOTE: The version information shown here may differ from the information displayed for your system.7 RAID-Ready Setup 7.1 Overview A RAID-Ready system is a system configuration that allows a user to perform a RAID migration at a later date. For more information on RAID migrations, see the RAID Migration section of this User Guide (Section 8). 7.2 System Requirements In order for a system to be considered RAID-Ready, it must meet all of the following requirements: • Contains a supported Intel chipset • Includes a single SATA hard drive • RAID must be enabled in the BIOS • Motherboard BIOS must include the Intel® Rapid Storage Technology option ROM • Intel® Rapid Storage Technology must be loaded • A partition that does not take up the entire capacity of the hard drive (4-5MB of free space is sufficient) 7.3 RAID-Ready System Setup Steps To set up a RAID-Ready system, follow these steps: 1. Enable RAID in system BIOS using the steps listed in Enabling RAID in BIOS (Section 3.2). 2. Install the Intel® Rapid Storage Technology driver using the steps listed in F6 Installation Steps (Section 5.2.3) 3. Install Intel® Rapid Storage Technology using the steps listed in Installation Steps (Section 6.3)8 Converting RAID-Ready to Full RAID 8.1 Overview This section explains how to convert (or migrate) from a RAID-Ready system to a fully-functional RAID system. The example in this section describes the migration steps for RAID 1. 8.2 RAID-Ready to 2-drive RAID 1 To convert a RAID-Ready system into a system with a 2-drive RAID 1 volume, follow these steps: Warning:This operation will delete all existing data from the additional hard drive or drives and the data cannot be recovered. It is critical to backup all important data on the additional drives before proceeding. The data on the source hard drive, however, will be preserved. 1. Install an additional SATA hard drive in the system. 2. Start Windows and open the Intel® Rapid Storage Technology software. 3. Select Create a custom volume.4. On the Select Volume Type screen, select Real-time data protection (RAID 1) and then click Next. 5. On the Configure Volume screen: a. Select the two installed disks b. Choose to keep data on the “System” disk c. Click Next6. Review the warning screen and then click Create Volume. 7. Review the confirmation screen and then click OK.8. After the volume has been created, click OK on the completion screen. 9. Review the Status screen, now showing the RAID array just created. 10. The data migration will begin and may take some time. During the migration, you can see the current status by holding the mouse pointer over the Intel® Rapid Storage Technology status bar icon.9 Verify and Repair 9.1 Overview Verify and Repair checks a volume for inconsistent or bad data. It may also fix any data problems or parity errors. The Verify process happens… ? Automatically after a hard system shutdown or system crash (except when configured for RAID 0) ? Manually when started from within the Intel® Rapid Storage Technology software The UI displays two functions: Verify Only Verify and Repair 9.2 Actions during Verify and Repair The Verify process checks each stripe rather than copying data. The driver walks through every stripe in the volume, starting at the lowest logical block address (LBA). Array Type Actions RAID 0 Verify: checks for any read failures. Repair: can?t repair since there is not a copy of good data RAID 1 Verify: checks for data mismatches and read failures Repair: copies to mirror RAID 5 Verify: checks for parity issues and read failures Repair: updates parity; assumes the data is correct and regenerates and rewrites parity RAID 10 Verify: checks for data mismatches and read failures Repair: copies to mirrorAppendix A: Error Messages A.1 Incompatible Hardware Issue: The following error message appears during installation: Incompatible hardware. This software is not supported on this chipset. Please select „Yes? to view the Readme file for a list of supported products. Refer to section 2 titled „System Requirements?. To resolve this issue, install the Intel® Rapid Storage Technology software on a system with a supported Intel chipset or by ensuring that AHCI or RAID is enabled in the system BIOS. A.2 Operating System Not Supported Issue: The following error message appears during installation: This operating system is not currently supported by this install package. Installer will now exit. To resolve this issue, install the Intel® Rapid Storage Technology software on a supported operating system. A.3 Source Hard Drive Cannot Be Larger Issue: When attempting to migrate from a single hard drive (or a RAID-Ready configuration) to a RAID configuration, the following error message appears and the migration process will not begin: The source hard drive cannot be larger than the selected hard drive member(s). Do one of the following to correct the problem: - If already inserted, select larger hard drive member(s). - Insert larger hard drive(s) into the system, and re-launch the Create RAID Volume from Existing Hard Drive Wizard. Follow the steps listed in the error message to resolve the problem. A.4 Hard Drive Has System Files Issue: The following error message appears after selecting a hard drive as a member hard drive during the Create RAID Volume process: This hard drive has system files and cannot be used to create a RAID volume. Please select another hard drive. Solution: Select a new hard drive.A.5 Source Hard Drive is Dynamic Disk Issue: When attempting to migrate from a RAID-Ready system to a full-RAID system, an error message is received that says the migration cannot continue because the source drive is a dynamic disk. However, Microsoft* Windows* Disk Management shows the disk as basic, not dynamic. This issue may occur if there is not enough space for the migration to successfully complete. Instead of reporting that there is not enough space, the Intel Rapid Storage Technology software reports that the migration cannot continue because the source drive is a dynamic disk. Note: This error is not related to the size of the destination hard drive(s). It may be received even if the destination hard drive(s) are equal to or greater in size than the source hard drive. To resolve this issue: ? If there is a single partition on the source hard drive, reducing the size of the partition by a few MBs may resolve the issue and allow the migration to occur. ? If there are multiple partitions on the source hard drive, reducing the size of the second partition by a few MBs may resolve the issue and allow the migration to occur. Document Number: XXXXXX Intel® Matrix Storage Manager 8.x User's Manual January 2009 Revision 1.02 ver7.0 / User's Manual INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL?S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The Intel® Matrix Storage Manager may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Intel, Intel® Matrix Storage Manager, Intel® Matrix Storage Technology, Intel® Rapid Recover Technology, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2008, Intel Corporation. All rights reserved.Contents ver7.0 / User's Manual 3 1 Introduction...................................................................................................... 6 1.1 Terminology........................................................................................... 6 1.2 Reference Documents.............................................................................. 8 2 Intel® 2.1 Matrix Storage Manager Features .............................................................. 9 Feature Overview.................................................................................... 9 2.2 RAID 0 (Striping) .................................................................................... 9 2.3 RAID 1 (Mirroring) .................................................................................10 2.4 RAID 5 (Striping with Parity) ...................................................................10 2.5 RAID 10................................................................................................11 2.6 Matrix RAID ..........................................................................................11 2.7 RAID Migration ......................................................................................12 2.8 RAID Level Migration..............................................................................12 2.9 Intel® Rapid Recover Technology ............................................................13 2.10 Advanced Host Controller Interface ..........................................................14 2.10.1 Native Command Queuing .........................................................14 2.10.2 Hot Plug ..................................................................................14 3 RAID BIOS Configuration ...................................................................................15 3.1 Overview ..............................................................................................15 3.2 Enabling RAID in BIOS............................................................................15 4 Intel® Matrix Storage Manager Option ROM.........................................................16 4.1 Overview ..............................................................................................16 4.2 User Interface .......................................................................................16 4.3 Version Identification .............................................................................16 4.4 RAID Volume Creation............................................................................17 5 Loading Driver During OS Installation..................................................................22 5.1 Overview ..............................................................................................22 5.2 F6 Installation Method ............................................................................22 5.2.1 Automatic F6 Floppy Creation.....................................................22 5.2.2 Manual F6 Floppy Creation.........................................................22 5.2.3 F6 Installation Steps .................................................................23 6 Intel® Matrix Storage Manager Installation..........................................................24 6.1 Overview ..............................................................................................24 6.2 Where to Obtain Software.......................................................................24 6.3 Installation Steps...................................................................................24 6.4 How to Confirm Software Installation .......................................................29 6.5 Version Identification .............................................................................31 6.5.1 Version Identification Using Intel® Matrix Storage Console............31 6.5.2 Version Identification Using Driver File ........................................31 7 RAID-Ready Setup............................................................................................324 ver7.0 / User's Manual 7.1 Overview ..............................................................................................32 7.2 System Requirements ............................................................................32 7.3 RAID-Ready System Setup Steps.............................................................32 8 RAID Migration.................................................................................................33 8.1 Overview ..............................................................................................33 8.2 RAID Migration Steps: RAID-Ready to 2-drive RAID 0/1 .............................33 8.3 RAID Migration Steps: RAID-Ready to 3 or 4-drive RAID 0/5.......................35 9 Volume Creation...............................................................................................42 9.1 RAID Volume Creation............................................................................42 9.2 Recovery Volume Creation ......................................................................49 9.2.1 Recovery Volume Creation in Basic Mode.....................................49 9.2.2 Recovery Volume Creation in Advanced Mode...............................50 Appendix A Error Messages.................................................................................................56 A.1 Incompatible Hardware .....................................................................................56 A.2 Operating System Not Supported .......................................................................56 A.3 Source Hard Drive Cannot Be Larger ...................................................................56 A.4 Hard Drive Has System Files ..............................................................................57 A.5 Source Hard Drive is Dynamic Disk .....................................................................57 Figures Figure 1. Matrix RAID........................................................................................12 Figure 2. User Prompt .......................................................................................16 Figure 3. Start Menu Item..................................................................................30 Figure 4. Driver Details Example.........................................................................30 Figure 5. Driver Version Information ...................................................................31 Figure 6. Tray Icon Status .................................................................................34 Figure 7. User Interface Status...........................................................................35 Figure 8. Progress Dialog...................................................................................35 Tables Table 1. RAID 0 Overview..................................................................................10 Table 2. RAID 1 Overview..................................................................................10 Table 3. RAID 5 Overview..................................................................................11 Table 4. RAID 10 Overview ................................................................................11 Table 5. Recovery Volume Overview ...................................................................13ver7.0 / User's Manual 5 Revision History Document Number Revision Number Description Revision Date N/A 1.0 Aligns with 8.x release • Clarified RAID-Ready requirements January 2009 §6 ver7.0 / User's Manual Introduction 1 Introduction The purpose of this document is to enable a user to properly set up and configure a system using Intel® Matrix Storage Manager. It provides steps for set up and configuration, as well as a brief overview on Intel® Matrix Storage Manager features. Note: The information in this document is only relevant on systems with a supported Intel chipset that include a supported Intel chipset, with a supported operating system. Supported Intel chipset and operating system information is available at the Intel® Rapid Storage Technology support web page. Note: The majority of the information in this document is related to either software configuration or hardware integration. Intel is not responsible for the software written by third party vendors or the implementation of Intel components in the products of third party manufacturers. Customers should always contact the place of purchase or system/software manufacturer with support questions about their specific hardware or software configuration. 1.1 Terminology Term Description AHCI Advanced Host Controller Interface: an interface specification that allows the storage driver to enable advanced Serial ATA features such as Native Command Queuing, native hot plug, and power management. Continuous Update Policy When a recovery volume is using this policy, data on the master drive is copied to the recovery drive automatically as long as both drives are connected to the system. Intel® Matrix Storage Manager Option ROM A code module built into the system BIOS that provides boot support for RAID volumes as well as a user interface for configuring and managing RAID volumes. Master Drive The hard drive that is the designated source drive in a recovery volume. Matrix RAID Two independent RAID volumes within a single RAID array. Member A hard drive used within a RAID array.ver7.0 / User's Manual 7 Introduction Term Description Migration The process of converting a system's data storage configuration from a non-RAID configuration (pass-thru) to a RAID configuration. Hot Plug The unannounced removal and insertion of a Serial ATA hard drive while the system is powered on. NCQ Native Command Queuing: a command protocol in Serial ATA that allows multiple commands to be outstanding within a hard drive at the same time. The commands are dynamically reordered to increase hard drive performance. On Request Update Policy When a recovery volume is using this policy, data on the master drive is copied to the recovery drive when you request it. Only changes since the last update process are copied. OS Operating System Port0 A serial ATA port (connector) on a motherboard identified as Port0. Port1 A serial ATA port (connector) on a motherboard identified as Port1. Port2 A serial ATA port (connector) on a motherboard identified as Port2. Port3 A serial ATA port (connector) on a motherboard identified as Port3. POST Power-On Self Test RAID Redundant Array of Independent Drives: allows data to be distributed across multiple hard drives to provide data redundancy or to enhance data storage performance. RAID 0 (striping) The data in the RAID volume is striped across the array's members. Striping divides data into units and distributes those units across the members without creating data redundancy, but improving read/write performance. RAID 1 (mirroring) The data in the RAID volume is mirrored across the RAID array's members. Mirroring is the term used to describe the key feature of RAID 1, which writes duplicate data to each member; therefore, creating data redundancy and increasing fault tolerance. RAID 5 (striping with parity) The data in the RAID volume and parity are striped across the array's members. Parity information is written with the data in a rotating sequence across the members of the array. This RAID level is a preferred configuration for efficiency, fault-tolerance, and performance. RAID 10 (striping and mirroring) The RAID level where information is striped across a two disk array for system performance. Each of the drives in the array has a mirror for fault tolerance. RAID 10 provides the performance benefits of RAID 0 and the redundancy of RAID 1. However, it requires four hard drives. RAID Array A logical grouping of physical hard drives.8 ver7.0 / User's Manual Introduction Term Description RAID Level Migration The process of converting a system's data storage configuration from one RAID level to another. RAID Volume A fixed amount of space across a RAID array that appears as a single physical hard drive to the operating system. Each RAID volume is created with a specific RAID level to provide data redundancy or to enhance data storage performance. Recovery Drive The hard drive that is the designated target drive in a recovery volume. Recovery Volume A volume utilizing Intel(R) Rapid Recover Technology. 1.2 Reference Documents Document Document No./Location Not Applicablever7.0 / User's Manual 9 Intel® Matrix Storage Manager Features 2 Intel® Matrix Storage Manager Features 2.1 Feature Overview The Intel® Matrix Storage Manager software package provides high-performance Serial ATA and Serial ATA RAID capabilities for supported operating systems. The key features of the Intel® Matrix Storage Manager are as follows: • RAID 0 • RAID 1 • RAID 5 • RAID 10 • Matrix RAID • RAID migration and RAID level migration • Intel® Rapid Recover Technology • Advanced Host Controller Interface (AHCI) support 2.2 RAID 0 (Striping) RAID 0 uses the read/write capabilities of two or more hard drives working in unison to maximize the storage performance of a computer system. Table 1 provides an overview of the advantages, the level of fault-tolerance provided, and the typical usage of RAID 0.Intel® Matrix Storage Manager Features 10 ver7.0 / User's Manual Table 1. RAID 0 Overview Hard Drives Required: 2-6 Advantage: Highest transfer rates Faulttolerance: None – if one disk fails all data will be lost Application: Typically used in desktops and workstations for maximum performance for temporary data and high I/O rate. 2-drive RAID 0 available in specific mobile configurations. 2.3 RAID 1 (Mirroring) A RAID 1 array contains two hard drives where the data between the two is mirrored in real time to provide good data reliability in the case of a single disk failure; when one disk drive fails, all data is immediately available on the other without any impact to the integrity of the data. Table 2 provides an overview of the advantages, the level of fault-tolerance provided, and the typical usage of RAID 1. Table 2. RAID 1 Overview Hard Drives Required: 2 Advantage: 100% redundancy of data. One disk may fail, but data will continue to be accessible. A rebuild to a new disk is recommended to maintain data redundancy. Faulttolerance: Excellent – disk mirroring means that all data on one disk is duplicated on another disk. Application: Typically used for smaller systems where capacity of one disk is sufficient and for any application(s) requiring very high availability. Available in specific mobile configurations. 2.4 RAID 5 (Striping with Parity) A RAID 5 array contains three or more hard drives where the data and parity are striped across all the hard drives in the array. Parity is a mathematical method for recreating data that was lost from a single drive, which increases fault-tolerance. Table 3 provides an overview of the advantages, the level of fault-tolerance provided, and the typical usage of RAID 5.Intel® Matrix Storage Manager Features ver7.0 / User's Manual 11 Table 3. RAID 5 Overview Hard Drives Required: 3-6 Advantage: Higher percentage of usable capacity and high read performance as well as fault-tolerance. Faulttolerance: Excellent - parity information allows data to be rebuilt after replacing a failed hard drive with a new drive. Application: Storage of large amounts of critical data. Not available in mobile configurations. 2.5 RAID 10 A RAID 10 array uses four hard drives to create a combination of RAID levels 0 and 1. It is a striped set whose members are each a mirrored set. Table 4 provides an overview of the advantages, the level of fault-tolerance provided, and the typical usage of RAID 10. Table 4. RAID 10 Overview Hard Drives Required: 4 Advantage: Combines the read performance of RAID 0 with the fault-tolerance of RAID 1. Faulttolerance: Excellent – disk mirroring means that all data on one disk is duplicated on another disk. Application: High-performance applications requiring data protection, such as video editing. Not available in mobile configurations. 2.6 Matrix RAID Matrix RAID allows you to create two RAID volumes on a single RAID array. As an example, on a system with an Intel® 82801GR I/O controller hub (ICH7R), Intel® Matrix Storage Manager allows you to create both a RAID 0 volume as well as a RAID 5 volume across four Serial ATA hard drives. Example: Refer to Figure 1.Intel® Matrix Storage Manager Features 12 ver7.0 / User's Manual Figure 1. Matrix RAID 2.7 RAID Migration The RAID migration feature enables a properly configured PC, known as a RAID-Ready system, to be converted into a high-performance RAID 0, RAID 1, RAID 5, or RAID 10 configuration by adding one or more Serial ATA hard drives to the system and invoking the RAID migration process from within Windows. The following RAID migrations are supported: Note: All migrations may not be available as each migration is supported on specific platform configurations. • RAID-Ready to 2,3,4,5 or 6-drive RAID 0 • RAID-Ready to 2-drive RAID 1 • RAID-Ready to 3,4,5 or 6-drive RAID 5 • RAID-Ready to 4-drive RAID 10 The migrations do not require re-installation of the operating system. All applications and data remain intact. 2.8 RAID Level Migration The RAID level migration feature enables a user to migrate data from a RAID 0, RAID 1, or RAID 10 volume to RAID 5 by adding any additional Serial ATA hard drives necessary and invoking the modify volume process from within Windows.Intel® Matrix Storage Manager Features ver7.0 / User's Manual 13 The following RAID level migrations are supported: Note: All migrations may not be available as each migration is supported on specific platform configurations. • 2-drive RAID 0 to 3,4,5 or 6-drive RAID 5 • 3-drive RAID 0 to 4,5 or 6-drive RAID 5 • 4-drive RAID 0 to 5 or 6-drive RAID 5 • 2-drive RAID 1 to 3,4,5 or 6-drive RAID 5 • 4-drive RAID 10 to 4,5 or 6-drive RAID 5 RAID level migrations do not require re-installation of the operating system. All applications and data remain intact. 2.9 Intel® Rapid Recover Technology Intel® Rapid Recover Technology utilizes RAID 1 (mirroring) functionality to copy data from a designated master drive to a designated recovery drive. The master drive data can be copied to the recovery drive either continuously or on request. When using the continuous update policy, changes made to the data on the master drive while the system is not docked are automatically copied to the recovery drive when the system is re-docked. When using the on request update policy, the master drive data can be restored to a previous state by copying the data on the recovery drive back to the master drive. Table 5 provides an overview of the advantages, the disadvantages, and the typical usage of Intel® Rapid Recover Technology. Table 5. Recovery Volume Overview Hard Drives Required: 2 Advantage: More control over how data is copied between master and recovery drives; fast volume updates (only changes to the master drive since the last update are copied to the recovery drive); member hard drive data can be viewed in Microsoft Windows Explorer*. Disadvantage: No increase in volume capacity. Application: Critical data protection for mobile systems; fast restoration of the master drive to a previous or default state.Intel® Matrix Storage Manager Features 14 ver7.0 / User's Manual 2.10 Advanced Host Controller Interface Advanced Host Controller Interface (AHCI) is an interface specification that allows the storage driver to enable advanced Serial ATA features such as Native Command Queuing and Native Hot Plug. 2.10.1 Native Command Queuing Native Command Queuing (NCQ) is a feature supported by AHCI that allows Serial ATA hard drives to accept more than one command at a time. NCQ, when used in conjunction with one or more hard drives that support NCQ, increases storage performance on random workloads by allowing the drive to internally optimize the order of commands. Note: To take advantage of NCQ, you need the following: • Chipset that supports AHCI • Intel® Matrix Storage Manager • One or more Serial ATA (SATA) hard drives that support NCQ 2.10.2 Hot Plug Hot plug, also referred to as hot swap, is a feature supported by AHCI that allows Serial ATA hard drives to be removed or inserted while the system is powered on and running. As an example, hot plug may be used to replace a failed hard drive that is in an externally-accessible drive enclosure. Note: To take advantage of hot plug, you need the following: • Chipset that supports AHCI • Intel® Matrix Storage Manager • Hot plug capability correctly enabled in the system BIOS by the OEM/motherboard manufacturerver7.0 / User's Manual 15 RAID BIOS Configuration 3 RAID BIOS Configuration 3.1 Overview To install the Intel® Matrix Storage Manager, the system BIOS must include the Intel® Matrix Storage Manager option ROM. The Intel® Matrix Storage Manager option ROM is tied to the controller hub. Version 7.0 of the option ROM supports platforms based on the Intel® 82801HEM I/O controller hub. 3.2 Enabling RAID in BIOS Use the following steps to enable RAID in the system BIOS: Note: The instructions listed below are specific to motherboards manufactured by Intel with a supported Intel chipset. The specific BIOS settings on non-Intel manufactured motherboards may differ. Refer to the motherboard documentation or contact the motherboard manufacturer or your place of purchase for specific instructions. Always follow the instructions that are provided with your motherboard. 1. Press the key after the Power-On-Self-Test (POST) memory test begins. 2. Select the Advanced menu, then the Drive Configuration menu. 3. Switch the Drive Mode option from Legacy to Enhanced. 4. Enable Intel(R) RAID Technology. 5. Press the key to save the BIOS settings and exit the BIOS Setup program.16 ver7.0 / User's Manual Intel® Matrix Storage Manager Option ROM 4 Intel® Matrix Storage Manager Option ROM 4.1 Overview The Intel® Matrix Storage Manager option ROM is a PnP option ROM that provides a pre-operating system user interface for RAID configurations. It also provides BIOS and DOS disk services (Int13h). 4.2 User Interface To enter the Intel® Matrix Storage Manager option ROM user interface, press the and keys simultaneously when prompted during the Power-On Self Test (POST). Example: Refer to Figure 2. Figure 2. User Prompt NOTE: The hard drive(s) and hard drive information listed for your system can differ from the following example. 4.3 Version Identification To identify the specific version of the Intel® Matrix Storage Manager option ROM integrated into the system BIOS, enter the option ROM user interface. The versionver7.0 / User's Manual 17 Intel® Matrix Storage Manager Option ROM number is located in the top right corner with the following format: vX.Y.W.XXXX, where X and Y are the major and minor version numbers. 4.4 RAID Volume Creation Use the following steps to create a RAID volume using the Intel® Matrix Storage Manager user interface: Note: The following procedure should only be used with a newly-built system or if you are reinstalling your operating system. The following procedure should not be used to migrate an existing system to RAID 0. If you wish to create matrix RAID volumes after the operating system software is loaded, they should be created using the Intel® Matrix Storage Console in Windows. 1. Press the and keys simultaneously when the following window appears during POST: 2. Select option 1. Create RAID Volume and press the key.18 ver7.0 / User's Manual Intel® Matrix Storage Manager Option ROM 3. Type in a volume name and press the key, or press the key to accept the default name. 4. Select the RAID level by using the < > or < > keys to scroll through the available values, then press the key.ver7.0 / User's Manual 19 Intel® Matrix Storage Manager Option ROM 5. Press the key to select the physical disks. A dialog similar to the following will appear: 6. Select the appropriate number of hard drives by using the < > or < > keys to scroll through the list of available hard drives. .Press the key to select a drive. When you have finished selecting hard drives, press the key.Intel® Matrix Storage Manager Option ROM 20 ver7.0 / User's Manual 7. Unless you have selected RAID 1, select the strip size by using the < > or < > keys to scroll through the available values, then press the key. 8. Select the volume capacity and press the key. Note: The default value indicates the maximum volume capacity using the selected disks. If less than the maximum volume capacity is chosen, creation of a second volume is needed to utilize the remaining space (i.e. a matrix RAID configuration).Intel® Matrix Storage Manager Option ROM ver7.0 / User's Manual 21 9. At the Create Volume prompt, press the key to create the volume. The following prompt will appear: 10. Press the key to confirm volume creation. 11. To exit the option ROM user interface, select option 5. Exit and press the key. 12. Press the key again to confirm exit. Note: To change any of the information before the volume creation has been confirmed, you must exit the Create Volume process and restart it. Press the key to exit the Create Volume process.22 ver7.0 / User's Manual Loading Driver During OS Installation 5 Loading Driver During OS Installation 5.1 Overview Unless using Microsoft Windows Vista*, the Intel® Matrix Storage Manager driver must be loaded during operating system installation using the F6 installation method. This is required in order to install an operating system onto a hard drive or RAID volume when in RAID mode or onto a hard drive when in AHCI mode. If using Microsoft Windows Vista, this is not required, as the operating system includes a driver for the AHCI and RAID controllers. Refer to Intel® Matrix Storage Manager Installation for instructions on how to installed an updated version of the software after the operating system is installed. 5.2 F6 Installation Method The F6 installation method requires a floppy with the driver files. 5.2.1 Automatic F6 Floppy Creation Use the following steps to automatically create a floppy that contains the files needed during the F6 installation process: 1. Download the latest Floppy Configuration Utility from the Intel download site: http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2101 2. Run the .EXE file. Note: Use F6flpy32.exe on a 32-bit system. Use F6flpy64.exe on a 64-bit system. 5.2.2 Manual F6 Floppy Creation Use the following steps to manually create a floppy that contains the files needed during the F6 installation process: 1. Download the Intel® Matrix Storage Manager and save it to your local drive (or use the CD shipped with your motherboard which contains the Intel® Matrix Storage Manager). Note: The Intel® Matrix Storage Manager can be downloaded from the following website: http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2101 2. Extract the driver files by running 'C:\IATAXX_CD.EXE –A -A -P C:\'. Note: This is described in the „Advanced Installation Instructions? section of the README.TXT.ver7.0 / User's Manual 23 Loading Driver During OS Installation 3. Copy the IAAHCI.CAT, IAACHI.INF, IASTOR.CAT, IASTOR.INF, IASTOR.SYS, and TXTSETUP.OEM. files to the root directory of a floppy diskette. Note: If the system has a 32-bit processor, copy the files found in the Drivers folder; if the system has a 64-bit processor, copy the files found in the Drivers64 folder. 5.2.3 F6 Installation Steps To install the Intel® Matrix Storage Manager driver using the F6 installation method, complete the following steps: 1. Press the key at the beginning of Windows XP setup (during text-mode phase) when prompted in the status line with the „Press F6 if you need to install a third party SCSI or RAID driver? message. Note: After pressing F6, nothing will happen immediately; setup will temporarily continue loading drivers and then you will be prompted with a screen to load support for mass storage device(s). 2. Press the key to „Specify Additional Device?. 3. Insert the floppy disk containing the driver files when you see the following prompt: „Please insert the disk labeled Manufacturer-supplied hardware support disk into Drive A:? and press the key. Refer to Automatic F6 Floppy Creation for instructions. 4. Select the RAID or AHCI controller entry that corresponds to your BIOS setup and press the key. Note: Not all available selections may appear in the list; use the < > or < > to see additional options. 5. Press the key to confirm. At this point, you have successfully F6 installed the Intel® Matrix Storage Manager driver and Windows XP setup should continue. Leave the floppy disk in the floppy drive until the system reboots itself because Windows setup will need to copy the files again from the floppy to the Windows installation folders. After Windows setup has copied these files again, remove the floppy diskette so that Windows setup can reboot as needed.24 ver7.0 / User's Manual Intel® Matrix Storage Manager Installation 6 Intel® Matrix Storage Manager Installation 6.1 Overview After installing an operating system onto a RAID volume or on a Serial ATA hard drive when in RAID or AHCI mode, the Intel® Matrix Storage Manager can be loaded from within Windows. This installs the user interface (i.e. Intel® Matrix Storage Console), the tray icon service, and the monitor service onto the system, allowing you to monitor the health of your RAID volume and/or hard drives. This method can also be used to upgrade to a newer version of the Intel® Matrix Storage Manager. Warning: The Intel® Matrix Storage Manager driver may be used to operate the hard drive from which the the system is booting or a hard drive that contains important data. For this reason, you cannot remove or un-install this driver from the system; however, you will have the ability to un-install all other non-driver components. The following non-driver components can be un-installed: • Intel® Matrix Storage Console • Help documentation • Start Menu shortcuts • System tray icon service • RAID monitor service 6.2 Where to Obtain Software If a CD-ROM was included with your motherboard or system, it should include the Intel® Matrix Storage Manager. The Intel® Matrix Storage Manager can be downloaded from the following Intel website: http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2101 6.3 Installation Steps Note: The instructions below assume that the BIOS has been configured correctly and the RAID driver has been installed using the F6 installation method (if applicable).ver7.0 / User's Manual 25 Intel® Matrix Storage Manager Installation 1. Run the Intel® Matrix Storage Manager executable. 2. Click Next to continue.26 ver7.0 / User's Manual Intel® Matrix Storage Manager Installation 3. Carefully review the warning and click Next to continue.ver7.0 / User's Manual 27 Intel® Matrix Storage Manager Installation 4. Click Yes to accept the license agreement terms.28 ver7.0 / User's Manual Intel® Matrix Storage Manager Installation 5. Review the readme if needed and click Next to continue.ver7.0 / User's Manual 29 Intel® Matrix Storage Manager Installation 6. Click Finish to complete installation and reboot the system. 6.4 How to Confirm Software Installation Refer to Figure 3 to confirm that Intel® Matrix Storage Manager has been installed.Intel® Matrix Storage Manager Installation 30 ver7.0 / User's Manual Figure 3. Start Menu Item If installation was done by have-disk, F6, or an unattended installation method, you can confirm that the Intel® Matrix Storage Manager has been loaded by completing the following steps: Note: The following instructions assume Classic mode in Windows* XP Professional. 1. Click on the Start button and then the Control Panel entry. 2. Double-click the System icon. Note: If using Microsoft Windows Vista, first select Classic View. 3. Select the Hardware tab. 4. Click on the Device Manager button. 5. Expand the SCSI and RAID Controllers entry. 6. Right-click on the Intel(R) 82801XX SATA Controller entry. 7. Select the Driver tab. 8. Click on the Driver Details button. The iastor.sys file should be listed. Example: Refer to Figure 4. Figure 4. Driver Details Example NOTE: The controller shown here may differ from the controller displayed for your system.Intel® Matrix Storage Manager Installation ver7.0 / User's Manual 31 6.5 Version Identification There are two ways to determine which version of the Intel® Matrix Storage Manager is installed: 1. Use the Intel® Matrix Storage Console 2. Locate the RAID driver (iaStor.sys) file and view the file properties 6.5.1 Version Identification Using Intel® Matrix Storage Console 1. To access the Intel® Matrix Storage Console, refer to Figure 3. 2. Under the View menu, select System Report. 3. Select the Intel® RAID Technology tab for the driver version information. Example: Refer to Figure 5. Figure 5. Driver Version Information NOTE: Driver version information shown here may differ from the information displayed for your system. 6.5.2 Version Identification Using Driver File 1. Locate the file iastor.sys in the following path: \Windows\System32\Drivers 2. Right-click on iastor.sys and select Properties. 3. Select the Version tab. The version number should be listed after the File Version parameter in the following format: x.y.z.aaaa32 ver7.0 / User's Manual RAID-Ready Setup 7 RAID-Ready Setup 7.1 Overview A "RAID Ready" system is a specific system configuration that allows a user to perform a RAID migration at a later date. For more information on RAID migrations, see RAID Migration. 7.2 System Requirements In order for a system to be considered “RAID Ready”, it must meet all of the following requirements: • Contains a supported Intel chipset • Includes a single Serial ATA (SATA) hard drive • RAID controller must be enabled in the BIOS • Motherboard BIOS must include the Intel® Matrix Storage Manager option ROM • Intel® Matrix Storage Manager must be loaded • A partition that does not take up the entire capacity of the hard drive (4-5MB of free space is sufficient) 7.3 RAID-Ready System Setup Steps Note: The system must meet all the requirements listed in System Requirements. 1. Enable RAID in system BIOS using the steps listed in Enabling RAID in BIOS. 2. Install Intel® Matrix Storage Manager driver using the steps listed in F6 Installation Steps. 3. Install Intel® Matrix Storage Manager using the steps listed in Installation Steps.ver7.0 / User's Manual 33 RAID Migration 8 RAID Migration 8.1 Overview The following sections explain how to migrate from a RAID-Ready system to a RAID system. 8.2 RAID Migration Steps: RAID-Ready to 2-drive RAID 0/1 Use the following steps to convert a RAID-Ready system into a system with a 2-drive RAID 0 or 1 volume: Note: The steps listed in this section assume that the system is a properly configured RAIDReady system. For more information on how to configure a RAID-Ready system, see RAID-Ready System Setup Steps. Warning: This operation will delete all existing data from the additional hard drive or drives and the data cannot be recovered. It is critical to backup all important data on the additional drives before proceeding. The data on the source hard drive, however, will be preserved. 1. Physically add an additional SATA hard drive to the system. 2. Boot into Windows* and open the Intel® Matrix Storage Console. Example: Refer to Figure 3. 3. Select Protect data from a hard drive failure with RAID 1 or Improve storage performance with RAID 0.34 ver7.0 / User's Manual RAID Migration 4. Select Yes to confirm volume creation. In the following example, RAID 1 was selected. Refer to Figure 6, Figure 7, and Figure 8 for examples of volume creation progress indicators. 5. When the migration is complete, reboot the system if needed. 6. If applicable, use a third party application or the Microsoft* Windows* operating system tools to create and format a new data partition in any unused space or use a third party application to extend the partition to utilize any remaining space. Figure 6. Tray Icon Statusver7.0 / User's Manual 35 RAID Migration Figure 7. User Interface Status Figure 8. Progress Dialog 8.3 RAID Migration Steps: RAID-Ready to 3 or 4- drive RAID 0/5 Use the following steps to convert a RAID-Ready system into a system with a 3 or 4- drive RAID 0/5 volume: Note: The steps listed in this section assume that the system is a properly configured RAIDReady system. For more information on how to configure a RAID-Ready system, see RAID-Ready System Setup Steps.36 ver7.0 / User's Manual RAID Migration Warning: This operation will delete all existing data from the additional hard drive or drives and the data cannot be recovered. It is critical to backup all important data on the additional drives before proceeding. The data on the source hard drive, however, will be preserved. Warning: It is very important to note which disk is the source drive (the one containing all of the information to be migrated). On a RAID-Ready system, this can be determined by noting the port the single hard drive is attached to a note during POST. You can also use the Intel® Matrix Storage Manager before the additional disks are installed to verify the port and serial number of the drive that contains the data. 1. Physically add two or three additional SATA hard drives to the system. 2. Boot into Windows* and open the Intel® Matrix Storage Console. Example: Refer to Figure 3. 3. Select Advanced Mode from the View menu. 4. Select Create RAID Volume from Existing Hard Drive from the Actions menu. 5. Click Next to continue. 6. Type in a volume name and press the key, or press the key to accept the default name.ver7.0 / User's Manual 37 RAID Migration 7. Select a RAID level. 8. Select a strip size. 9. Click Next to continue. 10. Select a source hard drive source. Note: The source hard drive can be selected by double-clicking on the hard drive, or by single-clicking on the hard drive and then selecting the right arrow key. The data on this hard drive will be preserved and38 ver7.0 / User's Manual RAID Migration migrated to the new RAID volume. 11. Click Next to continue. 12. Select the member hard drives. Note: The member hard drives can be selected by double-clicking on the hard drive, or by single-clicking on the hard drive and thenver7.0 / User's Manual 39 RAID Migration selecting the right arrow key. Warning: The data on the member hard drives will be deleted. Back up all important data before continuing. 13. Click Next to continue. 14. Use the field or the slider bar to specify the amount of available array space that will be used by the volume. Note: Any remaining space can be used to create aRAID Migration 40 ver7.0 / User's Manual second volume.RAID Migration ver7.0 / User's Manual 41 15. Click Finish to begin the migration process. 16. Once the migration is complete, reboot if needed. 17. If applicable, use a third party application or the Microsoft* Windows* operating system tools to create and format a new data partition in any unused space or use a third party application to extend the partition to utilize any remaining space.42 ver7.0 / User's Manual Volume Creation 9 Volume Creation RAID and recovery volumes can be created using the Intel® Matrix Storage Console. Note: RAID volume creation is only available as an option if you are have two or more SATA hard drives in addition to another bootable device. If you wish to create a RAID volume using your boot device, you will need to perform a RAID migration. See RAID Migration for instructions on how to perform a migration. 9.1 RAID Volume Creation Warning: Creating a RAID volume will permanently delete any existing data on the selected hard drives. Back up all important data before beginning these steps. If you wish to preserve the data, see RAID Migration for instructions on how to perform a RAID migration. To create a RAID volume, use the following steps: 1. Open the Intel® Matrix Storage Console. (Start >> All Programs >> Intel® Matrix Storage Manager >> Intel® Matrix Storage Console) 2. Switch to advanced mode by selecting the Advanced Mode option under the View menu. 3. Select Create RAID Volume under the Actions menu.ver7.0 / User's Manual 43 Volume Creation 4. Select Next.44 ver7.0 / User's Manual Volume Creation 5. Enter a name for the RAID volume.ver7.0 / User's Manual 45 Volume Creation 6. Select a RAID level.46 ver7.0 / User's Manual Volume Creation 7. Select a strip size. 8. Select Next to continue.ver7.0 / User's Manual 47 Volume Creation 9. Select the hard drives that will be used to create the RAID volume. 10. When you are finished selecting hard drives, select Next to continue.48 ver7.0 / User's Manual Volume Creation 11. Enter a size for the RAID volume. 12. Select Next to continue.ver7.0 / User's Manual 49 Volume Creation 13. Select Finish to create the RAID volume. 9.2 Recovery Volume Creation A recovery volume can be created using either Basic mode or Advanced mode in the Intel® Matrix Storage Console. 9.2.1 Recovery Volume Creation in Basic Mode Warning: Creating a recovery volume will permanently delete any existing data on the drive selected as the recovery drive. Back up all important data before beginning these steps. Note: This option may or may not be available depending on your system configuration. If you do not see the option listed, refer to Recovery Volume Creation in Advanced Mode. To create a recovery volume in Basic mode, use the following steps: 1. Open the Intel® Matrix Storage Console. (Start >> All Programs >> Intel® Matrix Storage Manager >> Intel® Matrix Storage Console)Volume Creation 50 ver7.0 / User's Manual 2. Select Protect data using Intel® Rapid Recover Technology. 3. Select Yes to confirm volume creation. 9.2.2 Recovery Volume Creation in Advanced Mode Warning: Creating a recovery volume will permanently delete any existing data on the drive selected as the recovery drive. Back up all important data before beginning these steps. To create a recovery volume in Basic mode, use the following steps: 1. Open the Intel® Matrix Storage Console. (Start >> All Programs >> Intel® Matrix Storage Manager >> Intel® Matrix Storage Console) 2. Select Advanced Mode in the View menu.Volume Creation ver7.0 / User's Manual 51 3. 3. Select Create Recovery Volume in the Actions menu. 4. Select Next to continue. 5. Modify the recovery volume name if you wish.Volume Creation 52 ver7.0 / User's Manual 6. Select a hard drive to be used as the master hard drive for the recovery volume.Volume Creation ver7.0 / User's Manual 53 7. Select a hard drive to be used as the recovery hard drive for the recovery volume.Volume Creation 54 ver7.0 / User's Manual 8. Select an update policy.Volume Creation ver7.0 / User's Manual 55 9. Select Finish to begin recovery volume creation.Volume Creation 56 ver7.0 / User's Manual Appendix A Error Messages A.1 Incompatible Hardware Issue: The following error message appears during installation: Solution: This issue can be resolved by installing the Intel® Matrix Storage Manager on a system with a supported Intel chipset, or by ensuring that AHCI or RAID is enabled in the system BIOS. A.2 Operating System Not Supported Issue: The following error message appears during installation: Solution: This issue can be resolved by installing the Intel® Matrix Storage Manager on a supported operating system. A.3 Source Hard Drive Cannot Be Larger Issue: When attempting to migrate from a single hard drive (or a RAID-Ready configuration) to a RAID configuration, the following error message appears and the migration process will not begin:Volume Creation ver7.0 / User's Manual 57 Solution: Follow the steps listed in the error message to resolve the problem. A.4 Hard Drive Has System Files Issue: The following error message appears after selecting a hard drive as a member hard drive during the Create RAID Volume process: Solution: Select a new hard drive. A.5 Source Hard Drive is Dynamic Disk Issue: When attempting to migrate from a RAID-Ready configuration to a RAID configuration, an error message is received that says the migration cannot continue because the source drive is a dynamic disk. However, Microsoft* Windows* Disk Management shows the disk as basic, not dynamic.Volume Creation 58 ver7.0 / User's Manual Solution: Reduce the size of the partition by a few MBs and see if that resolves the issue.

10% de réduction sur vos envois d'emailing --> CLIQUEZ ICI

Retour à l'accueil, cliquez ici

Voir également :

 

[PARENTDIR] Parent Directory                             -   
[TXT] 10GBASE-T-White-Pape..> 2015-04-06 20:27  1.5M  
[TXT] All-Wi-Fi-is-NOT-the..> 2015-04-01 14:58  2.5M  
[TXT] Aspera-FASP-Speeds-D..> 2015-03-23 18:57  641K  
[TXT] CLC-Genomics-Workben..> 2015-03-23 18:57  644K  
[TXT] Carte-d-interface-te..> 2015-04-01 18:03  2.7M  
[TXT] Carte-mere-Intel-D97..> 2015-04-01 18:27  2.7M  
[TXT] Ce-qui-compte-est-a-..> 2015-04-06 20:26  1.5M  
[TXT] Configurer-les-conne..> 2015-04-01 18:04  2.6M  
[TXT] Edison-Product-Brief..> 2015-04-06 20:24  385K  
[TXT] Education.htm           2015-03-23 18:56  646K  
[TXT] Fiche-produit-et-le-..> 2015-04-01 15:03  1.9M  
[TXT] Find-the-Best-Mobile..> 2015-03-23 18:57  560K  
[TXT] Finding-an-approxima..> 2015-04-01 15:03  2.0M  
[TXT] Galileo-Manuels.htm     2015-04-06 20:24  385K  
[TXT] Glossaire.htm           2015-03-24 05:59  923K  
[TXT] Intel-Code-of-Conduc..> 2015-03-23 18:57  629K  
[TXT] Intel-Core-2-Duo-Pro..> 2015-04-06 20:25  1.7M  
[TXT] Intel-Core-2-Duo-Pro..> 2015-04-06 20:26  1.5M  
[TXT] Intel-Edison-Getting..> 2015-04-01 14:57  2.5M  
[TXT] Intel-Education-CAP_..> 2015-04-01 15:02  2.0M  
[TXT] Intel-NetportExpress..> 2015-04-06 20:24  314K  
[TXT] Intel-PRO-100-LAN-Mo..> 2015-04-06 20:24  314K  
[TXT] Intel-PRO-Wireless-2..> 2015-04-06 20:24  220K  
[TXT] Intel-Server-Raid-Co..> 2015-04-06 20:27  1.5M  
[TXT] Intel-Viiv-Technolog..> 2015-04-06 20:25  1.7M  
[TXT] Intel-Xeon-Processor..> 2015-03-23 18:57  640K  
[TXT] Jim-Parsons-se-fait-..> 2015-04-06 20:26  1.5M  
[TXT] LSI-incorpore-manuel..> 2015-04-01 15:01  2.0M  
[TXT] Linux-Tutorial-For-I..> 2015-04-01 14:56  2.5M  
[TXT] Look-Inside-Intel-2i..> 2015-04-01 14:57  2.5M  
[TXT] Look-Inside-Jack-And..> 2015-04-06 20:26  1.5M  
[TXT] Manuel-d-Installatio..> 2015-04-01 14:56  2.6M  
[TXT] Manuel-d-Installatio..> 2015-04-01 14:56  2.5M  
[TXT] Manuel-d-Installatio..> 2015-04-01 14:59  2.3M  
[TXT] Manuel-d-Installatio..> 2015-04-01 15:00  2.3M  
[TXT] Manuel-d-installatio..> 2015-04-01 18:27  2.7M  
[TXT] Manuel-d-installatio..> 2015-04-01 15:00  2.3M  
[TXT] Manuel-d-installatio..> 2015-04-01 14:59  2.3M  
[TXT] Manuel-d-installatio..> 2015-04-06 20:30  3.0M  
[TXT] Maximizing-File-Tran..> 2015-04-06 20:29  1.5M  
[TXT] Mobile-Processeurs-I..> 2015-04-01 18:04  2.6M  
[TXT] Mobile-Processeurs-I..> 2015-04-01 18:25  3.0M  
[TXT] Niveaux-de-protectio..> 2015-04-06 20:24  374K  
[TXT] Pentium-Processeur-o..> 2015-04-01 15:04  1.9M  
[TXT] Processeur-Intel-Pen..> 2015-03-23 18:57  628K  
[TXT] Processeurs-overDriv..> 2015-03-23 18:56  670K  
[TXT] Processeurs-overDriv..> 2015-03-23 18:56  679K  
[TXT] Processors-manuals-m..> 2015-04-01 18:03  2.6M  
[TXT] Processors-sb-BXTS13..> 2015-04-01 18:26  2.9M  
[TXT] QUARK-X1000-Manuels.htm 2015-04-06 20:23  462K  
[TXT] Release_Notes_F_2015..> 2015-04-01 15:02  2.0M  
[TXT] Ressources-technique..> 2015-03-23 18:57  628K  
[TXT] Salesforce-Discusses..> 2015-03-23 18:56  641K  
[TXT] Simple-Reliable-Perf..> 2015-04-06 20:28  1.5M  
[TXT] Simplify-VMware-vSph..> 2015-04-06 20:29  1.5M  
[TXT] Technologie-de-stock..> 2015-04-01 18:26  3.0M  
[TXT] Telemedicine-Solutio..> 2015-03-23 18:57  560K  
[TXT] Texas-Advanced-Compu..> 2015-03-23 18:57  641K  
[TXT] Type-de-boitier-FC-P..> 2015-04-01 15:00  2.3M  
[TXT] Unite-de-stockage-In..> 2015-04-06 20:24  364K  
[TXT] University-of-Tennes..> 2015-03-23 18:57  641K  
[TXT] Videos-Tech-101-Manu..> 2015-03-23 18:56  712K  
[TXT] Videos-Tech-101-Vide..> 2015-03-23 18:55  722K  
[TXT] Xeon-Processor-D-Pro..> 2015-03-23 18:56  641K  
[TXT] adapter-pro100-pro10..> 2015-04-06 20:24  365K  
[TXT] chipsets-rste-sb-int..> 2015-04-01 18:25  3.0M  
[TXT] desktop-600sm-sb-d61..> 2015-04-01 18:02  2.7M  
[TXT] desktop-600sm-sb-d61..> 2015-04-01 18:02  2.7M  
[TXT] documentation.htm       2012-01-26 18:19  7.3M  
[TXT] edison-sb-edison_pb_..> 2015-04-06 20:24  386K  
[TXT] galileo-sb-galileo_b..> 2015-04-06 20:24  415K  
[TXT] galileo-sb-galileo_s..> 2015-04-07 06:24  475K  
[TXT] intelbxrts2011lc_liq..> 2015-04-01 14:58  2.5M  
[TXT] intelthermalsolution..> 2015-04-01 15:00  2.3M  
[TXT] motherboards-desktop..> 2015-04-01 18:27  2.7M  
[TXT] motherboards-server-..> 2015-04-06 20:27  1.5M  
[TXT] motherboards-server-..> 2015-04-01 15:01  2.2M  
[TXT] motherboards-server-..> 2015-04-06 20:24  381K  
[TXT] motherboards-server-..> 2015-04-06 20:29  1.5M  
[TXT] network-sb-10gbe_vsp..> 2015-04-06 20:29  1.5M  
[TXT] network-sb-fedexcase..> 2015-04-06 20:30  1.5M  
[TXT] network-sb-intel_eth..> 2015-04-06 20:27  1.5M  
[TXT] network-sb-inteliscs..> 2015-04-06 20:28  1.5M  
[TXT] processors-manuals-m..> 2015-04-06 20:30  3.0M  
[TXT] processors-overdrive..> 2015-03-23 18:56  679K  
[TXT] processors-overdrive..> 2015-03-23 18:56  712K  
[TXT] processors-overdrive..> 2015-04-01 15:04  1.9M  
[TXT] processors-sb-317804..> 2015-04-06 20:25  1.7M  
[TXT] rts2011ac_thermalsol..> 2015-04-01 14:56  2.6M  
[TXT] ssdc-hpssd-sb-DC_S35..> 2015-04-06 20:24  374K  
[TXT] support-viiv-sb-inte..> 2015-04-06 20:25  1.7M  
[TXT] wireless-wlan-pro201..> 2015-04-06 20:24  220K  
[TXT] x25e_high_performanc..> 2015-04-01 15:03  1.9M  
[TXT] x25e_high_performanc..> 2015-04-01 15:03  2.0M  
Documentation INTEL Rechercher un produit INTEL :

http://software.intel.com/sites/products/search/search.php?q=&x=26&y=18&product=&version=&docos=

Accéder au manuel utilisateur

Intel ® Math Kernel Library for Linux* OS User's Guide http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userguide_lnx/mkl_userguide_lnx.pdf

Accéder au manuel utilisateur

Intel ® Math Kernel Library for Mac OS* X User's Guide http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userguide_mac/mkl_userguide_mac.pdf

Accéder au manuel utilisateur

Intel ® Math Kernel Library for Windows* OS User's Guide Intel® MKL - Windows* OS Document Number: 315930-018US http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userguide_win/mkl_userguide_win.pdf

Accéder au manuel utilisateur

Intel ® Math Kernel Library Reference Manual Document Number: 630813-045US MKL 10.3 Update 8 http://software.intel.com/sites/products/documentation/hpc/mkl/mklman/mklman.pdf

Accéder au manuel utilisateur

Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS Document Number: 324207-005US

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/lin/start/getting_started_amplifier_xe_linux.pdf Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/win/start/getting_started_amplifier_xe_windows.pdf Intel® VTune™ Amplifier XE 2011 Release Notes for Linux Installation Guide and Release Notes Document number: 323591-001U

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/lin/start/release_notes_amplifier_xe_linux.pdf Intel® VTune™ Amplifier XE 2011 Release Notes for Windows* OS Installation Guide and Release Notes Document number: 323401-001U

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/win/start/release_notes_amplifier_xe_windows.pdf Intel(R) Threading Building Blocks Reference Manual

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/tbbxe/Reference.pdf Intel® Threading Building Blocks Design Patterns Design Patterns Document Number 323512-005U

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/tbbxe/Design_Patterns.pdf Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes Document number: 321604-003US 24 July 201

http://software.intel.com/sites/products/documentation/studio/studio/en-us/2011Update/release_notes_studio.pdf Intel® Math Kernel Library Summary Statistics Application Note

http://software.intel.com/sites/products/documentation/hpc/mkl/sslnotes/sslnotes.pdf Intel® Math Kernel Library Vector Statistical Library Notes

http://software.intel.com/sites/products/documentation/hpc/mkl/vslnotes/vslnotes.pdf

Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323648-001US

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/start/lin/getting_started_composerxe2011_cpp_lin.pdf Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323649-001US

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/start/mac/getting_started_composerxe2011_cpp_mac.pdf Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323647-001US

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/start/win/getting_started_composerxe2011_cpp_win.pdf Intel ® Fortran Composer XE 2011 Getting Started Tutorials Document Number: 323651-001US

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/start/lin/getting_started_composerxe2011_for_lin.pdf Intel® Parallel Inspector 2011 Release Notes Installation Guide and Release Notes Document number: 320754-002U

http://software.intel.com/sites/products/documentation/studio/inspector/en-us/2011Update/start/release_notes_inspector.pdf Intel ® Visual Fortran Composer XE 2011 Getting Started Tutorials Document Number: 323650-001US

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/start/win/getting_started_composerxe2011_for_win.pdf Intel® Rapid Storage Technology User Guide August 2011 Revision 1.

http://download.intel.com/support/chipsets/imsm/sb/irst_user_guide.pdf Intel® Matrix Storage Manager 8.x User's Manual January 2009 Revision 1.

http://download.intel.com/support/chipsets/imsm/sb/8_x_raid_ahci_users_manual.pdf Intel ® Math Kernel Library for Linux* OS User's Guide Intel® MKL - Linux* OS Document Number: 314774-019US Legal InformationContents Legal Information................................................................................7 Introducing the Intel® Math Kernel Library...........................................9 Getting Help and Support...................................................................11 Notational Conventions......................................................................13 Chapter 1: Overview Document Overview.................................................................................15 What's New.............................................................................................15 Related Information.................................................................................15 Chapter 2: Getting Started Checking Your Installation.........................................................................17 Setting Environment Variables...................................................................17 Scripts to Set Environment Variables .................................................18 Automating the Process of Setting Environment Variables.....................19 Compiler Support.....................................................................................19 Using Code Examples...............................................................................20 What You Need to Know Before You Begin Using the Intel ® Math Kernel Library...............................................................................................20 Chapter 3: Structure of the Intel® Math Kernel Library Architecture Support................................................................................23 High-level Directory Structure....................................................................23 Layered Model Concept.............................................................................24 Accessing the Intel ® Math Kernel Library Documentation...............................25 Contents of the Documentation Directories..........................................26 Viewing Man Pages..........................................................................26 Chapter 4: Linking Your Application with the Intel® Math Kernel Library Linking Quick Start...................................................................................27 Using the -mkl Compiler Option.........................................................27 Using the Single Dynamic Library.......................................................28 Selecting Libraries to Link with..........................................................28 Using the Link-line Advisor................................................................29 Using the Command-line Link Tool.....................................................29 Linking Examples.....................................................................................29 Linking on IA-32 Architecture Systems...............................................29 Linking on Intel(R) 64 Architecture Systems........................................30 Linking in Detail.......................................................................................31 Listing Libraries on a Link Line...........................................................31 Dynamically Selecting the Interface and Threading Layer......................32 Linking with Interface Libraries..........................................................33 Using the ILP64 Interface vs. LP64 Interface...............................33 Linking with Fortran 95 Interface Libraries..................................35 Linking with Threading Libraries.........................................................35 Sequential Mode of the Library..................................................35 Contents 3Selecting the Threading Layer...................................................36 Linking with Computational Libraries..................................................37 Linking with Compiler Run-time Libraries............................................37 Linking with System Libraries............................................................38 Building Custom Shared Objects................................................................38 Using the Custom Shared Object Builder.............................................38 Composing a List of Functions ..........................................................39 Specifying Function Names...............................................................40 Distributing Your Custom Shared Object.............................................40 Chapter 5: Managing Performance and Memory Using Parallelism of the Intel ® Math Kernel Library........................................41 Threaded Functions and Problems......................................................41 Avoiding Conflicts in the Execution Environment..................................43 Techniques to Set the Number of Threads...........................................44 Setting the Number of Threads Using an OpenMP* Environment Variable......................................................................................44 Changing the Number of Threads at Run Time.....................................44 Using Additional Threading Control.....................................................46 Intel MKL-specific Environment Variables for Threading Control. . . . .46 MKL_DYNAMIC........................................................................47 MKL_DOMAIN_NUM_THREADS..................................................48 Setting the Environment Variables for Threading Control..............49 Tips and Techniques to Improve Performance..............................................49 Coding Techniques...........................................................................50 Hardware Configuration Tips.............................................................50 Managing Multi-core Performance......................................................51 Operating on Denormals...................................................................52 FFT Optimized Radices.....................................................................52 Using Memory Management ......................................................................52 Intel MKL Memory Management Software............................................52 Redefining Memory Functions............................................................53 Chapter 6: Language-specific Usage Options Using Language-Specific Interfaces with Intel ® Math Kernel Library.................55 Interface Libraries and Modules.........................................................55 Fortran 95 Interfaces to LAPACK and BLAS..........................................57 Compiler-dependent Functions and Fortran 90 Modules.........................57 Mixed-language Programming with the Intel Math Kernel Library....................58 Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments..............................................................................58 Using Complex Types in C/C++.........................................................59 Calling BLAS Functions that Return the Complex Values in C/C++ Code..........................................................................................60 Support for Boost uBLAS Matrix-matrix Multiplication...........................61 Invoking Intel MKL Functions from Java* Applications...........................62 Intel MKL Java* Examples........................................................62 Running the Java* Examples.....................................................64 Known Limitations of the Java* Examples...................................65 Chapter 7: Coding Tips Intel® Math Kernel Library for Linux* OS User's Guide 4Aligning Data for Consistent Results...........................................................67 Using Predefined Preprocessor Symbols for Intel ® MKL Version-Dependent Compilation.........................................................................................68 Chapter 8: Working with the Intel® Math Kernel Library Cluster Software Linking with ScaLAPACK and Cluster FFTs....................................................69 Setting the Number of Threads..................................................................70 Using Shared Libraries..............................................................................71 Building ScaLAPACK Tests.........................................................................71 Examples for Linking with ScaLAPACK and Cluster FFT..................................71 Examples for Linking a C Application..................................................71 Examples for Linking a Fortran Application..........................................72 Chapter 9: Programming with Intel® Math Kernel Library in the Eclipse* Integrated Development Environment (IDE) Configuring the Eclipse* IDE CDT to Link with Intel MKL ...............................73 Getting Assistance for Programming in the Eclipse* IDE ...............................73 Viewing the Intel ® Math Kernel Library Reference Manual in the Eclipse* IDE................................................................................74 Searching the Intel Web Site from the Eclipse* IDE..............................74 Chapter 10: LINPACK and MP LINPACK Benchmarks Intel ® Optimized LINPACK Benchmark for Linux* OS.....................................77 Contents of the Intel ® Optimized LINPACK Benchmark..........................77 Running the Software.......................................................................78 Known Limitations of the Intel ® Optimized LINPACK Benchmark.............79 Intel ® Optimized MP LINPACK Benchmark for Clusters...................................79 Overview of the Intel ® Optimized MP LINPACK Benchmark for Clusters....79 Contents of the Intel ® Optimized MP LINPACK Benchmark for Clusters. . . .80 Building the MP LINPACK..................................................................82 New Features of Intel ® Optimized MP LINPACK Benchmark....................82 Benchmarking a Cluster....................................................................83 Options to Reduce Search Time.........................................................83 Appendix A: Intel® Math Kernel Library Language Interfaces Support Language Interfaces Support, by Function Domain.......................................87 Include Files............................................................................................88 Appendix B: Support for Third-Party Interfaces GMP* Functions.......................................................................................91 FFTW Interface Support............................................................................91 Appendix C: Directory Structure in Detail Detailed Structure of the IA-32 Architecture Directories................................93 Static Libraries in the lib/ia32 Directory..............................................93 Dynamic Libraries in the lib/ia32 Directory..........................................94 Detailed Structure of the Intel ® 64 Architecture Directories............................95 Static Libraries in the lib/intel64 Directory...........................................95 Dynamic Libraries in the lib/intel64 Directory.......................................97 Contents 5Intel® Math Kernel Library for Linux* OS User's Guide 6Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Java is a registered trademark of Oracle and/or its affiliates. Copyright © 2006 - 2011, Intel Corporation. All rights reserved. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for 7Optimization Notice use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Intel® Math Kernel Library for Linux* OS User's Guide 8Introducing the Intel® Math Kernel Library The Intel ® Math Kernel Library (Intel ® MKL) improves performance of scientific, engineering, and financial software that solves large computational problems. Among other functionality, Intel MKL provides linear algebra routines, fast Fourier transforms, as well as vectorized math and random number generation functions, all optimized for the latest Intel processors, including processors with multiple cores (see the Intel ® MKL Release Notes for the full list of supported processors). Intel MKL also performs well on non-Intel processors. Intel MKL is thread-safe and extensively threaded using the OpenMP* technology. Intel MKL provides the following major functionality: • Linear algebra, implemented in LAPACK (solvers and eigensolvers) plus level 1, 2, and 3 BLAS, offering the vector, vector-matrix, and matrix-matrix operations needed for complex mathematical software. If you prefer the FORTRAN 90/95 programming language, you can call LAPACK driver and computational subroutines through specially designed interfaces with reduced numbers of arguments. A C interface to LAPACK is also available. • ScaLAPACK (SCAlable LAPACK) with its support functionality including the Basic Linear Algebra Communications Subprograms (BLACS) and the Parallel Basic Linear Algebra Subprograms (PBLAS). ScaLAPACK is available for Intel MKL for Linux* and Windows* operating systems. • Direct sparse solver, an iterative sparse solver, and a supporting set of sparse BLAS (level 1, 2, and 3) for solving sparse systems of equations. • Multidimensional discrete Fourier transforms (1D, 2D, 3D) with a mixed radix support (for sizes not limited to powers of 2). Distributed versions of these functions are provided for use on clusters on the Linux* and Windows* operating systems. • A set of vectorized transcendental functions called the Vector Math Library (VML). For most of the supported processors, the Intel MKL VML functions offer greater performance than the libm (scalar) functions, while keeping the same high accuracy. • The Vector Statistical Library (VSL), which offers high performance vectorized random number generators for several probability distributions, convolution and correlation routines, and summary statistics functions. • Data Fitting Library, which provides capabilities for spline-based approximation of functions, derivatives and integrals of functions, and search. For details see the Intel® MKL Reference Manual. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 9 Intel® Math Kernel Library for Linux* OS User's Guide 10Getting Help and Support Intel provides a support web site that contains a rich repository of self help information, including getting started tips, known product issues, product errata, license information, user forums, and more. Visit the Intel MKL support website at http://www.intel.com/software/products/support/. The Intel MKL documentation integrates into the Eclipse* integrated development environment (IDE). See Getting Assistance for Programming in the Eclipse* IDE . 11 Intel® Math Kernel Library for Linux* OS User's Guide 12Notational Conventions The following term is used in reference to the operating system. Linux* OS This term refers to information that is valid on all supported Linux* operating systems. The following notations are used to refer to Intel MKL directories. The installation directory for the Intel® C++ Composer XE or Intel® Fortran Composer XE . The main directory where Intel MKL is installed: =/mkl. Replace this placeholder with the specific pathname in the configuring, linking, and building instructions. The following font conventions are used in this document. Italic Italic is used for emphasis and also indicates document names in body text, for example: see Intel MKL Reference Manual. Monospace lowercase Indicates filenames, directory names, and pathnames, for example: ./benchmarks/ linpack Monospace lowercase mixed with uppercase Indicates: • Commands and command-line options, for example, icc myprog.c -L$MKLPATH -I$MKLINCLUDE -lmkl -liomp5 -lpthread • C/C++ code fragments, for example, a = new double [SIZE*SIZE]; UPPERCASE MONOSPACE Indicates system variables, for example, $MKLPATH. Monospace italic Indicates a parameter in discussions, for example, lda. When enclosed in angle brackets, indicates a placeholder for an identifier, an expression, a string, a symbol, or a value, for example, . Substitute one of these items for the placeholder. [ items ] Square brackets indicate that the items enclosed in brackets are optional. { item | item } Braces indicate that only one of the items listed between braces should be selected. A vertical bar ( | ) separates the items. 13 Intel® Math Kernel Library for Linux* OS User's Guide 14Overview 1 Document Overview The Intel® Math Kernel Library (Intel® MKL) User's Guide provides usage information for the library. The usage information covers the organization, configuration, performance, and accuracy of Intel MKL, specifics of routine calls in mixed-language programming, linking, and more. This guide describes OS-specific usage of Intel MKL, along with OS-independent features. The document contains usage information for all Intel MKL function domains. This User's Guide provides the following information: • Describes post-installation steps to help you start using the library • Shows you how to configure the library with your development environment • Acquaints you with the library structure • Explains how to link your application with the library and provides simple usage scenarios • Describes how to code, compile, and run your application with Intel MKL This guide is intended for Linux OS programmers with beginner to advanced experience in software development. See Also Language Interfaces Support, by Function Domain What's New This User's Guide documents the Intel® Math Kernel Library (Intel® MKL) 10.3 Update 8. The document was updated to reflect addition of Data Fitting Functions to the product. Related Information To reference how to use the library in your application, use this guide in conjunction with the following documents: • The Intel® Math Kernel Library Reference Manual, which provides reference information on routine functionalities, parameter descriptions, interfaces, calling syntaxes, and return values. • The Intel® Math Kernel Library for Linux* OS Release Notes. 151 Intel® Math Kernel Library for Linux* OS User's Guide 16Getting Started 2 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Checking Your Installation After installing the Intel® Math Kernel Library (Intel® MKL), verify that the library is properly installed and configured: 1. Intel MKL installs in . Check that the subdirectory of referred to as was created. 2. If you want to keep multiple versions of Intel MKL installed on your system, update your build scripts to point to the correct Intel MKL version. 3. Check that the following files appear in the /bin directory and its subdirectories: mklvars.sh mklvars.csh ia32/mklvars_ia32.sh ia32/mklvars_ia32.csh intel64/mklvars_intel64.sh intel64/mklvars_intel64.csh Use these files to assign Intel MKL-specific values to several environment variables, as explained in Setting Environment Variables 4. To understand how the Intel MKL directories are structured, see Intel® Math Kernel Library Structure. 5. To make sure that Intel MKL runs on your system, launch an Intel MKL example, as explained in Using Code Examples. See Also Notational Conventions Setting Environment Variables See Also Setting the Number of Threads Using an OpenMP* Environment Variable 17Scripts to Set Environment Variables When the installation of Intel MKL for Linux* OS is complete, set the INCLUDE, MKLROOT, LD_LIBRARY_PATH, MANPATH, LIBRARY_PATH, CPATH, FPATH, and NLSPATH environment variables in the command shell using one of the script files in the bin subdirectory of the Intel MKL installation directory. Choose the script corresponding to your system architecture and command shell as explained in the following table: Architecture Shell Script File IA-32 C ia32/mklvars_ia32.csh IA-32 Bash and Bourne (sh) ia32/mklvars_ia32.sh Intel® 64 C intel64/mklvars_intel64.csh Intel® 64 Bash and Bourne (sh) intel64/mklvars_intel64.sh IA-32 and Intel® 64 C mklvars.csh IA-32 and Intel® 64 Bash and Bourne (sh) mklvars.sh Running the Scripts The scripts accept parameters to specify the following: • Architecture. • Addition of a path to Fortran 95 modules precompiled with the Intel ® Fortran compiler to the FPATH environment variable. Supply this parameter only if you are using the Intel ® Fortran compiler. • Interface of the Fortran 95 modules. This parameter is needed only if you requested addition of a path to the modules. Usage and values of these parameters depend on the scriptname (regardless of the extension). The following table lists values of the script parameters. Script Architecture (required, when applicable) Addition of a Path to Fortran 95 Modules (optional) Interface (optional) mklvars_ia32 n/a † mod n/a mklvars_intel64 n/a mod lp64, default ilp64 mklvars ia32 intel64 mod lp64, default ilp64 † Not applicable. For example: • The command mklvars_ia32.sh sets environment variables for the IA-32 architecture and adds no path to the Fortran 95 modules. • The command mklvars_intel64.sh mod ilp64 sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the ILP64 interface to the FPATH environment variable. • The command mklvars.sh intel64 mod 2 Intel® Math Kernel Library for Linux* OS User's Guide 18sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the LP64 interface to the FPATH environment variable. NOTE Supply the parameter specifying the architecture first, if it is needed. Values of the other two parameters can be listed in any order. See Also High-level Directory Structure Interface Libraries and Modules Fortran 95 Interfaces to LAPACK and BLAS Setting the Number of Threads Using an OpenMP* Environment Variable Automating the Process of Setting Environment Variables To automate setting of the INCLUDE, MKLROOT, LD_LIBRARY_PATH, MANPATH, LIBRARY_PATH, CPATH, FPATH, and NLSPATH environment variables, add mklvars*.*sh to your shell profile so that each time you login, the script automatically executes and sets the paths to the appropriate Intel MKL directories. To do this, with a local user account, edit the following files by adding the appropriate script to the path manipulation section right before exporting variables: Shell Files Commands bash ~/.bash_profile, ~/.bash_login or ~/.profile # setting up MKL environment for bash . /bin [/]/mklvars[].sh [] [mod] [lp64|ilp64] sh ~/.profile # setting up MKL environment for sh . /bin [/]/mklvars[].sh [] [mod] [lp64|ilp64] csh ~/.login # setting up MKL environment for sh . /bin [/]/mklvars[].csh [] [mod] [lp64|ilp64] In the above commands, replace with ia32 or intel64. If you have super user permissions, add the same commands to a general-system file in /etc/profile (for bash and sh) or in /etc/csh.login (for csh). CAUTION Before uninstalling Intel MKL, remove the above commands from all profile files where the script execution was added. Otherwise you may experience problems logging in. See Also Scripts to Set Environment Variables Compiler Support Intel MKL supports compilers identified in the Release Notes. However, the library has been successfully used with other compilers as well. Intel MKL provides a set of include files to simplify program development by specifying enumerated values and prototypes for the respective functions. Calling Intel MKL functions from your application without an appropriate include file may lead to incorrect behavior of the functions. Getting Started 2 19See Also Include Files Using Code Examples The Intel MKL package includes code examples, located in the examples subdirectory of the installation directory. Use the examples to determine: • Whether Intel MKL is working on your system • How you should call the library • How to link the library The examples are grouped in subdirectories mainly by Intel MKL function domains and programming languages. For example, the examples/spblas subdirectory contains a makefile to build the Sparse BLAS examples and the examples/vmlc subdirectory contains the makefile to build the C VML examples. Source code for the examples is in the next-level sources subdirectory. See Also High-level Directory Structure What You Need to Know Before You Begin Using the Intel® Math Kernel Library Target platform Identify the architecture of your target machine: • IA-32 or compatible • Intel® 64 or compatible Reason: Because Intel MKL libraries are located in directories corresponding to your particular architecture (see Architecture Support), you should provide proper paths on your link lines (see Linking Examples). To configure your development environment for the use with Intel MKL, set your environment variables using the script corresponding to your architecture (see Setting Environment Variables for details). Mathematical problem Identify all Intel MKL function domains that you require: • BLAS • Sparse BLAS • LAPACK • PBLAS • ScaLAPACK • Sparse Solver routines • Vector Mathematical Library functions (VML) • Vector Statistical Library functions • Fourier Transform functions (FFT) • Cluster FFT • Trigonometric Transform routines • Poisson, Laplace, and Helmholtz Solver routines • Optimization (Trust-Region) Solver routines • Data Fitting Functions • GMP* arithmetic functions. Deprecated and will be removed in a future release 2 Intel® Math Kernel Library for Linux* OS User's Guide 20Reason: The function domain you intend to use narrows the search in the Reference Manual for specific routines you need. Additionally, if you are using the Intel MKL cluster software, your link line is function-domain specific (see Working with the Cluster Software). Coding tips may also depend on the function domain (see Tips and Techniques to Improve Performance). Programming language Intel MKL provides support for both Fortran and C/C++ programming. Identify the language interfaces that your function domains support (see Intel® Math Kernel Library Language Interfaces Support). Reason: Intel MKL provides language-specific include files for each function domain to simplify program development (see Language Interfaces Support, by Function Domain). For a list of language-specific interface libraries and modules and an example how to generate them, see also Using Language-Specific Interfaces with Intel® Math Kernel Library. Range of integer data If your system is based on the Intel 64 architecture, identify whether your application performs calculations with large data arrays (of more than 2 31 -1 elements). Reason: To operate on large data arrays, you need to select the ILP64 interface, where integers are 64-bit; otherwise, use the default, LP64, interface, where integers are 32-bit (see Using the ILP64 Interface vs. LP64 Interface). Threading model Identify whether and how your application is threaded: • Threaded with the Intel compiler • Threaded with a third-party compiler • Not threaded Reason: The compiler you use to thread your application determines which threading library you should link with your application. For applications threaded with a third-party compiler you may need to use Intel MKL in the sequential mode (for more information, see Sequential Mode of the Library and Linking with Threading Libraries). Number of threads Determine the number of threads you want Intel MKL to use. Reason: Intel MKL is based on the OpenMP* threading. By default, the OpenMP* software sets the number of threads that Intel MKL uses. If you need a different number, you have to set it yourself using one of the available mechanisms. For more information, see Using Parallelism of the Intel® Math Kernel Library. Linking model Decide which linking model is appropriate for linking your application with Intel MKL libraries: • Static • Dynamic Reason: The link line syntax and libraries for static and dynamic linking are different. For the list of link libraries for static and dynamic models, linking examples, and other relevant topics, like how to save disk space by creating a custom dynamic library, see Linking Your Application with the Intel® Math Kernel Library. MPI used Decide what MPI you will use with the Intel MKL cluster software. You are strongly encouraged to use Intel® MPI 3.2 or later. MPI used Reason: To link your application with ScaLAPACK and/or Cluster FFT, the libraries corresponding to your particular MPI should be listed on the link line (see Working with the Cluster Software). Getting Started 2 212 Intel® Math Kernel Library for Linux* OS User's Guide 22Structure of the Intel® Math Kernel Library 3 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Architecture Support Intel® Math Kernel Library (Intel® MKL) for Linux* OS provides two architecture-specific implementations. The following table lists the supported architectures and directories where each architecture-specific implementation is located. Architecture Location IA-32 or compatible /lib/ia32 Intel® 64 or compatible /lib/intel64 See Also High-level Directory Structure Detailed Structure of the IA-32 Architecture Directories Detailed Structure of the Intel® 64 Architecture Directories High-level Directory Structure Directory Contents Installation directory of the Intel® Math Kernel Library (Intel® MKL) Subdirectories of bin Scripts to set environmental variables in the user shell bin/ia32 Shell scripts for the IA-32 architecture bin/intel64 Shell scripts for the Intel® 64 architecture benchmarks/linpack Shared-memory (SMP) version of the LINPACK benchmark benchmarks/mp_linpack Message-passing interface (MPI) version of the LINPACK benchmark examples Examples directory. Each subdirectory has source and data files include INCLUDE files for the library routines, as well as for tests and examples 23Directory Contents include/ia32 Fortran 95 .mod files for the IA-32 architecture and Intel® Fortran compiler include/intel64/lp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel Fortran compiler, and LP64 interface include/intel64/ilp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel Fortran compiler, and ILP64 interface include/fftw Header files for the FFTW2 and FFTW3 interfaces interfaces/blas95 Fortran 95 interfaces to BLAS and a makefile to build the library interfaces/fftw2x_cdft MPI FFTW 2.x interfaces to the Intel MKL Cluster FFTs interfaces/fftw3x_cdft MPI FFTW 3.x interfaces to the Intel MKL Cluster FFTs interfaces/fftw2xc FFTW 2.x interfaces to the Intel MKL FFTs (C interface) interfaces/fftw2xf FFTW 2.x interfaces to the Intel MKL FFTs (Fortran interface) interfaces/fftw3xc FFTW 3.x interfaces to the Intel MKL FFTs (C interface) interfaces/fftw3xf FFTW 3.x interfaces to the Intel MKL FFTs (Fortran interface) interfaces/lapack95 Fortran 95 interfaces to LAPACK and a makefile to build the library lib/ia32 Static libraries and shared objects for the IA-32 architecture lib/intel64 Static libraries and shared objects for the Intel® 64 architecture tests Source and data files for tests tools Tools and plug-ins tools/builder Tools for creating custom dynamically linkable libraries tools/plugins/ com.intel.mkl.help Eclipse* IDE plug-in with Intel MKL Reference Manual in WebHelp format. See mkl_documentation.htm for more information Subdirectories of Documentation/en_US/mkl Intel MKL documentation. man/en_US/man3 Man pages for Intel MKL functions. No directory for man pages is created in locales other than en_US even if a directory for the localized documentation is created in the respective locales. For more information, see Contents of the Documentation Directories. See Also Notational Conventions Layered Model Concept Intel MKL is structured to support multiple compilers and interfaces, different OpenMP* implementations, both serial and multiple threads, and a wide range of processors. Conceptually Intel MKL can be divided into distinct parts to support different interfaces, threading models, and core computations: 1. Interface Layer 2. Threading Layer 3. Computational Layer 3 Intel® Math Kernel Library for Linux* OS User's Guide 24You can combine Intel MKL libraries to meet your needs by linking with one library in each part layer-bylayer. Once the interface library is selected, the threading library you select picks up the chosen interface, and the computational library uses interfaces and OpenMP implementation (or non-threaded mode) chosen in the first two layers. To support threading with different compilers, one more layer is needed, which contains libraries not included in Intel MKL: • Compiler run-time libraries (RTL). The following table provides more details of each layer. Layer Description Interface Layer This layer matches compiled code of your application with the threading and/or computational parts of the library. This layer provides: • LP64 and ILP64 interfaces. • Compatibility with compilers that return function values differently. • A mapping between single-precision names and double-precision names for applications using Cray*-style naming (SP2DP interface). SP2DP interface supports Cray-style naming in applications targeted for the Intel 64 architecture and using the ILP64 interface. SP2DP interface provides a mapping between single-precision names (for both real and complex types) in the application and double-precision names in Intel MKL BLAS and LAPACK. Function names are mapped as shown in the following example for BLAS functions ?GEMM: SGEMM -> DGEMM DGEMM -> DGEMM CGEMM -> ZGEMM ZGEMM -> ZGEMM Mind that no changes are made to double-precision names. Threading Layer This layer: • Provides a way to link threaded Intel MKL with different threading compilers. • Enables you to link with a threaded or sequential mode of the library. This layer is compiled for different environments (threaded or sequential) and compilers (from Intel, GNU*, and so on). Computational Layer This layer is the heart of Intel MKL. It has only one library for each combination of architecture and supported OS. The Computational layer accommodates multiple architectures through identification of architecture features and chooses the appropriate binary code at run time. Compiler Run-time Libraries (RTL) To support threading with Intel compilers, Intel MKL uses RTLs of the Intel® C++ Composer XE or Intel® Fortran Composer XE. To thread using third-party threading compilers, use libraries in the Threading layer or an appropriate compatibility library. See Also Using the ILP64 Interface vs. LP64 Interface Linking Your Application with the Intel® Math Kernel Library Linking with Threading Libraries Accessing the Intel® Math Kernel Library Documentation Structure of the Intel® Math Kernel Library 3 25Contents of the Documentation Directories Most of Intel MKL documentation is installed at /Documentation// mkl. For example, the documentation in English is installed at / Documentation/en_US/mkl. However, some Intel MKL-related documents are installed one or two levels up. The following table lists MKL-related documentation. File name Comment Files in /Documentation /clicense or /flicense Common end user license for the Intel® C++ Composer XE 2011 or Intel® Fortran Composer XE 2011, respectively mklsupport.txt Information on package number for customer support reference Contents of /Documentation//mkl redist.txt List of redistributable files mkl_documentation.htm Overview and links for the Intel MKL documentation mkl_manual/index.htm Intel MKL Reference Manual in an uncompressed HTML format Release_Notes.htm Intel MKL Release Notes mkl_userguide/index.htm Intel MKL User's Guide in an uncompressed HTML format, this document mkl_link_line_advisor.htm Intel MKL Link-line Advisor Viewing Man Pages To access Intel MKL man pages, add the man pages directory to the MANPATH environment variable. If you performed the Setting Environment Variables step of the Getting Started process, this is done automatically. To view the man page for an Intel MKL function, enter the following command in your command shell: man In this release, is the function name with omitted prefixes denoting data type, task type, or any other field that may vary for this function. Examples: • For the BLAS function ddot, enter man dot • For the ScaLAPACK function pzgeql2, enter man pgeql2 • For the statistical function vslConvSetMode, enter man vslSetMode • For the VML function vdPackM , enter man vPack • For the FFT function DftiCommitDescriptor, enter man DftiCommitDescriptor NOTE Function names in the man command are case-sensitive. See Also High-level Directory Structure Setting Environment Variables 3 Intel® Math Kernel Library for Linux* OS User's Guide 26Linking Your Application with the Intel® Math Kernel Library 4 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Linking Quick Start Intel® Math Kernel Library (Intel® MKL) provides several options for quick linking of your application, which depend on the way you link: Using the Intel® Composer XE compiler see Using the -mkl Compiler Option. Explicit dynamic linking see Using the Single Dynamic Library for how to simplify your link line. Explicitly listing libraries on your link line see Selecting Libraries to Link with for a summary of the libraries. Using an interactive interface see Using the Link-line Advisor to determine libraries and options to specify on your link or compilation line. Using an internally provided tool see Using the Command-line Link Tool to determine libraries, options, and environment variables or even compile and build your application. Using the -mkl Compiler Option The Intel® Composer XE compiler supports the following variants of the -mkl compiler option: -mkl or -mkl=parallel to link with standard threaded Intel MKL. -mkl=sequential to link with sequential version of Intel MKL. -mkl=cluster to link with Intel MKL cluster components (sequential) that use Intel MPI. For more information on the -mkl compiler option, see the Intel Compiler User and Reference Guides. On Intel® 64 architecture systems, for each variant of the -mkl option, the compiler links your application using the LP64 interface. If you specify any variant of the -mkl compiler option, the compiler automatically includes the Intel MKL libraries. In cases not covered by the option, use the Link-line Advisor or see Linking in Detail. See Also Listing Libraries on a Link Line Using the ILP64 Interface vs. LP64 Interface Using the Link-line Advisor 27Intel® Software Documentation Library Using the Single Dynamic Library You can simplify your link line through the use of the Intel MKL Single Dynamic Library (SDL). To use SDL, place libmkl_rt.so on your link line. For example: ic? application.c -lmkl_rt SDL enables you to select the interface and threading library for Intel MKL at run time. By default, linking with SDL provides: • LP64 interface on systems based on the Intel® 64 architecture • Intel threading To use other interfaces or change threading preferences, including use of the sequential version of Intel MKL, you need to specify your choices using functions or environment variables as explained in section Dynamically Selecting the Interface and Threading Layer. Selecting Libraries to Link with To link with Intel MKL: • Choose one library from the Interface layer and one library from the Threading layer • Add the only library from the Computational layer and run-time libraries (RTL) The following table lists Intel MKL libraries to link with your application. Interface layer Threading layer Computational layer RTL IA-32 architecture, static linking libmkl_intel.a libmkl_intel_ thread.a libmkl_core.a libiomp5.so IA-32 architecture, dynamic linking libmkl_intel. so libmkl_intel_ thread.so libmkl_core. so libiomp5.so Intel® 64 architecture, static linking libmkl_intel_ lp64.a libmkl_intel_ thread.a libmkl_core.a libiomp5.so Intel® 64 architecture, dynamic linking libmkl_intel_ lp64.so libmkl_intel_ thread.so libmkl_core. so libiomp5.so The Single Dynamic Library (SDL) automatically links interface, threading, and computational libraries and thus simplifies linking. The following table lists Intel MKL libraries for dynamic linking using SDL. See Dynamically Selecting the Interface and Threading Layer for how to set the interface and threading layers at run time through function calls or environment settings. SDL RTL IA-32 and Intel® 64 architectures libmkl_rt.so libiomp5.so † † Use the Link-line Advisor to check whether you need to explicitly link the libiomp5.so RTL. For exceptions and alternatives to the libraries listed above, see Linking in Detail. See Also Layered Model Concept 4 Intel® Math Kernel Library for Linux* OS User's Guide 28Using the Link-line Advisor Using the -mkl Compiler Option Working with the Intel® Math Kernel Library Cluster Software Using the Link-line Advisor Use the Intel MKL Link-line Advisor to determine the libraries and options to specify on your link or compilation line. The latest version of the tool is available at http://software.intel.com/en-us/articles/intel-mkl-link-lineadvisor. The tool is also available in the product. The Advisor requests information about your system and on how you intend to use Intel MKL (link dynamically or statically, use threaded or sequential mode, etc.). The tool automatically generates the appropriate link line for your application. See Also Contents of the Documentation Directories Using the Command-line Link Tool Use the command-line Link tool provided by Intel MKL to simplify building your application with Intel MKL. The tool not only provides the options, libraries, and environment variables to use, but also performs compilation and building of your application. The tool mkl_link_tool is installed in the /tools directory. See the knowledge base article at http://software.intel.com/en-us/articles/mkl-command-line-link-tool for more information. Linking Examples See Also Using the Link-line Advisor Examples for Linking with ScaLAPACK and Cluster FFT Linking on IA-32 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc NOTE If you successfully completed the Setting Environment Variables step of the Getting Started process, you can omit -I$MKLINCLUDE in all the examples and omit -L$MKLPATH in the examples for dynamic linking. In these examples, MKLPATH=$MKLROOT/lib/ia32, MKLINCLUDE=$MKLROOT/include : • Static linking of myprog.f and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/ libmkl_core.a -Wl,--end-group -liomp5 -lpthread • Dynamic linking of myprog.f and parallel Intel MKL: Linking Your Application with the Intel® Math Kernel Library 4 29ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Static linking of myprog.f and sequential version of Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_sequential.a $MKLPATH/ libmkl_core.a -Wl,--end-group -lpthread • Dynamic linking of myprog.f and sequential version of Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel -lmkl_sequential -lmkl_core -lpthread • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL (Call the mkl_set_threading_layer function or set value of the MKL_THREADING_LAYER environment variable to choose threaded or sequential mode): ifort myprog.f -lmkl_rt • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/ia32 -lmkl_lapack95 -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/ libmkl_core.a -Wl,--end-group -liomp5 -lpthread • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/ia32 -lmkl_blas95 -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/ libmkl_core.a -Wl,--end-group -liomp5 -lpthread See Also Fortran 95 Interfaces to LAPACK and BLAS Examples for Linking a C Application Examples for Linking a Fortran Application Using the Single Dynamic Library Linking on Intel(R) 64 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc NOTE If you successfully completed the Setting Environment Variables step of the Getting Started process, you can omit -I$MKLINCLUDE in all the examples and omit -L$MKLPATH in the examples for dynamic linking. In these examples, MKLPATH=$MKLROOT/lib/intel64, MKLINCLUDE=$MKLROOT/include: • Static linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread 4 Intel® Math Kernel Library for Linux* OS User's Guide 30• Dynamic linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Static linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -Wl,--end-group -lpthread • Dynamic linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread • Static linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread • Dynamic linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL (Call appropriate functions or set environment variables to choose threaded or sequential mode and to set the interface): ifort myprog.f -lmkl_rt • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/intel64/lp64 -lmkl_lapack95_lp64 -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/intel64/lp64 -lmkl_blas95_lp64 -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread See Also Fortran 95 Interfaces to LAPACK and BLAS Examples for Linking a C Application Examples for Linking a Fortran Application Using the Single Dynamic Library Linking in Detail This section recommends which libraries to link with depending on your Intel MKL usage scenario and provides details of the linking. Listing Libraries on a Link Line To link with Intel MKL, specify paths and libraries on the link line as shown below. Linking Your Application with the Intel® Math Kernel Library 4 31NOTE The syntax below is for dynamic linking. For static linking, replace each library name preceded with "-l" with the path to the library file. For example, replace -lmkl_core with $MKLPATH/ libmkl_core.a, where $MKLPATH is the appropriate user-defined environment variable. -L -I [-I/{ia32|intel64|{ilp64|lp64}}] [-lmkl_blas{95|95_ilp64|95_lp64}] [-lmkl_lapack{95|95_ilp64|95_lp64}] [ ] -lmkl_{intel|intel_ilp64|intel_lp64|intel_sp2dp|gf|gf_ilp64|gf_lp64} -lmkl_{intel_thread|gnu_thread|pgi_thread|sequential} -lmkl_core -liomp5 [-lpthread] [-lm] In case of static linking, enclose the cluster components, interface, threading, and computational libraries in grouping symbols (for example, -Wl,--start-group $MKLPATH/libmkl_cdft_core.a $MKLPATH/ libmkl_blacs_intelmpi_ilp64.a $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group). The order of listing libraries on the link line is essential, except for the libraries enclosed in the grouping symbols above. See Also Using the Link-line Advisor Linking Examples Working with the Intel® Math Kernel Library Cluster Software Dynamically Selecting the Interface and Threading Layer The Single Dynamic Library (SDL) enables you to dynamically select the interface and threading layer for Intel MKL. Setting the Interface Layer Available interfaces depend on the architecture of your system. On systems based on the Intel ® 64 architecture, LP64 and ILP64 interfaces are available. To set one of these interfaces at run time, use the mkl_set_interface_layer function or the MKL_INTERFACE_LAYER environment variable. The following table provides values to be used to set each interface. Interface Layer Value of MKL_INTERFACE_LAYER Value of the Parameter of mkl_set_interface_layer LP64 LP64 MKL_INTERFACE_LP64 ILP64 ILP64 MKL_INTERFACE_ILP64 If the mkl_set_interface_layer function is called, the environment variable MKL_INTERFACE_LAYER is ignored. By default the LP64 interface is used. See the Intel MKL Reference Manual for details of the mkl_set_interface_layer function. 4 Intel® Math Kernel Library for Linux* OS User's Guide 32Setting the Threading Layer To set the threading layer at run time, use the mkl_set_threading_layer function or the MKL_THREADING_LAYER environment variable. The following table lists available threading layers along with the values to be used to set each layer. Threading Layer Value of MKL_THREADING_LAYER Value of the Parameter of mkl_set_threading_layer Intel threading INTEL MKL_THREADING_INTEL Sequential mode of Intel MKL SEQUENTIAL MKL_THREADING_SEQUENTIAL GNU threading GNU MKL_THREADING_GNU PGI threading PGI MKL_THREADING_PGI If the mkl_set_threading_layer function is called, the environment variable MKL_THREADING_LAYER is ignored. By default Intel threading is used. See the Intel MKL Reference Manual for details of the mkl_set_threading_layer function. See Also Using the Single Dynamic Library Layered Model Concept Directory Structure in Detail Linking with Interface Libraries Using the ILP64 Interface vs. LP64 Interface The Intel MKL ILP64 libraries use the 64-bit integer type (necessary for indexing large arrays, with more than 2 31 -1 elements), whereas the LP64 libraries index arrays with the 32-bit integer type. The LP64 and ILP64 interfaces are implemented in the Interface layer. Link with the following interface libraries for the LP64 or ILP64 interface, respectively: • libmkl_intel_lp64.a or libmkl_intel_ilp64.a for static linking • libmkl_intel_lp64.so or libmkl_intel_ilp64.so for dynamic linking The ILP64 interface provides for the following: • Support large data arrays (with more than 2 31 -1 elements) • Enable compiling your Fortran code with the -i8 compiler option The LP64 interface provides compatibility with the previous Intel MKL versions because "LP64" is just a new name for the only interface that the Intel MKL versions lower than 9.1 provided. Choose the ILP64 interface if your application uses Intel MKL for calculations with large data arrays or the library may be used so in future. Intel MKL provides the same include directory for the ILP64 and LP64 interfaces. Compiling for LP64/ILP64 The table below shows how to compile for the ILP64 and LP64 interfaces: Linking Your Application with the Intel® Math Kernel Library 4 33Fortran Compiling for ILP64 ifort -i8 -I/include ... Compiling for LP64 ifort -I/include ... C or C++ Compiling for ILP64 icc -DMKL_ILP64 -I/include ... Compiling for LP64 icc -I/include ... CAUTION Linking of an application compiled with the -i8 or -DMKL_ILP64 option to the LP64 libraries may result in unpredictable consequences and erroneous output. Coding for ILP64 You do not need to change existing code if you are not using the ILP64 interface. To migrate to ILP64 or write new code for ILP64, use appropriate types for parameters of the Intel MKL functions and subroutines: Integer Types Fortran C or C++ 32-bit integers INTEGER*4 or INTEGER(KIND=4) int Universal integers for ILP64/ LP64: • 64-bit for ILP64 • 32-bit otherwise INTEGER without specifying KIND MKL_INT Universal integers for ILP64/ LP64: • 64-bit integers INTEGER*8 or INTEGER(KIND=8) MKL_INT64 FFT interface integers for ILP64/ LP64 INTEGER without specifying KIND MKL_LONG To determine the type of an integer parameter of a function, use appropriate include files. For functions that support only a Fortran interface, use the C/C++ include files *.h. The above table explains which integer parameters of functions become 64-bit and which remain 32-bit for ILP64. The table applies to most Intel MKL functions except some VML and VSL functions, which require integer parameters to be 64-bit or 32-bit regardless of the interface: • VML: The mode parameter of VML functions is 64-bit. • Random Number Generators (RNG): All discrete RNG except viRngUniformBits64 are 32-bit. The viRngUniformBits64 generator function and vslSkipAheadStream service function are 64-bit. • Summary Statistics: The estimate parameter of the vslsSSCompute/vsldSSCompute function is 64- bit. Refer to the Intel MKL Reference Manual for more information. 4 Intel® Math Kernel Library for Linux* OS User's Guide 34To better understand ILP64 interface details, see also examples and tests. Limitations All Intel MKL function domains support ILP64 programming with the following exceptions: • FFTW interfaces to Intel MKL: • FFTW 2.x wrappers do not support ILP64. • FFTW 3.2 wrappers support ILP64 by a dedicated set of functions plan_guru64. • GMP* Arithmetic Functions do not support ILP64. NOTE GMP Arithmetic Functions are deprecated and will be removed in a future release. See Also High-level Directory Structure Include Files Language Interfaces Support, by Function Domain Layered Model Concept Directory Structure in Detail Linking with Fortran 95 Interface Libraries The libmkl_blas95*.a and libmkl_lapack95*.a libraries contain Fortran 95 interfaces for BLAS and LAPACK, respectively, which are compiler-dependent. In the Intel MKL package, they are prebuilt for the Intel® Fortran compiler. If you are using a different compiler, build these libraries before using the interface. See Also Fortran 95 Interfaces to LAPACK and BLAS Compiler-dependent Functions and Fortran 90 Modules Linking with Threading Libraries Sequential Mode of the Library You can use Intel MKL in a sequential (non-threaded) mode. In this mode, Intel MKL runs unthreaded code. However, it is thread-safe (except the LAPACK deprecated routine ?lacon), which means that you can use it in a parallel region in your OpenMP* code. The sequential mode requires no compatibility OpenMP* run-time library and does not respond to the environment variable OMP_NUM_THREADS or its Intel MKL equivalents. You should use the library in the sequential mode only if you have a particular reason not to use Intel MKL threading. The sequential mode may be helpful when using Intel MKL with programs threaded with some non-Intel compilers or in other situations where you need a non-threaded version of the library (for instance, in some MPI cases). To set the sequential mode, in the Threading layer, choose the *sequential.* library. Add the POSIX threads library (pthread) to your link line for the sequential mode because the *sequential.* library depends on pthread . See Also Directory Structure in Detail Using Parallelism of the Intel® Math Kernel Library Avoiding Conflicts in the Execution Environment Linking Examples Linking Your Application with the Intel® Math Kernel Library 4 35Selecting the Threading Layer Several compilers that Intel MKL supports use the OpenMP* threading technology. Intel MKL supports implementations of the OpenMP* technology that these compilers provide. To make use of this support, you need to link with the appropriate library in the Threading Layer and Compiler Support Run-time Library (RTL). Threading Layer Each Intel MKL threading library contains the same code compiled by the respective compiler (Intel, gnu and PGI* compilers on Linux OS). RTL This layer includes libiomp, the compatibility OpenMP* run-time library of the Intel compiler. In addition to the Intel compiler, libiomp provides support for one more threading compiler on Linux OS (GNU). That is, a program threaded with a GNU compiler can safely be linked with Intel MKL and libiomp. The table below helps explain what threading library and RTL you should choose under different scenarios when using Intel MKL (static cases only): Compiler Application Threaded? Threading Layer RTL Recommended Comment Intel Does not matter libmkl_intel_ thread.a libiomp5.so PGI Yes libmkl_pgi_ thread.a or libmkl_ sequential.a PGI* supplied Use of libmkl_sequential.a removes threading from Intel MKL calls. PGI No libmkl_intel_ thread.a libiomp5.so PGI No libmkl_pgi_ thread.a PGI* supplied PGI No libmkl_ sequential.a None gnu Yes libmkl_gnu_ thread.a libiomp5.so or GNU OpenMP run-time library libiomp5 offers superior scaling performance. gnu Yes libmkl_ sequential.a None gnu No libmkl_intel_ thread.a libiomp5.so other Yes libmkl_ sequential.a None other No libmkl_intel_ thread.a libiomp5.so 4 Intel® Math Kernel Library for Linux* OS User's Guide 36Linking with Computational Libraries If you are not using the Intel MKL cluster software, you need to link your application with only one computational library, depending on the linking method: Static Linking Dynamic Linking lib mkl_core.a lib mkl_core.so Computational Libraries for Applications that Use the Intel MKL Cluster Software ScaLAPACK and Cluster Fourier Transform Functions (Cluster FFT) require more computational libraries, which may depend on your architecture. The following table lists computational libraries for IA-32 architecture applications that use ScaLAPACK or Cluster FFT. Computational Libraries for IA-32 Architecture Function domain Static Linking Dynamic Linking ScaLAPACK † libmkl_scalapack_core.a libmkl_core.a libmkl_scalapack_core.so libmkl_core.so Cluster Fourier Transform Functions † libmkl_cdft_core.a libmkl_core.a libmkl_cdft_core.so libmkl_core.so † Also add the library with BLACS routines corresponding to the MPI used. The following table lists computational libraries for Intel ® 64 architecture applications that use ScaLAPACK or Cluster FFT. Computational Libraries for the Intel ® 64 Architecture Function domain Static Linking Dynamic Linking ScaLAPACK, LP64 interface 1 libmkl_scalapack_lp64.a libmkl_core.a libmkl_scalapack_lp64.so libmkl_core.so ScaLAPACK, ILP64 interface 1 libmkl_scalapack_ilp64.a libmkl_core.a libmkl_scalapack_ilp64.so libmkl_core.so Cluster Fourier Transform Functions 1 libmkl_cdft_core.a libmkl_core.a libmkl_cdft_core.so libmkl_core.so † Also add the library with BLACS routines corresponding to the MPI used. See Also Linking with ScaLAPACK and Cluster FFTs Using the Link-line Advisor Using the ILP64 Interface vs. LP64 Interface Linking with Compiler Run-time Libraries Dynamically link libiomp, the compatibility OpenMP* run-time library, even if you link other libraries statically. Linking Your Application with the Intel® Math Kernel Library 4 37Linking to the libiomp statically can be problematic because the more complex your operating environment or application, the more likely redundant copies of the library are included. This may result in performance issues (oversubscription of threads) and even incorrect results. To link libiomp dynamically, be sure the LD_LIBRARY_PATH environment variable is defined correctly. See Also Scripts to Set Environment Variables Layered Model Concept Linking with System Libraries To use the Intel MKL FFT, Trigonometric Transform, or Poisson, Laplace, and Helmholtz Solver routines, link in the math support system library by adding " -lm " to the link line. On Linux OS, the libiomp library relies on the native pthread library for multi-threading. Any time libiomp is required, add -lpthread to your link line afterwards (the order of listing libraries is important). Building Custom Shared Objects ?ustom shared objects reduce the collection of functions available in Intel MKL libraries to those required to solve your particular problems, which helps to save disk space and build your own dynamic libraries for distribution. The Intel MKL custom shared object builder enables you to create a dynamic library (shared object) containing the selected functions and located in the tools/builder directory. The builder contains a makefile and a definition file with the list of functions. NOTE The objects in Intel MKL static libraries are position-independent code (PIC), which is not typical for static libraries. Therefore, the custom shared object builder can create a shared object from a subset of Intel MKL functions by picking the respective object files from the static libraries. Using the Custom Shared Object Builder To build a custom shared object, use the following command: make target [] The following table lists possible values of target and explains what the command does for each value: Value Comment libia32 The builder uses static Intel MKL interface, threading, and core libraries to build a custom shared object for the IA-32 architecture. libintel64 The builder uses static Intel MKL interface, threading, and core libraries to build a custom shared object for the Intel® 64 architecture. soia32 The builder uses the single dynamic library libmkl_rt.so to build a custom shared object for the IA-32 architecture. sointel64 The builder uses the single dynamic library libmkl_rt.so to build a custom shared object for the Intel® 64 architecture. help The command prints Help on the custom shared object builder The placeholder stands for the list of parameters that define macros to be used by the makefile. The following table describes these parameters: 4 Intel® Math Kernel Library for Linux* OS User's Guide 38Parameter [Values] Description interface = {lp64|ilp64} Defines whether to use LP64 or ILP64 programming interfacefor the Intel 64architecture.The default value is lp64. threading = {parallel| sequential} Defines whether to use the Intel MKL in the threaded or sequential mode. The default value is parallel. export = Specifies the full name of the file that contains the list of entry-point functions to be included in the shared object. The default name is user_example_list (no extension). name = Specifies the name of the library to be created. By default, the names of the created library is mkl_custom.so. xerbla = Specifies the name of the object file .o that contains the user's error handler. The makefile adds this error handler to the library for use instead of the default Intel MKL error handler xerbla. If you omit this parameter, the native Intel MKL xerbla is used. See the description of the xerbla function in the Intel MKL Reference Manual on how to develop your own error handler. MKLROOT = Specifies the location of Intel MKL libraries used to build the custom shared object. By default, the builder uses the Intel MKL installation directory. All the above parameters are optional. In the simplest case, the command line is make ia32, and the missing options have default values. This command creates the mkl_custom.so library for processors using the IA-32 architecture. The command takes the list of functions from the user_list file and uses the native Intel MKL error handler xerbla. An example of a more complex case follows: make ia32 export=my_func_list.txt name=mkl_small xerbla=my_xerbla.o In this case, the command creates the mkl_small.so library for processors using the IA-32 architecture. The command takes the list of functions from my_func_list.txt file and uses the user's error handler my_xerbla.o. The process is similar for processors using the Intel® 64 architecture. See Also Using the Single Dynamic Library Composing a List of Functions To compose a list of functions for a minimal custom shared object needed for your application, you can use the following procedure: 1. Link your application with installed Intel MKL libraries to make sure the application builds. 2. Remove all Intel MKL libraries from the link line and start linking. Unresolved symbols indicate Intel MKL functions that your application uses. 3. Include these functions in the list. Important Each time your application starts using more Intel MKL functions, update the list to include the new functions. See Also Specifying Function Names Linking Your Application with the Intel® Math Kernel Library 4 39Specifying Function Names In the file with the list of functions for your custom shared object, adjust function names to the required interface. For example, for Fortran functions append an underscore character "_" to the names as a suffix: dgemm_ ddot_ dgetrf_ For more examples, see domain-specific lists of functions in the /tools/builder folder. NOTE The lists of functions are provided in the /tools/builder folder merely as examples. See Composing a List of Functions for how to compose lists of functions for your custom shared object. TIP Names of Fortran-style routines (BLAS, LAPACK, etc.) can be both upper-case or lower-case, with or without the trailing underscore. For example, these names are equivalent: BLAS: dgemm, DGEMM, dgemm_, DGEMM_ LAPACK: dgetrf, DGETRF, dgetrf_, DGETRF_. Properly capitalize names of C support functions in the function list. To do this, follow the guidelines below: 1. In the mkl_service.h include file, look up a #define directive for your function. 2. Take the function name from the replacement part of that directive. For example, the #define directive for the mkl_disable_fast_mm function is #define mkl_disable_fast_mm MKL_Disable_Fast_MM. Capitalize the name of this function in the list like this: MKL_Disable_Fast_MM. For the names of the Fortran support functions, see the tip. NOTE If selected functions have several processor-specific versions, the builder automatically includes them all in the custom library and the dispatcher manages them. Distributing Your Custom Shared Object To enable use of your custom shared object in a threaded mode, distribute libiomp5.so along with the custom shared object. 4 Intel® Math Kernel Library for Linux* OS User's Guide 40Managing Performance and Memory 5 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Parallelism of the Intel® Math Kernel Library Intel MKL is extensively parallelized. See Threaded Functions and Problems for lists of threaded functions and problems that can be threaded. Intel MKL is thread-safe, which means that all Intel MKL functions (except the LAPACK deprecated routine ? lacon) work correctly during simultaneous execution by multiple threads. In particular, any chunk of threaded Intel MKL code provides access for multiple threads to the same shared data, while permitting only one thread at any given time to access a shared piece of data. Therefore, you can call Intel MKL from multiple threads and not worry about the function instances interfering with each other. The library uses OpenMP* threading software, so you can use the environment variable OMP_NUM_THREADS to specify the number of threads or the equivalent OpenMP run-time function calls. Intel MKL also offers variables that are independent of OpenMP, such as MKL_NUM_THREADS, and equivalent Intel MKL functions for thread management. The Intel MKL variables are always inspected first, then the OpenMP variables are examined, and if neither is used, the OpenMP software chooses the default number of threads. By default, Intel MKL uses the number of threads equal to the number of physical cores on the system. To achieve higher performance, set the number of threads to the number of real processors or physical cores, as summarized in Techniques to Set the Number of Threads. See Also Managing Multi-core Performance Threaded Functions and Problems The following Intel MKL function domains are threaded: • Direct sparse solver. • LAPACK. For the list of threaded routines, see Threaded LAPACK Routines. • Level1 and Level2 BLAS. For the list of threaded routines, see Threaded BLAS Level1 and Level2 Routines. • All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers. • All mathematical VML functions. • FFT. For the list of FFT transforms that can be threaded, see Threaded FFT Problems. 41Threaded LAPACK Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. The following LAPACK routines are threaded: • Linear equations, computational routines: • Factorization: ?getrf, ?gbtrf, ?potrf, ?pptrf, ?sytrf, ?hetrf, ?sptrf, ?hptrf • Solving: ?dttrsb, ?gbtrs, ?gttrs, ?pptrs, ?pbtrs, ?pttrs, ?sytrs, ?sptrs, ?hptrs, ? tptrs, ?tbtrs • Orthogonal factorization, computational routines: ?geqrf, ?ormqr, ?unmqr, ?ormlq, ?unmlq, ?ormql, ?unmql, ?ormrq, ?unmrq • Singular Value Decomposition, computational routines: ?gebrd, ?bdsqr • Symmetric Eigenvalue Problems, computational routines: ?sytrd, ?hetrd, ?sptrd, ?hptrd, ?steqr, ?stedc. • Generalized Nonsymmetric Eigenvalue Problems, computational routines: chgeqz/zhgeqz. A number of other LAPACK routines, which are based on threaded LAPACK or BLAS routines, make effective use of parallelism: ?gesv, ?posv, ?gels, ?gesvd, ?syev, ?heev, cgegs/zgegs, cgegv/zgegv, cgges/zgges, cggesx/zggesx, cggev/zggev, cggevx/zggevx, and so on. Threaded BLAS Level1 and Level2 Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. The following routines are threaded for Intel ® Core™2 Duo and Intel ® Core™ i7 processors: • Level1 BLAS: ?axpy, ?copy, ?swap, ddot/sdot, cdotc, drot/srot • Level2 BLAS: ?gemv, ?trmv, dsyr/ssyr, dsyr2/ssyr2, dsymv/ssymv Threaded FFT Problems The following characteristics of a specific problem determine whether your FFT computation may be threaded: • rank • domain • size/length • precision (single or double) • placement (in-place or out-of-place) • strides • number of transforms • layout (for example, interleaved or split layout of complex data) Most FFT problems are threaded. In particular, computation of multiple transforms in one call (number of transforms > 1) is threaded. Details of which transforms are threaded follow. One-dimensional (1D) transforms 1D transforms are threaded in many cases. 5 Intel® Math Kernel Library for Linux* OS User's Guide 421D complex-to-complex (c2c) transforms of size N using interleaved complex data layout are threaded under the following conditions depending on the architecture: Architecture Conditions Intel ® 64 N is a power of 2, log2(N) > 9, the transform is double-precision out-of-place, and input/output strides equal 1. IA-32 N is a power of 2, log2(N) > 13, and the transform is single-precision. N is a power of 2, log2(N) > 14, and the transform is double-precision. Any N is composite, log2(N) > 16, and input/output strides equal 1. 1D real-to-complex and complex-to-real transforms are not threaded. 1D complex-to-complex transforms using split-complex layout are not threaded. Prime-size complex-to-complex 1D transforms are not threaded. Multidimensional transforms All multidimensional transforms on large-volume data are threaded. Avoiding Conflicts in the Execution Environment Certain situations can cause conflicts in the execution environment that make the use of threads in Intel MKL problematic. This section briefly discusses why these problems exist and how to avoid them. If you thread the program using OpenMP directives and compile the program with Intel compilers, Intel MKL and the program will both use the same threading library. Intel MKL tries to determine if it is in a parallel region in the program, and if it is, it does not spread its operations over multiple threads unless you specifically request Intel MKL to do so via the MKL_DYNAMIC functionality. However, Intel MKL can be aware that it is in a parallel region only if the threaded program and Intel MKL are using the same threading library. If your program is threaded by some other means, Intel MKL may operate in multithreaded mode, and the performance may suffer due to overuse of the resources. The following table considers several cases where the conflicts may arise and provides recommendations depending on your threading model: Threading model Discussion You thread the program using OS threads (pthreads on Linux* OS). If more than one thread calls Intel MKL, and the function being called is threaded, it may be important that you turn off Intel MKL threading. Set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). You thread the program using OpenMP directives and/or pragmas and compile the program using a compiler other than a compiler from Intel. This is more problematic because setting of the OMP_NUM_THREADS environment variable affects both the compiler's threading library and libiomp. In this case, choose the threading library that matches the layered Intel MKL with the OpenMP compiler you employ (see Linking Examples on how to do this). If this is not possible, use Intel MKL in the sequential mode. To do this, you should link with the appropriate threading library: libmkl_sequential.a or libmkl_sequential.so (see High-level Directory Structure). There are multiple programs running on a multiple-cpu system, for example, a parallelized program that runs using MPI for communication in which each processor is treated as a node. The threading software will see multiple processors on the system even though each processor has a separate MPI process running on it. In this case, one of the solutions is to set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). Section Intel(R) Optimized MP LINPACK Benchmark for Clusters discusses another solution for a Hybrid (OpenMP* + MPI) mode. Managing Performance and Memory 5 43See Also Using Additional Threading Control Linking with Compiler Run-time Libraries Techniques to Set the Number of Threads Use one of the following techniques to change the number of threads to use in Intel MKL: • Set one of the OpenMP or Intel MKL environment variables: • OMP_NUM_THREADS • MKL_NUM_THREADS • MKL_DOMAIN_NUM_THREADS • Call one of the OpenMP or Intel MKL functions: • omp_set_num_threads() • mkl_set_num_threads() • mkl_domain_set_num_threads() When choosing the appropriate technique, take into account the following rules: • The Intel MKL threading controls take precedence over the OpenMP controls because they are inspected first. • A function call takes precedence over any environment variables. The exception, which is a consequence of the previous rule, is the OpenMP subroutine omp_set_num_threads(), which does not have precedence over Intel MKL environment variables, such as MKL_NUM_THREADS. See Using Additional Threading Control for more details. • You cannot change run-time behavior in the course of the run using the environment variables because they are read only once at the first call to Intel MKL. Setting the Number of Threads Using an OpenMP* Environment Variable You can set the number of threads using the environment variable OMP_NUM_THREADS. To change the number of threads, use the appropriate command in the command shell in which the program is going to run, for example: • For the bash shell, enter: export OMP_NUM_THREADS= • For the csh or tcsh shell, enter: set OMP_NUM_THREADS= See Also Using Additional Threading Control Changing the Number of Threads at Run Time You cannot change the number of threads during run time using environment variables. However, you can call OpenMP API functions from your program to change the number of threads during run time. The following sample code shows how to change the number of threads during run time using the omp_set_num_threads() routine. See also Techniques to Set the Number of Threads. The following example shows both C and Fortran code examples. To run this example in the C language, use the omp.h header file from the Intel(R) compiler package. If you do not have the Intel compiler but wish to explore the functionality in the example, use Fortran API for omp_set_num_threads() rather than the C version. For example, omp_set_num_threads_( &i_one ); // ******* C language ******* #include "omp.h" 5 Intel® Math Kernel Library for Linux* OS User's Guide 44#include "mkl.h" #include #define SIZE 1000 int main(int args, char *argv[]){ double *a, *b, *c; a = (double*)malloc(sizeof(double)*SIZE*SIZE); b = (double*)malloc(sizeof(double)*SIZE*SIZE); c = (double*)malloc(sizeof(double)*SIZE*SIZE); double alpha=1, beta=1; int m=SIZE, n=SIZE, k=SIZE, lda=SIZE, ldb=SIZE, ldc=SIZE, i=0, j=0; char transa='n', transb='n'; for( i=0; i #include ... mkl_set_num_threads ( 1 ); // ******* Fortran language ******* ... call mkl_set_num_threads( 1 ) See the Intel MKL Reference Manual for the detailed description of the threading control functions, their parameters, calling syntax, and more code examples. MKL_DYNAMIC The MKL_DYNAMIC environment variable enables Intel MKL to dynamically change the number of threads. The default value of MKL_DYNAMIC is TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE. When MKL_DYNAMIC is TRUE, Intel MKL tries to use what it considers the best number of threads, up to the maximum number you specify. For example, MKL_DYNAMIC set to TRUE enables optimal choice of the number of threads in the following cases: • If the requested number of threads exceeds the number of physical cores (perhaps because of using the Intel® Hyper-Threading Technology), and MKL_DYNAMIC is not changed from its default value of TRUE, Intel MKL will scale down the number of threads to the number of physical cores. • If you are able to detect the presence of MPI, but cannot determine if it has been called in a thread-safe mode (it is impossible to detect this with MPICH 1.2.x, for instance), and MKL_DYNAMIC has not been changed from its default value of TRUE, Intel MKL will run one thread. Managing Performance and Memory 5 47When MKL_DYNAMIC is FALSE, Intel MKL tries not to deviate from the number of threads the user requested. However, setting MKL_DYNAMIC=FALSE does not ensure that Intel MKL will use the number of threads that you request. The library may have no choice on this number for such reasons as system resources. Additionally, the library may examine the problem and use a different number of threads than the value suggested. For example, if you attempt to do a size one matrix-matrix multiply across eight threads, the library may instead choose to use only one thread because it is impractical to use eight threads in this event. Note also that if Intel MKL is called in a parallel region, it will use only one thread by default. If you want the library to use nested parallelism, and the thread within a parallel region is compiled with the same OpenMP compiler as Intel MKL is using, you may experiment with setting MKL_DYNAMIC to FALSE and manually increasing the number of threads. In general, set MKL_DYNAMIC to FALSE only under circumstances that Intel MKL is unable to detect, for example, to use nested parallelism where the library is already called from a parallel section. MKL_DOMAIN_NUM_THREADS The MKL_DOMAIN_NUM_THREADS environment variable suggests the number of threads for a particular function domain. MKL_DOMAIN_NUM_THREADS accepts a string value , which must have the following format: ::= { } ::= [ * ] ( | | | ) [ * ] ::= ::= MKL_DOMAIN_ALL | MKL_DOMAIN_BLAS | MKL_DOMAIN_FFT | MKL_DOMAIN_VML | MKL_DOMAIN_PARDISO ::= [ * ] ( | | ) [ * ] ::= ::= | | In the syntax above, values of indicate function domains as follows: MKL_DOMAIN_ALL All function domains MKL_DOMAIN_BLAS BLAS Routines MKL_DOMAIN_FFT non-cluster Fourier Transform Functions MKL_DOMAIN_VML Vector Mathematical Functions MKL_DOMAIN_PARDISO PARDISO For example, MKL_DOMAIN_ALL 2 : MKL_DOMAIN_BLAS 1 : MKL_DOMAIN_FFT 4 MKL_DOMAIN_ALL=2 : MKL_DOMAIN_BLAS=1 : MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2, MKL_DOMAIN_BLAS=1, MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2; MKL_DOMAIN_BLAS=1; MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL = 2 MKL_DOMAIN_BLAS 1 , MKL_DOMAIN_FFT 4 MKL_DOMAIN_ALL,2: MKL_DOMAIN_BLAS 1, MKL_DOMAIN_FFT,4 . The global variables MKL_DOMAIN_ALL, MKL_DOMAIN_BLAS, MKL_DOMAIN_FFT, MKL_DOMAIN_VML, and MKL_DOMAIN_PARDISO, as well as the interface for the Intel MKL threading control functions, can be found in the mkl.h header file. The table below illustrates how values of MKL_DOMAIN_NUM_THREADS are interpreted. 5 Intel® Math Kernel Library for Linux* OS User's Guide 48Value of MKL_DOMAIN_NUM_ THREADS Interpretation MKL_DOMAIN_ALL= 4 All parts of Intel MKL should try four threads. The actual number of threads may be still different because of the MKL_DYNAMIC setting or system resource issues. The setting is equivalent to MKL_NUM_THREADS = 4. MKL_DOMAIN_ALL= 1, MKL_DOMAIN_BLAS =4 All parts of Intel MKL should try one thread, except for BLAS, which is suggested to try four threads. MKL_DOMAIN_VML= 2 VML should try two threads. The setting affects no other part of Intel MKL. Be aware that the domain-specific settings take precedence over the overall ones. For example, the "MKL_DOMAIN_BLAS=4" value of MKL_DOMAIN_NUM_THREADS suggests trying four threads for BLAS, regardless of later setting MKL_NUM_THREADS, and a function call "mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS );" suggests the same, regardless of later calls to mkl_set_num_threads(). However, a function call with input "MKL_DOMAIN_ALL", such as "mkl_domain_set_num_threads (4, MKL_DOMAIN_ALL);" is equivalent to "mkl_set_num_threads(4)", and thus it will be overwritten by later calls to mkl_set_num_threads. Similarly, the environment setting of MKL_DOMAIN_NUM_THREADS with "MKL_DOMAIN_ALL=4" will be overwritten with MKL_NUM_THREADS = 2. Whereas the MKL_DOMAIN_NUM_THREADS environment variable enables you set several variables at once, for example, "MKL_DOMAIN_BLAS=4,MKL_DOMAIN_FFT=2", the corresponding function does not take string syntax. So, to do the same with the function calls, you may need to make several calls, which in this example are as follows: mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS ); mkl_domain_set_num_threads ( 2, MKL_DOMAIN_FFT ); Setting the Environment Variables for Threading Control To set the environment variables used for threading control, in the command shell in which the program is going to run, enter the export or set commands, depending on the shell you use. For example, for a bash shell, use the export commands: export = For example: export MKL_NUM_THREADS=4 export MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4" export MKL_DYNAMIC=FALSE For the csh or tcsh shell, use the set commands. set =. For example: set MKL_NUM_THREADS=4 set MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4" set MKL_DYNAMIC=FALSE Tips and Techniques to Improve Performance Managing Performance and Memory 5 49Coding Techniques To obtain the best performance with Intel MKL, ensure the following data alignment in your source code: • Align arrays on 16-byte boundaries. See Aligning Addresses on 16-byte Boundaries for how to do it. • Make sure leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16, where element_size is the size of an array element in bytes. • For two-dimensional arrays, avoid leading dimension values divisible by 2048 bytes. For example, for a double-precision array, with element_size = 8, avoid leading dimensions 256, 512, 768, 1024, … (elements). LAPACK Packed Routines The routines with the names that contain the letters HP, OP, PP, SP, TP, UP in the matrix type and storage position (the second and third letters respectively) operate on the matrices in the packed format (see LAPACK "Routine Naming Conventions" sections in the Intel MKL Reference Manual). Their functionality is strictly equivalent to the functionality of the unpacked routines with the names containing the letters HE, OR, PO, SY, TR, UN in the same positions, but the performance is significantly lower. If the memory restriction is not too tight, use an unpacked routine for better performance. In this case, you need to allocate N 2 /2 more memory than the memory required by a respective packed routine, where N is the problem size (the number of equations). For example, to speed up solving a symmetric eigenproblem with an expert driver, use the unpacked routine: call dsyevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info) where a is the dimension lda-by-n, which is at least N 2 elements, instead of the packed routine: call dspevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) where ap is the dimension N*(N+1)/2. FFT Functions Additional conditions can improve performance of the FFT functions. The addresses of the first elements of arrays and the leading dimension values, in bytes (n*element_size), of two-dimensional arrays should be divisible by cache line size, which equals: • 32 bytes for the Intel ® Pentium® III processors • 64 bytes for the Intel ® Pentium® 4 processors and processors using Intel ® 64 architecture Hardware Configuration Tips Dual-Core Intel® Xeon® processor 5100 series systems To get the best performance with Intel MKL on Dual-Core Intel ® Xeon® processor 5100 series systems, enable the Hardware DPL (streaming data) Prefetcher functionality of this processor. To configure this functionality, use the appropriate BIOS settings, as described in your BIOS documentation. 5 Intel® Math Kernel Library for Linux* OS User's Guide 50Intel® Hyper-Threading Technology Intel ® Hyper-Threading Technology (Intel ® HT Technology) is especially effective when each thread performs different types of operations and when there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance by disabling Intel HT Technology. If you run with Intel HT Technology enabled, performance may be especially impacted if you run on fewer threads than physical cores. Moreover, if, for example, there are two threads to every physical core, the thread scheduler may assign two threads to some cores and ignore the other cores altogether. If you are using the OpenMP* library of the Intel Compiler, read the respective User Guide on how to best set the thread affinity interface to avoid this situation. For Intel MKL, apply the following setting: set KMP_AFFINITY=granularity=fine,compact,1,0 See Also Using Parallelism of the Intel® Math Kernel Library Managing Multi-core Performance You can obtain best performance on systems with multi-core processors by requiring that threads do not migrate from core to core. To do this, bind threads to the CPU cores by setting an affinity mask to threads. Use one of the following options: • OpenMP facilities (recommended, if available), for example, the KMP_AFFINITY environment variable using the Intel OpenMP library • A system function, as explained below Consider the following performance issue: • The system has two sockets with two cores each, for a total of four cores (CPUs) • T he two -thread parallel application that calls the Intel MKL FFT happens to run faster than in four threads, but the performance in two threads is very unstable The following code example shows how to resolve this issue by setting an affinity mask by operating system means using the Intel compiler. The code calls the system function sched_setaffinity to bind the threads to the cores on different sockets. Then the Intel MKL FFT function is called: #define _GNU_SOURCE //for using the GNU CPU affinity // (works with the appropriate kernel and glibc) // Set affinity mask #include #include #include #include int main(void) { int NCPUs = sysconf(_SC_NPROCESSORS_CONF); printf("Using thread affinity on %i NCPUs\n", NCPUs); #pragma omp parallel default(shared) { cpu_set_t new_mask; cpu_set_t was_mask; int tid = omp_get_thread_num(); CPU_ZERO(&new_mask); // 2 packages x 2 cores/pkg x 1 threads/core (4 total cores) CPU_SET(tid==0 ? 0 : 2, &new_mask); if (sched_getaffinity(0, sizeof(was_mask), &was_mask) == -1) { printf("Error: sched_getaffinity(%d, sizeof(was_mask), &was_mask)\n", tid); } if (sched_setaffinity(0, sizeof(new_mask), &new_mask) == -1) { printf("Error: sched_setaffinity(%d, sizeof(new_mask), &new_mask)\n", tid); } printf("tid=%d new_mask=%08X was_mask=%08X\n", tid, Managing Performance and Memory 5 51 *(unsigned int*)(&new_mask), *(unsigned int*)(&was_mask)); } // Call Intel MKL FFT function return 0; } Compile the application with the Intel compiler using the following command: icc test_application.c -openmp where test_application.c is the filename for the application. Build the application. Run it in two threads, for example, by using the environment variable to set the number of threads: env OMP_NUM_THREADS=2 ./a.out See the Linux Programmer's Manual (in man pages format) for particulars of the sched_setaffinity function used in the above example. Operating on Denormals The IEEE 754-2008 standard, "An IEEE Standard for Binary Floating-Point Arithmetic", defines denormal (or subnormal) numbers as non-zero numbers smaller than the smallest possible normalized numbers for a specific floating-point format. Floating-point operations on denormals are slower than on normalized operands because denormal operands and results are usually handled through a software assist mechanism rather than directly in hardware. This software processing causes Intel MKL functions that consume denormals to run slower than with normalized floating-point numbers. You can mitigate this performance issue by setting the appropriate bit fields in the MXCSR floating-point control register to flush denormals to zero (FTZ) or to replace any denormals loaded from memory with zero (DAZ). Check your compiler documentation to determine whether it has options to control FTZ and DAZ. Note that these compiler options may slightly affect accuracy. FFT Optimized Radices You can improve the performance of Intel MKL FFT if the length of your data vector permits factorization into powers of optimized radices. In Intel MKL, the optimized radices are 2, 3, 5, 7, 11, and 13. Using Memory Management Intel MKL Memory Management Software Intel MKL has memory management software that controls memory buffers for the use by the library functions. New buffers that the library allocates when your application calls Intel MKL are not deallocated until the program ends. To get the amount of memory allocated by the memory management software, call the mkl_mem_stat() function. If your program needs to free memory, call mkl_free_buffers(). If another call is made to a library function that needs a memory buffer, the memory manager again allocates the buffers and they again remain allocated until either the program ends or the program deallocates the memory. This behavior facilitates better performance. However, some tools may report this behavior as a memory leak. The memory management software is turned on by default. To turn it off, set the MKL_DISABLE_FAST_MM environment variable to any value or call the mkl_disable_fast_mm() function. Be aware that this change may negatively impact performance of some Intel MKL routines, especially for small problem sizes. 5 Intel® Math Kernel Library for Linux* OS User's Guide 52Redefining Memory Functions In C/C++ programs, you can replace Intel MKL memory functions that the library uses by default with your own functions. To do this, use the memory renaming feature. Memory Renaming Intel MKL memory management by default uses standard C run-time memory functions to allocate or free memory. These functions can be replaced using memory renaming. Intel MKL accesses the memory functions by pointers i_malloc, i_free, i_calloc, and i_realloc, which are visible at the application level. These pointers initially hold addresses of the standard C run-time memory functions malloc, free, calloc, and realloc, respectively. You can programmatically redefine values of these pointers to the addresses of your application's memory management functions. Redirecting the pointers is the only correct way to use your own set of memory management functions. If you call your own memory functions without redirecting the pointers, the memory will get managed by two independent memory management packages, which may cause unexpected memory issues. How to Redefine Memory Functions To redefine memory functions, use the following procedure: 1. Include the i_malloc.h header file in your code. This header file contains all declarations required for replacing the memory allocation functions. The header file also describes how memory allocation can be replaced in those Intel libraries that support this feature. 2. Redefine values of pointers i_malloc, i_free, i_calloc, and i_realloc prior to the first call to MKL functions, as shown in the following example: #include "i_malloc.h" . . . i_malloc = my_malloc; i_calloc = my_calloc; i_realloc = my_realloc; i_free = my_free; . . . // Now you may call Intel MKL functions Managing Performance and Memory 5 535 Intel® Math Kernel Library for Linux* OS User's Guide 54Language-specific Usage Options 6 The Intel® Math Kernel Library (Intel® MKL) provides broad support for Fortran and C/C++ programming. However, not all functions support both Fortran and C interfaces. For example, some LAPACK functions have no C interface. You can call such functions from C using mixed-language programming. If you want to use LAPACK or BLAS functions that support Fortran 77 in the Fortran 95 environment, additional effort may be initially required to build compiler-specific interface libraries and modules from the source code provided with Intel MKL. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Language-Specific Interfaces with Intel® Math Kernel Library This section discusses mixed-language programming and the use of language-specific interfaces with Intel MKL. See also Appendix G in the Intel MKL Reference Manual for details of the FFTW interfaces to Intel MKL. Interface Libraries and Modules You can create the following interface libraries and modules using the respective makefiles located in the interfaces directory. File name Contains Libraries, in Intel MKL architecture-specific directories libmkl_blas95.a 1 Fortran 95 wrappers for BLAS (BLAS95) for IA-32 architecture. libmkl_blas95_ilp64.a 1 Fortran 95 wrappers for BLAS (BLAS95) supporting LP64 interface. libmkl_blas95_lp64.a 1 Fortran 95 wrappers for BLAS (BLAS95) supporting ILP64 interface. libmkl_lapack95.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) for IA-32 architecture. libmkl_lapack95_lp64.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting LP64 interface. libmkl_lapack95_ilp64.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting ILP64 interface. 55File name Contains libfftw2xc_intel.a 1 Interfaces for FFTW version 2.x (C interface for Intel compilers) to call Intel MKL FFTs. libfftw2xc_gnu.a Interfaces for FFTW version 2.x (C interface for GNU compilers) to call Intel MKL FFTs. libfftw2xf_intel.a Interfaces for FFTW version 2.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. libfftw2xf_gnu.a Interfaces for FFTW version 2.x (Fortran interface for GNU compiler) to call Intel MKL FFTs. libfftw3xc_intel.a 2 Interfaces for FFTW version 3.x (C interface for Intel compiler) to call Intel MKL FFTs. libfftw3xc_gnu.a Interfaces for FFTW version 3.x (C interface for GNU compilers) to call Intel MKL FFTs. libfftw3xf_intel.a 2 Interfaces for FFTW version 3.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. libfftw3xf_gnu.a Interfaces for FFTW version 3.x (Fortran interface for GNU compilers) to call Intel MKL FFTs. libfftw2x_cdft_SINGLE.a Single-precision interfaces for MPI FFTW version 2.x (C interface) to call Intel MKL cluster FFTs. libfftw2x_cdft_DOUBLE.a Double-precision interfaces for MPI FFTW version 2.x (C interface) to call Intel MKL cluster FFTs. libfftw3x_cdft.a Interfaces for MPI FFTW version 3.x (C interface) to call Intel MKL cluster FFTs. libfftw3x_cdft_ilp64.a Interfaces for MPI FFTW version 3.x (C interface) to call Intel MKL cluster FFTs supporting the ILP64 interface. Modules, in architecture- and interface-specific subdirectories of the Intel MKL include directory blas95.mod 1 Fortran 95 interface module for BLAS (BLAS95). lapack95.mod 1 Fortran 95 interface module for LAPACK (LAPACK95). f95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95. mkl95_blas.mod 1 Fortran 95 interface module for BLAS (BLAS95), identical to blas95.mod. To be removed in one of the future releases. mkl95_lapack.mod 1 Fortran 95 interface module for LAPACK (LAPACK95), identical to lapack95.mod. To be removed in one of the future releases. mkl95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95, identical to f95_precision.mod. To be removed in one of the future releases. mkl_service.mod 1 Fortran 95 interface module for Intel MKL support functions. 1 Prebuilt for the Intel® Fortran compiler 2 FFTW3 interfaces are integrated with Intel MKL. Look into /interfaces/fftw3x*/ makefile for options defining how to build and where to place the standalone library with the wrappers. See Also Fortran 95 Interfaces to LAPACK and BLAS 6 Intel® Math Kernel Library for Linux* OS User's Guide 56Fortran 95 Interfaces to LAPACK and BLAS Fortran 95 interfaces are compiler-dependent. Intel MKL provides the interface libraries and modules precompiled with the Intel® Fortran compiler. Additionally, the Fortran 95 interfaces and wrappers are delivered as sources. (For more information, see Compiler-dependent Functions and Fortran 90 Modules). If you are using a different compiler, build the appropriate library and modules with your compiler and link the library as a user's library: 1. Go to the respective directory /interfaces/blas95 or / interfaces/lapack95 2. Type one of the following commands depending on your architecture: • For the IA-32 architecture, make libia32 INSTALL_DIR= • For the Intel® 64 architecture, make libintel64 [interface=lp64|ilp64] INSTALL_DIR= Important The parameter INSTALL_DIR is required. As a result, the required library is built and installed in the /lib directory, and the .mod files are built and installed in the /include/[/{lp64|ilp64}] directory, where is one of {ia32, intel64}. By default, the ifort compiler is assumed. You may change the compiler with an additional parameter of make: FC=. For example, the command make libintel64 FC=pgf95 INSTALL_DIR= interface=lp64 builds the required library and .mod files and installs them in subdirectories of . To delete the library from the building directory, use one of the following commands: • For the IA-32 architecture, make cleania32 INSTALL_DIR= • For the Intel ® 64 architecture, make cleanintel64 [interface=lp64|ilp64] INSTALL_DIR= • For all the architectures, make clean INSTALL_DIR= CAUTION Even if you have administrative rights, avoid setting INSTALL_DIR=../.. or INSTALL_DIR= in a build or clean command above because these settings replace or delete the Intel MKL prebuilt Fortran 95 library and modules. Compiler-dependent Functions and Fortran 90 Modules Compiler-dependent functions occur whenever the compiler inserts into the object code function calls that are resolved in its run-time library (RTL). Linking of such code without the appropriate RTL will result in undefined symbols. Intel MKL has been designed to minimize RTL dependencies. In cases where RTL dependencies might arise, the functions are delivered as source code and you need to compile the code with whatever compiler you are using for your application. Language-specific Usage Options 6 57In particular, Fortran 90 modules result in the compiler-specific code generation requiring RTL support. Therefore, Intel MKL delivers these modules compiled with the Intel compiler, along with source code, to be used with different compilers. Mixed-language Programming with the Intel Math Kernel Library Appendix A: Intel(R) Math Kernel Library Language Interfaces Support lists the programming languages supported for each Intel MKL function domain. However, you can call Intel MKL routines from different language environments. Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments Not all Intel MKL function domains support both C and Fortran environments. To use Intel MKL Fortran-style functions in C/C++ environments, you should observe certain conventions, which are discussed for LAPACK and BLAS in the subsections below. CAUTION Avoid calling BLAS 95/LAPACK 95 from C/C++. Such calls require skills in manipulating the descriptor of a deferred-shape array, which is the Fortran 90 type. Moreover, BLAS95/LAPACK95 routines contain links to a Fortran RTL. LAPACK and BLAS Because LAPACK and BLAS routines are Fortran-style, when calling them from C-language programs, follow the Fortran-style calling conventions: • Pass variables by address, not by value. Function calls in Example "Calling a Complex BLAS Level 1 Function from C++" and Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrate this. • Store your data in Fortran style, that is, column-major rather than row-major order. With row-major order, adopted in C, the last array index changes most quickly and the first one changes most slowly when traversing the memory segment where the array is stored. With Fortran-style columnmajor order, the last index changes most slowly whereas the first index changes most quickly (as illustrated by the figure below for a two-dimensional array). For example, if a two-dimensional matrix A of size mxn is stored densely in a one-dimensional array B, you can access a matrix element like this: A[i][j] = B[i*n+j] in C ( i=0, ... , m-1, j=0, ... , -1) A(i,j) = B(j*m+i) in Fortran ( i=1, ... , m, j=1, ... , n). When calling LAPACK or BLAS routines from C, be aware that because the Fortran language is caseinsensitive, the routine names can be both upper-case or lower-case, with or without the trailing underscore. For example, the following names are equivalent: 6 Intel® Math Kernel Library for Linux* OS User's Guide 58• LAPACK: dgetrf, DGETRF, dgetrf_, and DGETRF_ • BLAS: dgemm, DGEMM, dgemm_, and DGEMM_ See Example "Calling a Complex BLAS Level 1 Function from C++" on how to call BLAS routines from C. See also the Intel(R) MKL Reference Manual for a description of the C interface to LAPACK functions. CBLAS Instead of calling BLAS routines from a C-language program, you can use the CBLAS interface. CBLAS is a C-style interface to the BLAS routines. You can call CBLAS routines using regular C-style calls. Use the mkl.h header file with the CBLAS interface. The header file specifies enumerated values and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrates the use of the CBLAS interface. C Interface to LAPACK Instead of calling LAPACK routines from a C-language program, you can use the C interface to LAPACK provided by Intel MKL. The C interface to LAPACK is a C-style interface to the LAPACK routines. This interface supports matrices in row-major and column-major order, which you can define in the first function argument matrix_order. Use the mkl_lapacke.h header file with the C interface to LAPACK. The header file specifies constants and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. You can find examples of the C interface to LAPACK in the examples/lapacke subdirectory in the Intel MKL installation directory. Using Complex Types in C/C++ As described in the documentation for the Intel® Fortran Compiler XE, C/C++ does not directly implement the Fortran types COMPLEX(4) and COMPLEX(8). However, you can write equivalent structures. The type COMPLEX(4) consists of two 4-byte floating-point numbers. The first of them is the real-number component, and the second one is the imaginary-number component. The type COMPLEX(8) is similar to COMPLEX(4) except that it contains two 8-byte floating-point numbers. Intel MKL provides complex types MKL_Complex8 and MKL_Complex16, which are structures equivalent to the Fortran complex types COMPLEX(4) and COMPLEX(8), respectively. The MKL_Complex8 and MKL_Complex16 types are defined in the mkl_types.h header file. You can use these types to define complex data. You can also redefine the types with your own types before including the mkl_types.h header file. The only requirement is that the types must be compatible with the Fortran complex layout, that is, the complex type must be a pair of real numbers for the values of real and imaginary parts. For example, you can use the following definitions in your C++ code: #define MKL_Complex8 std::complex and #define MKL_Complex16 std::complex See Example "Calling a Complex BLAS Level 1 Function from C++" for details. You can also define these types in the command line: -DMKL_Complex8="std::complex" -DMKL_Complex16="std::complex" See Also Intel® Software Documentation Library Language-specific Usage Options 6 59Calling BLAS Functions that Return the Complex Values in C/C++ Code Complex values that functions return are handled differently in C and Fortran. Because BLAS is Fortran-style, you need to be careful when handling a call from C to a BLAS function that returns complex values. However, in addition to normal function calls, Fortran enables calling functions as though they were subroutines, which provides a mechanism for returning the complex value correctly when the function is called from a C program. When a Fortran function is called as a subroutine, the return value is the first parameter in the calling sequence. You can use this feature to call a BLAS function from C. The following example shows how a call to a Fortran function as a subroutine converts to a call from C and the hidden parameter result gets exposed: Normal Fortran function call: result = cdotc( n, x, 1, y, 1 ) A call to the function as a subroutine: call cdotc( result, n, x, 1, y, 1) A call to the function from C: cdotc( &result, &n, x, &one, y, &one ) NOTE Intel MKL has both upper-case and lower-case entry points in the Fortran-style (caseinsensitive) BLAS, with or without the trailing underscore. So, all these names are equivalent and acceptable: cdotc, CDOTC, cdotc_, and CDOTC_. The above example shows one of the ways to call several level 1 BLAS functions that return complex values from your C and C++ applications. An easier way is to use the CBLAS interface. For instance, you can call the same function using the CBLAS interface as follows: cblas_cdotu( n, x, 1, y, 1, &result ) NOTE The complex value comes last on the argument list in this case. The following examples show use of the Fortran-style BLAS interface from C and C++, as well as the CBLAS (C language) interface: • Example "Calling a Complex BLAS Level 1 Function from C" • Example "Calling a Complex BLAS Level 1 Function from C++" • Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" Example "Calling a Complex BLAS Level 1 Function from C" The example below illustrates a call from a C program to the complex BLAS Level 1 function zdotc(). This function computes the dot product of two double-precision complex vectors. In this example, the complex dot product is returned in the structure c. #include "mkl.h" #define N 5 int main() { int n = N, inca = 1, incb = 1, i; MKL_Complex16 a[N], b[N], c; for( i = 0; i < n; i++ ){ a[i].real = (double)i; a[i].imag = (double)i * 2.0; b[i].real = (double)(n - i); b[i].imag = (double)i * 2.0; } zdotc( &c, &n, a, &inca, b, &incb ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.real, c.imag ); return 0; } 6 Intel® Math Kernel Library for Linux* OS User's Guide 60Example "Calling a Complex BLAS Level 1 Function from C++" Below is the C++ implementation: #include #include #define MKL_Complex16 std::complex #include "mkl.h" #define N 5 int main() { int n, inca = 1, incb = 1, i; std::complex a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i] = std::complex(i,i*2.0); b[i] = std::complex(n-i,i*2.0); } zdotc(&c, &n, a, &inca, b, &incb ); std::cout << "The complex dot product is: " << c << std::endl; return 0; } Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" This example uses CBLAS: #include #include "mkl.h" typedef struct{ double re; double im; } complex16; #define N 5 int main() { int n, inca = 1, incb = 1, i; complex16 a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i].re = (double)i; a[i].im = (double)i * 2.0; b[i].re = (double)(n - i); b[i].im = (double)i * 2.0; } cblas_zdotc_sub(n, a, inca, b, incb, &c ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.re, c.im ); return 0; } Support for Boost uBLAS Matrix-matrix Multiplication If you are used to uBLAS, you can perform BLAS matrix-matrix multiplication in C++ using Intel MKL substitution of Boost uBLAS functions. uBLAS is the Boost C++ open-source library that provides BLAS functionality for dense, packed, and sparse matrices. The library uses an expression template technique for passing expressions as function arguments, which enables evaluating vector and matrix expressions in one pass without temporary matrices. uBLAS provides two modes: • Debug (safe) mode, default. Checks types and conformance. • Release (fast) mode. Does not check types and conformance. To enable this mode, use the NDEBUG preprocessor symbol. The documentation for the Boost uBLAS is available at www.boost.org. Language-specific Usage Options 6 61Intel MKL provides overloaded prod() functions for substituting uBLAS dense matrix-matrix multiplication with the Intel MKL gemm calls. Though these functions break uBLAS expression templates and introduce temporary matrices, the performance advantage can be considerable for matrix sizes that are not too small (roughly, over 50). You do not need to change your source code to use the functions. To call them: • Include the header file mkl_boost_ublas_matrix_prod.hpp in your code (from the Intel MKL include directory) • Add appropriate Intel MKL libraries to the link line. The list of expressions that are substituted follows: prod( m1, m2 ) prod( trans(m1), m2 ) prod( trans(conj(m1)), m2 ) prod( conj(trans(m1)), m2 ) prod( m1, trans(m2) ) prod( trans(m1), trans(m2) ) prod( trans(conj(m1)), trans(m2) ) prod( conj(trans(m1)), trans(m2) ) prod( m1, trans(conj(m2)) ) prod( trans(m1), trans(conj(m2)) ) prod( trans(conj(m1)), trans(conj(m2)) ) prod( conj(trans(m1)), trans(conj(m2)) ) prod( m1, conj(trans(m2)) ) prod( trans(m1), conj(trans(m2)) ) prod( trans(conj(m1)), conj(trans(m2)) ) prod( conj(trans(m1)), conj(trans(m2)) ) These expressions are substituted in the release mode only (with NDEBUG preprocessor symbol defined). Supported uBLAS versions are Boost 1.34.1 and higher. To get them, visit www.boost.org. A code example provided in the /examples/ublas/source/sylvester.cpp file illustrates usage of the Intel MKL uBLAS header file for solving a special case of the Sylvester equation. To run the Intel MKL ublas examples, specify the BOOST_ROOT parameter in the make command, for instance, when using Boost version 1.37.0: make libia32 BOOST_ROOT = /boost_1_37_0 See Also Using Code Examples Invoking Intel MKL Functions from Java* Applications Intel MKL Java* Examples To demonstrate binding with Java, Intel MKL includes a set of Java examples in the following directory: /examples/java. The examples are provided for the following MKL functions: • ?gemm, ?gemv, and ?dot families from CBLAS • The complete set of non-cluster FFT functions 6 Intel® Math Kernel Library for Linux* OS User's Guide 62• ESSL 1 -like functions for one-dimensional convolution and correlation • VSL Random Number Generators (RNG), except user-defined ones and file subroutines • VML functions, except GetErrorCallBack, SetErrorCallBack, and ClearErrorCallBack You can see the example sources in the following directory: /examples/java/examples. The examples are written in Java. They demonstrate usage of the MKL functions with the following variety of data: • 1- and 2-dimensional data sequences • Real and complex types of the data • Single and double precision However, the wrappers, used in the examples, do not: • Demonstrate the use of large arrays (>2 billion elements) • Demonstrate processing of arrays in native memory • Check correctness of function parameters • Demonstrate performance optimizations The examples use the Java Native Interface (JNI* developer framework) to bind with Intel MKL. The JNI documentation is available from http://java.sun.com/javase/6/docs/technotes/guides/jni/. The Java example set includes JNI wrappers that perform the binding. The wrappers do not depend on the examples and may be used in your Java applications. The wrappers for CBLAS, FFT, VML, VSL RNG, and ESSL-like convolution and correlation functions do not depend on each other. To build the wrappers, just run the examples. The makefile builds the wrapper binaries. After running the makefile, you can run the examples, which will determine whether the wrappers were built correctly. As a result of running the examples, the following directories will be created in /examples/ java: • docs • include • classes • bin • _results The directories docs, include, classes, and bin will contain the wrapper binaries and documentation; the directory _results will contain the testing results. For a Java programmer, the wrappers are the following Java classes: • com.intel.mkl.CBLAS • com.intel.mkl.DFTI • com.intel.mkl.ESSL • com.intel.mkl.VML • com.intel.mkl.VSL Documentation for the particular wrapper and example classes will be generated from the Java sources while building and running the examples. To browse the documentation, open the index file in the docs directory (created by the build script): /examples/java/docs/index.html. The Java wrappers for CBLAS, VML, VSL RNG, and FFT establish the interface that directly corresponds to the underlying native functions, so you can refer to the Intel MKL Reference Manual for their functionality and parameters. Interfaces for the ESSL-like functions are described in the generated documentation for the com.intel.mkl.ESSL class. Each wrapper consists of the interface part for Java and JNI stub written in C. You can find the sources in the following directory: Language-specific Usage Options 6 63/examples/java/wrappers. Both Java and C parts of the wrapper for CBLAS and VML demonstrate the straightforward approach, which you may use to cover additional CBLAS functions. The wrapper for FFT is more complicated because it needs to support the lifecycle for FFT descriptor objects. To compute a single Fourier transform, an application needs to call the FFT software several times with the same copy of the native FFT descriptor. The wrapper provides the handler class to hold the native descriptor, while the virtual machine runs Java bytecode. The wrapper for VSL RNG is similar to the one for FFT. The wrapper provides the handler class to hold the native descriptor of the stream state. The wrapper for the convolution and correlation functions mitigates the same difficulty of the VSL interface, which assumes a similar lifecycle for "task descriptors". The wrapper utilizes the ESSL-like interface for those functions, which is simpler for the case of 1-dimensional data. The JNI stub additionally encapsulates the MKL functions into the ESSL-like wrappers written in C and so "packs" the lifecycle of a task descriptor into a single call to the native method. The wrappers meet the JNI Specification versions 1.1 and 5.0 and should work with virtually every modern implementation of Java. The examples and the Java part of the wrappers are written for the Java language described in "The Java Language Specification (First Edition)" and extended with the feature of "inner classes" (this refers to late 1990s). This level of language version is supported by all versions of the Sun Java Development Kit* (JDK*) developer toolkit and compatible implementations starting from version 1.1.5, or by all modern versions of Java. The level of C language is "Standard C" (that is, C89) with additional assumptions about integer and floatingpoint data types required by the Intel MKL interfaces and the JNI header files. That is, the native float and double data types must be the same as JNI jfloat and jdouble data types, respectively, and the native int must be 4 bytes long. 1 IBM Engineering Scientific Subroutine Library (ESSL*). See Also Running the Java* Examples Running the Java* Examples The Java examples support all the C and C++ compilers that Intel MKL does. The makefile intended to run the examples also needs the make utility, which is typically provided with the Linux* OS distribution. To run Java examples, the JDK* developer toolkit is required for compiling and running Java code. A Java implementation must be installed on the computer or available via the network. You may download the JDK from the vendor website. The examples should work for all versions of JDK. However, they were tested only with the following Java implementation s for all the supported architectures: • J2SE* SDK 1.4.2, JDK 5.0 and 6.0 from Sun Microsystems, Inc. (http://sun.com/). • JRockit* JDK 1.4.2 and 5.0 from Oracle Corporation (http://oracle.com/). Note that the Java run-time environment* (JRE*) system, which may be pre-installed on your computer, is not enough. You need the JDK* developer toolkit that supports the following set of tools: • java • javac • javah • javadoc To make these tools available for the examples makefile, set the JAVA_HOME environment variable and add the JDK binaries directory to the system PATH, for example , using thebash shell: export JAVA_HOME=/home//jdk1.5.0_09 export PATH=${JAVA_HOME}/bin:${PATH} 6 Intel® Math Kernel Library for Linux* OS User's Guide 64You may also need to clear the JDK_HOME environment variable, if it is assigned a value: unset JDK_HOME To start the examples, use the makefile found in the Intel MKL Java examples directory: make {soia32|sointel64|libia32|libintel64} [function=...] [compiler=...] If you type the make command and omit the target (for example, soia32), the makefile prints the help info, which explains the targets and parameters. For the examples list, see the examples.lst file in the Java examples directory. Known Limitations of the Java* Examples This section explains limitations of Java examples. Functionality Some Intel MKL functions may fail to work if called from the Java environment by using a wrapper, like those provided with the Intel MKL Java examples. Only those specific CBLAS, FFT, VML, VSL RNG, and the convolution/correlation functions listed in the Intel MKL Java Examples section were tested with the Java environment. So, you may use the Java wrappers for these CBLAS, FFT, VML, VSL RNG, and convolution/ correlation functions in your Java applications. Performance The Intel MKL functions must work faster than similar functions written in pure Java. However, the main goal of these wrappers is to provide code examples, not maximum performance. So, an Intel MKL function called from a Java application will probably work slower than the same function called from a program written in C/ C++ or Fortran. Known bugs There are a number of known bugs in Intel MKL (identified in the Release Notes), as well as incompatibilities between different versions of JDK. The examples and wrappers include workarounds for these problems. Look at the source code in the examples and wrappers for comments that describe the workarounds. Language-specific Usage Options 6 656 Intel® Math Kernel Library for Linux* OS User's Guide 66Coding Tips 7 This section discusses programming with the Intel® Math Kernel Library (Intel® MKL) to provide coding tips that meet certain, specific needs, such as consistent results of computations or conditional compilation. Aligning Data for Consistent Results Routines in Intel MKL may return different results from run-to-run on the same system. This is usually due to a change in the order in which floating-point operations are performed. The two most influential factors are array alignment and parallelism. Array alignment can determine how internal loops order floating-point operations. Non-deterministic parallelism may change the order in which computational tasks are executed. While these results may differ, they should still fall within acceptable computational error bounds. To better assure identical results from run-to-run, do the following: • Align input arrays on 16-byte boundaries • Run Intel MKL in the sequential mode To align input arrays on 16-byte boundaries, use mkl_malloc() in place of system provided memory allocators, as shown in the code example below. Sequential mode of Intel MKL removes the influence of nondeterministic parallelism. Aligning Addresses on 16-byte Boundaries // ******* C language ******* ... #include ... void *darray; int workspace; ... // Allocate workspace aligned on 16-byte boundary darray = mkl_malloc( sizeof(double)*workspace, 16 ); ... // call the program using MKL mkl_app( darray ); ... // Free workspace mkl_free( darray ); ! ******* Fortran language ******* ... double precision darray pointer (p_wrk,darray(1)) integer workspace ... ! Allocate workspace aligned on 16-byte boundary p_wrk = mkl_malloc( 8*workspace, 16 ) ... ! call the program using MKL call mkl_app( darray ) ... ! Free workspace call mkl_free(p_wrk) 67Using Predefined Preprocessor Symbols for Intel® MKL Version-Dependent Compilation Preprocessor symbols (macros) substitute values in a program before it is compiled. The substitution is performed in the preprocessing phase. The following preprocessor symbols are available: Predefined Preprocessor Symbol Description __INTEL_MKL__ Intel MKL major version __INTEL_MKL_MINOR__ Intel MKL minor version __INTEL_MKL_UPDATE__ Intel MKL update number INTEL_MKL_VERSION Intel MKL full version in the following format: INTEL_MKL_VERSION = (__INTEL_MKL__*100+__INTEL_MKL_MINOR__)*100+__I NTEL_MKL_UPDATE__ These symbols enable conditional compilation of code that uses new features introduced in a particular version of the library. To perform conditional compilation: 1. Include in your code the file where the macros are defined: • mkl.h for C/C++ • mkl.fi for Fortran 2. [Optionally] Use the following preprocessor directives to check whether the macro is defined: • #ifdef, #endif for C/C++ • !DEC$IF DEFINED, !DEC$ENDIF for Fortran 3. Use preprocessor directives for conditional inclusion of code: • #if, #endif for C/C++ • !DEC$IF, !DEC$ENDIF for Fortran Example Compile a part of the code if Intel MKL version is MKL 10.3 update 4: C/C++: #include "mkl.h" #ifdef INTEL_MKL_VERSION #if INTEL_MKL_VERSION == 100304 // Code to be conditionally compiled #endif #endif Fortran: include "mkl.fi" !DEC$IF DEFINED INTEL_MKL_VERSION !DEC$IF INTEL_MKL_VERSION .EQ. 100304 * Code to be conditionally compiled !DEC$ENDIF !DEC$ENDIF 7 Intel® Math Kernel Library for Linux* OS User's Guide 68Working with the Intel® Math Kernel Library Cluster Software 8 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Linking with ScaLAPACK and Cluster FFTs The Intel MKL ScaLAPACK and Cluster FFTs support MPI implementations identified in the Intel MKL Release Notes. To link a program that calls ScaLAPACK or Cluster FFTs, you need to know how to link a message-passing interface (MPI) application first. Use mpi scripts to do this. For example, mpicc or mpif77 are C or FORTRAN 77 scripts, respectively, that use the correct MPI header files. The location of these scripts and the MPI library depends on your MPI implementation. For example, for the default installation of MPICH, /opt/mpich/bin/mpicc and /opt/ mpich/bin/mpif77 are the compiler scripts and /opt/mpich/lib/libmpich.a is the MPI library. Check the documentation that comes with your MPI implementation for implementation-specific details of linking. To link with Intel MKL ScaLAPACK and/or Cluster FFTs, use the following general form : < linker script> \ -L [-Wl,--start-group] \ [-Wl,--end-group] where the placeholders stand for paths and libraries as explained in the following table: One of ScaLAPACK or Cluster FFT libraries for the appropriate architecture and programming interface (LP64 or ILP64). Available libraries are listed in Directory Structure in Detail. For example, for the IA-32 architecture, it is either - lmkl_scalapack_core or -lmkl_cdft_core. The BLACS library corresponding to your architecture, programming interface (LP64 or ILP64), and MPI version. Available BLACS libraries are listed in Directory Structure in Detail. For example, for the IA-32 architecture, choose one of - lmkl_blacs, -lmkl_blacs_intelmpi, or -lmkl_blacs_openmpi, depending on the MPI version you use; specifically, for Intel MPI 3.x, choose - lmkl_blacs_intelmpi. for ScaLAPACK, and for Cluster FFTs. Processor optimized kernels, threading library, and system library for threading support, linked as described in Listing Libraries on a Link Line. 69 The LAPACK library and . One of several MPI implementations (MPICH, Intel MPI, and so on). < linker script> A linker script that corresponds to the MPI version. For instance, for Intel MPI 3.x, use . For example, if you are using Intel MPI 3.x, want to statically use the LP64 interface with ScaLAPACK, and have only one MPI process per core (and thus do not use threading), specify the following linker options: -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_scalapack_lp64.a $MKLPATH/ libmkl_blacs_intelmpi_lp64.a $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -static_mpi -Wl,--end-group -lpthread -lm NOTE Grouping symbols -Wl,--start-group and -Wl,--end-group are required for static linking. TIP Use the Link-line Advisor to quickly choose the appropriate set of , , and . See Also Linking Your Application with the Intel® Math Kernel Library Examples for Linking with ScaLAPACK and Cluster FFT Setting the Number of Threads The OpenMP* software responds to the environment variable OMP_NUM_THREADS. Intel MKL also has other mechanisms to set the number of threads, such as the MKL_NUM_THREADS or MKL_DOMAIN_NUM_THREADS environment variables (see Using Additional Threading Control). Make sure that the relevant environment variables have the same and correct values on all the nodes. Intel MKL versions 10.0 and higher no longer set the default number of threads to one, but depend on the OpenMP libraries used with the compiler to set the default number. For the threading layer based on the Intel compiler (libmkl_intel_thread.a), this value is the number of CPUs according to the OS. CAUTION Avoid over-prescribing the number of threads, which may occur, for instance, when the number of MPI ranks per node and the number of threads per node are both greater than one. The product of MPI ranks per node and the number of threads per node should not exceed the number of physical cores per node. The best way to set an environment variable, such as OMP_NUM_THREADS, is your login environment. Remember that changing this value on the head node and then doing your run, as you do on a sharedmemory (SMP) system, does not change the variable on all the nodes because mpirun starts a fresh default shell on all the nodes. To change the number of threads on all the nodes, in .bashrc, add a line at the top, as follows: OMP_NUM_THREADS=1; export OMP_NUM_THREADS You can run multiple CPUs per node using MPICH. To do this, build MPICH to enable multiple CPUs per node. Be aware that certain MPICH applications may fail to work perfectly in a threaded environment (see the Known Limitations section in the Release Notes. If you encounter problems with MPICH and setting of the number of threads is greater than one, first try setting the number of threads to one and see whether the problem persists. 8 Intel® Math Kernel Library for Linux* OS User's Guide 70See Also Techniques to Set the Number of Threads Using Shared Libraries All needed shared libraries must be visible on all the nodes at run time. To achieve this, point these libraries by the LD_LIBRARY_PATH environment variable in the .bashrc file. If Intel MKL is installed only on one node, link statically when building your Intel MKL applications rather than use shared libraries. The Intel compilers or GNU compilers can be used to compile a program that uses Intel MKL. However, make sure that the MPI implementation and compiler match up correctly. Building ScaLAPACK Tests To build ScaLAPACK tests, • For the IA-32 architecture, add libmkl_scalapack_core.a to your link command. • For the Intel® 64 architecture, add libmkl_scalapack_lp64.a or libmkl_scalapack_ilp64.a, depending on the desired interface. Examples for Linking with ScaLAPACK and Cluster FFT This section provides examples of linking with ScaLAPACK and Cluster FFT. Note that a binary linked with ScaLAPACK runs the same way as any other MPI application (refer to the documentation that comes with your MPI implementation). For instance, the script mpirun is used in the case of MPICH2 and OpenMPI, and a number of MPI processes is set by -np. In the case of MPICH 2.0 and all Intel MPIs, start the daemon before running your application; the execution is driven by the script mpiexec. For further linking examples, see the support website for Intel products at http://www.intel.com/software/ products/support/. See Also Directory Structure in Detail Examples for Linking a C Application These examples illustrate linking of an application whose main module is in C under the following conditions: • MPICH2 1.0.7 or higher is installed in /opt/mpich. • $MKLPATH is a user-defined variable containing /lib/ia32. • You use the Intel® C++ Compiler 10.0 or higher. To link with ScaLAPACK for a cluster of systems based on the IA-32 architecture, use the following link line: /opt/mpich/bin/mpicc \ -L$MKLPATH \ -lmkl_scalapack_core \ -lmkl_blacs_intelmpi \ -lmkl_intel -lmkl_intel_thread -lmkl_core \ -liomp5 -lpthread To link with Cluster FFT for a cluster of systems based on the IA-32 architecture, use the following link line: /opt/mpich/bin/mpicc \ -Wl,--start-group \ $MKLPATH/libmkl_cdft_core.a \ Working with the Intel® Math Kernel Library Cluster Software 8 71 $MKLPATH/libmkl_blacs_intelmpi.a \ $MKLPATH/libmkl_intel.a \ $MKLPATH/libmkl_intel_thread.a \ $MKLPATH/libmkl_core.a \ -Wl,--end-group \ -liomp5 -lpthread See Also Linking with ScaLAPACK and Cluster FFTs Examples for Linking a Fortran Application These examples illustrate linking of an application whose main module is in Fortran under the following conditions: • Intel MPI 3.0 is installed in /opt/intel/mpi/3.0. • $MKLPATH is a user-defined variable containing /lib/intel64 . • You use the Intel® Fortran Compiler 10.0 or higher. To link with ScaLAPACK for a cluster of systems based on the Intel® 64 architecture, use the following link line: /opt/intel/mpi/3.0/bin/mpiifort \ -L$MKLPATH \ -lmkl_scalapack_lp64 \ -lmkl_blacs_intelmpi_lp64 \ -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \ -liomp5 -lpthread To link with Cluster FFT for a cluster of systems based on the Intel® 64 architecture, use the following link line: /opt/intel/mpi/3.0/bin/mpiifort \ -Wl,--start-group \ $MKLPATH/libmkl_cdft_core.a \ $MKLPATH/libmkl_blacs_intelmpi_ilp64.a \ $MKLPATH/libmkl_intel_ilp64.a \ $MKLPATH/libmkl_intel_thread.a \ $MKLPATH/libmkl_core.a \ -Wl,--end-group \ -liomp5 -lpthread See Also Linking with ScaLAPACK and Cluster FFTs 8 Intel® Math Kernel Library for Linux* OS User's Guide 72Programming with Intel® Math Kernel Library in the Eclipse* Integrated Development Environment (IDE) 9 Configuring the Eclipse* IDE CDT to Link with Intel MKL This section explains how to configure the Eclipse* Integrated Development Environment (IDE) C/C++ Development Tools (CDT) to link with Intel® Math Kernel Library (Intel® MKL). TIP After configuring your CDT, you can benefit from the Eclipse-provided code assist feature. See Code/Context Assist description in the CDT Help for details. To configure your Eclipse IDE CDT to link with Intel MKL, you need to perform the steps explained below. The specific instructions for performing these steps depend on your version of the CDT and on the tool-chain/ compiler integration. Refer to the CDT Help for more details. To configure your Eclipse IDE CDT, do the following: 1. Open Project Properties for your project. 2. Add the Intel MKL include path, that is, /include, to the project's include paths. 3. Add the Intel MKL library path for the target architecture to the project's library paths. For example, for the Intel® 64 architecture, add /lib/intel64. 4. Specify the names of the Intel MKL libraries to link with your application. For example, you may need the following libraries: mkl_intel_lp64, mkl_intel_thread, mkl_core, and iomp5. NOTE Because compilers typically require library names rather than file names, omit the "lib" prefix and "a" or "so" extension. See Also Selecting Libraries to Link with Linking in Detail Getting Assistance for Programming in the Eclipse* IDE Intel MKL provides an Eclipse* IDE plug-in (com.intel.mkl.help) that contains the Intel MKL Reference Manual (see High-level Directory Structure for the plug-in location after the library installation). To install the plug-in, do one of the following: • Use the Eclipse IDE Update Manager (recommended). To invoke the Manager, use Help > Software Updates command in your Eclipse IDE. • Copy the plug-in to the plugins folder of your Eclipse IDE directory. In this case, if you use earlier C/C++ Development Tools (CDT) versions (3.x, 4.x), delete or rename the index subfolder in the eclipse/configuration/org.eclipse.help.base folder of your Eclipse IDE to avoid delays in Index updating. The following Intel MKL features assist you while programming in the Eclipse* IDE: • The Intel MKL Reference Manual viewable from within the IDE 73• Eclipse Help search tuned to target the Intel Web sites • Code/Content Assist in the Eclipse IDE CDT The Intel MKL plug-in for Eclipse IDE provides the first two features. The last feature is native to the Eclipse IDE CDT. See the Code Assist description in Eclipse IDE Help for details. Viewing the Intel® Math Kernel Library Reference Manual in the Eclipse* IDE To view the Reference Manual, in Eclipse, 1. Select Help > Help Contents from the menu. 2. In the Help tab, under All Topics , click Intel® Math Kernel Library Help . 3. In the Help tree that expands, click Intel Math Kernel Library Reference Manual. 4. The Intel MKL Help Index is also available in Eclipse, and the Reference Manual is included in the Eclipse Help search. Searching the Intel Web Site from the Eclipse* IDE The Intel MKL plug-in tunes Eclipse Help search to targethttp://www.intel.com so that when you are connected to the Internet and run a search from the Eclipse Help pane, the search hits at the site are shown through a separate link. The following figure shows search results for "VML Functions" in Eclipse Help. In the figure, 1 hit means an entry hit to the respective site. Click "Intel.com (1 hit)" to open the list of actual hits to the Intel Web site. 9 Intel® Math Kernel Library for Linux* OS User's Guide 74Programming with Intel® Math Kernel Library in the Eclipse* Integrated Development Environment (IDE) 9 759 Intel® Math Kernel Library for Linux* OS User's Guide 76LINPACK and MP LINPACK Benchmarks 10 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Intel® Optimized LINPACK Benchmark for Linux* OS Intel® Optimized LINPACK Benchmark is a generalization of the LINPACK 1000 benchmark. It solves a dense (real*8) system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system, converts that time into a performance rate, and tests the results for accuracy. The generalization is in the number of equations (N) it can solve, which is not limited to 1000. It uses partial pivoting to assure the accuracy of the results. Do not use this benchmark to report LINPACK 100 performance because that is a compiled-code only benchmark. This is a shared-memory (SMP) implementation which runs on a single platform. Do not confuse this benchmark with: • MP LINPACK, which is a distributed memory version of the same benchmark. • LINPACK, the library, which has been expanded upon by the LAPACK library. Intel provides optimized versions of the LINPACK benchmarks to help you obtain high LINPACK benchmark results on your genuine Intel processor systems more easily than with the High Performance Linpack (HPL) benchmark. Use this package to benchmark your SMP machine. Additional information on this software as well as other Intel software performance products is available at http://www.intel.com/software/products/. Contents of the Intel® Optimized LINPACK Benchmark The Intel Optimized LINPACK Benchmark for Linux* OS contains the following files, located in the ./ benchmarks/linpack/ subdirectory of the Intel® Math Kernel Library (Intel® MKL) directory: File in ./benchmarks/ linpack/ Description xlinpack_xeon32 The 32-bit program executable for a system based on Intel® Xeon® processor or Intel® Xeon® processor MP with or without Streaming SIMD Extensions 3 (SSE3). xlinpack_xeon64 The 64-bit program executable for a system with Intel® Xeon® processor using Intel® 64 architecture. runme_xeon32 A sample shell script for executing a pre-determined problem set for linpack_xeon32. OMP_NUM_THREADS set to 2 processors. runme_xeon64 A sample shell script for executing a pre-determined problem set for linpack_xeon64. OMP_NUM_THREADS set to 4 processors. 77File in ./benchmarks/ linpack/ Description lininput_xeon32 Input file for pre-determined problem for the runme_xeon32 script. lininput_xeon64 Input file for pre-determined problem for the runme_xeon64 script. lin_xeon32.txt Result of the runme_xeon32 script execution. lin_xeon64.txt Result of the runme_xeon64 script execution. help.lpk Simple help file. xhelp.lpk Extended help file. See Also High-level Directory Structure Running the Software To obtain results for the pre-determined sample problem sizes on a given system, type one of the following, as appropriate: ./runme_xeon32 ./runme_xeon64 To run the software for other problem sizes, see the extended help included with the program. Extended help can be viewed by running the program executable with the -e option: ./xlinpack_xeon32 -e ./xlinpack_xeon64 -e The pre-defined data input fileslininput_xeon32 and lininput_xeon64 are provided merely as examples. Different systems have different number of processors or amount of memory and thus require new input files. The extended help can be used for insight into proper ways to change the sample input files. Each input file requires at least the following amount of memory: lininput_xeon32 2 GB lininput_xeon64 16 GB If the system has less memory than the above sample data input requires, you may need to edit or create your own data input files, as explained in the extended help. Each sample script uses the OMP_NUM_THREADS environment variable to set the number of processors it is targeting. To optimize performance on a different number of physical processors, change that line appropriately. If you run the Intel Optimized LINPACK Benchmark without setting the number of threads, it will default to the number of cores according to the OS. You can find the settings for this environment variable in the runme_* sample scripts. If the settings do not yet match the situation for your machine, edit the script. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 10 Intel® Math Kernel Library for Linux* OS User's Guide 78Known Limitations of the Intel® Optimized LINPACK Benchmark The following limitations are known for the Intel Optimized LINPACK Benchmark for Linux* OS: • Intel Optimized LINPACK Benchmark is threaded to effectively use multiple processors. So, in multiprocessor systems, best performance will be obtained with the Intel® Hyper-Threading Technology turned off, which ensures that the operating system assigns threads to physical processors only. • If an incomplete data input file is given, the binaries may either hang or fault. See the sample data input files and/or the extended help for insight into creating a correct data input file. Intel® Optimized MP LINPACK Benchmark for Clusters Overview of the Intel® Optimized MP LINPACK Benchmark for Clusters The Intel® Optimized MP LINPACK Benchmark for Clusters is based on modifications and additions to HPL 2.0 from Innovative Computing Laboratories (ICL) at the University of Tennessee, Knoxville (UTK). The Intel Optimized MP LINPACK Benchmark for Clusters can be used for Top 500 runs (see http://www.top500.org). To use the benchmark you need be intimately familiar with the HPL distribution and usage. The Intel Optimized MP LINPACK Benchmark for Clusters provides some additional enhancements and bug fixes designed to make the HPL usage more convenient, as well as explain Intel® Message-Passing Interface (MPI) settings that may enhance performance. The ./benchmarks/mp_linpack directory adds techniques to minimize search times frequently associated with long runs. The Intel® Optimized MP LINPACK Benchmark for Clusters is an implementation of the Massively Parallel MP LINPACK benchmark by means of HPL code. It solves a random dense (real*8) system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system, converts that time into a performance rate, and tests the results for accuracy. You can solve any size (N) system of equations that fit into memory. The benchmark uses full row pivoting to ensure the accuracy of the results. Use the Intel Optimized MP LINPACK Benchmark for Clusters on a distributed memory machine. On a shared memory machine, use the Intel Optimized LINPACK Benchmark. Intel provides optimized versions of the LINPACK benchmarks to help you obtain high LINPACK benchmark results on your systems based on genuine Intel processors more easily than with the HPL benchmark. Use the Intel Optimized MP LINPACK Benchmark to benchmark your cluster. The prebuilt binaries require that you first install Intel® MPI 3.x be installed on the cluster. The run-time version of Intel MPI is free and can be downloaded from www.intel.com/software/products/ . The Intel package includes software developed at the University of Tennessee, Knoxville, Innovative Computing Laboratories and neither the University nor ICL endorse or promote this product. Although HPL 2.0 is redistributable under certain conditions, this particular package is subject to the Intel MKL license. Intel MKL has introduced a new functionality into MP LINPACK, which is called a hybrid build, while continuing to support the older version. The term hybrid refers to special optimizations added to take advantage of mixed OpenMP*/MPI parallelism. If you want to use one MPI process per node and to achieve further parallelism by means of OpenMP, use the hybrid build. In general, the hybrid build is useful when the number of MPI processes per core is less than one. If you want to rely exclusively on MPI for parallelism and use one MPI per core, use the non-hybrid build. In addition to supplying certain hybrid prebuilt binaries, Intel MKL supplies some hybrid prebuilt libraries for Intel® MPI to take advantage of the additional OpenMP* optimizations. If you wish to use an MPI version other than Intel MPI, you can do so by using the MP LINPACK source provided. You can use the source to build a non-hybrid version that may be used in a hybrid mode, but it would be missing some of the optimizations added to the hybrid version. Non-hybrid builds are the default of the source code makefiles provided. In some cases, the use of the hybrid mode is required for external reasons. If there is a choice, the non-hybrid code may be faster. To use the non-hybrid code in a hybrid mode, use the threaded version of Intel MKL BLAS, link with a thread-safe MPI, and call function MPI_init_thread() so as to indicate a need for MPI to be thread-safe. LINPACK and MP LINPACK Benchmarks 10 79Intel MKL also provides prebuilt binaries that are dynamically linked against Intel MPI libraries. NOTE Performance of statically and dynamically linked prebuilt binaries may be different. The performance of both depends on the version of Intel MPI you are using. You can build binaries statically linked against a particular version of Intel MPI by yourself. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Contents of the Intel® Optimized MP LINPACK Benchmark for Clusters The Intel Optimized MP LINPACK Benchmark for Clusters (MP LINPACK Benchmark) includes the HPL 2.0 distribution in its entirety, as well as the modifications delivered in the files listed in the table below and located in the ./benchmarks/mp_linpack/ subdirectory of the Intel MKL directory. Directory/File in ./benchmarks/ mp_linpack/ Contents testing/ptest/HPL_pdtest.c HPL 2.0 code modified to display captured DGEMM information in ASYOUGO2_DISPLAY if it was captured (for details, see New Features). src/blas/HPL_dgemm.c HPL 2.0 code modified to capture DGEMM information, if desired, from ASYOUGO2_DISPLAY. src/grid/HPL_grid_init.c HPL 2.0 code modified to do additional grid experiments originally not in HPL 2.0. src/pgesv/HPL_pdgesvK2.c HPL 2.0 code modified to do ASYOUGO and ENDEARLY modifications. src/pgesv/HPL_pdgesv0.c HPL 2.0 code modified to do ASYOUGO, ASYOUGO2, and ENDEARLY modifications. testing/ptest/HPL.dat HPL 2.0 sample HPL.dat modified. Make.ia32 (New) Sample architecture makefile for processors using the IA-32 architecture and Linux OS. Make.intel64 (New) Sample architecture makefile for processors using the Intel® 64 architecture and Linux OS. HPL.dat A repeat of testing/ptest/HPL.dat in the top-level directory. Prebuilt executables readily available for simple performance testing. bin_intel/ia32/xhpl_ia32 (New) Prebuilt binary for the IA-32 architecture and Linux OS. Statically linked against Intel® MPI 3.2. bin_intel/ia32/xhpl_ia32_dynamic (New) Prebuilt binary for the IA-32 architecture and Linux OS. Dynamically linked against Intel® MPI 3.2. 10 Intel® Math Kernel Library for Linux* OS User's Guide 80Directory/File in ./benchmarks/ mp_linpack/ Contents bin_intel/intel64/xhpl_intel64 (New) Prebuilt binary for the Intel® 64 architecture and Linux OS. Statically linked against Intel® MPI 3.2. bin_intel/intel64/ xhpl_intel64_dynamic (New) Prebuilt binary for the Intel® 64 architecture and Linux OS. Dynamically linked against Intel® MPI 3.2. Prebuilt hybrid executables bin_intel/ia32/xhpl_hybrid_ia32 (New) Prebuilt hybrid binary for the IA-32 architecture and Linux OS. Statically linked against Intel® MPI 3.2. bin_intel/ia32/ xhpl_hybrid_ia32_dynamic (New) Prebuilt hybrid binary for the IA-32 architecture and Linux OS. Dynamically linked against Intel® MPI 3.2. bin_intel/intel64/ xhpl_hybrid_intel64 (New) Prebuilt hybrid binary for the Intel® 64 architecture and Linux OS. Statically linked against Intel® MPI 3.2. bin_intel/intel64/ xhpl_hybrid_intel64_dynamic (New) Prebuilt hybrid binary for the Intel® 64 and Linux OS. Dynamically linked against Intel® MPI 3.2. Prebuilt libraries lib_hybrid/ia32/libhpl_hybrid.a (New) Prebuilt library with the hybrid version of MP LINPACK for the IA-32 architecture and Intel MPI 3.2. lib_hybrid/intel64/ libhpl_hybrid.a (New) Prebuilt library with the hybrid version of MP LINPACK for the Intel® 64 architecture and Intel MPI 3.2. Files that refer to run scripts bin_intel/ia32/runme_ia32 (New) Sample run script for the IA-32 architecture and a pure MPI binary statically linked against Intel MPI 3.2. bin_intel/ia32/ runme_ia32_dynamic (New) Sample run script for the IA-32 architecture and a pure MPI binary dynamically linked against Intel MPI 3.2. bin_intel/ia32/HPL_serial.dat (New) Example of an MP LINPACK benchmark input file for a pure MPI binary and the IA-32 architecture. bin_intel/ia32/runme_hybrid_ia32 (New) Sample run script for the IA-32 architecture and a hybrid binary statically linked against Intel MPI 3.2. bin_intel/ia32/ runme_hybrid_ia32_dynamic (New) Sample run script for the IA-32 architecture and a hybrid binary dynamically linked against Intel MPI 3.2. bin_intel/ia32/HPL_hybrid.dat (New) Example of an MP LINPACK benchmark input file for a hybrid binary and the IA-32 architecture. bin_intel/intel64/runme_intel64 (New) Sample run script for the Intel® 64 architecture and a pure MPI binary statically linked against Intel MPI 3.2. bin_intel/intel64/ runme_intel64_dynamic (New) Sample run script for the Intel® 64 architecture and a pure MPI binary dynamically linked against Intel MPI 3.2. bin_intel/intel64/HPL_serial.dat (New) Example of an MP LINPACK benchmark input file for a pure MPI binary and the Intel® 64 architecture. bin_intel/intel64/ runme_hybrid_intel64 (New) Sample run script for the Intel® 64 architecture and a hybrid binary statically linked against Intel MPI 3.2. LINPACK and MP LINPACK Benchmarks 10 81Directory/File in ./benchmarks/ mp_linpack/ Contents bin_intel/intel64/ runme_hybrid_intel64_dynamic (New) Sample run script for the Intel® 64 architecture and a hybrid binary dynamically linked against Intel MPI 3.2. bin_intel/intel64/HPL_hybrid.dat (New) Example of an MP LINPACK benchmark input file for a hybrid binary and the Intel® 64 architecture. nodeperf.c (New) Sample utility that tests the DGEMM speed across the cluster. See Also High-level Directory Structure Building the MP LINPACK The MP LINPACK Benchmark contains a few sample architecture makefiles. You can edit them to fit your specific configuration. Specifically: • Set TOPdir to the directory that MP LINPACK is being built in. • You may set MPI variables, that is, MPdir, MPinc, and MPlib. • Specify the location Intel MKL and of files to be used (LAdir, LAinc, LAlib). • Adjust compiler and compiler/linker options. • Specify the version of MP LINPACK you are going to build (hybrid or non-hybrid) by setting the version parameter for the make command. For example: make arch=intel64 version=hybrid install For some sample cases, like Linux systems based on the Intel® 64 architecture, the makefiles contain values that must be common. However, you need to be familiar with building an HPL and picking appropriate values for these variables. New Features of Intel® Optimized MP LINPACK Benchmark The toolset is basically identical with the HPL 2.0 distribution. There are a few changes that are optionally compiled in and disabled until you specifically request them. These new features are: ASYOUGO: Provides non-intrusive performance information while runs proceed. There are only a few outputs and this information does not impact performance. This is especially useful because many runs can go for hours without any information. ASYOUGO2: Provides slightly intrusive additional performance information by intercepting every DGEMM call. ASYOUGO2_DISPLAY: Displays the performance of all the significant DGEMMs inside the run. ENDEARLY: Displays a few performance hints and then terminates the run early. FASTSWAP: Inserts the LAPACK-optimized DLASWP into HPL's code. You can experiment with this to determine best results. HYBRID: Establishes the Hybrid OpenMP/MPI mode of MP LINPACK, providing the possibility to use threaded Intel MKL and prebuilt MP LINPACK hybrid libraries. CAUTION Use this option only with an Intel compiler and the Intel® MPI library version 3.1 or higher. You are also recommended to use the compiler version 10.0 or higher. 10 Intel® Math Kernel Library for Linux* OS User's Guide 82Benchmarking a Cluster To benchmark a cluster, follow the sequence of steps below (some of them are optional). Pay special attention to the iterative steps 3 and 4. They make a loop that searches for HPL parameters (specified in HPL.dat) that enable you to reach the top performance of your cluster. 1. Install HPL and make sure HPL is functional on all the nodes. 2. You may run nodeperf.c (included in the distribution) to see the performance of DGEMM on all the nodes. Compile nodeperf.c with your MPI and Intel MKL. For example: mpiicc -O3 nodeperf.c -L$MKLPATH $MKLPATH/libmkl_intel_lp64.a \ -Wl,--start-group $MKLPATH/libmkl_sequential.a \ $MKLPATH/libmkl_core.a -Wl,--end-group -lpthread . Launching nodeperf.c on all the nodes is especially helpful in a very large cluster. nodeperf enables quick identification of the potential problem spot without numerous small MP LINPACK runs around the cluster in search of the bad node. It goes through all the nodes, one at a time, and reports the performance of DGEMM followed by some host identifier. Therefore, the higher the DGEMM performance, the faster that node was performing. 3. Edit HPL.dat to fit your cluster needs. Read through the HPL documentation for ideas on this. Note, however, that you should use at least 4 nodes. 4. Make an HPL run, using compile options such as ASYOUGO, ASYOUGO2, or ENDEARLY to aid in your search. These options enable you to gain insight into the performance sooner than HPL would normally give this insight. When doing so, follow these recommendations: • Use MP LINPACK, which is a patched version of HPL, to save time in the search. All performance intrusive features are compile-optional in MP LINPACK. That is, if you do not use the new options to reduce search time, these features are disabled. The primary purpose of the additions is to assist you in finding solutions. HPL requires a long time to search for many different parameters. In MP LINPACK, the goal is to get the best possible number. Given that the input is not fixed, there is a large parameter space you must search over. An exhaustive search of all possible inputs is improbably large even for a powerful cluster. MP LINPACK optionally prints information on performance as it proceeds. You can also terminate early. • Save time by compiling with -DENDEARLY -DASYOUGO2 and using a negative threshold (do not use a negative threshold on the final run that you intend to submit as a Top500 entry). Set the threshold in line 13 of the HPL 2.0 input file HPL.dat • If you are going to run a problem to completion, do it with -DASYOUGO. 5. Using the quick performance feedback, return to step 3 and iterate until you are sure that the performance is as good as possible. See Also Options to Reduce Search Time Options to Reduce Search Time Running large problems to completion on large numbers of nodes can take many hours. The search space for MP LINPACK is also large: not only can you run any size problem, but over a number of block sizes, grid layouts, lookahead steps, using different factorization methods, and so on. It can be a large waste of time to run a large problem to completion only to discover it ran 0.01% slower than your previous best problem. Use the following options to reduce the search time: • -DASYOUGO • -DENDEARLY • -DASYOUGO2 LINPACK and MP LINPACK Benchmarks 10 83Use -DASYOUGO2 cautiously because it does have a marginal performance impact. To see DGEMM internal performance, compile with -DASYOUGO2 and -DASYOUGO2_DISPLAY. These options provide a lot of useful DGEMM performance information at the cost of around 0.2% performance loss. If you want to use the old HPL, simply omit these options and recompile from scratch. To do this, try "make arch= clean_arch_all". -DASYOUGO -DASYOUGO gives performance data as the run proceeds. The performance always starts off higher and then drops because this actually happens in LU decomposition (a decomposition of a matrix into a product of a lower (L) and upper (U) triangular matrices). The ASYOUGO performance estimate is usually an overestimate (because the LU decomposition slows down as it goes), but it gets more accurate as the problem proceeds. The greater the lookahead step, the less accurate the first number may be. ASYOUGO tries to estimate where one is in the LU decomposition that MP LINPACK performs and this is always an overestimate as compared to ASYOUGO2, which measures actually achieved DGEMM performance. Note that the ASYOUGO output is a subset of the information that ASYOUGO2 provides. So, refer to the description of the -DASYOUGO2 option below for the details of the output. -DENDEARLY -DENDEARLY t erminates the problem after a few steps, so that you can set up 10 or 20 HPL runs without monitoring them, see how they all do, and then only run the fastest ones to completion. -DENDEARLY assumes -DASYOUGO. You do not need to define both, although it doesn't hurt. To avoid the residual check for a problem that terminates early, set the "threshold" parameter in HPL.dat to a negative number when testing ENDEARLY. It also sometimes gives a better picture to compile with -DASYOUGO2 when using - DENDEARLY. Usage notes on -DENDEARLY follow: • -DENDEARLY stops the problem after a few iterations of DGEMM on the block size (the bigger the blocksize, the further it gets). It prints only 5 or 6 "updates", whereas -DASYOUGO prints about 46 or so output elements before the problem completes. • Performance for -DASYOUGO and -DENDEARLY always starts off at one speed, slowly increases, and then slows down toward the end (because that is what LU does). -DENDEARLY is likely to terminate before it starts to slow down. • -DENDEARLY terminates the problem early with an HPL Error exit. It means that you need to ignore the missing residual results, which are wrong because the problem never completed. However, you can get an idea what the initial performance was, and if it looks good, then run the problem to completion without - DENDEARLY. To avoid the error check, you can set HPL's threshold parameter in HPL.dat to a negative number. • Though -DENDEARLY terminates early, HPL treats the problem as completed and computes Gflop rating as though the problem ran to completion. Ignore this erroneously high rating. • The bigger the problem, the more accurately the last update that -DENDEARLY returns is close to what happens when the problem runs to completion. -DENDEARLY is a poor approximation for small problems. It is for this reason that you are suggested to use ENDEARLY in conjunction with ASYOUGO2, because ASYOUGO2 reports actual DGEMM performance, which can be a closer approximation to problems just starting. -DASYOUGO2 -DASYOUGO2 gives detailed single-node DGEMM performance information. It captures all DGEMM calls (if you use Fortran BLAS) and records their data. Because of this, the routine has a marginal intrusive overhead. Unlike -DASYOUGO, which is quite non-intrusive, -DASYOUGO2 interrupts every DGEMM call to monitor its performance. You should beware of this overhead, although for big problems, it is, less than 0.1%. Here is a sample ASYOUGO2 output (the first 3 non-intrusive numbers can be found in ASYOUGO and ENDEARLY), so it suffices to describe these numbers here: 10 Intel® Math Kernel Library for Linux* OS User's Guide 84Col=001280 Fract=0.050 Mflops=42454.99 (DT=9.5 DF=34.1 DMF=38322.78). The problem size was N=16000 with a block size of 128. After 10 blocks, that is, 1280 columns, an output was sent to the screen. Here, the fraction of columns completed is 1280/16000=0.08. Only up to 40 outputs are printed, at various places through the matrix decomposition: fractions 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 0.250 0.255 0.260 0.265 0.270 0.275 0.280 0.285 0.290 0.295 0.300 0.305 0.310 0.315 0.320 0.325 0.330 0.335 0.340 0.345 0.350 0.355 0.360 0.365 0.370 0.375 0.380 0.385 0.390 0.395 0.400 0.405 0.410 0.415 0.420 0.425 0.430 0.435 0.440 0.445 0.450 0.455 0.460 0.465 0.470 0.475 0.480 0.485 0.490 0.495 0.515 0.535 0.555 0.575 0.595 0.615 0.635 0.655 0.675 0.695 0.795 0.895. However, this problem size is so small and the block size so big by comparison that as soon as it prints the value for 0.045, it was already through 0.08 fraction of the columns. On a really big problem, the fractional number will be more accurate. It never prints more than the 112 numbers above. So, smaller problems will have fewer than 112 updates, and the biggest problems will have precisely 112 updates. Mflops is an estimate based on 1280 columns of LU being completed. However, with lookahead steps, sometimes that work is not actually completed when the output is made. Nevertheless, this is a good estimate for comparing identical runs. The 3 numbers in parenthesis are intrusive ASYOUGO2 addins. DT is the total time processor 0 has spent in DGEMM. DF is the number of billion operations that have been performed in DGEMM by one processor. Hence, the performance of processor 0 (in Gflops) in DGEMM is always DF/DT. Using the number of DGEMM flops as a basis instead of the number of LU flops, you get a lower bound on performance of the run by looking at DMF, which can be compared to Mflops above (It uses the global LU time, but the DGEMM flops are computed under the assumption that the problem is evenly distributed amongst the nodes, as only HPL's node (0,0) returns any output.) Note that when using the above performance monitoring tools to compare different HPL.dat input data sets, you should be aware that the pattern of performance drop-off that LU experiences is sensitive to some input data. For instance, when you try very small problems, the performance drop-off from the initial values to end values is very rapid. The larger the problem, the less the drop-off, and it is probably safe to use the first few performance values to estimate the difference between a problem size 700000 and 701000, for instance. Another factor that influences the performance drop-off is the grid dimensions (P and Q). For big problems, the performance tends to fall off less from the first few steps when P and Q are roughly equal in value. You can make use of a large number of parameters, such as broadcast types, and change them so that the final performance is determined very closely by the first few steps. Using these tools will greatly assist the amount of data you can test. See Also Benchmarking a Cluster LINPACK and MP LINPACK Benchmarks 10 8510 Intel® Math Kernel Library for Linux* OS User's Guide 86Intel® Math Kernel Library Language Interfaces Support A Language Interfaces Support, by Function Domain The following table shows language interfaces that Intel® Math Kernel Library (Intel® MKL) provides for each function domain. However, Intel MKL routines can be called from other languages using mixed-language programming. See Mixed-language Programming with Intel® MKL for an example of how to call Fortran routines from C/C++. Function Domain FORTRAN 77 interface Fortran 9 0/95 interface C/C++ interface Basic Linear Algebra Subprograms (BLAS) Yes Yes via CBLAS BLAS-like extension transposition routines Yes Yes Sparse BLAS Level 1 Yes Yes via CBLAS Sparse BLAS Level 2 and 3 Yes Yes Yes LAPACK routines for solving systems of linear equations Yes Yes Yes LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations Yes Yes Yes Auxiliary and utility LAPACK routines Yes Yes Parallel Basic Linear Algebra Subprograms (PBLAS) Yes ScaLAPACK routines Yes † DSS/PARDISO* solvers Yes Yes Yes Other Direct and Iterative Sparse Solver routines Yes Yes Yes Vector Mathematical Library (VML) functions Yes Yes Yes Vector Statistical Library (VSL) functions Yes Yes Yes Fourier Transform functions (FFT) Yes Yes Cluster FFT functions Yes Yes Trigonometric Transform routines Yes Yes Fast Poisson, Laplace, and Helmholtz Solver (Poisson Library) routines Yes Yes Optimization (Trust-Region) Solver routines Yes Yes Yes Data Fitting functions Yes Yes Yes GMP* arithmetic functions †† Yes Support functions (including memory allocation) Yes Yes Yes † Supported using a mixed language programming call. See Intel ® MKL Include Files for the respective header file. 87†† GMP Arithmetic Functions are deprecated and will be removed in a future release. Include Files Function domain Fortran Include Files C/C++ Include Files All function domains mkl.fi mkl.h BLAS Routines blas.f90 mkl_blas.fi mkl_blas.h BLAS-like Extension Transposition Routines mkl_trans.fi mkl_trans.h CBLAS Interface to BLAS mkl_cblas.h Sparse BLAS Routines mkl_spblas.fi mkl_spblas.h LAPACK Routines lapack.f90 mkl_lapack.fi mkl_lapack.h C Interface to LAPACK mkl_lapacke.h ScaLAPACK Routines mkl_scalapack.h All Sparse Solver Routines mkl_solver.f90 mkl_solver.h PARDISO mkl_pardiso.f77 mkl_pardiso.f90 mkl_pardiso.h DSS Interface mkl_dss.f77 mkl_dss.f90 mkl_dss.h RCI Iterative Solvers ILU Factorization mkl_rci.fi mkl_rci.h Optimization Solver Routines mkl_rci.fi mkl_rci.h Vector Mathematical Functions mkl_vml.f77 mkl_vml.90 mkl_vml.h Vector Statistical Functions mkl_vsl.f77 mkl_vsl.f90 mkl_vsl_functions.h Fourier Transform Functions mkl_dfti.f90 mkl_dfti.h Cluster Fourier Transform Functions mkl_cdft.f90 mkl_cdft.h Partial Differential Equations Support Routines Trigonometric Transforms mkl_trig_transforms.f90 mkl_trig_transforms.h Poisson Solvers mkl_poisson.f90 mkl_poisson.h Data Fitting functions mkl_df.f77 mkl_df.f90 mkl_df.h GMP interface † mkl_gmp.h Support functions mkl_service.f90 mkl_service.h A Intel® Math Kernel Library for Linux* OS User's Guide 88Function domain Fortran Include Files C/C++ Include Files mkl_service.fi Memory allocation routines i_malloc.h Intel MKL examples interface mkl_example.h † GMP Arithmetic Functions are deprecated and will be removed in a future release. See Also Language Interfaces Support, by Function Domain Intel® Math Kernel Library Language Interfaces Support A 89A Intel® Math Kernel Library for Linux* OS User's Guide 90Support for Third-Party Interfaces B GMP* Functions Intel® Math Kernel Library (Intel® MKL) implementation of GMP* arithmetic functions includes arbitrary precision arithmetic operations on integer numbers. The interfaces of such functions fully match the GNU Multiple Precision* (GMP) Arithmetic Library. For specifications of these functions, please see http:// software.intel.com/sites/products/documentation/hpc/mkl/gnump/index.htm. NOTE Intel MKL GMP Arithmetic Functions are deprecated and will be removed in a future release. If you currently use the GMP* library, you need to modify INCLUDE statements in your programs to mkl_gmp.h. FFTW Interface Support Intel® Math Kernel Library (Intel® MKL) offers two collections of wrappers for the FFTW interface (www.fftw.org). The wrappers are the superstructure of FFTW to be used for calling the Intel MKL Fourier transform functions. These collections correspond to the FFTW versions 2.x and 3.x and the Intel MKL versions 7.0 and later. These wrappers enable using Intel MKL Fourier transforms to improve the performance of programs that use FFTW without changing the program source code. See the "FFTW Interface to Intel® Math Kernel Library" appendix in the Intel MKL Reference Manual for details on the use of the wrappers. Important For ease of use, FFTW3 interface is also integrated in Intel MKL. 91B Intel® Math Kernel Library for Linux* OS User's Guide 92Directory Structure in Detail C Tables in this section show contents of the Intel(R) Math Kernel Library (Intel(R) MKL) architecture-specific directories. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Detailed Structure of the IA-32 Architecture Directories Static Libraries in the lib/ia32 Directory File Contents Interface layer libmkl_intel.a Interface library for the Intel compilers libmkl_blas95.a Fortran 95 interface library for BLAS for the Intel® Fortran compiler libmkl_lapack95.a Fortran 95 interface library for LAPACK for the Intel Fortran compiler libmkl_gf.a Interface library for the GNU* Fortran compiler Threading layer libmkl_intel_thread.a Threading library for the Intel compilers libmkl_gnu_thread.a Threading library for the GNU Fortran and C compilers libmkl_pgi_thread.a Threading library for the PGI* compiler libmkl_sequential.a Sequential library Computational layer libmkl_core.a Kernel library for the IA-32 architecture libmkl_solver.a Deprecated. Empty library for backward compatibility libmkl_solver_sequential.a Deprecated. Empty library for backward compatibility libmkl_scalapack_core.a ScaLAPACK routines libmkl_cdft_core.a Cluster version of FFT functions 93File Contents Run-time Libraries (RTL) libmkl_blacs.a BLACS routines supporting the following MPICH versions: • Myricom* MPICH version 1.2.5.10 • ANL* MPICH version 1.2.5.2 libmkl_blacs_intelmpi.a BLACS routines supporting Intel MPI and MPICH2 libmkl_blacs_intelmpi20.a A soft link to lib/32/libmkl_blacs_intelmpi.a libmkl_blacs_openmpi.a BLACS routines supporting OpenMPI Dynamic Libraries in the lib/ia32 Directory File Contents libmkl_rt.so Single Dynamic Library Interface layer libmkl_intel.so Interface library for the Intel compilers libmkl_gf.so Interface library for the GNU Fortran compiler Threading layer libmkl_intel_thread.so Threading library for the Intel compilers libmkl_gnu_thread.so Threading library for the GNU Fortran and C compilers libmkl_pgi_thread.so Threading library for the PGI* compiler libmkl_sequential.so Sequential library Computational layer libmkl_core.so Library dispatcher for dynamic load of processor-specific kernel library libmkl_def.so Default kernel library (Intel® Pentium®, Pentium® Pro, Pentium® II, and Pentium® III processors) libmkl_p4.so Pentium® 4 processor kernel library libmkl_p4p.so Kernel library for the Intel® Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3), including Intel® Core™ Duo and Intel® Core™ Solo processors. libmkl_p4m.so Kernel library for processors based on the Intel® Core™ microarchitecture (except Intel® Core™ Duo and Intel® Core™ Solo processors, for which mkl_p4p.so is intended) libmkl_p4m3.so Kernel library for the Intel® Core™ i7 processors libmkl_vml_def.so VML/VSL part of default kernel for old Intel® Pentium® processors libmkl_vml_ia.so VML/VSL default kernel for newer Intel® architecture processors C Intel® Math Kernel Library for Linux* OS User's Guide 94File Contents libmkl_vml_p4.so VML/VSL part of Pentium® 4 processor kernel libmkl_vml_p4m.so VML/VSL for processors based on the Intel® Core™ microarchitecture libmkl_vml_p4m2.so VML/VSL for 45nm Hi-k Intel® Core™2 and Intel Xeon® processor families libmkl_vml_p4m3.so VML/VSL for the Intel® Core™ i7 processors libmkl_vml_p4p.so VML/VSL for Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3) libmkl_vml_avx.so VML/VSL optimized for the Intel® Advanced Vector Extensions (Intel® AVX) libmkl_scalapack_core.so ScaLAPACK routines. libmkl_cdft_core.so Cluster version of FFT functions. Run-time Libraries (RTL) libmkl_blacs_intelmpi.so BLACS routines supporting Intel MPI and MPICH2 locale/en_US/mkl_msg.cat Catalog of Intel® Math Kernel Library (Intel® MKL) messages in English locale/ja_JP/mkl_msg.cat Catalog of Intel MKL messages in Japanese. Available only if the Intel® MKL package provides Japanese localization. Please see the Release Notes for this information Detailed Structure of the Intel® 64 Architecture Directories Static Libraries in the lib/intel64 Directory File Contents Interface layer libmkl_intel_lp64.a LP64 interface library for the Intel compilers libmkl_intel_ilp64.a ILP64 interface library for the Intel compilers libmkl_intel_sp2dp.a SP2DP interface library for the Intel compilers libmkl_blas95_lp64.a Fortran 95 interface library for BLAS for the Intel® Fortran compiler. Supports the LP64 interface libmkl_blas95_ilp64.a Fortran 95 interface library for BLAS for the Intel® Fortran compiler. Supports the ILP64 interface libmkl_lapack95_lp64.a Fortran 95 interface library for LAPACK for the Intel® Fortran compiler. Supports the LP64 interface libmkl_lapack95_ilp64.a Fortran 95 interface library for LAPACK for the Intel® Fortran compiler. Supports the ILP64 interface Directory Structure in Detail C 95File Contents libmkl_gf_lp64.a LP64 interface library for the GNU Fortran compilers libmkl_gf_ilp64.a ILP64 interface library for the GNU Fortran compilers Threading layer libmkl_intel_thread.a Threading library for the Intel compilers libmkl_gnu_thread.a Threading library for the GNU Fortran and C compilers libmkl_pgi_thread.a Threading library for the PGI compiler libmkl_sequential.a Sequential library Computational layer libmkl_core.a Kernel library for the Intel® 64 architecture libmkl_solver_lp64.a Deprecated. Empty library for backward compatibility libmkl_solver_lp64_sequential.a Deprecated. Empty library for backward compatibility libmkl_solver_ilp64.a Deprecated. Empty library for backward compatibility libmkl_solver_ilp64_sequential.a Deprecated. Empty library for backward compatibility libmkl_scalapack_lp64.a ScaLAPACK routine library supporting the LP64 interface libmkl_scalapack_ilp64.a ScaLAPACK routine library supporting the ILP64 interface libmkl_cdft_core.a Cluster version of FFT functions. Run-time Libraries (RTL) libmkl_blacs_lp64.a LP64 version of BLACS routines supporting the following MPICH versions: • Myricom* MPICH version 1.2.5.10 • ANL* MPICH version 1.2.5.2 libmkl_blacs_ilp64.a ILP64 version of BLACS routines supporting the following MPICH versions: • Myricom* MPICH version 1.2.5.10 • ANL* MPICH version 1.2.5.2 libmkl_blacs_intelmpi_lp64.a LP64 version of BLACS routines supporting Intel MPI and MPICH2 libmkl_blacs_intelmpi_ilp64.a ILP64 version of BLACS routines supporting Intel MPI and MPICH2 libmkl_blacs_intelmpi20_lp64.a A soft link to lib/intel64/ libmkl_blacs_intelmpi_lp64.a libmkl_blacs_intelmpi20_ilp64.a A soft link to lib/intel64/ libmkl_blacs_intelmpi_ilp64.a libmkl_blacs_openmpi_lp64.a LP64 version of BLACS routines supporting OpenMPI. libmkl_blacs_openmpi_ilp64.a ILP64 version of BLACS routines supporting OpenMPI. libmkl_blacs_sgimpt_lp64.a LP64 version of BLACS routines supporting SGI MPT. C Intel® Math Kernel Library for Linux* OS User's Guide 96File Contents libmkl_blacs_sgimpt_ilp64.a ILP64 version of BLACS routines supporting SGI MPT. Dynamic Libraries in the lib/intel64 Directory File Contents libmkl_rt.so Single Dynamic Library Interface layer libmkl_intel_lp64.so LP64 interface library for the Intel compilers libmkl_intel_ilp64.so ILP64 interface library for the Intel compilers libmkl_intel_sp2dp.so SP2DP interface library for the Intel compilers libmkl_gf_lp64.so LP64 interface library for the GNU Fortran compilers libmkl_gf_ilp64.so ILP64 interface library for the GNU Fortran compilers Threading layer libmkl_intel_thread.so Threading library for the Intel compilers libmkl_gnu_thread.so Threading library for the GNU Fortran and C compilers libmkl_pgi_thread.so Threading library for the PGI* compiler libmkl_sequential.so Sequential library Computational layer libmkl_core.so Library dispatcher for dynamic load of processor-specific kernel libmkl_def.so Default kernel library libmkl_mc.so Kernel library for processors based on the Intel® Core™ microarchitecture libmkl_mc3.so Kernel library for the Intel® Core™ i7 processors libmkl_avx.so Kernel optimized for the Intel® Advanced Vector Extensions (Intel® AVX). libmkl_vml_def.so VML/VSL part of default kernels libmkl_vml_p4n.so VML/VSL for the Intel® Xeon® processor using the Intel® 64 architecture libmkl_vml_mc.so VML/VSL for processors based on the Intel® Core™ microarchitecture libmkl_vml_mc2.so VML/VSL for 45nm Hi-k Intel® Core™2 and Intel Xeon® processor families libmkl_vml_mc3.so VML/VSL for the Intel® Core™ i7 processors libmkl_vml_avx.so VML/VSL optimized for the Intel® Advanced Vector Extensions (Intel® AVX) libmkl_scalapack_lp64.so ScaLAPACK routine library supporting the LP64 interface Directory Structure in Detail C 97File Contents libmkl_scalapack_ilp64.so ScaLAPACK routine library supporting the ILP64 interface libmkl_cdft_core.so Cluster version of FFT functions. Run-time Libraries (RTL) libmkl_intelmpi_lp64.so LP64 version of BLACS routines supporting Intel MPI and MPICH2 libmkl_intelmpi_ilp64.so ILP64 version of BLACS routines supporting Intel MPI and MPICH2 locale/en_US/mkl_msg.cat Catalog of Intel® Math Kernel Library (Intel® MKL) messages in English locale/ja_JP/mkl_msg.cat Catalog of Intel MKL messages in Japanese. Available only if the Intel® MKL package provides Japanese localization. Please see the Release Notes for this information C Intel® Math Kernel Library for Linux* OS User's Guide 98Index A affinity mask 51 aligning data 67 architecture support 23 B BLAS calling routines from C 58 Fortran 95 interface to 57 threaded routines 41 C C interface to LAPACK, use of 58 C, calling LAPACK, BLAS, CBLAS from 58 C/C++, Intel(R) MKL complex types 59 calling BLAS functions from C 60 CBLAS interface from C 60 complex BLAS Level 1 function from C 60 complex BLAS Level 1 function from C++ 60 Fortran-style routines from C 58 CBLAS interface, use of 58 Cluster FFT, linking with 69 cluster software, Intel(R) MKL cluster software, linking with commands 69 linking examples 71 code examples, use of 20 coding data alignment techniques to improve performance 50 compilation, Intel(R) MKL version-dependent 68 compiler run-time libraries, linking with 37 compiler-dependent function 57 complex types in C and C++, Intel(R) MKL 59 computation results, consistency 67 computational libraries, linking with 37 conditional compilation 68 configuring Eclipse* CDT 73 consistent results 67 conventions, notational 13 custom shared object building 38 composing list of functions 39 specifying function names 40 D denormal number, performance 52 directory structure documentation 26 high-level 23 in-detail documentation directories, contents 26 man pages 26 documentation, for Intel(R) MKL, viewing in Eclipse* IDE 74 E Eclipse* CDT configuring 73 viewing Intel(R) MKL documentation in 74 Eclipse* IDE, searching the Intel Web site 74 Enter index keyword 27 environment variables, setting 18 examples, linking for cluster software 71 general 29 F FFT interface data alignment 50 optimised radices 52 threaded problems 41 FFTW interface support 91 Fortran 95 interface libraries 35 G GNU* Multiple Precision Arithmetic Library 91 H header files, Intel(R) MKL 88 HT technology, configuration tip 50 hybrid, version, of MP LINPACK 79 I ILP64 programming, support for 33 include files, Intel(R) MKL 88 installation, checking 17 Intel(R) Hyper-Threading Technology, configuration tip 50 Intel(R) Web site, searching in Eclipse* IDE 74 interface Fortran 95, libraries 35 LP64 and ILP64, use of 33 interface libraries and modules, Intel(R) MKL 55 interface libraries, linking with 33 J Java* examples 62 L language interfaces support 87 language-specific interfaces interface libraries and modules 55 LAPACK C interface to, use of 58 calling routines from C 58 Fortran 95 interface to 57 performance of packed routines 50 threaded routines 41 layers, Intel(R) MKL structure 24 libraries to link with computational 37 interface 33 run-time 37 system libraries 38 Index 99threading 36 link tool, command line 29 link-line syntax 31 linking examples cluster software 71 general 29 linking with compiler run-time libraries 37 computational libraries 37 interface libraries 33 system libraries 38 threading libraries 36 linking, quick start 27 linking, Web-based advisor 29 LINPACK benchmark 77 M man pages, viewing 26 memory functions, redefining 53 memory management 52 memory renaming 53 mixed-language programming 58 module, Fortran 95 57 MP LINPACK benchmark 79 multi-core performance 51 N notational conventions 13 number of threads changing at run time 44 changing with OpenMP* environment variable 44 Intel(R) MKL choice, particular cases 47 setting for cluster 70 techniques to set 44 P parallel performance 43 parallelism, of Intel(R) MKL 41 performance multi-core 51 with denormals 52 with subnormals 52 S ScaLAPACK, linking with 69 SDL 28, 32 sequential mode of Intel(R) MKL 35 Single Dynamic Library 28, 32 structure high-level 23 in-detail model 24 support, technical 11 supported architectures 23 syntax, link-line 31 system libraries, linking with 38 T technical support 11 thread safety, of Intel(R) MKL 41 threaded functions 41 threaded problems 41 threading control, Intel(R) MKL-specific 46 threading libraries, linking with 36 U uBLAS, matrix-matrix multiplication, substitution with Intel MKL functions 61 unstable output, getting rid of 67 usage information 15 Intel® Math Kernel Library for Linux* OS User's Guide 100 Intel ® Math Kernel Library for Mac OS* X User's Guide Intel® MKL - Mac OS* X Document Number: 315932-018US Legal InformationContents Legal Information................................................................................7 Introducing the Intel® Math Kernel Library...........................................9 Getting Help and Support...................................................................11 Notational Conventions......................................................................13 Chapter 1: Overview Document Overview.................................................................................15 What's New.............................................................................................15 Related Information.................................................................................15 Chapter 2: Getting Started Checking Your Installation.........................................................................17 Setting Environment Variables ..................................................................17 Compiler Support.....................................................................................19 Using Code Examples...............................................................................19 What You Need to Know Before You Begin Using the Intel ® Math Kernel Library...............................................................................................19 Chapter 3: Structure of the Intel® Math Kernel Library Architecture Support................................................................................21 High-level Directory Structure....................................................................21 Layered Model Concept.............................................................................22 Accessing the Intel ® Math Kernel Library Documentation...............................23 Contents of the Documentation Directories..........................................23 Viewing Man Pages..........................................................................24 Chapter 4: Linking Your Application with the Intel® Math Kernel Library Linking Quick Start...................................................................................25 Using the -mkl Compiler Option.........................................................25 Using the Single Dynamic Library.......................................................26 Selecting Libraries to Link with..........................................................26 Using the Link-line Advisor................................................................27 Using the Command-line Link Tool.....................................................27 Linking Examples.....................................................................................27 Linking on IA-32 Architecture Systems...............................................27 Linking on Intel(R) 64 Architecture Systems........................................28 Linking in Detail.......................................................................................29 Listing Libraries on a Link Line...........................................................29 Dynamically Selecting the Interface and Threading Layer......................30 Linking with Interface Libraries..........................................................31 Using the ILP64 Interface vs. LP64 Interface...............................31 Linking with Fortran 95 Interface Libraries..................................33 Linking with Threading Libraries.........................................................33 Sequential Mode of the Library..................................................33 Selecting the Threading Layer...................................................33 Linking with Compiler Run-time Libraries............................................34 Contents 3Linking with System Libraries............................................................34 Building Custom Dynamically Linked Shared Libraries ..................................35 Using the Custom Dynamically Linked Shared Library Builder................35 Composing a List of Functions ..........................................................36 Specifying Function Names...............................................................36 Distributing Your Custom Dynamically Linked Shared Library.................37 Chapter 5: Managing Performance and Memory Using Parallelism of the Intel ® Math Kernel Library........................................39 Threaded Functions and Problems......................................................39 Avoiding Conflicts in the Execution Environment..................................41 Techniques to Set the Number of Threads...........................................42 Setting the Number of Threads Using an OpenMP* Environment Variable......................................................................................42 Changing the Number of Threads at Run Time.....................................42 Using Additional Threading Control.....................................................44 Intel MKL-specific Environment Variables for Threading Control. . . . .44 MKL_DYNAMIC........................................................................45 MKL_DOMAIN_NUM_THREADS..................................................46 Setting the Environment Variables for Threading Control..............47 Tips and Techniques to Improve Performance..............................................47 Coding Techniques...........................................................................47 Hardware Configuration Tips.............................................................48 Operating on Denormals...................................................................49 FFT Optimized Radices.....................................................................49 Using Memory Management ......................................................................49 Intel MKL Memory Management Software............................................49 Redefining Memory Functions............................................................49 Chapter 6: Language-specific Usage Options Using Language-Specific Interfaces with Intel ® Math Kernel Library.................51 Interface Libraries and Modules.........................................................51 Fortran 95 Interfaces to LAPACK and BLAS..........................................52 Compiler-dependent Functions and Fortran 90 Modules.........................53 Mixed-language Programming with the Intel Math Kernel Library....................53 Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments..............................................................................54 Using Complex Types in C/C++.........................................................55 Calling BLAS Functions that Return the Complex Values in C/C++ Code..........................................................................................55 Support for Boost uBLAS Matrix-matrix Multiplication...........................57 Invoking Intel MKL Functions from Java* Applications...........................58 Intel MKL Java* Examples........................................................58 Running the Java* Examples.....................................................60 Known Limitations of the Java* Examples...................................60 Chapter 7: Coding Tips Aligning Data for Consistent Results...........................................................63 Using Predefined Preprocessor Symbols for Intel ® MKL Version-Dependent Compilation.........................................................................................64 Intel® Math Kernel Library for Mac OS* X User's Guide 4Chapter 8: Configuring Your Integrated Development Environment to Link with Intel Math Kernel Library Configuring the Apple Xcode* Developer Software to Link with Intel ® Math Kernel Library......................................................................................65 Chapter 9: Intel® Optimized LINPACK Benchmark for Mac OS* X Contents of the Intel ® Optimized LINPACK Benchmark..................................67 Running the Software...............................................................................67 Known Limitations of the Intel ® Optimized LINPACK Benchmark.....................68 Appendix A: Intel® Math Kernel Library Language Interfaces Support Language Interfaces Support, by Function Domain.......................................69 Include Files............................................................................................70 Appendix B: Support for Third-Party Interfaces GMP* Functions.......................................................................................73 FFTW Interface Support............................................................................73 Appendix C: Directory Structure in Detail Static Libraries in the lib directory..............................................................75 Dynamic Libraries in the lib directory..........................................................76 Contents 5Intel® Math Kernel Library for Mac OS* X User's Guide 6Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Java is a registered trademark of Oracle and/or its affiliates. Copyright © 2007 - 2011, Intel Corporation. All rights reserved. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for 7Optimization Notice use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Intel® Math Kernel Library for Mac OS* X User's Guide 8Introducing the Intel® Math Kernel Library The Intel ® Math Kernel Library (Intel ® MKL) improves performance of scientific, engineering, and financial software that solves large computational problems. Among other functionality, Intel MKL provides linear algebra routines, fast Fourier transforms, as well as vectorized math and random number generation functions, all optimized for the latest Intel processors, including processors with multiple cores (see the Intel ® MKL Release Notes for the full list of supported processors). Intel MKL also performs well on non-Intel processors. Intel MKL is thread-safe and extensively threaded using the OpenMP* technology. Intel MKL provides the following major functionality: • Linear algebra, implemented in LAPACK (solvers and eigensolvers) plus level 1, 2, and 3 BLAS, offering the vector, vector-matrix, and matrix-matrix operations needed for complex mathematical software. If you prefer the FORTRAN 90/95 programming language, you can call LAPACK driver and computational subroutines through specially designed interfaces with reduced numbers of arguments. A C interface to LAPACK is also available. • ScaLAPACK (SCAlable LAPACK) with its support functionality including the Basic Linear Algebra Communications Subprograms (BLACS) and the Parallel Basic Linear Algebra Subprograms (PBLAS). ScaLAPACK is available for Intel MKL for Linux* and Windows* operating systems. • Direct sparse solver, an iterative sparse solver, and a supporting set of sparse BLAS (level 1, 2, and 3) for solving sparse systems of equations. • Multidimensional discrete Fourier transforms (1D, 2D, 3D) with a mixed radix support (for sizes not limited to powers of 2). Distributed versions of these functions are provided for use on clusters on the Linux* and Windows* operating systems. • A set of vectorized transcendental functions called the Vector Math Library (VML). For most of the supported processors, the Intel MKL VML functions offer greater performance than the libm (scalar) functions, while keeping the same high accuracy. • The Vector Statistical Library (VSL), which offers high performance vectorized random number generators for several probability distributions, convolution and correlation routines, and summary statistics functions. • Data Fitting Library, which provides capabilities for spline-based approximation of functions, derivatives and integrals of functions, and search. For details see the Intel® MKL Reference Manual. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 9 Intel® Math Kernel Library for Mac OS* X User's Guide 10Getting Help and Support Intel provides a support web site that contains a rich repository of self help information, including getting started tips, known product issues, product errata, license information, user forums, and more. Visit the Intel MKL support website at http://www.intel.com/software/products/support/. 11 Intel® Math Kernel Library for Mac OS* X User's Guide 12Notational Conventions The following term is used in reference to the operating system. Mac OS * X This term refers to information that is valid on all Intel®-based systems running the Mac OS* X operating system. The following notations are used to refer to Intel MKL directories. The installation directory for the Intel® C++ Composer XE or Intel® Fortran Composer XE . The main directory where Intel MKL is installed: =/mkl. Replace this placeholder with the specific pathname in the configuring, linking, and building instructions. The following font conventions are used in this document. Italic Italic is used for emphasis and also indicates document names in body text, for example: see Intel MKL Reference Manual. Monospace lowercase mixed with uppercase Indicates: • Commands and command-line options, for example, icc myprog.c -L$MKLPATH -I$MKLINCLUDE -lmkl -liomp5 -lpthread • Filenames, directory names, and pathnames, for example, /System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home • C/C++ code fragments, for example, a = new double [SIZE*SIZE]; UPPERCASE MONOSPACE Indicates system variables, for example, $MKLPATH. Monospace italic Indicates a parameter in discussions, for example, lda. When enclosed in angle brackets, indicates a placeholder for an identifier, an expression, a string, a symbol, or a value, for example, . Substitute one of these items for the placeholder. [ items ] Square brackets indicate that the items enclosed in brackets are optional. { item | item } Braces indicate that only one of the items listed between braces should be selected. A vertical bar ( | ) separates the items. 13 Intel® Math Kernel Library for Mac OS* X User's Guide 14Overview 1 Document Overview The Intel® Math Kernel Library (Intel® MKL) User's Guide provides usage information for the library. The usage information covers the organization, configuration, performance, and accuracy of Intel MKL, specifics of routine calls in mixed-language programming, linking, and more. This guide describes OS-specific usage of Intel MKL, along with OS-independent features. The document contains usage information for all Intel MKL function domains. This User's Guide provides the following information: • Describes post-installation steps to help you start using the library • Shows you how to configure the library with your development environment • Acquaints you with the library structure • Explains how to link your application with the library and provides simple usage scenarios • Describes how to code, compile, and run your application with Intel MKL This guide is intended for Mac OS X programmers with beginner to advanced experience in software development. See Also Language Interfaces Support, by Function Domain What's New This User's Guide documents the Intel® Math Kernel Library (Intel® MKL) 10.3 Update 8. The document was updated to reflect addition of Data Fitting Functions to the product. Related Information To reference how to use the library in your application, use this guide in conjunction with the following documents: • The Intel® Math Kernel Library Reference Manual, which provides reference information on routine functionalities, parameter descriptions, interfaces, calling syntaxes, and return values. • The Intel® Math Kernel Library for Mac OS * X Release Notes. 151 Intel® Math Kernel Library for Mac OS* X User's Guide 16Getting Started 2 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Checking Your Installation After installing the Intel® Math Kernel Library (Intel® MKL), verify that the library is properly installed and configured: 1. Intel MKL installs in . Check that the subdirectory of referred to as was created. 2. If you want to keep multiple versions of Intel MKL installed on your system, update your build scripts to point to the correct Intel MKL version. 3. Check that the following files appear in the /bin directory and its subdirectories: mklvars.sh mklvars.csh ia32/mklvars_ia32.sh ia32/mklvars_ia32.csh intel64/mklvars_intel64.sh intel64/mklvars_intel64.csh Use these files to assign Intel MKL-specific values to several environment variables, as explained in Setting Environment Variables 4. To understand how the Intel MKL directories are structured, see Intel® Math Kernel Library Structure. 5. To make sure that Intel MKL runs on your system, launch an Intel MKL example, as explained in Using Code Examples. See Also Notational Conventions Setting Environment Variables When the installation of Intel MKL for Mac OS* X is complete, set the INCLUDE, MKLROOT, DYLD_LIBRARY_PATH, MANPATH, LIBRARY_PATH, CPATH, FPATH, and NLSPATH environment variables in the command shell using one of the script files in the bin subdirectory of the Intel MKL installation directory. Choose the script corresponding to your system architecture and command shell as explained in the following table: 17Architecture Shell Script File IA-32 C ia32/mklvars_ia32.csh IA-32 Bash ia32/mklvars_ia32.sh Intel® 64 C intel64/mklvars_intel64.csh Intel® 64 Bash intel64/mklvars_intel64.sh IA-32 and Intel® 64 C mklvars.csh IA-32 and Intel® 64 Bash mklvars.sh Running the Scripts The scripts accept parameters to specify the following: • Architecture. • Addition of a path to Fortran 95 modules precompiled with the Intel ® Fortran compiler to the FPATH environment variable. Supply this parameter only if you are using the Intel ® Fortran compiler. • Interface of the Fortran 95 modules. This parameter is needed only if you requested addition of a path to the modules. Usage and values of these parameters depend on the scriptname (regardless of the extension). The following table lists values of the script parameters. Script Architecture (required, when applicable) Addition of a Path to Fortran 95 Modules (optional) Interface (optional) mklvars_ia32 n/a † mod n/a mklvars_intel64 n/a mod lp64, default ilp64 mklvars ia32 intel64 mod lp64, default ilp64 † Not applicable. For example: • The command mklvars_ia32.sh sets environment variables for the IA-32 architecture and adds no path to the Fortran 95 modules. • The command mklvars_intel64.sh mod ilp64 sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the ILP64 interface to the FPATH environment variable. • The command mklvars.sh intel64 mod sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the LP64 interface to the FPATH environment variable. NOTE Supply the parameter specifying the architecture first, if it is needed. Values of the other two parameters can be listed in any order. 2 Intel® Math Kernel Library for Mac OS* X User's Guide 18See Also High-level Directory Structure Interface Libraries and Modules Fortran 95 Interfaces to LAPACK and BLAS Setting the Number of Threads Using an OpenMP* Environment Variable Compiler Support Intel MKL supports compilers identified in the Release Notes. However, the library has been successfully used with other compilers as well. Intel MKL provides a set of include files to simplify program development by specifying enumerated values and prototypes for the respective functions. Calling Intel MKL functions from your application without an appropriate include file may lead to incorrect behavior of the functions. See Also Include Files Using Code Examples The Intel MKL package includes code examples, located in the examples subdirectory of the installation directory. Use the examples to determine: • Whether Intel MKL is working on your system • How you should call the library • How to link the library The examples are grouped in subdirectories mainly by Intel MKL function domains and programming languages. For example, the examples/spblas subdirectory contains a makefile to build the Sparse BLAS examples and the examples/vmlc subdirectory contains the makefile to build the C VML examples. Source code for the examples is in the next-level sources subdirectory. See Also High-level Directory Structure What You Need to Know Before You Begin Using the Intel® Math Kernel Library Target platform Identify the architecture of your target machine: • IA-32 or compatible • Intel® 64 or compatible Reason: Linking Examples To configure your development environment for the use with Intel MKL, set your environment variables using the script corresponding to your architecture (see Setting Environment Variables for details). Mathematical problem Identify all Intel MKL function domains that you require: • BLAS • Sparse BLAS • LAPACK • Sparse Solver routines • Vector Mathematical Library functions (VML) • Vector Statistical Library functions Getting Started 2 19• Fourier Transform functions (FFT) • Trigonometric Transform routines • Poisson, Laplace, and Helmholtz Solver routines • Optimization (Trust-Region) Solver routines • Data Fitting Functions • GMP* arithmetic functions. Deprecated and will be removed in a future release Reason: The function domain you intend to use narrows the search in the Reference Manual for specific routines you need. Coding tips may also depend on the function domain (see Tips and Techniques to Improve Performance). Programming language Intel MKL provides support for both Fortran and C/C++ programming. Identify the language interfaces that your function domains support (see Intel® Math Kernel Library Language Interfaces Support). Reason: Intel MKL provides language-specific include files for each function domain to simplify program development (see Language Interfaces Support, by Function Domain). For a list of language-specific interface libraries and modules and an example how to generate them, see also Using Language-Specific Interfaces with Intel® Math Kernel Library. Range of integer data If your system is based on the Intel 64 architecture, identify whether your application performs calculations with large data arrays (of more than 2 31 -1 elements). Reason: To operate on large data arrays, you need to select the ILP64 interface, where integers are 64-bit; otherwise, use the default, LP64, interface, where integers are 32-bit (see Using the ILP64 Interface vs. LP64 Interface). Threading model Identify whether and how your application is threaded: • Threaded with the Intel compiler • Threaded with a third-party compiler • Not threaded Reason: The compiler you use to thread your application determines which threading library you should link with your application. For applications threaded with a third-party compiler you may need to use Intel MKL in the sequential mode (for more information, see Sequential Mode of the Library and Linking with Threading Libraries). Number of threads Determine the number of threads you want Intel MKL to use. Reason: Intel MKL is based on the OpenMP* threading. By default, the OpenMP* software sets the number of threads that Intel MKL uses. If you need a different number, you have to set it yourself using one of the available mechanisms. For more information, see Using Parallelism of the Intel® Math Kernel Library. Linking model Decide which linking model is appropriate for linking your application with Intel MKL libraries: • Static • Dynamic Reason: The link line syntax and libraries for static and dynamic linking are different. For the list of link libraries for static and dynamic models, linking examples, and other relevant topics, like how to save disk space by creating a custom dynamic library, see Linking Your Application with the Intel® Math Kernel Library. 2 Intel® Math Kernel Library for Mac OS* X User's Guide 20Structure of the Intel® Math Kernel Library 3 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Architecture Support Intel® Math Kernel Library (Intel® MKL) for Mac OS* X supports the IA-32, Intel® 64, and compatible architectures in its universal libraries, located in the /lib directory. NOTE Universal libraries contain both 32-bit and 64-bit code. If these libraries are used for linking, the linker dispatches appropriate code as follows: • A 32-bit linker dispatches 32-bit code and creates 32-bit executable files. • A 64-bit linker dispatches 64-bit code and creates 64-bit executable files. See Also High-level Directory Structure Directory Structure in Detail High-level Directory Structure Directory Contents Installation directory of the Intel® Math Kernel Library (Intel® MKL) Subdirectories of bin/ Scripts to set environmental variables in the user shell bin/ia32 Shell scripts for the IA-32 architecture bin/intel64 Shell scripts for the Intel® 64 architecture benchmarks/linpack Shared-Memory (SMP) version of LINPACK benchmark examples Examples directory. Each subdirectory has source and data files include INCLUDE files for the library routines, as well as for tests and examples include/ia32 Fortran 95 .mod files for the IA-32 architecture and Intel® Fortran compiler 21Directory Contents include/intel64/lp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel Fortran compiler, and LP64 interface include/intel64/ilp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel® Fortran compiler, and ILP64 interface include/fftw Header files for the FFTW2 and FFTW3 interfaces interfaces/blas95 Fortran 95 interfaces to BLAS and a makefile to build the library interfaces/fftw2xc FFTW 2.x interfaces to Intel MKL FFTs (C interface) interfaces/fftw2xf FFTW 2.x interfaces to Intel MKL FFTs (Fortran interface) interfaces/fftw3xc FFTW 3.x interfaces to Intel MKL FFTs (C interface) interfaces/fftw3xf FFTW 3.x interfaces to Intel MKL FFTs (Fortran interface) interfaces/lapack95 Fortran 95 interfaces to LAPACK and a makefile to build the library lib Universal static libraries and shared objects for the IA-32 and Intel® 64 architectures tests Source and data files for tests tools Tools and plug-ins tools/builder Tools for creating custom dynamically linkable libraries tools/plugins/ com.intel.mkl.help Eclipse* IDE plug-in with Intel MKL Reference Manual in WebHelp format. See mkl_documentation.htm for more information Subdirectories of Documentation/en_US/mkl Intel MKL documentation man/en_US/man3 Man pages for Intel MKL functions See Also Notational Conventions Layered Model Concept Intel MKL is structured to support multiple compilers and interfaces, different OpenMP* implementations, both serial and multiple threads, and a wide range of processors. Conceptually Intel MKL can be divided into distinct parts to support different interfaces, threading models, and core computations: 1. Interface Layer 2. Threading Layer 3. Computational Layer You can combine Intel MKL libraries to meet your needs by linking with one library in each part layer-bylayer. Once the interface library is selected, the threading library you select picks up the chosen interface, and the computational library uses interfaces and OpenMP implementation (or non-threaded mode) chosen in the first two layers. To support threading with different compilers, one more layer is needed, which contains libraries not included in Intel MKL: • Compiler run-time libraries (RTL). The following table provides more details of each layer. 3 Intel® Math Kernel Library for Mac OS* X User's Guide 22Layer Description Interface Layer This layer matches compiled code of your application with the threading and/or computational parts of the library. This layer provides: • LP64 and ILP64 interfaces. • Compatibility with compilers that return function values differently. • A mapping between single-precision names and double-precision names for applications using Cray*-style naming (SP2DP interface). SP2DP interface supports Cray-style naming in applications targeted for the Intel 64 architecture and using the ILP64 interface. SP2DP interface provides a mapping between single-precision names (for both real and complex types) in the application and double-precision names in Intel MKL BLAS and LAPACK. Function names are mapped as shown in the following example for BLAS functions ?GEMM: SGEMM -> DGEMM DGEMM -> DGEMM CGEMM -> ZGEMM ZGEMM -> ZGEMM Mind that no changes are made to double-precision names. Threading Layer This layer: • Provides a way to link threaded Intel MKL with different threading compilers. • Enables you to link with a threaded or sequential mode of the library. This layer is compiled for different environments (threaded or sequential) and compilers (from Intel, GNU*). Computational Layer This layer is the heart of Intel MKL. It has only one library for each combination of architecture and supported OS. The Computational layer accommodates multiple architectures through identification of architecture features and chooses the appropriate binary code at run time. Compiler Run-time Libraries (RTL) To support threading with Intel compilers, Intel MKL uses RTLs of the Intel® C++ Composer XE or Intel® Fortran Composer XE. To thread using third-party threading compilers, use libraries in the Threading layer or an appropriate compatibility library. See Also Using the ILP64 Interface vs. LP64 Interface Linking Your Application with the Intel® Math Kernel Library Linking with Threading Libraries Accessing the Intel® Math Kernel Library Documentation Contents of the Documentation Directories Most of Intel MKL documentation is installed at /Documentation// mkl. For example, the documentation in English is installed at / Documentation/en_US/mkl. However, some Intel MKL-related documents are installed one or two levels up. The following table lists MKL-related documentation. File name Comment Files in /Documentation /clicense or /flicense Common end user license for the Intel® C++ Composer XE 2011 or Intel® Fortran Composer XE 2011, respectively Structure of the Intel® Math Kernel Library 3 23File name Comment mklsupport.txt Information on package number for customer support reference Contents of /Documentation//mkl redist.txt List of redistributable files mkl_documentation.htm Overview and links for the Intel MKL documentation mkl_manual/index.htm Intel MKL Reference Manual in an uncompressed HTML format Release_Notes.htm Intel MKL Release Notes mkl_userguide/index.htm Intel MKL User's Guide in an uncompressed HTML format, this document mkl_link_line_advisor.htm Intel MKL Link-line Advisor Viewing Man Pages To access Intel MKL man pages, add the man pages directory to the MANPATH environment variable. If you performed the Setting Environment Variables step of the Getting Started process, this is done automatically. To view the man page for an Intel MKL function, enter the following command in your command shell: man In this release, is the function name with omitted prefixes denoting data type, task type, or any other field that may vary for this function. Examples: • For the BLAS function ddot, enter man dot • For the statistical function vslConvSetMode, enter man vslSetMode • For the VML function vdPackM , enter man vPack • For the FFT function DftiCommitDescriptor, enter man DftiCommitDescriptor NOTE Function names in the man command are case-sensitive. See Also High-level Directory Structure Setting Environment Variables 3 Intel® Math Kernel Library for Mac OS* X User's Guide 24Linking Your Application with the Intel® Math Kernel Library 4 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Linking Quick Start Intel® Math Kernel Library (Intel® MKL) provides several options for quick linking of your application, which depend on the way you link: Using the Intel® Composer XE compiler see Using the -mkl Compiler Option. Explicit dynamic linking see Using the Single Dynamic Library for how to simplify your link line. Explicitly listing libraries on your link line see Selecting Libraries to Link with for a summary of the libraries. Using an interactive interface see Using the Link-line Advisor to determine libraries and options to specify on your link or compilation line. Using an internally provided tool see Using the Command-line Link Tool to determine libraries, options, and environment variables or even compile and build your application. Using the -mkl Compiler Option The Intel® Composer XE compiler supports the following variants of the -mkl compiler option: -mkl or -mkl=parallel to link with standard threaded Intel MKL. -mkl=sequential to link with sequential version of Intel MKL. -mkl=cluster to link with Intel MKL cluster components (sequential) that use Intel MPI. For more information on the -mkl compiler option, see the Intel Compiler User and Reference Guides. On Intel® 64 architecture systems, for each variant of the -mkl option, the compiler links your application using the LP64 interface. If you specify any variant of the -mkl compiler option, the compiler automatically includes the Intel MKL libraries. In cases not covered by the option, use the Link-line Advisor or see Linking in Detail. See Also Listing Libraries on a Link Line Using the ILP64 Interface vs. LP64 Interface Using the Link-line Advisor 25Intel® Software Documentation Library Using the Single Dynamic Library You can simplify your link line through the use of the Intel MKL Single Dynamic Library (SDL). To use SDL, place libmkl_rt.dylib on your link line. For example: ic? application.c -lmkl_rt SDL enables you to select the interface and threading library for Intel MKL at run time. By default, linking with SDL provides: • LP64 interface on systems based on the Intel® 64 architecture • Intel threading To use other interfaces or change threading preferences, including use of the sequential version of Intel MKL, you need to specify your choices using functions or environment variables as explained in section Dynamically Selecting the Interface and Threading Layer. Selecting Libraries to Link with To link with Intel MKL: • Choose one library from the Interface layer and one library from the Threading layer • Add the only library from the Computational layer and run-time libraries (RTL) The following table lists Intel MKL libraries to link with your application. Interface layer Threading layer Computational layer RTL IA-32 architecture, static linking libmkl_intel.a libmkl_intel_ thread.a libmkl_core.a libiomp5.dylib IA-32 architecture, dynamic linking libmkl_intel. dylib libmkl_intel_ thread.dylib libmkl_core. dylib libiomp5.dylib Intel® 64 architecture, static linking libmkl_intel_ lp64.a libmkl_intel_ thread.a libmkl_core.a libiomp5.dylib Intel® 64 architecture, dynamic linking libmkl_intel_ lp64.dylib libmkl_intel_ thread.dylib libmkl_core. dylib libiomp5.dylib The Single Dynamic Library (SDL) automatically links interface, threading, and computational libraries and thus simplifies linking. The following table lists Intel MKL libraries for dynamic linking using SDL. See Dynamically Selecting the Interface and Threading Layer for how to set the interface and threading layers at run time through function calls or environment settings. SDL RTL IA-32 and Intel® 64 architectures libmkl_rt.dylib libiomp5.dylib † † Use the Link-line Advisor to check whether you need to explicitly link the libiomp5.dylib RTL. For exceptions and alternatives to the libraries listed above, see Linking in Detail. See Also Layered Model Concept 4 Intel® Math Kernel Library for Mac OS* X User's Guide 26Using the Link-line Advisor Using the -mkl Compiler Option Using the Link-line Advisor Use the Intel MKL Link-line Advisor to determine the libraries and options to specify on your link or compilation line. The latest version of the tool is available at http://software.intel.com/en-us/articles/intel-mkl-link-lineadvisor. The tool is also available in the product. The Advisor requests information about your system and on how you intend to use Intel MKL (link dynamically or statically, use threaded or sequential mode, etc.). The tool automatically generates the appropriate link line for your application. See Also Contents of the Documentation Directories Using the Command-line Link Tool Use the command-line Link tool provided by Intel MKL to simplify building your application with Intel MKL. The tool not only provides the options, libraries, and environment variables to use, but also performs compilation and building of your application. The tool mkl_link_tool is installed in the /tools directory. See the knowledge base article at http://software.intel.com/en-us/articles/mkl-command-line-link-tool for more information. Linking Examples See Also Using the Link-line Advisor Linking on IA-32 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc NOTE If you successfully completed the Setting Environment Variables step of the Getting Started process, you can omit -I$MKLINCLUDE in all the examples and omit -L$MKLPATH in the examples for dynamic linking. In these examples, MKLPATH=$MKLROOT/lib, MKLINCLUDE=$MKLROOT/include : • Static linking of myprog.f and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE $MKLPATH/libmkl_intel.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread • Dynamic linking of myprog.f and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Static linking of myprog.f and sequential version of Intel MKL: Linking Your Application with the Intel® Math Kernel Library 4 27ifort myprog.f -L$MKLPATH -I$MKLINCLUDE $MKLPATH/libmkl_intel.a $MKLPATH/ libmkl_sequential.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel.a $MKLPATH/ libmkl_sequential.a $MKLPATH/libmkl_core.a -lpthread • Dynamic linking of myprog.f and sequential version of Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel -lmkl_sequential -lmkl_core -lpthread • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL (Call the mkl_set_threading_layer function or set value of the MKL_THREADING_LAYER environment variable to choose threaded or sequential mode): ifort myprog.f -lmkl_rt • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/ia32 -lmkl_lapack95 $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/ia32 -lmkl_blas95 $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/ libmkl_core.a $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/ libmkl_core.a -liomp5 -lpthread See Also Fortran 95 Interfaces to LAPACK and BLAS Linking on Intel(R) 64 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc NOTE If you successfully completed the Setting Environment Variables step of the Getting Started process, you can omit -I$MKLINCLUDE in all the examples and omit -L$MKLPATH in the examples for dynamic linking. In these examples, MKLPATH=$MKLROOT/lib, MKLINCLUDE=$MKLROOT/include: • Static linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE $MKLPATH/libmkl_intel_lp64.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread • Dynamic linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Static linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE $MKLPATH/libmkl_intel_lp64.a $MKLPATH/ libmkl_sequential.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -lpthread 4 Intel® Math Kernel Library for Mac OS* X User's Guide 28• Dynamic linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread • Static linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread • Dynamic linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL (Call appropriate functions or set environment variables to choose threaded or sequential mode and to set the interface): ifort myprog.f -lmkl_rt • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/intel64/lp64 -lmkl_lapack95_lp64 $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/intel64/lp64 -lmkl_blas95_lp64 $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread See Also Fortran 95 Interfaces to LAPACK and BLAS Linking in Detail This section recommends which libraries to link with depending on your Intel MKL usage scenario and provides details of the linking. Listing Libraries on a Link Line To link with Intel MKL, specify paths and libraries on the link line as shown below. NOTE The syntax below is for dynamic linking. For static linking, replace each library name preceded with "-l" with the path to the library file. For example, replace -lmkl_core with $MKLPATH/ libmkl_core.a, where $MKLPATH is the appropriate user-defined environment variable. -L -I [-I/{ia32|intel64|{ilp64|lp64}}] [-lmkl_blas{95|95_ilp64|95_lp64}] [-lmkl_lapack{95|95_ilp64|95_lp64}] -lmkl_{intel|intel_ilp64|intel_lp64} Linking Your Application with the Intel® Math Kernel Library 4 29-lmkl_{intel_thread|sequential} -lmkl_core -liomp5 [-lpthread] [-lm] In case of static linking, for all components except BLAS and FFT, repeat interface, threading, and computational libraries two times (for example, libmkl_intel_ilp64.a libmkl_intel_thread.a libmkl_core.a libmkl_intel_ilp64.a libmkl_intel_thread.a libmkl_core.a). For the LAPACK component, repeat the threading and computational libraries three times. The order of listing libraries on the link line is essential. See Also Using the Link-line Advisor Linking Examples Dynamically Selecting the Interface and Threading Layer The Single Dynamic Library (SDL) enables you to dynamically select the interface and threading layer for Intel MKL. Setting the Interface Layer Available interfaces depend on the architecture of your system. On systems based on the Intel ® 64 architecture, LP64 and ILP64 interfaces are available. To set one of these interfaces at run time, use the mkl_set_interface_layer function or the MKL_INTERFACE_LAYER environment variable. The following table provides values to be used to set each interface. Interface Layer Value of MKL_INTERFACE_LAYER Value of the Parameter of mkl_set_interface_layer LP64 LP64 MKL_INTERFACE_LP64 ILP64 ILP64 MKL_INTERFACE_ILP64 If the mkl_set_interface_layer function is called, the environment variable MKL_INTERFACE_LAYER is ignored. By default the LP64 interface is used. See the Intel MKL Reference Manual for details of the mkl_set_interface_layer function. Setting the Threading Layer To set the threading layer at run time, use the mkl_set_threading_layer function or the MKL_THREADING_LAYER environment variable. The following table lists available threading layers along with the values to be used to set each layer. Threading Layer Value of MKL_THREADING_LAYER Value of the Parameter of mkl_set_threading_layer Intel threading INTEL MKL_THREADING_INTEL Sequential mode of Intel MKL SEQUENTIAL MKL_THREADING_SEQUENTIAL If the mkl_set_threading_layer function is called, the environment variable MKL_THREADING_LAYER is ignored. By default Intel threading is used. 4 Intel® Math Kernel Library for Mac OS* X User's Guide 30See the Intel MKL Reference Manual for details of the mkl_set_threading_layer function. See Also Using the Single Dynamic Library Layered Model Concept Directory Structure in Detail Linking with Interface Libraries Using the ILP64 Interface vs. LP64 Interface The Intel MKL ILP64 libraries use the 64-bit integer type (necessary for indexing large arrays, with more than 2 31 -1 elements), whereas the LP64 libraries index arrays with the 32-bit integer type. The LP64 and ILP64 interfaces are implemented in the Interface layer. Link with the following interface libraries for the LP64 or ILP64 interface, respectively: • libmkl_intel_lp64.a or libmkl_intel_ilp64.a for static linking • libmkl_intel_lp64.dylib or libmkl_intel_ilp64.dylib for dynamic linking The ILP64 interface provides for the following: • Support large data arrays (with more than 2 31 -1 elements) • Enable compiling your Fortran code with the -i8 compiler option The LP64 interface provides compatibility with the previous Intel MKL versions because "LP64" is just a new name for the only interface that the Intel MKL versions lower than 9.1 provided. Choose the ILP64 interface if your application uses Intel MKL for calculations with large data arrays or the library may be used so in future. Intel MKL provides the same include directory for the ILP64 and LP64 interfaces. Compiling for LP64/ILP64 The table below shows how to compile for the ILP64 and LP64 interfaces: Fortran Compiling for ILP64 ifort -i8 -I/include ... Compiling for LP64 ifort -I/include ... C or C++ Compiling for ILP64 icc -DMKL_ILP64 -I/include ... Compiling for LP64 icc -I/include ... CAUTION Linking of an application compiled with the -i8 or -DMKL_ILP64 option to the LP64 libraries may result in unpredictable consequences and erroneous output. Coding for ILP64 You do not need to change existing code if you are not using the ILP64 interface. To migrate to ILP64 or write new code for ILP64, use appropriate types for parameters of the Intel MKL functions and subroutines: Linking Your Application with the Intel® Math Kernel Library 4 31Integer Types Fortran C or C++ 32-bit integers INTEGER*4 or INTEGER(KIND=4) int Universal integers for ILP64/ LP64: • 64-bit for ILP64 • 32-bit otherwise INTEGER without specifying KIND MKL_INT Universal integers for ILP64/ LP64: • 64-bit integers INTEGER*8 or INTEGER(KIND=8) MKL_INT64 FFT interface integers for ILP64/ LP64 INTEGER without specifying KIND MKL_LONG To determine the type of an integer parameter of a function, use appropriate include files. For functions that support only a Fortran interface, use the C/C++ include files *.h. The above table explains which integer parameters of functions become 64-bit and which remain 32-bit for ILP64. The table applies to most Intel MKL functions except some VML and VSL functions, which require integer parameters to be 64-bit or 32-bit regardless of the interface: • VML: The mode parameter of VML functions is 64-bit. • Random Number Generators (RNG): All discrete RNG except viRngUniformBits64 are 32-bit. The viRngUniformBits64 generator function and vslSkipAheadStream service function are 64-bit. • Summary Statistics: The estimate parameter of the vslsSSCompute/vsldSSCompute function is 64- bit. Refer to the Intel MKL Reference Manual for more information. To better understand ILP64 interface details, see also examples and tests. Limitations All Intel MKL function domains support ILP64 programming with the following exceptions: • FFTW interfaces to Intel MKL: • FFTW 2.x wrappers do not support ILP64. • FFTW 3.2 wrappers support ILP64 by a dedicated set of functions plan_guru64. • GMP* Arithmetic Functions do not support ILP64. NOTE GMP Arithmetic Functions are deprecated and will be removed in a future release. See Also High-level Directory Structure Include Files Language Interfaces Support, by Function Domain Layered Model Concept Directory Structure in Detail 4 Intel® Math Kernel Library for Mac OS* X User's Guide 32Linking with Fortran 95 Interface Libraries The libmkl_blas95*.a and libmkl_lapack95*.a libraries contain Fortran 95 interfaces for BLAS and LAPACK, respectively, which are compiler-dependent. In the Intel MKL package, they are prebuilt for the Intel® Fortran compiler. If you are using a different compiler, build these libraries before using the interface. See Also Fortran 95 Interfaces to LAPACK and BLAS Compiler-dependent Functions and Fortran 90 Modules Linking with Threading Libraries Sequential Mode of the Library You can use Intel MKL in a sequential (non-threaded) mode. In this mode, Intel MKL runs unthreaded code. However, it is thread-safe (except the LAPACK deprecated routine ?lacon), which means that you can use it in a parallel region in your OpenMP* code. The sequential mode requires no compatibility OpenMP* run-time library and does not respond to the environment variable OMP_NUM_THREADS or its Intel MKL equivalents. You should use the library in the sequential mode only if you have a particular reason not to use Intel MKL threading. The sequential mode may be helpful when using Intel MKL with programs threaded with some non-Intel compilers or in other situations where you need a non-threaded version of the library (for instance, in some MPI cases). To set the sequential mode, in the Threading layer, choose the *sequential.* library. Add the POSIX threads library (pthread) to your link line for the sequential mode because the *sequential.* library depends on pthread . See Also Directory Structure in Detail Using Parallelism of the Intel® Math Kernel Library Avoiding Conflicts in the Execution Environment Linking Examples Selecting the Threading Layer Several compilers that Intel MKL supports use the OpenMP* threading technology. Intel MKL supports implementations of the OpenMP* technology that these compilers provide. To make use of this support, you need to link with the appropriate library in the Threading Layer and Compiler Support Run-time Library (RTL). Threading Layer Each Intel MKL threading library contains the same code compiled by the respective compiler (Intel, gnu and PGI* compilers on Mac OS X). RTL This layer includes libiomp, the compatibility OpenMP* run-time library of the Intel compiler. In addition to the Intel compiler, libiomp provides support for one more threading compiler on Mac OS X (GNU). That is, a program threaded with a GNU compiler can safely be linked with Intel MKL and libiomp. The table below helps explain what threading library and RTL you should choose under different scenarios when using Intel MKL (static cases only): Linking Your Application with the Intel® Math Kernel Library 4 33Compiler Application Threaded? Threading Layer RTL Recommended Comment Intel Does not matter libmkl_intel_ thread.a libiomp5.dylib PGI Yes libmkl_pgi_ thread.a or libmkl_ sequential.a PGI* supplied Use of libmkl_sequential.a removes threading from Intel MKL calls. PGI No libmkl_intel_ thread.a libiomp5.dylib PGI No libmkl_pgi_ thread.a PGI* supplied PGI No libmkl_ sequential.a None gnu Yes libmkl_ sequential.a None gnu No libmkl_intel_ thread.a libiomp5.dylib other Yes libmkl_ sequential.a None other No libmkl_intel_ thread.a libiomp5.dylib Linking with Compiler Run-time Libraries Dynamically link libiomp, the compatibility OpenMP* run-time library, even if you link other libraries statically. Linking to the libiomp statically can be problematic because the more complex your operating environment or application, the more likely redundant copies of the library are included. This may result in performance issues (oversubscription of threads) and even incorrect results. To link libiomp dynamically, be sure the DYLD_LIBRARY_PATH environment variable is defined correctly. See Also Setting Environment Variables Layered Model Concept Linking with System Libraries To use the Intel MKL FFT, Trigonometric Transform, or Poisson, Laplace, and Helmholtz Solver routines, link in the math support system library by adding " -lm " to the link line. On Mac OS X, the libiomp library relies on the native pthread library for multi-threading. Any time libiomp is required, add -lpthread to your link line afterwards (the order of listing libraries is important). 4 Intel® Math Kernel Library for Mac OS* X User's Guide 34Building Custom Dynamically Linked Shared Libraries ?ustom dynamically linked shared libraries reduce the collection of functions available in Intel MKL libraries to those required to solve your particular problems, which helps to save disk space and build your own dynamic libraries for distribution. The Intel MKL custom dynamically linked shared library builder enables you to create a dynamic ally linked shared library containing the selected functions and located in the tools/builder directory. The builder contains a makefile and a definition file with the list of functions. Using the Custom Dynamically Linked Shared Library Builder To build a custom dynamically linked shared library, use the following command: make target [] The following table lists possible values of target and explains what the command does for each value: Value Comment libuni The builder uses static Intel MKL interface, threading, and core libraries to build a universal dynamically linked shared library for the IA-32 or Intel® 64 architecture. dylibuni The builder uses the single dynamic library libmkl_rt.dylib to build a universal dynamically linked shared library for the IA-32 or Intel® 64 architecture. help The command prints Help on the custom dynamically linked shared library builder The placeholder stands for the list of parameters that define macros to be used by the makefile. The following table describes these parameters: Parameter [Values] Description interface = {lp64|ilp64} Defines whether to use LP64 or ILP64 programming interfacefor the Intel 64architecture.The default value is lp64. threading = {parallel| sequential} Defines whether to use the Intel MKL in the threaded or sequential mode. The default value is parallel. export = Specifies the full name of the file that contains the list of entry-point functions to be included in the shared object. The default name is user_example_list (no extension). name = Specifies the name of the library to be created. By default, the names of the created library is mkl_custom.dylib. xerbla = Specifies the name of the object file .o that contains the user's error handler. The makefile adds this error handler to the library for use instead of the default Intel MKL error handler xerbla. If you omit this parameter, the native Intel MKL xerbla is used. See the description of the xerbla function in the Intel MKL Reference Manual on how to develop your own error handler. MKLROOT = Specifies the location of Intel MKL libraries used to build the custom dynamically linked shared library. By default, the builder uses the Intel MKL installation directory. All the above parameters are optional. In the simplest case, the command line is make ia32, and the missing options have default values. This command creates the mkl_custom.dylib library for processors using the IA-32 architecture. The command takes the list of functions from the user_list file and uses the native Intel MKL error handler xerbla. An example of a more complex case follows: Linking Your Application with the Intel® Math Kernel Library 4 35make ia32 export=my_func_list.txt name=mkl_small xerbla=my_xerbla.o In this case, the command creates the mkl_small.dylib library for processors using the IA-32 architecture. The command takes the list of functions from my_func_list.txt file and uses the user's error handler my_xerbla.o. The process is similar for processors using the Intel® 64 architecture. See Also Using the Single Dynamic Library Composing a List of Functions To compose a list of functions for a minimal custom dynamically linked shared library needed for your application, you can use the following procedure: 1. Link your application with installed Intel MKL libraries to make sure the application builds. 2. Remove all Intel MKL libraries from the link line and start linking. Unresolved symbols indicate Intel MKL functions that your application uses. 3. Include these functions in the list. Important Each time your application starts using more Intel MKL functions, update the list to include the new functions. See Also Specifying Function Names Specifying Function Names In the file with the list of functions for your custom dynamically linked shared library, adjust function names to the required interface. For example, for Fortran functions append an underscore character "_" to the names as a suffix: dgemm_ ddot_ dgetrf_ For more examples, see domain-specific lists of functions in the /tools/builder folder. NOTE The lists of functions are provided in the /tools/builder folder merely as examples. See Composing a List of Functions for how to compose lists of functions for your custom dynamically linked shared library. TIP Names of Fortran-style routines (BLAS, LAPACK, etc.) can be both upper-case or lower-case, with or without the trailing underscore. For example, these names are equivalent: BLAS: dgemm, DGEMM, dgemm_, DGEMM_ LAPACK: dgetrf, DGETRF, dgetrf_, DGETRF_. Properly capitalize names of C support functions in the function list. To do this, follow the guidelines below: 1. In the mkl_service.h include file, look up a #define directive for your function. 2. Take the function name from the replacement part of that directive. For example, the #define directive for the mkl_disable_fast_mm function is #define mkl_disable_fast_mm MKL_Disable_Fast_MM. Capitalize the name of this function in the list like this: MKL_Disable_Fast_MM. 4 Intel® Math Kernel Library for Mac OS* X User's Guide 36For the names of the Fortran support functions, see the tip. NOTE If selected functions have several processor-specific versions, the builder automatically includes them all in the custom library and the dispatcher manages them. Distributing Your Custom Dynamically Linked Shared Library To enable use of your custom dynamically linked shared library in a threaded mode, distribute libiomp5.dylib along with the custom dynamically linked shared library. Linking Your Application with the Intel® Math Kernel Library 4 374 Intel® Math Kernel Library for Mac OS* X User's Guide 38Managing Performance and Memory 5 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Parallelism of the Intel® Math Kernel Library Intel MKL is extensively parallelized. See Threaded Functions and Problems for lists of threaded functions and problems that can be threaded. Intel MKL is thread-safe, which means that all Intel MKL functions (except the LAPACK deprecated routine ? lacon) work correctly during simultaneous execution by multiple threads. In particular, any chunk of threaded Intel MKL code provides access for multiple threads to the same shared data, while permitting only one thread at any given time to access a shared piece of data. Therefore, you can call Intel MKL from multiple threads and not worry about the function instances interfering with each other. The library uses OpenMP* threading software, so you can use the environment variable OMP_NUM_THREADS to specify the number of threads or the equivalent OpenMP run-time function calls. Intel MKL also offers variables that are independent of OpenMP, such as MKL_NUM_THREADS, and equivalent Intel MKL functions for thread management. The Intel MKL variables are always inspected first, then the OpenMP variables are examined, and if neither is used, the OpenMP software chooses the default number of threads. By default, Intel MKL uses the number of threads equal to the number of physical cores on the system. To achieve higher performance, set the number of threads to the number of real processors or physical cores, as summarized in Techniques to Set the Number of Threads. Threaded Functions and Problems The following Intel MKL function domains are threaded: • Direct sparse solver. • LAPACK. For the list of threaded routines, see Threaded LAPACK Routines. • Level1 and Level2 BLAS. For the list of threaded routines, see Threaded BLAS Level1 and Level2 Routines. • All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers. • All mathematical VML functions. • FFT. For the list of FFT transforms that can be threaded, see Threaded FFT Problems. Threaded LAPACK Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. 39The following LAPACK routines are threaded: • Linear equations, computational routines: • Factorization: ?getrf, ?gbtrf, ?potrf, ?pptrf, ?sytrf, ?hetrf, ?sptrf, ?hptrf • Solving: ?dttrsb, ?gbtrs, ?gttrs, ?pptrs, ?pbtrs, ?pttrs, ?sytrs, ?sptrs, ?hptrs, ? tptrs, ?tbtrs • Orthogonal factorization, computational routines: ?geqrf, ?ormqr, ?unmqr, ?ormlq, ?unmlq, ?ormql, ?unmql, ?ormrq, ?unmrq • Singular Value Decomposition, computational routines: ?gebrd, ?bdsqr • Symmetric Eigenvalue Problems, computational routines: ?sytrd, ?hetrd, ?sptrd, ?hptrd, ?steqr, ?stedc. • Generalized Nonsymmetric Eigenvalue Problems, computational routines: chgeqz/zhgeqz. A number of other LAPACK routines, which are based on threaded LAPACK or BLAS routines, make effective use of parallelism: ?gesv, ?posv, ?gels, ?gesvd, ?syev, ?heev, cgegs/zgegs, cgegv/zgegv, cgges/zgges, cggesx/zggesx, cggev/zggev, cggevx/zggevx, and so on. Threaded BLAS Level1 and Level2 Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. The following routines are threaded for Intel ® Core™2 Duo and Intel ® Core™ i7 processors: • Level1 BLAS: ?axpy, ?copy, ?swap, ddot/sdot, cdotc, drot/srot • Level2 BLAS: ?gemv, ?trmv, dsyr/ssyr, dsyr2/ssyr2, dsymv/ssymv Threaded FFT Problems The following characteristics of a specific problem determine whether your FFT computation may be threaded: • rank • domain • size/length • precision (single or double) • placement (in-place or out-of-place) • strides • number of transforms • layout (for example, interleaved or split layout of complex data) Most FFT problems are threaded. In particular, computation of multiple transforms in one call (number of transforms > 1) is threaded. Details of which transforms are threaded follow. One-dimensional (1D) transforms 1D transforms are threaded in many cases. 1D complex-to-complex (c2c) transforms of size N using interleaved complex data layout are threaded under the following conditions depending on the architecture: 5 Intel® Math Kernel Library for Mac OS* X User's Guide 40Architecture Conditions Intel ® 64 N is a power of 2, log2(N) > 9, the transform is double-precision out-of-place, and input/output strides equal 1. IA-32 N is a power of 2, log2(N) > 13, and the transform is single-precision. N is a power of 2, log2(N) > 14, and the transform is double-precision. Any N is composite, log2(N) > 16, and input/output strides equal 1. 1D real-to-complex and complex-to-real transforms are not threaded. 1D complex-to-complex transforms using split-complex layout are not threaded. Prime-size complex-to-complex 1D transforms are not threaded. Multidimensional transforms All multidimensional transforms on large-volume data are threaded. Avoiding Conflicts in the Execution Environment Certain situations can cause conflicts in the execution environment that make the use of threads in Intel MKL problematic. This section briefly discusses why these problems exist and how to avoid them. If you thread the program using OpenMP directives and compile the program with Intel compilers, Intel MKL and the program will both use the same threading library. Intel MKL tries to determine if it is in a parallel region in the program, and if it is, it does not spread its operations over multiple threads unless you specifically request Intel MKL to do so via the MKL_DYNAMIC functionality. However, Intel MKL can be aware that it is in a parallel region only if the threaded program and Intel MKL are using the same threading library. If your program is threaded by some other means, Intel MKL may operate in multithreaded mode, and the performance may suffer due to overuse of the resources. The following table considers several cases where the conflicts may arise and provides recommendations depending on your threading model: Threading model Discussion You thread the program using OS threads (pthreads on Mac OS* X). If more than one thread calls Intel MKL, and the function being called is threaded, it may be important that you turn off Intel MKL threading. Set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). You thread the program using OpenMP directives and/or pragmas and compile the program using a compiler other than a compiler from Intel. This is more problematic because setting of the OMP_NUM_THREADS environment variable affects both the compiler's threading library and libiomp. In this case, choose the threading library that matches the layered Intel MKL with the OpenMP compiler you employ (see Linking Examples on how to do this). If this is not possible, use Intel MKL in the sequential mode. To do this, you should link with the appropriate threading library: libmkl_sequential.a or libmkl_sequential.dylib (see High-level Directory Structure). There are multiple programs running on a multiple-cpu system, for example, a parallelized program that runs using MPI for communication in which each processor is treated as a node. The threading software will see multiple processors on the system even though each processor has a separate MPI process running on it. In this case, one of the solutions is to set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). See Also Using Additional Threading Control Managing Performance and Memory 5 41Linking with Compiler Run-time Libraries Techniques to Set the Number of Threads Use one of the following techniques to change the number of threads to use in Intel MKL: • Set one of the OpenMP or Intel MKL environment variables: • OMP_NUM_THREADS • MKL_NUM_THREADS • MKL_DOMAIN_NUM_THREADS • Call one of the OpenMP or Intel MKL functions: • omp_set_num_threads() • mkl_set_num_threads() • mkl_domain_set_num_threads() When choosing the appropriate technique, take into account the following rules: • The Intel MKL threading controls take precedence over the OpenMP controls because they are inspected first. • A function call takes precedence over any environment variables. The exception, which is a consequence of the previous rule, is the OpenMP subroutine omp_set_num_threads(), which does not have precedence over Intel MKL environment variables, such as MKL_NUM_THREADS. See Using Additional Threading Control for more details. • You cannot change run-time behavior in the course of the run using the environment variables because they are read only once at the first call to Intel MKL. Setting the Number of Threads Using an OpenMP* Environment Variable You can set the number of threads using the environment variable OMP_NUM_THREADS. To change the number of threads, in the command shell in which the program is going to run, enter: export OMP_NUM_THREADS=. See Also Using Additional Threading Control Changing the Number of Threads at Run Time You cannot change the number of threads during run time using environment variables. However, you can call OpenMP API functions from your program to change the number of threads during run time. The following sample code shows how to change the number of threads during run time using the omp_set_num_threads() routine. See also Techniques to Set the Number of Threads. The following example shows both C and Fortran code examples. To run this example in the C language, use the omp.h header file from the Intel(R) compiler package. If you do not have the Intel compiler but wish to explore the functionality in the example, use Fortran API for omp_set_num_threads() rather than the C version. For example, omp_set_num_threads_( &i_one ); // ******* C language ******* #include "omp.h" #include "mkl.h" #include #define SIZE 1000 int main(int args, char *argv[]){ double *a, *b, *c; a = (double*)malloc(sizeof(double)*SIZE*SIZE); b = (double*)malloc(sizeof(double)*SIZE*SIZE); c = (double*)malloc(sizeof(double)*SIZE*SIZE); double alpha=1, beta=1; 5 Intel® Math Kernel Library for Mac OS* X User's Guide 42int m=SIZE, n=SIZE, k=SIZE, lda=SIZE, ldb=SIZE, ldc=SIZE, i=0, j=0; char transa='n', transb='n'; for( i=0; i #include ... mkl_set_num_threads ( 1 ); // ******* Fortran language ******* ... call mkl_set_num_threads( 1 ) See the Intel MKL Reference Manual for the detailed description of the threading control functions, their parameters, calling syntax, and more code examples. MKL_DYNAMIC The MKL_DYNAMIC environment variable enables Intel MKL to dynamically change the number of threads. The default value of MKL_DYNAMIC is TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE. When MKL_DYNAMIC is TRUE, Intel MKL tries to use what it considers the best number of threads, up to the maximum number you specify. For example, MKL_DYNAMIC set to TRUE enables optimal choice of the number of threads in the following cases: • If the requested number of threads exceeds the number of physical cores (perhaps because of using the Intel® Hyper-Threading Technology), and MKL_DYNAMIC is not changed from its default value of TRUE, Intel MKL will scale down the number of threads to the number of physical cores. • If you are able to detect the presence of MPI, but cannot determine if it has been called in a thread-safe mode (it is impossible to detect this with MPICH 1.2.x, for instance), and MKL_DYNAMIC has not been changed from its default value of TRUE, Intel MKL will run one thread. When MKL_DYNAMIC is FALSE, Intel MKL tries not to deviate from the number of threads the user requested. However, setting MKL_DYNAMIC=FALSE does not ensure that Intel MKL will use the number of threads that you request. The library may have no choice on this number for such reasons as system resources. Managing Performance and Memory 5 45Additionally, the library may examine the problem and use a different number of threads than the value suggested. For example, if you attempt to do a size one matrix-matrix multiply across eight threads, the library may instead choose to use only one thread because it is impractical to use eight threads in this event. Note also that if Intel MKL is called in a parallel region, it will use only one thread by default. If you want the library to use nested parallelism, and the thread within a parallel region is compiled with the same OpenMP compiler as Intel MKL is using, you may experiment with setting MKL_DYNAMIC to FALSE and manually increasing the number of threads. In general, set MKL_DYNAMIC to FALSE only under circumstances that Intel MKL is unable to detect, for example, to use nested parallelism where the library is already called from a parallel section. MKL_DOMAIN_NUM_THREADS The MKL_DOMAIN_NUM_THREADS environment variable suggests the number of threads for a particular function domain. MKL_DOMAIN_NUM_THREADS accepts a string value , which must have the following format: ::= { } ::= [ * ] ( | | | ) [ * ] ::= ::= MKL_DOMAIN_ALL | MKL_DOMAIN_BLAS | MKL_DOMAIN_FFT | MKL_DOMAIN_VML | MKL_DOMAIN_PARDISO ::= [ * ] ( | | ) [ * ] ::= ::= | | In the syntax above, values of indicate function domains as follows: MKL_DOMAIN_ALL All function domains MKL_DOMAIN_BLAS BLAS Routines MKL_DOMAIN_FFT Fourier Transform Functions MKL_DOMAIN_VML Vector Mathematical Functions MKL_DOMAIN_PARDISO PARDISO For example, MKL_DOMAIN_ALL 2 : MKL_DOMAIN_BLAS 1 : MKL_DOMAIN_FFT 4 MKL_DOMAIN_ALL=2 : MKL_DOMAIN_BLAS=1 : MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2, MKL_DOMAIN_BLAS=1, MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2; MKL_DOMAIN_BLAS=1; MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL = 2 MKL_DOMAIN_BLAS 1 , MKL_DOMAIN_FFT 4 MKL_DOMAIN_ALL,2: MKL_DOMAIN_BLAS 1, MKL_DOMAIN_FFT,4 . The global variables MKL_DOMAIN_ALL, MKL_DOMAIN_BLAS, MKL_DOMAIN_FFT, MKL_DOMAIN_VML, and MKL_DOMAIN_PARDISO, as well as the interface for the Intel MKL threading control functions, can be found in the mkl.h header file. The table below illustrates how values of MKL_DOMAIN_NUM_THREADS are interpreted. 5 Intel® Math Kernel Library for Mac OS* X User's Guide 46Value of MKL_DOMAIN_NUM_ THREADS Interpretation MKL_DOMAIN_ALL= 4 All parts of Intel MKL should try four threads. The actual number of threads may be still different because of the MKL_DYNAMIC setting or system resource issues. The setting is equivalent to MKL_NUM_THREADS = 4. MKL_DOMAIN_ALL= 1, MKL_DOMAIN_BLAS =4 All parts of Intel MKL should try one thread, except for BLAS, which is suggested to try four threads. MKL_DOMAIN_VML= 2 VML should try two threads. The setting affects no other part of Intel MKL. Be aware that the domain-specific settings take precedence over the overall ones. For example, the "MKL_DOMAIN_BLAS=4" value of MKL_DOMAIN_NUM_THREADS suggests trying four threads for BLAS, regardless of later setting MKL_NUM_THREADS, and a function call "mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS );" suggests the same, regardless of later calls to mkl_set_num_threads(). However, a function call with input "MKL_DOMAIN_ALL", such as "mkl_domain_set_num_threads (4, MKL_DOMAIN_ALL);" is equivalent to "mkl_set_num_threads(4)", and thus it will be overwritten by later calls to mkl_set_num_threads. Similarly, the environment setting of MKL_DOMAIN_NUM_THREADS with "MKL_DOMAIN_ALL=4" will be overwritten with MKL_NUM_THREADS = 2. Whereas the MKL_DOMAIN_NUM_THREADS environment variable enables you set several variables at once, for example, "MKL_DOMAIN_BLAS=4,MKL_DOMAIN_FFT=2", the corresponding function does not take string syntax. So, to do the same with the function calls, you may need to make several calls, which in this example are as follows: mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS ); mkl_domain_set_num_threads ( 2, MKL_DOMAIN_FFT ); Setting the Environment Variables for Threading Control To set the environment variables used for threading control, in the command shell in which the program is going to run, enter : export = For example: export MKL_NUM_THREADS=4 export MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4" export MKL_DYNAMIC=FALSE Tips and Techniques to Improve Performance Coding Techniques To obtain the best performance with Intel MKL, ensure the following data alignment in your source code: • Align arrays on 16-byte boundaries. See Aligning Addresses on 16-byte Boundaries for how to do it. • Make sure leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16, where element_size is the size of an array element in bytes. • For two-dimensional arrays, avoid leading dimension values divisible by 2048 bytes. For example, for a double-precision array, with element_size = 8, avoid leading dimensions 256, 512, 768, 1024, … (elements). Managing Performance and Memory 5 47LAPACK Packed Routines The routines with the names that contain the letters HP, OP, PP, SP, TP, UP in the matrix type and storage position (the second and third letters respectively) operate on the matrices in the packed format (see LAPACK "Routine Naming Conventions" sections in the Intel MKL Reference Manual). Their functionality is strictly equivalent to the functionality of the unpacked routines with the names containing the letters HE, OR, PO, SY, TR, UN in the same positions, but the performance is significantly lower. If the memory restriction is not too tight, use an unpacked routine for better performance. In this case, you need to allocate N 2 /2 more memory than the memory required by a respective packed routine, where N is the problem size (the number of equations). For example, to speed up solving a symmetric eigenproblem with an expert driver, use the unpacked routine: call dsyevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info) where a is the dimension lda-by-n, which is at least N 2 elements, instead of the packed routine: call dspevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) where ap is the dimension N*(N+1)/2. FFT Functions Additional conditions can improve performance of the FFT functions. The addresses of the first elements of arrays and the leading dimension values, in bytes (n*element_size), of two-dimensional arrays should be divisible by cache line size, which equals64 bytes. Hardware Configuration Tips Dual-Core Intel® Xeon® processor 5100 series systems To get the best performance with Intel MKL on Dual-Core Intel ® Xeon® processor 5100 series systems, enable the Hardware DPL (streaming data) Prefetcher functionality of this processor. To configure this functionality, use the appropriate BIOS settings, as described in your BIOS documentation. Intel® Hyper-Threading Technology Intel ® Hyper-Threading Technology (Intel ® HT Technology) is especially effective when each thread performs different types of operations and when there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance by disabling Intel HT Technology. If you run with Intel HT Technology enabled, performance may be especially impacted if you run on fewer threads than physical cores. Moreover, if, for example, there are two threads to every physical core, the thread scheduler may assign two threads to some cores and ignore the other cores altogether. If you are using the OpenMP* library of the Intel Compiler, read the respective User Guide on how to best set the thread affinity interface to avoid this situation. For Intel MKL, apply the following setting: set KMP_AFFINITY=granularity=fine,compact,1,0 See Also Using Parallelism of the Intel® Math Kernel Library 5 Intel® Math Kernel Library for Mac OS* X User's Guide 48Operating on Denormals The IEEE 754-2008 standard, "An IEEE Standard for Binary Floating-Point Arithmetic", defines denormal (or subnormal) numbers as non-zero numbers smaller than the smallest possible normalized numbers for a specific floating-point format. Floating-point operations on denormals are slower than on normalized operands because denormal operands and results are usually handled through a software assist mechanism rather than directly in hardware. This software processing causes Intel MKL functions that consume denormals to run slower than with normalized floating-point numbers. You can mitigate this performance issue by setting the appropriate bit fields in the MXCSR floating-point control register to flush denormals to zero (FTZ) or to replace any denormals loaded from memory with zero (DAZ). Check your compiler documentation to determine whether it has options to control FTZ and DAZ. Note that these compiler options may slightly affect accuracy. FFT Optimized Radices You can improve the performance of Intel MKL FFT if the length of your data vector permits factorization into powers of optimized radices. In Intel MKL, the optimized radices are 2, 3, 5, 7, 11, and 13. Using Memory Management Intel MKL Memory Management Software Intel MKL has memory management software that controls memory buffers for the use by the library functions. New buffers that the library allocates when your application calls Intel MKL are not deallocated until the program ends. To get the amount of memory allocated by the memory management software, call the mkl_mem_stat() function. If your program needs to free memory, call mkl_free_buffers(). If another call is made to a library function that needs a memory buffer, the memory manager again allocates the buffers and they again remain allocated until either the program ends or the program deallocates the memory. This behavior facilitates better performance. However, some tools may report this behavior as a memory leak. The memory management software is turned on by default. To turn it off, set the MKL_DISABLE_FAST_MM environment variable to any value or call the mkl_disable_fast_mm() function. Be aware that this change may negatively impact performance of some Intel MKL routines, especially for small problem sizes. Redefining Memory Functions In C/C++ programs, you can replace Intel MKL memory functions that the library uses by default with your own functions. To do this, use the memory renaming feature. Memory Renaming Intel MKL memory management by default uses standard C run-time memory functions to allocate or free memory. These functions can be replaced using memory renaming. Intel MKL accesses the memory functions by pointers i_malloc, i_free, i_calloc, and i_realloc, which are visible at the application level. These pointers initially hold addresses of the standard C run-time memory functions malloc, free, calloc, and realloc, respectively. You can programmatically redefine values of these pointers to the addresses of your application's memory management functions. Redirecting the pointers is the only correct way to use your own set of memory management functions. If you call your own memory functions without redirecting the pointers, the memory will get managed by two independent memory management packages, which may cause unexpected memory issues. Managing Performance and Memory 5 49How to Redefine Memory Functions To redefine memory functions, use the following procedure: 1. Include the i_malloc.h header file in your code. This header file contains all declarations required for replacing the memory allocation functions. The header file also describes how memory allocation can be replaced in those Intel libraries that support this feature. 2. Redefine values of pointers i_malloc, i_free, i_calloc, and i_realloc prior to the first call to MKL functions, as shown in the following example: #include "i_malloc.h" . . . i_malloc = my_malloc; i_calloc = my_calloc; i_realloc = my_realloc; i_free = my_free; . . . // Now you may call Intel MKL functions 5 Intel® Math Kernel Library for Mac OS* X User's Guide 50Language-specific Usage Options 6 The Intel® Math Kernel Library (Intel® MKL) provides broad support for Fortran and C/C++ programming. However, not all functions support both Fortran and C interfaces. For example, some LAPACK functions have no C interface. You can call such functions from C using mixed-language programming. If you want to use LAPACK or BLAS functions that support Fortran 77 in the Fortran 95 environment, additional effort may be initially required to build compiler-specific interface libraries and modules from the source code provided with Intel MKL. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Language-Specific Interfaces with Intel® Math Kernel Library This section discusses mixed-language programming and the use of language-specific interfaces with Intel MKL. See also Appendix G in the Intel MKL Reference Manual for details of the FFTW interfaces to Intel MKL. Interface Libraries and Modules You can create the following interface libraries and modules using the respective makefiles located in the interfaces directory. File name Contains Libraries, in Intel MKL architecture-specific directories libmkl_blas95.a 1 Fortran 95 wrappers for BLAS (BLAS95) for IA-32 architecture. libmkl_blas95_ilp64.a 1 Fortran 95 wrappers for BLAS (BLAS95) supporting LP64 interface. libmkl_blas95_lp64.a 1 Fortran 95 wrappers for BLAS (BLAS95) supporting ILP64 interface. libmkl_lapack95.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) for IA-32 architecture. libmkl_lapack95_lp64.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting LP64 interface. libmkl_lapack95_ilp64.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting ILP64 interface. 51File name Contains libfftw2xc_intel.a 1 Interfaces for FFTW version 2.x (C interface for Intel compilers) to call Intel MKL FFTs. libfftw2xc_gnu.a Interfaces for FFTW version 2.x (C interface for GNU compilers) to call Intel MKL FFTs. libfftw2xf_intel.a Interfaces for FFTW version 2.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. libfftw2xf_gnu.a Interfaces for FFTW version 2.x (Fortran interface for GNU compiler) to call Intel MKL FFTs. libfftw3xc_intel.a 2 Interfaces for FFTW version 3.x (C interface for Intel compiler) to call Intel MKL FFTs. libfftw3xc_gnu.a Interfaces for FFTW version 3.x (C interface for GNU compilers) to call Intel MKL FFTs. libfftw3xf_intel.a 2 Interfaces for FFTW version 3.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. libfftw3xf_gnu.a Interfaces for FFTW version 3.x (Fortran interface for GNU compilers) to call Intel MKL FFTs. Modules, in architecture- and interface-specific subdirectories of the Intel MKL include directory blas95.mod 1 Fortran 95 interface module for BLAS (BLAS95). lapack95.mod 1 Fortran 95 interface module for LAPACK (LAPACK95). f95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95. mkl95_blas.mod 1 Fortran 95 interface module for BLAS (BLAS95), identical to blas95.mod. To be removed in one of the future releases. mkl95_lapack.mod 1 Fortran 95 interface module for LAPACK (LAPACK95), identical to lapack95.mod. To be removed in one of the future releases. mkl95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95, identical to f95_precision.mod. To be removed in one of the future releases. mkl_service.mod 1 Fortran 95 interface module for Intel MKL support functions. 1 Prebuilt for the Intel® Fortran compiler 2 FFTW3 interfaces are integrated with Intel MKL. Look into /interfaces/fftw3x*/ makefile for options defining how to build and where to place the standalone library with the wrappers. See Also Fortran 95 Interfaces to LAPACK and BLAS Fortran 95 Interfaces to LAPACK and BLAS Fortran 95 interfaces are compiler-dependent. Intel MKL provides the interface libraries and modules precompiled with the Intel® Fortran compiler. Additionally, the Fortran 95 interfaces and wrappers are delivered as sources. (For more information, see Compiler-dependent Functions and Fortran 90 Modules). If you are using a different compiler, build the appropriate library and modules with your compiler and link the library as a user's library: 1. Go to the respective directory /interfaces/blas95 or / interfaces/lapack95 6 Intel® Math Kernel Library for Mac OS* X User's Guide 522. Type one of the following commands depending on your architecture: • For the IA-32 architecture, make libia32 INSTALL_DIR= • For the Intel® 64 architecture, make libintel64 [interface=lp64|ilp64] INSTALL_DIR= Important The parameter INSTALL_DIR is required. As a result, the required library is built and installed in the /lib directory, and the .mod files are built and installed in the /include/[/{lp64|ilp64}] directory, where is one of {ia32, intel64}. By default, the ifort compiler is assumed. You may change the compiler with an additional parameter of make: FC=. For example, the command make libintel64 FC=pgf95 INSTALL_DIR= interface=lp64 builds the required library and .mod files and installs them in subdirectories of . To delete the library from the building directory, use one of the following commands: • For the IA-32 architecture, make cleania32 INSTALL_DIR= • For the Intel ® 64 architecture, make cleanintel64 [interface=lp64|ilp64] INSTALL_DIR= • For all the architectures, make clean INSTALL_DIR= CAUTION Even if you have administrative rights, avoid setting INSTALL_DIR=../.. or INSTALL_DIR= in a build or clean command above because these settings replace or delete the Intel MKL prebuilt Fortran 95 library and modules. Compiler-dependent Functions and Fortran 90 Modules Compiler-dependent functions occur whenever the compiler inserts into the object code function calls that are resolved in its run-time library (RTL). Linking of such code without the appropriate RTL will result in undefined symbols. Intel MKL has been designed to minimize RTL dependencies. In cases where RTL dependencies might arise, the functions are delivered as source code and you need to compile the code with whatever compiler you are using for your application. In particular, Fortran 90 modules result in the compiler-specific code generation requiring RTL support. Therefore, Intel MKL delivers these modules compiled with the Intel compiler, along with source code, to be used with different compilers. Mixed-language Programming with the Intel Math Kernel Library Appendix A: Intel(R) Math Kernel Library Language Interfaces Support lists the programming languages supported for each Intel MKL function domain. However, you can call Intel MKL routines from different language environments. Language-specific Usage Options 6 53Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments Not all Intel MKL function domains support both C and Fortran environments. To use Intel MKL Fortran-style functions in C/C++ environments, you should observe certain conventions, which are discussed for LAPACK and BLAS in the subsections below. CAUTION Avoid calling BLAS 95/LAPACK 95 from C/C++. Such calls require skills in manipulating the descriptor of a deferred-shape array, which is the Fortran 90 type. Moreover, BLAS95/LAPACK95 routines contain links to a Fortran RTL. LAPACK and BLAS Because LAPACK and BLAS routines are Fortran-style, when calling them from C-language programs, follow the Fortran-style calling conventions: • Pass variables by address, not by value. Function calls in Example "Calling a Complex BLAS Level 1 Function from C++" and Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrate this. • Store your data in Fortran style, that is, column-major rather than row-major order. With row-major order, adopted in C, the last array index changes most quickly and the first one changes most slowly when traversing the memory segment where the array is stored. With Fortran-style columnmajor order, the last index changes most slowly whereas the first index changes most quickly (as illustrated by the figure below for a two-dimensional array). For example, if a two-dimensional matrix A of size mxn is stored densely in a one-dimensional array B, you can access a matrix element like this: A[i][j] = B[i*n+j] in C ( i=0, ... , m-1, j=0, ... , -1) A(i,j) = B(j*m+i) in Fortran ( i=1, ... , m, j=1, ... , n). When calling LAPACK or BLAS routines from C, be aware that because the Fortran language is caseinsensitive, the routine names can be both upper-case or lower-case, with or without the trailing underscore. For example, the following names are equivalent: • LAPACK: dgetrf, DGETRF, dgetrf_, and DGETRF_ • BLAS: dgemm, DGEMM, dgemm_, and DGEMM_ See Example "Calling a Complex BLAS Level 1 Function from C++" on how to call BLAS routines from C. See also the Intel(R) MKL Reference Manual for a description of the C interface to LAPACK functions. CBLAS Instead of calling BLAS routines from a C-language program, you can use the CBLAS interface. 6 Intel® Math Kernel Library for Mac OS* X User's Guide 54CBLAS is a C-style interface to the BLAS routines. You can call CBLAS routines using regular C-style calls. Use the mkl.h header file with the CBLAS interface. The header file specifies enumerated values and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrates the use of the CBLAS interface. C Interface to LAPACK Instead of calling LAPACK routines from a C-language program, you can use the C interface to LAPACK provided by Intel MKL. The C interface to LAPACK is a C-style interface to the LAPACK routines. This interface supports matrices in row-major and column-major order, which you can define in the first function argument matrix_order. Use the mkl_lapacke.h header file with the C interface to LAPACK. The header file specifies constants and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. You can find examples of the C interface to LAPACK in the examples/lapacke subdirectory in the Intel MKL installation directory. Using Complex Types in C/C++ As described in the documentation for the Intel® Fortran Compiler XE, C/C++ does not directly implement the Fortran types COMPLEX(4) and COMPLEX(8). However, you can write equivalent structures. The type COMPLEX(4) consists of two 4-byte floating-point numbers. The first of them is the real-number component, and the second one is the imaginary-number component. The type COMPLEX(8) is similar to COMPLEX(4) except that it contains two 8-byte floating-point numbers. Intel MKL provides complex types MKL_Complex8 and MKL_Complex16, which are structures equivalent to the Fortran complex types COMPLEX(4) and COMPLEX(8), respectively. The MKL_Complex8 and MKL_Complex16 types are defined in the mkl_types.h header file. You can use these types to define complex data. You can also redefine the types with your own types before including the mkl_types.h header file. The only requirement is that the types must be compatible with the Fortran complex layout, that is, the complex type must be a pair of real numbers for the values of real and imaginary parts. For example, you can use the following definitions in your C++ code: #define MKL_Complex8 std::complex and #define MKL_Complex16 std::complex See Example "Calling a Complex BLAS Level 1 Function from C++" for details. You can also define these types in the command line: -DMKL_Complex8="std::complex" -DMKL_Complex16="std::complex" See Also Intel® Software Documentation Library Calling BLAS Functions that Return the Complex Values in C/C++ Code Complex values that functions return are handled differently in C and Fortran. Because BLAS is Fortran-style, you need to be careful when handling a call from C to a BLAS function that returns complex values. However, in addition to normal function calls, Fortran enables calling functions as though they were subroutines, which provides a mechanism for returning the complex value correctly when the function is called from a C program. When a Fortran function is called as a subroutine, the return value is the first parameter in the calling sequence. You can use this feature to call a BLAS function from C. The following example shows how a call to a Fortran function as a subroutine converts to a call from C and the hidden parameter result gets exposed: Language-specific Usage Options 6 55Normal Fortran function call: result = cdotc( n, x, 1, y, 1 ) A call to the function as a subroutine: call cdotc( result, n, x, 1, y, 1) A call to the function from C: cdotc( &result, &n, x, &one, y, &one ) NOTE Intel MKL has both upper-case and lower-case entry points in the Fortran-style (caseinsensitive) BLAS, with or without the trailing underscore. So, all these names are equivalent and acceptable: cdotc, CDOTC, cdotc_, and CDOTC_. The above example shows one of the ways to call several level 1 BLAS functions that return complex values from your C and C++ applications. An easier way is to use the CBLAS interface. For instance, you can call the same function using the CBLAS interface as follows: cblas_cdotu( n, x, 1, y, 1, &result ) NOTE The complex value comes last on the argument list in this case. The following examples show use of the Fortran-style BLAS interface from C and C++, as well as the CBLAS (C language) interface: • Example "Calling a Complex BLAS Level 1 Function from C" • Example "Calling a Complex BLAS Level 1 Function from C++" • Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" Example "Calling a Complex BLAS Level 1 Function from C" The example below illustrates a call from a C program to the complex BLAS Level 1 function zdotc(). This function computes the dot product of two double-precision complex vectors. In this example, the complex dot product is returned in the structure c. #include "mkl.h" #define N 5 int main() { int n = N, inca = 1, incb = 1, i; MKL_Complex16 a[N], b[N], c; for( i = 0; i < n; i++ ){ a[i].real = (double)i; a[i].imag = (double)i * 2.0; b[i].real = (double)(n - i); b[i].imag = (double)i * 2.0; } zdotc( &c, &n, a, &inca, b, &incb ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.real, c.imag ); return 0; } Example "Calling a Complex BLAS Level 1 Function from C++" Below is the C++ implementation: #include #include #define MKL_Complex16 std::complex #include "mkl.h" #define N 5 int main() { int n, inca = 1, incb = 1, i; std::complex a[N], b[N], c; n = N; 6 Intel® Math Kernel Library for Mac OS* X User's Guide 56 for( i = 0; i < n; i++ ){ a[i] = std::complex(i,i*2.0); b[i] = std::complex(n-i,i*2.0); } zdotc(&c, &n, a, &inca, b, &incb ); std::cout << "The complex dot product is: " << c << std::endl; return 0; } Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" This example uses CBLAS: #include #include "mkl.h" typedef struct{ double re; double im; } complex16; #define N 5 int main() { int n, inca = 1, incb = 1, i; complex16 a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i].re = (double)i; a[i].im = (double)i * 2.0; b[i].re = (double)(n - i); b[i].im = (double)i * 2.0; } cblas_zdotc_sub(n, a, inca, b, incb, &c ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.re, c.im ); return 0; } Support for Boost uBLAS Matrix-matrix Multiplication If you are used to uBLAS, you can perform BLAS matrix-matrix multiplication in C++ using Intel MKL substitution of Boost uBLAS functions. uBLAS is the Boost C++ open-source library that provides BLAS functionality for dense, packed, and sparse matrices. The library uses an expression template technique for passing expressions as function arguments, which enables evaluating vector and matrix expressions in one pass without temporary matrices. uBLAS provides two modes: • Debug (safe) mode, default. Checks types and conformance. • Release (fast) mode. Does not check types and conformance. To enable this mode, use the NDEBUG preprocessor symbol. The documentation for the Boost uBLAS is available at www.boost.org. Intel MKL provides overloaded prod() functions for substituting uBLAS dense matrix-matrix multiplication with the Intel MKL gemm calls. Though these functions break uBLAS expression templates and introduce temporary matrices, the performance advantage can be considerable for matrix sizes that are not too small (roughly, over 50). You do not need to change your source code to use the functions. To call them: • Include the header file mkl_boost_ublas_matrix_prod.hpp in your code (from the Intel MKL include directory) • Add appropriate Intel MKL libraries to the link line. The list of expressions that are substituted follows: prod( m1, m2 ) prod( trans(m1), m2 ) prod( trans(conj(m1)), m2 ) prod( conj(trans(m1)), m2 ) Language-specific Usage Options 6 57prod( m1, trans(m2) ) prod( trans(m1), trans(m2) ) prod( trans(conj(m1)), trans(m2) ) prod( conj(trans(m1)), trans(m2) ) prod( m1, trans(conj(m2)) ) prod( trans(m1), trans(conj(m2)) ) prod( trans(conj(m1)), trans(conj(m2)) ) prod( conj(trans(m1)), trans(conj(m2)) ) prod( m1, conj(trans(m2)) ) prod( trans(m1), conj(trans(m2)) ) prod( trans(conj(m1)), conj(trans(m2)) ) prod( conj(trans(m1)), conj(trans(m2)) ) These expressions are substituted in the release mode only (with NDEBUG preprocessor symbol defined). Supported uBLAS versions are Boost 1.34.1 and higher. To get them, visit www.boost.org. A code example provided in the /examples/ublas/source/sylvester.cpp file illustrates usage of the Intel MKL uBLAS header file for solving a special case of the Sylvester equation. To run the Intel MKL ublas examples, specify the BOOST_ROOT parameter in the make command, for instance, when using Boost version 1.37.0: make libia32 BOOST_ROOT = /boost_1_37_0 See Also Using Code Examples Invoking Intel MKL Functions from Java* Applications Intel MKL Java* Examples To demonstrate binding with Java, Intel MKL includes a set of Java examples in the following directory: /examples/java. The examples are provided for the following MKL functions: • ?gemm, ?gemv, and ?dot families from CBLAS • The complete set of FFT functions • ESSL 1 -like functions for one-dimensional convolution and correlation • VSL Random Number Generators (RNG), except user-defined ones and file subroutines • VML functions, except GetErrorCallBack, SetErrorCallBack, and ClearErrorCallBack You can see the example sources in the following directory: /examples/java/examples. The examples are written in Java. They demonstrate usage of the MKL functions with the following variety of data: • 1- and 2-dimensional data sequences • Real and complex types of the data • Single and double precision However, the wrappers, used in the examples, do not: • Demonstrate the use of large arrays (>2 billion elements) • Demonstrate processing of arrays in native memory 6 Intel® Math Kernel Library for Mac OS* X User's Guide 58• Check correctness of function parameters • Demonstrate performance optimizations The examples use the Java Native Interface (JNI* developer framework) to bind with Intel MKL. The JNI documentation is available from http://java.sun.com/javase/6/docs/technotes/guides/jni/. The Java example set includes JNI wrappers that perform the binding. The wrappers do not depend on the examples and may be used in your Java applications. The wrappers for CBLAS, FFT, VML, VSL RNG, and ESSL-like convolution and correlation functions do not depend on each other. To build the wrappers, just run the examples. The makefile builds the wrapper binaries. After running the makefile, you can run the examples, which will determine whether the wrappers were built correctly. As a result of running the examples, the following directories will be created in /examples/ java: • docs • include • classes • bin • _results The directories docs, include, classes, and bin will contain the wrapper binaries and documentation; the directory _results will contain the testing results. For a Java programmer, the wrappers are the following Java classes: • com.intel.mkl.CBLAS • com.intel.mkl.DFTI • com.intel.mkl.ESSL • com.intel.mkl.VML • com.intel.mkl.VSL Documentation for the particular wrapper and example classes will be generated from the Java sources while building and running the examples. To browse the documentation, open the index file in the docs directory (created by the build script): /examples/java/docs/index.html. The Java wrappers for CBLAS, VML, VSL RNG, and FFT establish the interface that directly corresponds to the underlying native functions, so you can refer to the Intel MKL Reference Manual for their functionality and parameters. Interfaces for the ESSL-like functions are described in the generated documentation for the com.intel.mkl.ESSL class. Each wrapper consists of the interface part for Java and JNI stub written in C. You can find the sources in the following directory: /examples/java/wrappers. Both Java and C parts of the wrapper for CBLAS and VML demonstrate the straightforward approach, which you may use to cover additional CBLAS functions. The wrapper for FFT is more complicated because it needs to support the lifecycle for FFT descriptor objects. To compute a single Fourier transform, an application needs to call the FFT software several times with the same copy of the native FFT descriptor. The wrapper provides the handler class to hold the native descriptor, while the virtual machine runs Java bytecode. The wrapper for VSL RNG is similar to the one for FFT. The wrapper provides the handler class to hold the native descriptor of the stream state. The wrapper for the convolution and correlation functions mitigates the same difficulty of the VSL interface, which assumes a similar lifecycle for "task descriptors". The wrapper utilizes the ESSL-like interface for those functions, which is simpler for the case of 1-dimensional data. The JNI stub additionally encapsulates the MKL functions into the ESSL-like wrappers written in C and so "packs" the lifecycle of a task descriptor into a single call to the native method. The wrappers meet the JNI Specification versions 1.1 and 5.0 and should work with virtually every modern implementation of Java. Language-specific Usage Options 6 59The examples and the Java part of the wrappers are written for the Java language described in "The Java Language Specification (First Edition)" and extended with the feature of "inner classes" (this refers to late 1990s). This level of language version is supported by all versions of the Sun Java Development Kit* (JDK*) developer toolkit and compatible implementations starting from version 1.1.5, or by all modern versions of Java. The level of C language is "Standard C" (that is, C89) with additional assumptions about integer and floatingpoint data types required by the Intel MKL interfaces and the JNI header files. That is, the native float and double data types must be the same as JNI jfloat and jdouble data types, respectively, and the native int must be 4 bytes long. 1 IBM Engineering Scientific Subroutine Library (ESSL*). See Also Running the Java* Examples Running the Java* Examples The Java examples support all the C and C++ compilers that Intel MKL does. The makefile intended to run the examples also needs the make utility, which is typically provided with the Mac OS* X distribution. To run Java examples, the JDK* developer toolkit is required for compiling and running Java code. A Java implementation must be installed on the computer or available via the network. You may download the JDK from the vendor website. The examples should work for all versions of JDK. However, they were tested only with the following Java implementation for all the supported architectures: • J2SE* SDK 1.4.2 and JDK 5.0 from Apple Computer, Inc. (http://apple.com/). Note that the Java run-time environment* (JRE*) system, which may be pre-installed on your computer, is not enough. You need the JDK* developer toolkit that supports the following set of tools: • java • javac • javah • javadoc To make these tools available for the examples makefile, set the JAVA_HOME environment variable and add the JDK binaries directory to the system PATH, for example : export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home export PATH=${JAVA_HOME}/bin:${PATH} You may also need to clear the JDK_HOME environment variable, if it is assigned a value: unset JDK_HOME To start the examples, use the makefile found in the Intel MKL Java examples directory: make {dylibia32|libia32} [function=...] [compiler=...] If you type the make command and omit the target (for example, dylibia32), the makefile prints the help info, which explains the targets and parameters. For the examples list, see the examples.lst file in the Java examples directory. Known Limitations of the Java* Examples This section explains limitations of Java examples. 6 Intel® Math Kernel Library for Mac OS* X User's Guide 60Functionality Some Intel MKL functions may fail to work if called from the Java environment by using a wrapper, like those provided with the Intel MKL Java examples. Only those specific CBLAS, FFT, VML, VSL RNG, and the convolution/correlation functions listed in the Intel MKL Java Examples section were tested with the Java environment. So, you may use the Java wrappers for these CBLAS, FFT, VML, VSL RNG, and convolution/ correlation functions in your Java applications. Performance The Intel MKL functions must work faster than similar functions written in pure Java. However, the main goal of these wrappers is to provide code examples, not maximum performance. So, an Intel MKL function called from a Java application will probably work slower than the same function called from a program written in C/ C++ or Fortran. Known bugs There are a number of known bugs in Intel MKL (identified in the Release Notes), as well as incompatibilities between different versions of JDK. The examples and wrappers include workarounds for these problems. Look at the source code in the examples and wrappers for comments that describe the workarounds. Language-specific Usage Options 6 616 Intel® Math Kernel Library for Mac OS* X User's Guide 62Coding Tips 7 This section discusses programming with the Intel® Math Kernel Library (Intel® MKL) to provide coding tips that meet certain, specific needs, such as consistent results of computations or conditional compilation. Aligning Data for Consistent Results Routines in Intel MKL may return different results from run-to-run on the same system. This is usually due to a change in the order in which floating-point operations are performed. The two most influential factors are array alignment and parallelism. Array alignment can determine how internal loops order floating-point operations. Non-deterministic parallelism may change the order in which computational tasks are executed. While these results may differ, they should still fall within acceptable computational error bounds. To better assure identical results from run-to-run, do the following: • Align input arrays on 16-byte boundaries • Run Intel MKL in the sequential mode To align input arrays on 16-byte boundaries, use mkl_malloc() in place of system provided memory allocators, as shown in the code example below. Sequential mode of Intel MKL removes the influence of nondeterministic parallelism. Aligning Addresses on 16-byte Boundaries // ******* C language ******* ... #include ... void *darray; int workspace; ... // Allocate workspace aligned on 16-byte boundary darray = mkl_malloc( sizeof(double)*workspace, 16 ); ... // call the program using MKL mkl_app( darray ); ... // Free workspace mkl_free( darray ); ! ******* Fortran language ******* ... double precision darray pointer (p_wrk,darray(1)) integer workspace ... ! Allocate workspace aligned on 16-byte boundary p_wrk = mkl_malloc( 8*workspace, 16 ) ... ! call the program using MKL call mkl_app( darray ) ... ! Free workspace call mkl_free(p_wrk) 63Using Predefined Preprocessor Symbols for Intel® MKL Version-Dependent Compilation Preprocessor symbols (macros) substitute values in a program before it is compiled. The substitution is performed in the preprocessing phase. The following preprocessor symbols are available: Predefined Preprocessor Symbol Description __INTEL_MKL__ Intel MKL major version __INTEL_MKL_MINOR__ Intel MKL minor version __INTEL_MKL_UPDATE__ Intel MKL update number INTEL_MKL_VERSION Intel MKL full version in the following format: INTEL_MKL_VERSION = (__INTEL_MKL__*100+__INTEL_MKL_MINOR__)*100+__I NTEL_MKL_UPDATE__ These symbols enable conditional compilation of code that uses new features introduced in a particular version of the library. To perform conditional compilation: 1. Include in your code the file where the macros are defined: • mkl.h for C/C++ • mkl.fi for Fortran 2. [Optionally] Use the following preprocessor directives to check whether the macro is defined: • #ifdef, #endif for C/C++ • !DEC$IF DEFINED, !DEC$ENDIF for Fortran 3. Use preprocessor directives for conditional inclusion of code: • #if, #endif for C/C++ • !DEC$IF, !DEC$ENDIF for Fortran Example Compile a part of the code if Intel MKL version is MKL 10.3 update 4: C/C++: #include "mkl.h" #ifdef INTEL_MKL_VERSION #if INTEL_MKL_VERSION == 100304 // Code to be conditionally compiled #endif #endif Fortran: include "mkl.fi" !DEC$IF DEFINED INTEL_MKL_VERSION !DEC$IF INTEL_MKL_VERSION .EQ. 100304 * Code to be conditionally compiled !DEC$ENDIF !DEC$ENDIF 7 Intel® Math Kernel Library for Mac OS* X User's Guide 64Configuring Your Integrated Development Environment to Link with Intel Math Kernel Library 8 Configuring the Apple Xcode* Developer Software to Link with Intel® Math Kernel Library This section provides information on linking Intel MKL with the Apple Xcode* developer software. Please note that the screen shots are from Apple Xcode* 2.4 and may be different in other versions, whereas the fundamental steps to configuring Xcode* for use with Intel MKL are more widely applicable: 1. Open your project that uses Intel MKL. 2. Under Targets, double-click the active target. In the Target dialog box, assign values to the build settings as explained in the next steps. 3. Click the plus icon under the Build Settings table, located at the bottom of the dialog box, to add a row. In the new row, type HEADER_SEARCH_PATHS under Name and the path to the Intel® MKL include files, that is, /include, under Value: 654. Click the plus icon under the Build Settings table to add another row, in which type LIBRARY_SEARCH_PATHS under Name and the path to the Intel MKL libraries, such as /lib, under Value. 5. Double-click OTHER_LDFLAGS under Name and under Value, type linker options for additional libraries (for example, -lmkl_core -lguide -lpthread). 6. (Optional, needed only for dynamic linking) Under Executables, double-click the active executable, click the Arguments tab, and under Variables to be set in the environment, add DYLD_LIBRARY_PATH with the value of /lib. See Also Notational Conventions Linking in Detail 8 Intel® Math Kernel Library for Mac OS* X User's Guide 66Intel® Optimized LINPACK Benchmark for Mac OS* X 9 Intel® Optimized LINPACK Benchmark is a generalization of the LINPACK 1000 benchmark. It solves a dense (real*8) system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system, converts that time into a performance rate, and tests the results for accuracy. The generalization is in the number of equations (N) it can solve, which is not limited to 1000. It uses partial pivoting to assure the accuracy of the results. Do not use this benchmark to report LINPACK 100 performance because that is a compiled-code only benchmark. This is a shared-memory (SMP) implementation which runs on a single platform. Do not confuse this benchmark with LINPACK, the library, which has been expanded upon by the LAPACK library. Intel provides optimized versions of the LINPACK benchmarks to help you obtain high LINPACK benchmark results on your genuine Intel processor systems more easily than with the High Performance Linpack (HPL) benchmark. Use this package to benchmark your SMP machine. Additional information on this software as well as other Intel software performance products is available at http://www.intel.com/software/products/. Contents of the Intel® Optimized LINPACK Benchmark The Intel Optimized LINPACK Benchmark for Mac OS* X contains the following files, located in the ./ benchmarks/linpack/ subdirectory of the Intel® Math Kernel Library (Intel® MKL) directory: File in ./benchmarks/ linpack/ Description linpack_cd32.app The 32-bit program executable for a system using Intel® Core™ Duo processor on Mac OS* X. linpack_cd64.app The 64-bit program executable for a system using Intel® Core™ microarchitecture on Mac OS* X. runme32 A sample shell script for executing a pre-determined problem set for linpack_cd32.appOMP_NUM_THREADS set to 2 cores. runme64 A sample shell script for executing a pre-determined problem set for linpack_cd64.appOMP_NUM_THREADS set to 2 cores. lininput Input file for pre-determined problem for the runme32 script. lin_cd32.txt Result of the runme32 script execution. lin_cd64.txt Result of the runme64 script execution. help.lpk Simple help file. xhelp.lpk Extended help file. See Also High-level Directory Structure Running the Software To obtain results for the pre-determined sample problem sizes on a given system, type one of the following, as appropriate: 67./runme32 ./runme64 To run the software for other problem sizes, see the extended help included with the program. Extended help can be viewed by running the program executable with the -e option: ./linpack_cd32.app -e ./linpack_cd64.app -e The pre-defined data input filelininput is provided merely as an example. Different systems have different amount of memory and thus require new input files. The extended help can be used for insight into proper ways to change the sample input files. lininput requires at least 2 GB of memory. If the system has less memory than the above sample data input requires, you may need to edit or create your own data input files, as explained in the extended help. Each sample script uses the OMP_NUM_THREADS environment variable to set the number of processors it is targeting. To optimize performance on a different number of physical processors, change that line appropriately. If you run the Intel Optimized LINPACK Benchmark without setting the number of threads, it will default to the number of cores according to the OS. You can find the settings for this environment variable in the runme* sample scripts. If the settings do not yet match the situation for your machine, edit the script. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Known Limitations of the Intel® Optimized LINPACK Benchmark The following limitations are known for the Intel Optimized LINPACK Benchmark for Mac OS* X: • If an incomplete data input file is given, the binaries may either hang or fault. See the sample data input files and/or the extended help for insight into creating a correct data input file. • The binary will hang if it is not given an input file or any other arguments. 9 Intel® Math Kernel Library for Mac OS* X User's Guide 68Intel® Math Kernel Library Language Interfaces Support A Language Interfaces Support, by Function Domain The following table shows language interfaces that Intel® Math Kernel Library (Intel® MKL) provides for each function domain. However, Intel MKL routines can be called from other languages using mixed-language programming. See Mixed-language Programming with Intel® MKL for an example of how to call Fortran routines from C/C++. Function Domain FORTRAN 77 interface Fortran 9 0/95 interface C/C++ interface Basic Linear Algebra Subprograms (BLAS) Yes Yes via CBLAS BLAS-like extension transposition routines Yes Yes Sparse BLAS Level 1 Yes Yes via CBLAS Sparse BLAS Level 2 and 3 Yes Yes Yes LAPACK routines for solving systems of linear equations Yes Yes Yes LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations Yes Yes Yes Auxiliary and utility LAPACK routines Yes Yes DSS/PARDISO* solvers Yes Yes Yes Other Direct and Iterative Sparse Solver routines Yes Yes Yes Vector Mathematical Library (VML) functions Yes Yes Yes Vector Statistical Library (VSL) functions Yes Yes Yes Fourier Transform functions (FFT) Yes Yes Trigonometric Transform routines Yes Yes Fast Poisson, Laplace, and Helmholtz Solver (Poisson Library) routines Yes Yes Optimization (Trust-Region) Solver routines Yes Yes Yes Data Fitting functions Yes Yes Yes GMP* arithmetic functions †† Yes Support functions (including memory allocation) Yes Yes Yes †† GMP Arithmetic Functions are deprecated and will be removed in a future release. 69Include Files Function domain Fortran Include Files C/C++ Include Files All function domains mkl.fi mkl.h BLAS Routines blas.f90 mkl_blas.fi mkl_blas.h BLAS-like Extension Transposition Routines mkl_trans.fi mkl_trans.h CBLAS Interface to BLAS mkl_cblas.h Sparse BLAS Routines mkl_spblas.fi mkl_spblas.h LAPACK Routines lapack.f90 mkl_lapack.fi mkl_lapack.h C Interface to LAPACK mkl_lapacke.h All Sparse Solver Routines mkl_solver.f90 mkl_solver.h PARDISO mkl_pardiso.f77 mkl_pardiso.f90 mkl_pardiso.h DSS Interface mkl_dss.f77 mkl_dss.f90 mkl_dss.h RCI Iterative Solvers ILU Factorization mkl_rci.fi mkl_rci.h Optimization Solver Routines mkl_rci.fi mkl_rci.h Vector Mathematical Functions mkl_vml.f77 mkl_vml.90 mkl_vml.h Vector Statistical Functions mkl_vsl.f77 mkl_vsl.f90 mkl_vsl_functions.h Fourier Transform Functions mkl_dfti.f90 mkl_dfti.h Partial Differential Equations Support Routines Trigonometric Transforms mkl_trig_transforms.f90 mkl_trig_transforms.h Poisson Solvers mkl_poisson.f90 mkl_poisson.h Data Fitting functions mkl_df.f77 mkl_df.f90 mkl_df.h GMP interface † mkl_gmp.h Support functions mkl_service.f90 mkl_service.fi mkl_service.h Memory allocation routines i_malloc.h Intel MKL examples interface mkl_example.h † GMP Arithmetic Functions are deprecated and will be removed in a future release. A Intel® Math Kernel Library for Mac OS* X User's Guide 70See Also Language Interfaces Support, by Function Domain Intel® Math Kernel Library Language Interfaces Support A 71A Intel® Math Kernel Library for Mac OS* X User's Guide 72Support for Third-Party Interfaces B GMP* Functions Intel® Math Kernel Library (Intel® MKL) implementation of GMP* arithmetic functions includes arbitrary precision arithmetic operations on integer numbers. The interfaces of such functions fully match the GNU Multiple Precision* (GMP) Arithmetic Library. For specifications of these functions, please see http:// software.intel.com/sites/products/documentation/hpc/mkl/gnump/index.htm. NOTE Intel MKL GMP Arithmetic Functions are deprecated and will be removed in a future release. If you currently use the GMP* library, you need to modify INCLUDE statements in your programs to mkl_gmp.h. FFTW Interface Support Intel® Math Kernel Library (Intel® MKL) offers two collections of wrappers for the FFTW interface (www.fftw.org). The wrappers are the superstructure of FFTW to be used for calling the Intel MKL Fourier transform functions. These collections correspond to the FFTW versions 2.x and 3.x and the Intel MKL versions 7.0 and later. These wrappers enable using Intel MKL Fourier transforms to improve the performance of programs that use FFTW without changing the program source code. See the "FFTW Interface to Intel® Math Kernel Library" appendix in the Intel MKL Reference Manual for details on the use of the wrappers. Important For ease of use, FFTW3 interface is also integrated in Intel MKL. 73B Intel® Math Kernel Library for Mac OS* X User's Guide 74Directory Structure in Detail C Tables in this section show contents of the /lib directory. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Static Libraries in the lib directory File Contents Interface layer libmkl_intel.a Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support LP64 interface or on IA-32 architecture systems. libmkl_intel_lp64.a Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support LP64 interface or on IA-32 architecture systems. libmkl_intel_ilp64.a Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support ILP64 interface or on IA-32 architecture systems. libmkl_intel_sp2dp.a SP2DP interface library for the Intel compilers. Threading layer libmkl_intel_thread.a Threading library for the Intel compilers libmkl_pgi_thread.a Threading library for the PGI* compiler libmkl_sequential.a Sequential library Computational layer libmkl_core.a Kernel library libmkl_solver_lp64.a Deprecated. Empty library for backward compatibility libmkl_solver_lp64_sequential.a Deprecated. Empty library for backward compatibility libmkl_solver_ilp64.a Deprecated. Empty library for backward compatibility libmkl_solver_ilp64_sequential.a Deprecated. Empty library for backward compatibility 75Dynamic Libraries in the lib directory File Contents libmkl_rt.dylib Single Dynamic Library Interface layer libmkl_intel.dylib Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support LP64 interface or on IA-32 architecture systems. libmkl_intel_lp64.dylib Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support LP64 interface or on IA-32 architecture systems. libmkl_intel_ilp64.dylib Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support ILP64 interface or on IA-32 architecture systems. libmkl_intel_sp2dp.dylib SP2DP interface library for the Intel compilers. Threading layer libmkl_intel_thread.dylib Threading library for the Intel compilers libmkl_sequential.dylib Sequential library Computational layer libmkl_core.dylib Contains the dispatcher for dynamic load of the processor-specific kernel library libmkl_lapack.dylib LAPACK and DSS/PARDISO routines and drivers libmkl_mc.dylib 64-bit kernel for processors based on the Intel® Core™ microarchitecture libmkl_mc3.dylib 64-bit kernel for the Intel® Core™ i7 processors libmkl_p4p.dylib 32-bit kernel for the Intel® Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3), including Intel® Core™ Duo and Intel® Core™ Solo processors. libmkl_p4m.dylib 32-bit kernel for the Intel® Core™ microarchitecture libmkl_p4m3.dylib 32-bit kernel library for the Intel® Core™ i7 processors libmkl_vml_mc.dylib 64-bit VML for processors based on the Intel® Core™ microarchitecture libmkl_vml_mc2.dylib 64-bit VML/VSL for 45nm Hi-k Intel® Core™2 and the Intel Xeon® processor families libmkl_vml_mc3.dylib 64-bit VML/VSL for the Intel® Core™ i7 processors libmkl_vml_p4p.dylib 32-bit VML for the Intel® Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3) libmkl_vml_p4m.dylib 32-bit VML for processors based on Intel® Core™ microarchitecture libmkl_vml_p4m2.dylib 32-bit VML/VSL for 45nm Hi-k Intel® Core™2 and Intel Xeon® processor families C Intel® Math Kernel Library for Mac OS* X User's Guide 76File Contents libmkl_vml_p4m3.dylib 32-bit VML/VSL for the Intel® Core™ i7 processors libmkl_vml_avx.dylib VML/VSL optimized for the Intel® Advanced Vector Extensions (Intel® AVX) RTL locale/en_US/mkl_msg.cat Catalog of Intel® Math Kernel Library (Intel® MKL) messages in English Directory Structure in Detail C 77C Intel® Math Kernel Library for Mac OS* X User's Guide 78Index A aligning data 63 architecture support 21 B BLAS calling routines from C 54 Fortran 95 interface to 52 threaded routines 39 C C interface to LAPACK, use of 54 C, calling LAPACK, BLAS, CBLAS from 54 C/C++, Intel(R) MKL complex types 55 calling BLAS functions from C 55 CBLAS interface from C 55 complex BLAS Level 1 function from C 55 complex BLAS Level 1 function from C++ 55 Fortran-style routines from C 54 CBLAS interface, use of 54 code examples, use of 19 coding data alignment techniques to improve performance 47 compilation, Intel(R) MKL version-dependent 64 compiler run-time libraries, linking with 34 compiler-dependent function 53 complex types in C and C++, Intel(R) MKL 55 computation results, consistency 63 conditional compilation 64 consistent results 63 conventions, notational 13 custom dynamically linked shared library building 35 composing list of functions 36 specifying function names 36 D denormal number, performance 49 directory structure documentation 23 high-level 21 in-detail documentation directories, contents 23 man pages 24 E Enter index keyword 25 environment variables, setting 17 examples, linking 27 F FFT interface data alignment 47 optimised radices 49 threaded problems 39 FFTW interface support 73 Fortran 95 interface libraries 33 G GNU* Multiple Precision Arithmetic Library 73 H header files, Intel(R) MKL 70 HT technology, configuration tip 48 I ILP64 programming, support for 31 include files, Intel(R) MKL 70 installation, checking 17 Intel(R) Hyper-Threading Technology, configuration tip 48 interface Fortran 95, libraries 33 LP64 and ILP64, use of 31 interface libraries and modules, Intel(R) MKL 51 interface libraries, linking with 31 J Java* examples 58 L language interfaces support 69 language-specific interfaces interface libraries and modules 51 LAPACK C interface to, use of 54 calling routines from C 54 Fortran 95 interface to 52 performance of packed routines 47 threaded routines 39 layers, Intel(R) MKL structure 22 libraries to link with interface 31 run-time 34 system libraries 34 threading 33 link tool, command line 27 link-line syntax 29 linking examples 27 linking with compiler run-time libraries 34 interface libraries 31 system libraries 34 threading libraries 33 linking, quick start 25 linking, Web-based advisor 27 LINPACK benchmark Index 79M man pages, viewing 24 memory functions, redefining 49 memory management 49 memory renaming 49 mixed-language programming 53 module, Fortran 95 52 N notational conventions 13 number of threads changing at run time 42 changing with OpenMP* environment variable 42 Intel(R) MKL choice, particular cases 45 techniques to set 42 P parallel performance 41 parallelism, of Intel(R) MKL 39 performance with denormals 49 with subnormals 49 S SDL 26, 30 sequential mode of Intel(R) MKL 33 Single Dynamic Library 26, 30 structure high-level 21 in-detail model 22 support, technical 11 supported architectures 21 syntax, link-line 29 system libraries, linking with 34 T technical support 11 thread safety, of Intel(R) MKL 39 threaded functions 39 threaded problems 39 threading control, Intel(R) MKL-specific 44 threading libraries, linking with 33 U uBLAS, matrix-matrix multiplication, substitution with Intel MKL functions 57 unstable output, getting rid of 63 usage information 15 X Xcode*, configuring 65 Intel® Math Kernel Library for Mac OS* X User's Guide 80 Intel ® Math Kernel Library for Windows* OS User's Guide Intel® MKL - Windows* OS Document Number: 315930-018US Legal InformationContents Legal Information................................................................................7 Introducing the Intel® Math Kernel Library...........................................9 Getting Help and Support...................................................................11 Notational Conventions......................................................................13 Chapter 1: Overview Document Overview.................................................................................15 What's New.............................................................................................15 Related Information.................................................................................15 Chapter 2: Getting Started Checking Your Installation.........................................................................17 Setting Environment Variables ..................................................................17 Compiler Support.....................................................................................19 Using Code Examples...............................................................................19 What You Need to Know Before You Begin Using the Intel ® Math Kernel Library...............................................................................................19 Chapter 3: Structure of the Intel® Math Kernel Library Architecture Support................................................................................23 High-level Directory Structure....................................................................23 Layered Model Concept.............................................................................25 Contents of the Documentation Directories..................................................26 Chapter 4: Linking Your Application with the Intel® Math Kernel Library Linking Quick Start...................................................................................27 Using the /Qmkl Compiler Option.......................................................27 Automatically Linking a Project in the Visual Studio* Integrated Development Environment with Intel ® MKL......................................28 Automatically Linking Your Microsoft Visual C/C++* Project with Intel ® MKL..........................................................................28 Automatically Linking Your Intel ® Visual Fortran Project with Intel ® MKL..........................................................................28 Using the Single Dynamic Library.......................................................28 Selecting Libraries to Link with..........................................................29 Using the Link-line Advisor................................................................29 Using the Command-line Link Tool.....................................................30 Linking Examples.....................................................................................30 Linking on IA-32 Architecture Systems...............................................30 Linking on Intel(R) 64 Architecture Systems........................................31 Linking in Detail.......................................................................................31 Dynamically Selecting the Interface and Threading Layer......................32 Linking with Interface Libraries..........................................................33 Using the cdecl and stdcall Interfaces.........................................33 Using the ILP64 Interface vs. LP64 Interface...............................34 Linking with Fortran 95 Interface Libraries..................................36 Contents 3Linking with Threading Libraries.........................................................36 Sequential Mode of the Library..................................................36 Selecting the Threading Layer...................................................36 Linking with Computational Libraries..................................................37 Linking with Compiler Run-time Libraries............................................38 Linking with System Libraries............................................................38 Building Custom Dynamic-link Libraries.......................................................39 Using the Custom Dynamic-link Library Builder in the Command-line Mode.........................................................................................39 Composing a List of Functions ..........................................................40 Specifying Function Names...............................................................41 Building a Custom Dynamic-link Library in the Visual Studio* Development System...................................................................41 Distributing Your Custom Dynamic-link Library....................................42 Chapter 5: Managing Performance and Memory Using Parallelism of the Intel ® Math Kernel Library........................................43 Threaded Functions and Problems......................................................43 Avoiding Conflicts in the Execution Environment..................................45 Techniques to Set the Number of Threads...........................................46 Setting the Number of Threads Using an OpenMP* Environment Variable......................................................................................46 Changing the Number of Threads at Run Time.....................................46 Using Additional Threading Control.....................................................48 Intel MKL-specific Environment Variables for Threading Control. . . . .48 MKL_DYNAMIC........................................................................49 MKL_DOMAIN_NUM_THREADS..................................................50 Setting the Environment Variables for Threading Control..............51 Tips and Techniques to Improve Performance..............................................52 Coding Techniques...........................................................................52 Hardware Configuration Tips.............................................................53 Managing Multi-core Performance......................................................53 Operating on Denormals...................................................................54 FFT Optimized Radices.....................................................................54 Using Memory Management ......................................................................54 Intel MKL Memory Management Software............................................54 Redefining Memory Functions............................................................55 Chapter 6: Language-specific Usage Options Using Language-Specific Interfaces with Intel ® Math Kernel Library.................57 Interface Libraries and Modules.........................................................57 Fortran 95 Interfaces to LAPACK and BLAS..........................................59 Compiler-dependent Functions and Fortran 90 Modules.........................59 Using the stdcall Calling Convention in C/C++.....................................60 Compiling an Application that Calls the Intel ® Math Kernel Library and Uses the CVF Calling Conventions..................................................60 Mixed-language Programming with the Intel Math Kernel Library....................61 Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments..............................................................................61 Using Complex Types in C/C++.........................................................62 Intel® Math Kernel Library for Windows* OS User's Guide 4Calling BLAS Functions that Return the Complex Values in C/C++ Code..........................................................................................63 Support for Boost uBLAS Matrix-matrix Multiplication...........................64 Invoking Intel MKL Functions from Java* Applications...........................65 Intel MKL Java* Examples........................................................66 Running the Java* Examples.....................................................67 Known Limitations of the Java* Examples...................................68 Chapter 7: Coding Tips Aligning Data for Consistent Results...........................................................69 Using Predefined Preprocessor Symbols for Intel ® MKL Version-Dependent Compilation.........................................................................................70 Chapter 8: Working with the Intel® Math Kernel Library Cluster Software MPI Support............................................................................................71 Linking with ScaLAPACK and Cluster FFTs....................................................71 Determining the Number of Threads...........................................................73 Using DLLs..............................................................................................73 Setting Environment Variables on a Cluster.................................................74 Building ScaLAPACK Tests.........................................................................74 Examples for Linking with ScaLAPACK and Cluster FFT..................................74 Examples for Linking a C Application..................................................75 Examples for Linking a Fortran Application..........................................75 Chapter 9: Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) Configuring Your Integrated Development Environment to Link with Intel Math Kernel Library .............................................................................77 Configuring the Microsoft Visual C/C++* Development System to Link with Intel ® MKL............................................................................77 Configuring Intel ® Visual Fortran to Link with Intel MKL.........................77 Running an Intel MKL Example in the Visual Studio* 2008 IDE...............78 Creating, Configuring, and Running the Intel ® C/C++ and/or Visual C++* 2008 Project.....................................................78 Creating, Configuring, and Running the Intel Visual Fortran Project...............................................................................80 Support Files for Intel ® Math Kernel Library Examples...................81 Known Limitations of the Project Creation Procedure....................82 Getting Assistance for Programming in the Microsoft Visual Studio* IDE .........82 Viewing Intel MKL Documentation in Visual Studio* IDE........................82 Using Context-Sensitive Help............................................................83 Using the IntelliSense* Capability......................................................84 Chapter 10: LINPACK and MP LINPACK Benchmarks Intel ® Optimized LINPACK Benchmark for Windows* OS................................87 Contents of the Intel ® Optimized LINPACK Benchmark..........................87 Running the Software.......................................................................88 Known Limitations of the Intel ® Optimized LINPACK Benchmark.............89 Intel ® Optimized MP LINPACK Benchmark for Clusters...................................89 Overview of the Intel ® Optimized MP LINPACK Benchmark for Clusters....89 Contents 5Contents of the Intel ® Optimized MP LINPACK Benchmark for Clusters. . . .90 Building the MP LINPACK..................................................................91 New Features of Intel ® Optimized MP LINPACK Benchmark....................91 Benchmarking a Cluster....................................................................92 Options to Reduce Search Time.........................................................92 Appendix A: Intel® Math Kernel Library Language Interfaces Support Language Interfaces Support, by Function Domain.......................................95 Include Files............................................................................................96 Appendix B: Support for Third-Party Interfaces GMP* Functions.......................................................................................99 FFTW Interface Support............................................................................99 Appendix C: Directory Structure in Detail Detailed Structure of the IA-32 Architecture Directories...............................101 Static Libraries in the lib\ia32 Directory............................................101 Dynamic Libraries in the lib\ia32 Directory........................................102 Contents of the redist\ia32\mkl Directory..........................................102 Detailed Structure of the Intel ® 64 Architecture Directories..........................103 Static Libraries in the lib\intel64 Directory.........................................104 Dynamic Libraries in the lib\intel64 Directory.....................................105 Contents of the redist\intel64\mkl Directory......................................105 Intel® Math Kernel Library for Windows* OS User's Guide 6Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Java is a registered trademark of Oracle and/or its affiliates. Copyright © 2007 - 2011, Intel Corporation. All rights reserved. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. 7Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Intel® Math Kernel Library for Windows* OS User's Guide 8Introducing the Intel® Math Kernel Library The Intel ® Math Kernel Library (Intel ® MKL) improves performance of scientific, engineering, and financial software that solves large computational problems. Among other functionality, Intel MKL provides linear algebra routines, fast Fourier transforms, as well as vectorized math and random number generation functions, all optimized for the latest Intel processors, including processors with multiple cores (see the Intel ® MKL Release Notes for the full list of supported processors). Intel MKL also performs well on non-Intel processors. Intel MKL is thread-safe and extensively threaded using the OpenMP* technology. Intel MKL provides the following major functionality: • Linear algebra, implemented in LAPACK (solvers and eigensolvers) plus level 1, 2, and 3 BLAS, offering the vector, vector-matrix, and matrix-matrix operations needed for complex mathematical software. If you prefer the FORTRAN 90/95 programming language, you can call LAPACK driver and computational subroutines through specially designed interfaces with reduced numbers of arguments. A C interface to LAPACK is also available. • ScaLAPACK (SCAlable LAPACK) with its support functionality including the Basic Linear Algebra Communications Subprograms (BLACS) and the Parallel Basic Linear Algebra Subprograms (PBLAS). ScaLAPACK is available for Intel MKL for Linux* and Windows* operating systems. • Direct sparse solver, an iterative sparse solver, and a supporting set of sparse BLAS (level 1, 2, and 3) for solving sparse systems of equations. • Multidimensional discrete Fourier transforms (1D, 2D, 3D) with a mixed radix support (for sizes not limited to powers of 2). Distributed versions of these functions are provided for use on clusters on the Linux* and Windows* operating systems. • A set of vectorized transcendental functions called the Vector Math Library (VML). For most of the supported processors, the Intel MKL VML functions offer greater performance than the libm (scalar) functions, while keeping the same high accuracy. • The Vector Statistical Library (VSL), which offers high performance vectorized random number generators for several probability distributions, convolution and correlation routines, and summary statistics functions. • Data Fitting Library, which provides capabilities for spline-based approximation of functions, derivatives and integrals of functions, and search. For details see the Intel® MKL Reference Manual. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 9 Intel® Math Kernel Library for Windows* OS User's Guide 10Getting Help and Support Intel provides a support web site that contains a rich repository of self help information, including getting started tips, known product issues, product errata, license information, user forums, and more. Visit the Intel MKL support website at http://www.intel.com/software/products/support/. The Intel MKL documentation integrates into the Microsoft Visual Studio* integrated development environment (IDE). See Getting Assistance for Programming in the Microsoft Visual Studio* IDE. 11 Intel® Math Kernel Library for Windows* OS User's Guide 12Notational Conventions The following term is used in reference to the operating system. Windows* OS This term refers to information that is valid on all supported Windows* operating systems. The following notations are used to refer to Intel MKL directories. The installation directory for the Intel® C++ Composer XE or Intel® Visual Fortran Composer XE . The main directory where Intel MKL is installed: =\mkl. Replace this placeholder with the specific pathname in the configuring, linking, and building instructions. The following font conventions are used in this document. Italic Italic is used for emphasis and also indicates document names in body text, for example: see Intel MKL Reference Manual. Monospace lowercase mixed with uppercase Indicates: • Commands and command-line options, for example, ifort myprog.f mkl_blas95.lib mkl_c.lib libiomp5md.lib • Filenames, directory names, and pathnames, for example, C:\Program Files\Java\jdk1.5.0_09 • C/C++ code fragments, for example, a = new double [SIZE*SIZE]; UPPERCASE MONOSPACE Indicates system variables, for example, $MKLPATH. Monospace italic Indicates a parameter in discussions, for example, lda. When enclosed in angle brackets, indicates a placeholder for an identifier, an expression, a string, a symbol, or a value, for example, . Substitute one of these items for the placeholder. [ items ] Square brackets indicate that the items enclosed in brackets are optional. { item | item } Braces indicate that only one of the items listed between braces should be selected. A vertical bar ( | ) separates the items. 13 Intel® Math Kernel Library for Windows* OS User's Guide 14Overview 1 Document Overview The Intel® Math Kernel Library (Intel® MKL) User's Guide provides usage information for the library. The usage information covers the organization, configuration, performance, and accuracy of Intel MKL, specifics of routine calls in mixed-language programming, linking, and more. This guide describes OS-specific usage of Intel MKL, along with OS-independent features. The document contains usage information for all Intel MKL function domains. This User's Guide provides the following information: • Describes post-installation steps to help you start using the library • Shows you how to configure the library with your development environment • Acquaints you with the library structure • Explains how to link your application with the library and provides simple usage scenarios • Describes how to code, compile, and run your application with Intel MKL This guide is intended for Windows OS programmers with beginner to advanced experience in software development. See Also Language Interfaces Support, by Function Domain What's New This User's Guide documents the Intel® Math Kernel Library (Intel® MKL) 10.3 Update 8. The document was updated to reflect addition of Data Fitting Functions to the product and to describe how to build a custom dynamic-link library in the Visual Studio* Development System (see Building a Custom Dynamic-link Library in the Visual Studio* Development System). Related Information To reference how to use the library in your application, use this guide in conjunction with the following documents: • The Intel® Math Kernel Library Reference Manual, which provides reference information on routine functionalities, parameter descriptions, interfaces, calling syntaxes, and return values. • The Intel® Math Kernel Library for Windows* OS Release Notes. 151 Intel® Math Kernel Library for Windows* OS User's Guide 16Getting Started 2 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Checking Your Installation After installing the Intel® Math Kernel Library (Intel® MKL), verify that the library is properly installed and configured: 1. Intel MKL installs in . Check that the subdirectory of referred to as was created. Check that subdirectories for Intel MKL redistributable DLLs redist\ia32\mkl and redist \intel64\mkl were created in the directory (See redist.txt in the Intel MKL documentation directory for a list of files that can be redistributed.) 2. If you want to keep multiple versions of Intel MKL installed on your system, update your build scripts to point to the correct Intel MKL version. 3. Check that the following files appear in the \bin directory and its subdirectories: mklvars.bat ia32\mklvars_ia32.bat intel64\mklvars_intel64.bat Use these files to assign Intel MKL-specific values to several environment variables, as explained in Setting Environment Variables 4. To understand how the Intel MKL directories are structured, see Intel® Math Kernel Library Structure. 5. To make sure that Intel MKL runs on your system, do one of the following: • Launch an Intel MKL example, as explained in Using Code Examples • In the Visual Studio* IDE, create and run a simple project that uses Intel MKL, as explained in Running an Intel MKL Example in the Visual Studio IDE See Also Notational Conventions Setting Environment Variables When the installation of Intel MKL for Windows* OS is complete, set the PATH, LIB, and INCLUDE environment variables in the command shell using one of the script files in the bin subdirectory of the Intel MKL installation directory: ia32\mklvars_ia32.bat for the IA-32 architecture, 17intel64\mklvars_intel64.bat for the Intel® 64 architecture, mklvars.bat for the IA-32 and Intel® 64 architectures. Running the Scripts The scripts accept parameters to specify the following: • Architecture. • Addition of a path to Fortran 95 modules precompiled with the Intel ® Fortran compiler to the INCLUDE environment variable. Supply this parameter only if you are using the Intel ® Fortran compiler. • Interface of the Fortran 95 modules. This parameter is needed only if you requested addition of a path to the modules. Usage and values of these parameters depend on the script. The following table lists values of the script parameters. Script Architecture (required, when applicable) Addition of a Path to Fortran 95 Modules (optional) Interface (optional) mklvars_ia32 n/a † mod n/a mklvars_intel64 n/a mod lp64, default ilp64 mklvars ia32 intel64 mod lp64, default ilp64 † Not applicable. For example: • The command mklvars_ia32 sets environment variables for the IA-32 architecture and adds no path to the Fortran 95 modules. • The command mklvars_intel64 mod ilp64 sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the ILP64 interface to the INCLUDE environment variable. • The command mklvars intel64 mod sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the LP64 interface to the INCLUDE environment variable. NOTE Supply the parameter specifying the architecture first, if it is needed. Values of the other two parameters can be listed in any order. See Also High-level Directory Structure Interface Libraries and Modules Fortran 95 Interfaces to LAPACK and BLAS Setting the Number of Threads Using an OpenMP* Environment Variable 2 Intel® Math Kernel Library for Windows* OS User's Guide 18Compiler Support Intel MKL supports compilers identified in the Release Notes. However, the library has been successfully used with other compilers as well. Although Compaq no longer supports the Compaq Visual Fortran* (CVF) compiler, Intel MKL still preserves the CVF interface in the IA-32 architecture implementation. You can use this interface with the Intel® Fortran Compiler. Intel MKL provides both stdcall (default CVF interface) and cdecl (default interface of the Microsoft Visual C* application) interfaces for the IA-32 architecture. Intel MKL provides a set of include files to simplify program development by specifying enumerated values and prototypes for the respective functions. Calling Intel MKL functions from your application without an appropriate include file may lead to incorrect behavior of the functions. See Also Compiling an Application that Calls the Intel® Math Kernel Library and Uses the CVF Calling Conventions Using the cdecl and stdcall Interfaces Include Files Using Code Examples The Intel MKL package includes code examples, located in the examples subdirectory of the installation directory. Use the examples to determine: • Whether Intel MKL is working on your system • How you should call the library • How to link the library The examples are grouped in subdirectories mainly by Intel MKL function domains and programming languages. For example, the examples\spblas subdirectory contains a makefile to build the Sparse BLAS examples and the examples\vmlc subdirectory contains the makefile to build the C VML examples. Source code for the examples is in the next-level sources subdirectory. See Also High-level Directory Structure Running an Intel MKL Example in the Visual Studio* 2008 IDE What You Need to Know Before You Begin Using the Intel® Math Kernel Library Target platform Identify the architecture of your target machine: • IA-32 or compatible • Intel® 64 or compatible Reason: Because Intel MKL libraries are located in directories corresponding to your particular architecture (see Architecture Support), you should provide proper paths on your link lines (see Linking Examples). To configure your development environment for the use with Intel MKL, set your environment variables using the script corresponding to your architecture (see Setting Environment Variables for details). Mathematical problem Identify all Intel MKL function domains that you require: • BLAS • Sparse BLAS Getting Started 2 19• LAPACK • PBLAS • ScaLAPACK • Sparse Solver routines • Vector Mathematical Library functions (VML) • Vector Statistical Library functions • Fourier Transform functions (FFT) • Cluster FFT • Trigonometric Transform routines • Poisson, Laplace, and Helmholtz Solver routines • Optimization (Trust-Region) Solver routines • Data Fitting Functions • GMP* arithmetic functions. Deprecated and will be removed in a future release Reason: The function domain you intend to use narrows the search in the Reference Manual for specific routines you need. Additionally, if you are using the Intel MKL cluster software, your link line is function-domain specific (see Working with the Cluster Software). Coding tips may also depend on the function domain (see Tips and Techniques to Improve Performance). Programming language Intel MKL provides support for both Fortran and C/C++ programming. Identify the language interfaces that your function domains support (see Intel® Math Kernel Library Language Interfaces Support). Reason: Intel MKL provides language-specific include files for each function domain to simplify program development (see Language Interfaces Support, by Function Domain). For a list of language-specific interface libraries and modules and an example how to generate them, see also Using Language-Specific Interfaces with Intel® Math Kernel Library. Range of integer data If your system is based on the Intel 64 architecture, identify whether your application performs calculations with large data arrays (of more than 2 31 -1 elements). Reason: To operate on large data arrays, you need to select the ILP64 interface, where integers are 64-bit; otherwise, use the default, LP64, interface, where integers are 32-bit (see Using the ILP64 Interface vs. LP64 Interface). Threading model Identify whether and how your application is threaded: • Threaded with the Intel compiler • Threaded with a third-party compiler • Not threaded Reason: The compiler you use to thread your application determines which threading library you should link with your application. For applications threaded with a third-party compiler you may need to use Intel MKL in the sequential mode (for more information, see Sequential Mode of the Library and Linking with Threading Libraries). Number of threads Determine the number of threads you want Intel MKL to use. Reason: Intel MKL is based on the OpenMP* threading. By default, the OpenMP* software sets the number of threads that Intel MKL uses. If you need a different number, you have to set it yourself using one of the available mechanisms. For more information, see Using Parallelism of the Intel® Math Kernel Library. Linking model Decide which linking model is appropriate for linking your application with Intel MKL libraries: • Static 2 Intel® Math Kernel Library for Windows* OS User's Guide 20• Dynamic Reason: The link libraries for static and dynamic linking are different. For the list of link libraries for static and dynamic models, linking examples, and other relevant topics, like how to save disk space by creating a custom dynamic library, see Linking Your Application with the Intel® Math Kernel Library. MPI used Decide what MPI you will use with the Intel MKL cluster software. You are strongly encouraged to use Intel® MPI 3.2 or later. MPI used Reason: To link your application with ScaLAPACK and/or Cluster FFT, the libraries corresponding to your particular MPI should be listed on the link line (see Working with the Cluster Software). Getting Started 2 212 Intel® Math Kernel Library for Windows* OS User's Guide 22Structure of the Intel® Math Kernel Library 3 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Architecture Support Intel® Math Kernel Library (Intel® MKL) for Windows* OS provides two architecture-specific implementations. The following table lists the supported architectures and directories where each architecture-specific implementation is located. Architecture Location IA-32 or compatible \lib\ia32 \redist\ia32\mkl (DLLs) Intel® 64 or compatible \lib\intel64 \redist \intel64\mkl (DLLs) See Also High-level Directory Structure Detailed Structure of the IA-32 Architecture Directories Detailed Structure of the Intel® 64 Architecture Directories High-level Directory Structure Directory Contents Installation directory of the Intel® Math Kernel Library (Intel® MKL) Subdirectories of bin Batch files to set environmental variables in the user shell bin\ia32 Batch files for the IA-32 architecture bin\intel64 Batch files for the Intel® 64 architecture benchmarks\linpack Shared-Memory (SMP) version of the LINPACK benchmark benchmarks\mp_linpack Message-passing interface (MPI) version of the LINPACK benchmark 23Directory Contents lib\ia32 Static libraries and static interfaces to DLLs for the IA-32 architecture lib\intel64 Static libraries and static interfaces to DLLs for the Intel® 64 architecture examples Examples directory. Each subdirectory has source and data files include INCLUDE files for the library routines, as well as for tests and examples include\ia32 Fortran 95 .mod files for the IA-32 architecture and Intel Fortran compiler include\intel64\lp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel® Fortran compiler, and LP64 interface include\intel64\ilp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel Fortran compiler, and ILP64 interface include\fftw Header files for the FFTW2 and FFTW3 interfaces interfaces\blas95 Fortran 95 interfaces to BLAS and a makefile to build the library interfaces\fftw2x_cdft MPI FFTW 2.x interfaces to Intel MKL Cluster FFTs interfaces\fftw3x_cdft MPI FFTW 3.x interfaces to Intel MKL Cluster FFTs interfaces\fftw2xc FFTW 2.x interfaces to the Intel MKL FFTs (C interface) interfaces\fftw2xf FFTW 2.x interfaces to the Intel MKL FFTs (Fortran interface) interfaces\fftw3xc FFTW 3.x interfaces to the Intel MKL FFTs (C interface) interfaces\fftw3xf FFTW 3.x interfaces to the Intel MKL FFTs (Fortran interface) interfaces\lapack95 Fortran 95 interfaces to LAPACK and a makefile to build the library tests Source and data files for tests tools Commad-line link tool and tools for creating custom dynamically linkable libraries tools\builder Tools for creating custom dynamically linkable libraries Subdirectories of redist\ia32\mkl DLLs for applications running on processors with the IA-32 architecture redist\intel64\mkl DLLs for applications running on processors with Intel® 64 architecture Documentation\en_US\MKL Intel MKL documentation Documentation\vshelp \1033\ intel.mkldocs Help2-format files for integration of the Intel MKL documentation with the Microsoft Visual Studio* 2005/2008 IDE Documentation\msvhelp \1033\mkl Microsoft Help Viewer*-format files for integration of the Intel MKL documentation with the Microsoft Visual Studio* 2010 IDE See Also Notational Conventions 3 Intel® Math Kernel Library for Windows* OS User's Guide 24Layered Model Concept Intel MKL is structured to support multiple compilers and interfaces, different OpenMP* implementations, both serial and multiple threads, and a wide range of processors. Conceptually Intel MKL can be divided into distinct parts to support different interfaces, threading models, and core computations: 1. Interface Layer 2. Threading Layer 3. Computational Layer You can combine Intel MKL libraries to meet your needs by linking with one library in each part layer-bylayer. Once the interface library is selected, the threading library you select picks up the chosen interface, and the computational library uses interfaces and OpenMP implementation (or non-threaded mode) chosen in the first two layers. To support threading with different compilers, one more layer is needed, which contains libraries not included in Intel MKL: • Compiler run-time libraries (RTL). The following table provides more details of each layer. Layer Description Interface Layer This layer matches compiled code of your application with the threading and/or computational parts of the library. This layer provides: • cdecl and CVF default interfaces. • LP64 and ILP64 interfaces. • Compatibility with compilers that return function values differently. • A mapping between single-precision names and double-precision names for applications using Cray*-style naming (SP2DP interface). SP2DP interface supports Cray-style naming in applications targeted for the Intel 64 architecture and using the ILP64 interface. SP2DP interface provides a mapping between single-precision names (for both real and complex types) in the application and double-precision names in Intel MKL BLAS and LAPACK. Function names are mapped as shown in the following example for BLAS functions ?GEMM: SGEMM -> DGEMM DGEMM -> DGEMM CGEMM -> ZGEMM ZGEMM -> ZGEMM Mind that no changes are made to double-precision names. Threading Layer This layer: • Provides a way to link threaded Intel MKL with different threading compilers. • Enables you to link with a threaded or sequential mode of the library. This layer is compiled for different environments (threaded or sequential) and compilers (from Intel, Microsoft, and so on). Computational Layer This layer is the heart of Intel MKL. It has only one library for each combination of architecture and supported OS. The Computational layer accommodates multiple architectures through identification of architecture features and chooses the appropriate binary code at run time. Compiler Run-time Libraries (RTL) To support threading with Intel compilers, Intel MKL uses RTLs of the Intel® C++ Composer XE or Intel® Visual Fortran Composer XE. To thread using third-party threading compilers, use libraries in the Threading layer or an appropriate compatibility library. See Also Using the ILP64 Interface vs. LP64 Interface Structure of the Intel® Math Kernel Library 3 25Linking Your Application with the Intel® Math Kernel Library Linking with Threading Libraries Contents of the Documentation Directories Most of Intel MKL documentation is installed at \Documentation\ \mkl. For example, the documentation in English is installed at \Documentation\en_US\mkl. However, some Intel MKL-related documents are installed one or two levels up. The following table lists MKL-related documentation. File name Comment Files in \Documentation \clicense.rtf or \flicense.rtf Common end user license for the Intel® C++ Composer XE 2011 or Intel® Visual Fortran Composer XE 2011, respectively mklsupport.txt Information on package number for customer support reference Contents of \Documentation\\mkl redist.txt List of redistributable files mkl_documentation.htm Overview and links for the Intel MKL documentation mkl_manual\index.htm Intel MKL Reference Manual in an uncompressed HTML format Release_Notes.htm Intel MKL Release Notes mkl_userguide\index.htm Intel MKL User's Guide in an uncompressed HTML format, this document mkl_link_line_advisor.htm Intel MKL Link-line Advisor 3 Intel® Math Kernel Library for Windows* OS User's Guide 26Linking Your Application with the Intel® Math Kernel Library 4 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Linking Quick Start Intel® Math Kernel Library (Intel® MKL) provides several options for quick linking of your application. The simplest options depend on your development environment: Intel® Composer XE compiler see Using the /Qmkl Compiler Option. Microsoft Visual Studio* Integrated Development Environment (IDE) see Automatically Linking a Project in the Visual Studio* IDE with Intel MKL. Other options are independent of your development environment, but depend on the way you link: Explicit dynamic linking see Using the Single Dynamic Library for how to simplify your link line. Explicitly listing libraries on your link line see Selecting Libraries to Link with for a summary of the libraries. Using an interactive interface see Using the Link-line Advisor to determine libraries and options to specify on your link or compilation line. Using an internally provided tool see Using the Command-line Link Tool to determine libraries, options, and environment variables or even compile and build your application. Using the /Qmkl Compiler Option The Intel® Composer XE compiler supports the following variants of the /Qmkl compiler option: /Qmkl or /Qmkl:parallel to link with standard threaded Intel MKL. /Qmkl:sequential to link with sequential version of Intel MKL. /Qmkl:cluster to link with Intel MKL cluster components (sequential) that use Intel MPI. For more information on the /Qmkl compiler option, see the Intel Compiler User and Reference Guides. For each variant of the /Qmkl option, the compiler links your application using the following conventions: • cdecl for the IA-32 architecture • LP64 for the Intel® 64 architecture If you specify any variant of the /Qmkl compiler option, the compiler automatically includes the Intel MKL libraries. In cases not covered by the option, use the Link-line Advisor or see Linking in Detail. 27See Also Using the ILP64 Interface vs. LP64 Interface Using the Link-line Advisor Intel® Software Documentation Library Automatically Linking a Project in the Visual Studio* Integrated Development Environment with Intel® MKL After a default installation of the Intel® Math Kernel Library (Intel® MKL), Intel® C++ Composer XE, or Intel® Visual Fortran Composer XE, you can easily configure your project to automatically link with Intel MKL. Automatically Linking Your Microsoft Visual C/C++* Project with Intel® MKL Configure your Microsoft Visual C/C++* project for automatic linking with Intel MKL as follows: • For the Visual Studio* 2010 development system: 1. Go to Project>Properties>Configuration Properties>Intel Performance Libraries. 2. Change the Use MKL property setting by selecting Parallel, Sequential, or Cluster as appropriate. • For the Visual Studio 2005/2008 development system: 1. Go to Project>Intel C++ Composer XE 2011>Select Build Components. 2. From the Use MKL drop-down menu, select Parallel, Sequential, or Cluster as appropriate. Specific Intel MKL libraries that link with your application may depend on more project settings. For details, see the Intel® Composer XE documentation. See Also Intel® Software Documentation Library Automatically Linking Your Intel® Visual Fortran Project with Intel® MKL Configure your Intel® Visual Fortran project for automatic linking with Intel MKL as follows: Go to Project > Properties > Libraries > Use Intel Math Kernel Library and select Parallel, Sequential, or Cluster as appropriate. Specific Intel MKL libraries that link with your application may depend on more project settings. For details see the Intel® Visual Fortran Compiler XE User and Reference Guides. See Also Intel® Software Documentation Library Using the Single Dynamic Library You can simplify your link line through the use of the Intel MKL Single Dynamic Library (SDL). To use SDL, place mkl_rt.lib on your link line. For example: icl.exe application.c mkl_rt.lib mkl_rt.lib is the import library for mkl_rt.dll. SDL enables you to select the interface and threading library for Intel MKL at run time. By default, linking with SDL provides: • LP64 interface on systems based on the Intel® 64 architecture • Intel threading To use other interfaces or change threading preferences, including use of the sequential version of Intel MKL, you need to specify your choices using functions or environment variables as explained in section Dynamically Selecting the Interface and Threading Layer. 4 Intel® Math Kernel Library for Windows* OS User's Guide 28Selecting Libraries to Link with To link with Intel MKL: • Choose one library from the Interface layer and one library from the Threading layer • Add the only library from the Computational layer and run-time libraries (RTL) The following table lists Intel MKL libraries to link with your application. Interface layer Threading layer Computational layer RTL IA-32 architecture, static linking mkl_intel_c.lib mkl_intel_ thread.lib mkl_core.lib libiomp5md.lib IA-32 architecture, dynamic linking mkl_intel_c_ dll.lib mkl_intel_ thread_dll.lib mkl_core_dll. lib libiomp5md.lib Intel® 64 architecture, static linking mkl_intel_ lp64.lib mkl_intel_ thread.lib mkl_core.lib libiomp5md.lib Intel® 64 architecture, dynamic linking mkl_intel_ lp64_dll.lib mkl_intel_ thread_dll.lib mkl_core_dll. lib libiomp5md.lib The Single Dynamic Library (SDL) automatically links interface, threading, and computational libraries and thus simplifies linking. The following table lists Intel MKL libraries for dynamic linking using SDL. See Dynamically Selecting the Interface and Threading Layer for how to set the interface and threading layers at run time through function calls or environment settings. SDL RTL IA-32 and Intel® 64 architectures mkl_rt.lib libiomp5md.lib † † Linking with libiomp5md.lib is not required. For exceptions and alternatives to the libraries listed above, see Linking in Detail. See Also Layered Model Concept Using the Link-line Advisor Using the /Qmkl Compiler Option Working with the Intel® Math Kernel Library Cluster Software Using the Link-line Advisor Use the Intel MKL Link-line Advisor to determine the libraries and options to specify on your link or compilation line. The latest version of the tool is available at http://software.intel.com/en-us/articles/intel-mkl-link-lineadvisor. The tool is also available in the product. The Advisor requests information about your system and on how you intend to use Intel MKL (link dynamically or statically, use threaded or sequential mode, etc.). The tool automatically generates the appropriate link line for your application. See Also Contents of the Documentation Directories Linking Your Application with the Intel® Math Kernel Library 4 29Using the Command-line Link Tool Use the command-line Link tool provided by Intel MKL to simplify building your application with Intel MKL. The tool not only provides the options, libraries, and environment variables to use, but also performs compilation and building of your application. The tool mkl_link_tool.exe is installed in the \tools directory. See the knowledge base article at http://software.intel.com/en-us/articles/mkl-command-line-link-tool for more information. Linking Examples See Also Using the Link-line Advisor Examples for Linking with ScaLAPACK and Cluster FFT Linking on IA-32 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc: • Static linking of myprog.f and parallel Intel MKL supporting the cdecl interface: ifort myprog.f mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Dynamic linking of myprog.f and parallel Intel MKL supporting the cdecl interface: ifort myprog.f mkl_intel_c_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib • Static linking of myprog.f and sequential version of Intel MKL supporting the cdecl interface: ifort myprog.f mkl_intel_c.lib mkl_sequential.lib mkl_core.lib • Dynamic linking of myprog.f and sequential version of Intel MKL supporting the cdecl interface: ifort myprog.f mkl_intel_c_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib • Static linking of user code myprog.f and parallel Intel MKL supporting the stdcall interface: ifort myprog.f mkl_intel_s.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Dynamic linking of user code myprog.f and parallel Intel MKL supporting the stdcall interface: ifort myprog.f mkl_intel_s_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL supporting the cdecl or stdcall interface (Call the mkl_set_threading_layer function or set value of the MKL_THREADING_LAYER environment variable to choose threaded or sequential mode): ifort myprog.f mkl_rt.lib • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL supporting the cdecl interface: ifort myprog.f mkl_lapack95.lib mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL supporting the cdecl interface: ifort myprog.f mkl_blas95.lib mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib 4 Intel® Math Kernel Library for Windows* OS User's Guide 30See Also Fortran 95 Interfaces to LAPACK and BLAS Examples for Linking a C Application Examples for Linking a Fortran Application Using the Single Dynamic Library Linking on Intel(R) 64 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc: • Static linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Dynamic linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f mkl_intel_lp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib • Static linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f mkl_intel_lp64.lib mkl_sequential.lib mkl_core.lib • Dynamic linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib • Static linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f mkl_intel_ilp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Dynamic linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f mkl_intel_ilp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL supporting the LP64 or ILP64 interface (Call appropriate functions or set environment variables to choose threaded or sequential mode and to set the interface): ifort myprog.f mkl_rt.lib • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f mkl_lapack95_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f mkl_blas95_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib See Also Fortran 95 Interfaces to LAPACK and BLAS Examples for Linking a C Application Examples for Linking a Fortran Application Using the Single Dynamic Library Linking in Detail This section recommends which libraries to link with depending on your Intel MKL usage scenario and provides details of the linking. Linking Your Application with the Intel® Math Kernel Library 4 31Dynamically Selecting the Interface and Threading Layer The Single Dynamic Library (SDL) enables you to dynamically select the interface and threading layer for Intel MKL. Setting the Interface Layer Available interfaces depend on the architecture of your system. On systems based on the Intel ® 64 architecture, LP64 and ILP64 interfaces are available. To set one of these interfaces at run time, use the mkl_set_interface_layer function or the MKL_INTERFACE_LAYER environment variable. The following table provides values to be used to set each interface. Interface Layer Value of MKL_INTERFACE_LAYER Value of the Parameter of mkl_set_interface_layer LP64 LP64 MKL_INTERFACE_LP64 ILP64 ILP64 MKL_INTERFACE_ILP64 If the mkl_set_interface_layer function is called, the environment variable MKL_INTERFACE_LAYER is ignored. By default the LP64 interface is used. See the Intel MKL Reference Manual for details of the mkl_set_interface_layer function. On systems based on the IA-32 architecture, the cdecl and stdcall interfaces are available. These interfaces have different function naming conventions, and SDL selects between cdecl and stdcall at link time according to the function names. Setting the Threading Layer To set the threading layer at run time, use the mkl_set_threading_layer function or the MKL_THREADING_LAYER environment variable. The following table lists available threading layers along with the values to be used to set each layer. Threading Layer Value of MKL_THREADING_LAYER Value of the Parameter of mkl_set_threading_layer Intel threading INTEL MKL_THREADING_INTEL Sequential mode of Intel MKL SEQUENTIAL MKL_THREADING_SEQUENTIAL PGI threading PGI MKL_THREADING_PGI If the mkl_set_threading_layer function is called, the environment variable MKL_THREADING_LAYER is ignored. By default Intel threading is used. See the Intel MKL Reference Manual for details of the mkl_set_threading_layer function. Replacing Error Handling and Progress Information Routines You can replace the Intel MKL error handling routine xerbla or progress information routine mkl_progress with your own function. If you are using SDL, to replace xerbla or mkl_progress, call the mkl_set_xerbla and mkl_set_progress function, respectively. See the Intel MKL Reference Manual for details. 4 Intel® Math Kernel Library for Windows* OS User's Guide 32NOTE If you are using SDL, you cannot perform the replacement by linking the object file with your implementation of xerbla or mkl_progress. See Also Using the Single Dynamic Library Layered Model Concept Using the cdecl and stdcall Interfaces Directory Structure in Detail Linking with Interface Libraries Using the cdecl and stdcall Interfaces Intel MKL provides the following interfaces in its IA-32 architecture implementation: • stdcall Default Compaq Visual Fortran* (CVF) interface. Use it with the Intel® Fortran Compiler. • cdecl Default interface of the Microsoft Visual C/C++* application. To use each of these interfaces, link with the appropriate library, as specified in the following table: Interface Library for Static Linking Library for Dynamic Linking cdecl mkl_intel_c.lib mkl_intel_c_dll.lib stdcall mkl_intel_s.lib mkl_intel_s_dll.lib To link with the cdecl or stdcall interface library, use appropriate calling syntax in C applications and appropriate compiler options for Fortran applications. If you are using a C compiler, to link with the cdecl or stdcall interface library, call Intel MKL routines in your code as explained in the table below: Interface Library Calling Intel MKL Routines mkl_intel_s [_dll].lib Call a routine with the following statement: extern __stdcall name( , , .. ); where stdcall is actually the CVF compiler default compilation, which differs from the regular stdcall compilation in the way how strings are passed to the routine. Because the default CVF format is not identical with stdcall, you must specially handle strings in the calling sequence. See how to do it in sections on interfaces in the CVF documentation. mkl_intel_c [_dll].lib Use the following declaration: name( , , .. ); If you are using a Fortran compiler, to link with the cdecl or stdcall interface library, provide compiler options as explained in the table below: Interface Library Compiler Options Comment CVF compiler mkl_intel_s[_dll].lib Default mkl_intel_c[_dll].lib /iface=(cref, nomixed_str_len_arg) Linking Your Application with the Intel® Math Kernel Library 4 33Interface Library Compiler Options Comment Intel® Fortran compiler mkl_intel_c[_dll].lib Default mkl_intel_s[_dll].lib /Gm or /iface:cvf /Gm and /iface:cvf options enable compatibility of the CVF and Powerstation calling conventions See Also Using the stdcall Calling Convention in C/C++ Compiling an Application that Calls the Intel® Math Kernel Library and Uses the CVF Calling Conventions Using the ILP64 Interface vs. LP64 Interface The Intel MKL ILP64 libraries use the 64-bit integer type (necessary for indexing large arrays, with more than 2 31 -1 elements), whereas the LP64 libraries index arrays with the 32-bit integer type. The LP64 and ILP64 interfaces are implemented in the Interface layer. Link with the following interface libraries for the LP64 or ILP64 interface, respectively: • mkl_intel_lp64.lib or mkl_intel_ilp64.lib for static linking • mkl_intel_lp64_dll.lib or mkl_intel_ilp64_dll.lib for dynamic linking The ILP64 interface provides for the following: • Support large data arrays (with more than 2 31 -1 elements) • Enable compiling your Fortran code with the /4I8 compiler option The LP64 interface provides compatibility with the previous Intel MKL versions because "LP64" is just a new name for the only interface that the Intel MKL versions lower than 9.1 provided. Choose the ILP64 interface if your application uses Intel MKL for calculations with large data arrays or the library may be used so in future. Intel MKL provides the same include directory for the ILP64 and LP64 interfaces. Compiling for LP64/ILP64 The table below shows how to compile for the ILP64 and LP64 interfaces: Fortran Compiling for ILP64 ifort /4I8 /I\include ... Compiling for LP64 ifort /I\include ... C or C++ Compiling for ILP64 icl /DMKL_ILP64 /I\include ... Compiling for LP64 icl /I\include ... CAUTION Linking of an application compiled with the /4I8 or /DMKL_ILP64 option to the LP64 libraries may result in unpredictable consequences and erroneous output. Coding for ILP64 You do not need to change existing code if you are not using the ILP64 interface. 4 Intel® Math Kernel Library for Windows* OS User's Guide 34To migrate to ILP64 or write new code for ILP64, use appropriate types for parameters of the Intel MKL functions and subroutines: Integer Types Fortran C or C++ 32-bit integers INTEGER*4 or INTEGER(KIND=4) int Universal integers for ILP64/ LP64: • 64-bit for ILP64 • 32-bit otherwise INTEGER without specifying KIND MKL_INT Universal integers for ILP64/ LP64: • 64-bit integers INTEGER*8 or INTEGER(KIND=8) MKL_INT64 FFT interface integers for ILP64/ LP64 INTEGER without specifying KIND MKL_LONG To determine the type of an integer parameter of a function, use appropriate include files. For functions that support only a Fortran interface, use the C/C++ include files *.h. The above table explains which integer parameters of functions become 64-bit and which remain 32-bit for ILP64. The table applies to most Intel MKL functions except some VML and VSL functions, which require integer parameters to be 64-bit or 32-bit regardless of the interface: • VML: The mode parameter of VML functions is 64-bit. • Random Number Generators (RNG): All discrete RNG except viRngUniformBits64 are 32-bit. The viRngUniformBits64 generator function and vslSkipAheadStream service function are 64-bit. • Summary Statistics: The estimate parameter of the vslsSSCompute/vsldSSCompute function is 64- bit. Refer to the Intel MKL Reference Manual for more information. To better understand ILP64 interface details, see also examples and tests. Limitations All Intel MKL function domains support ILP64 programming with the following exceptions: • FFTW interfaces to Intel MKL: • FFTW 2.x wrappers do not support ILP64. • FFTW 3.2 wrappers support ILP64 by a dedicated set of functions plan_guru64. • GMP* Arithmetic Functions do not support ILP64. NOTE GMP Arithmetic Functions are deprecated and will be removed in a future release. See Also High-level Directory Structure Include Files Language Interfaces Support, by Function Domain Layered Model Concept Linking Your Application with the Intel® Math Kernel Library 4 35Directory Structure in Detail Linking with Fortran 95 Interface Libraries The mkl_blas95*.lib and mkl_lapack95*.lib libraries contain Fortran 95 interfaces for BLAS and LAPACK, respectively, which are compiler-dependent. In the Intel MKL package, they are prebuilt for the Intel® Fortran compiler. If you are using a different compiler, build these libraries before using the interface. See Also Fortran 95 Interfaces to LAPACK and BLAS Compiler-dependent Functions and Fortran 90 Modules Linking with Threading Libraries Sequential Mode of the Library You can use Intel MKL in a sequential (non-threaded) mode. In this mode, Intel MKL runs unthreaded code. However, it is thread-safe (except the LAPACK deprecated routine ?lacon), which means that you can use it in a parallel region in your OpenMP* code. The sequential mode requires no compatibility OpenMP* run-time library and does not respond to the environment variable OMP_NUM_THREADS or its Intel MKL equivalents. You should use the library in the sequential mode only if you have a particular reason not to use Intel MKL threading. The sequential mode may be helpful when using Intel MKL with programs threaded with some non-Intel compilers or in other situations where you need a non-threaded version of the library (for instance, in some MPI cases). To set the sequential mode, in the Threading layer, choose the *sequential.* library. See Also Directory Structure in Detail Using Parallelism of the Intel® Math Kernel Library Avoiding Conflicts in the Execution Environment Linking Examples Selecting the Threading Layer Several compilers that Intel MKL supports use the OpenMP* threading technology. Intel MKL supports implementations of the OpenMP* technology that these compilers provide. To make use of this support, you need to link with the appropriate library in the Threading Layer and Compiler Support Run-time Library (RTL). Threading Layer Each Intel MKL threading library contains the same code compiled by the respective compiler (Intel and PGI* compilers on Windows OS). RTL This layer includes libiomp, the compatibility OpenMP* run-time library of the Intel compiler. In addition to the Intel compiler, libiomp provides support for one more threading compiler on Windows OS (Microsoft Visual C++*). That is, a program threaded with the Microsoft Visual C++ compiler can safely be linked with Intel MKL and libiomp. The table below helps explain what threading library and RTL you should choose under different scenarios when using Intel MKL (static cases only): 4 Intel® Math Kernel Library for Windows* OS User's Guide 36Compiler Application Threaded? Threading Layer RTL Recommended Comment Intel Does not matter mkl_intel_ thread.lib libiomp5md.lib PGI Yes mkl_pgi_thread. lib or mkl_sequential. lib PGI* supplied Use of mkl_sequential.lib removes threading from Intel MKL calls. PGI No mkl_intel_ thread.lib libiomp5md.lib PGI No mkl_pgi_thread. lib PGI* supplied PGI No mkl_sequential. lib None Microsoft Yes mkl_intel_ thread.lib libiomp5md.lib For the OpenMP* library of the Microsoft Visual Studio* IDE version 2005 or later. Microsoft Yes mkl_sequential. lib None For Win32 threading. Microsoft No mkl_intel_ thread.lib libiomp5md.lib other Yes mkl_sequential. lib None other No mkl_intel_ thread.lib libiomp5md.lib TIP To use the threaded Intel MKL, compile your code with the /MT option. The compiler driver will pass the option to the linker and the latter will load multi-thread (MT) run-time libraries. Linking with Computational Libraries If you are not using the Intel MKL cluster software, you need to link your application with only one computational library, depending on the linking method: Static Linking Dynamic Linking mkl_core.lib mkl_core_dll.lib Computational Libraries for Applications that Use the Intel MKL Cluster Software ScaLAPACK and Cluster Fourier Transform Functions (Cluster FFT) require more computational libraries, which may depend on your architecture. The following table lists computational libraries for IA-32 architecture applications that use ScaLAPACK or Cluster FFT. Linking Your Application with the Intel® Math Kernel Library 4 37Computational Libraries for IA-32 Architecture Function domain Static Linking Dynamic Linking ScaLAPACK † mkl_scalapack_core.lib mkl_core.lib mkl_scalapack_core_dll.lib mkl_core_dll.lib Cluster Fourier Transform Functions † mkl_cdft_core.lib mkl_core.lib mkl_cdft_core_dll.lib mkl_core_dll.lib † Also add the library with BLACS routines corresponding to the MPI used. The following table lists computational libraries for Intel ® 64 architecture applications that use ScaLAPACK or Cluster FFT. Computational Libraries for the Intel ® 64 Architecture Function domain Static Linking Dynamic Linking ScaLAPACK, LP64 interface 1 mkl_scalapack_lp64.lib mkl_core.lib mkl_scalapack_lp64_dll.lib mkl_core_dll.lib ScaLAPACK, ILP64 interface 1 mkl_scalapack_ilp64.lib mkl_core.lib mkl_scalapack_ilp64_dll.lib mkl_core_dll.lib Cluster Fourier Transform Functions 1 mkl_cdft_core.lib mkl_core.lib mkl_cdft_core_dll.lib mkl_core_dll.lib † Also add the library with BLACS routines corresponding to the MPI used. See Also Linking with ScaLAPACK and Cluster FFTs Using the Link-line Advisor Using the ILP64 Interface vs. LP64 Interface Linking with Compiler Run-time Libraries Dynamically link libiomp, the compatibility OpenMP* run-time library, even if you link other libraries statically. Linking to the libiomp statically can be problematic because the more complex your operating environment or application, the more likely redundant copies of the library are included. This may result in performance issues (oversubscription of threads) and even incorrect results. To link libiomp dynamically, be sure the PATH environment variable is defined correctly. See Also Setting Environment Variables Layered Model Concept Linking with System Libraries If your system is based on the Intel® 64 architecture, be aware that Microsoft SDK builds 1289 or higher provide the bufferoverflowu.lib library to resolve the __security_cookie external references. Makefiles for examples and tests include this library by using the buf_lib=bufferoverflowu.lib macro. If you are using older SDKs, leave this macro empty on your command line as follows: buf_lib= . 4 Intel® Math Kernel Library for Windows* OS User's Guide 38Building Custom Dynamic-link Libraries ?ustom dynamic-link libraries (DLL) reduce the collection of functions available in Intel MKL libraries to those required to solve your particular problems, which helps to save disk space and build your own dynamic libraries for distribution. The Intel MKL custom DLL builder enables you to create a dynamic library containing the selected functions and located in the tools\builder directory. The builder contains a makefile and a definition file with the list of functions. Using the Custom Dynamic-link Library Builder in the Command-line Mode To build a custom DLL, use the following command: nmake target [] The following table lists possible values of target and explains what the command does for each value: Value Comment libia32 The builder uses static Intel MKL interface, threading, and core libraries to build a custom DLL for the IA-32 architecture. libintel64 The builder uses static Intel MKL interface, threading, and core libraries to build a custom DLL for the Intel® 64 architecture. dllia32 The builder uses the single dynamic library libmkl_rt.dll to build a custom DLL for the IA-32 architecture. dllintel64 The builder uses the single dynamic library libmkl_rt.dll to build a custom DLL for the Intel® 64 architecture. help The command prints Help on the custom DLL builder The placeholder stands for the list of parameters that define macros to be used by the makefile. The following table describes these parameters: Parameter [Values] Description interface Defines which programming interface to use.Possible values: • For the IA-32 architecture, {cdecl|stdcall}. The default value is cdecl. • For the Intel 64 architecture, {lp64|ilp64}. The default value is lp64. threading = {parallel| sequential} Defines whether to use the Intel MKL in the threaded or sequential mode. The default value is parallel. export = Specifies the full name of the file that contains the list of entry-point functions to be included in the DLL. The default name is user_example_list (no extension). name = Specifies the name of the dll and interface library to be created. By default, the names of the created libraries are mkl_custom.dll and mkl_custom.lib. xerbla = Specifies the name of the object file .obj that contains the user's error handler. The makefile adds this error handler to the library for use instead of the default Intel MKL error handler xerbla. If you omit this parameter, the native Intel MKL xerbla is used. See the description of the xerbla function in the Intel MKL Reference Manual on how to develop your own error handler. For the IA-32 architecture, the object file should be in the interface defined by the interface macro (cdecl or stdcall). Linking Your Application with the Intel® Math Kernel Library 4 39Parameter [Values] Description MKLROOT = Specifies the location of Intel MKL libraries used to build the custom DLL. By default, the builder uses the Intel MKL installation directory. buf_lib Manages resolution of the __security_cookie external references in the custom DLL on systems based on the Intel® 64 architecture. By default, the makefile uses the bufferoverflowu.lib library of Microsoft SDK builds 1289 or higher. This library resolves the __security_cookie external references. To avoid using this library, set the empty value of this parameter. Therefore, if you are using an older SDK, set buf_lib= . CAUTION Use the buf_lib parameter only with the empty value. Incorrect value of the parameter causes builder errors. crt = Specifies the name of the Microsoft C run-time library to be used to build the custom DLL. By default, the builder uses msvcrt.lib. manifest = {yes|no|embed} Manages the creation of a Microsoft manifest for the custom DLL: • If manifest=yes, the manifest file with the name defined by the name parameter above and the manifest extension will be created. • If manifest=no, the manifest file will not be created. • If manifest=embed, the manifest will be embedded into the DLL. By default, the builder does not use the manifest parameter. All the above parameters are optional. In the simplest case, the command line is nmake ia32, and the missing options have default values. This command creates the mkl_custom.dll and mkl_custom.lib libraries with the cdecl interface for processors using the IA-32 architecture. The command takes the list of functions from the functions_list file and uses the native Intel MKL error handler xerbla. An example of a more complex case follows: nmake ia32 interface=stdcall export=my_func_list.txt name=mkl_small xerbla=my_xerbla.obj In this case, the command creates the mkl_small.dll and mkl_small.lib libraries with the stdcall interface for processors using the IA-32 architecture. The command takes the list of functions from my_func_list.txt file and uses the user's error handler my_xerbla.obj. The process is similar for processors using the Intel® 64 architecture. See Also Linking with System Libraries Composing a List of Functions To compose a list of functions for a minimal custom DLL needed for your application, you can use the following procedure: 1. Link your application with installed Intel MKL libraries to make sure the application builds. 2. Remove all Intel MKL libraries from the link line and start linking. Unresolved symbols indicate Intel MKL functions that your application uses. 3. Include these functions in the list. 4 Intel® Math Kernel Library for Windows* OS User's Guide 40Important Each time your application starts using more Intel MKL functions, update the list to include the new functions. See Also Specifying Function Names Specifying Function Names In the file with the list of functions for your custom DLL, adjust function names to the required interface. For example, you can list the cdecl entry points as follows: DGEMM DTRSM DDOT DGETRF DGETRS cblas_dgemm cblas_ddot You can list the stdcall entry points as follows: _DGEMM@60 _DDOT@20 _DGETRF@24 For more examples, see domain-specific lists of function names in the \tools\builder folder. This folder contains lists of function names for both cdecl or stdcall interfaces. NOTE The lists of function names are provided in the \tools\builder folder merely as examples. See Composing a List of Functions for how to compose lists of functions for your custom DLL. TIP Names of Fortran-style routines (BLAS, LAPACK, etc.) can be both upper-case or lower-case, with or without the trailing underscore. For example, these names are equivalent: BLAS: dgemm, DGEMM, dgemm_, DGEMM_ LAPACK: dgetrf, DGETRF, dgetrf_, DGETRF_. Properly capitalize names of C support functions in the function list. To do this, follow the guidelines below: 1. In the mkl_service.h include file, look up a #define directive for your function. 2. Take the function name from the replacement part of that directive. For example, the #define directive for the mkl_disable_fast_mm function is #define mkl_disable_fast_mm MKL_Disable_Fast_MM. Capitalize the name of this function in the list like this: MKL_Disable_Fast_MM. For the names of the Fortran support functions, see the tip. Building a Custom Dynamic-link Library in the Visual Studio* Development System You can build a custom dynamic-link library (DLL) in the Microsoft Visual Studio* Development System (VS*) . To do this, use projects available in the tools\builder\MSVS_Projects subdirectory of the Intel MKL directory. The directory contains the VS2005, VS2008, and VS2010 subdirectories with projects for the respective versions of the Visual Studio Development System. For each version of VS two solutions are available: Linking Your Application with the Intel® Math Kernel Library 4 41• libia32.sln builds a custom DLL for the IA-32 architecture. • libintel64.sln builds a custom DLL for the Intel® 64 architecture. The builder uses the following default settings for the custom DLL: Interface: cdecl for the IA-32 architecture and LP64 for the Intel 64 architecture Error handler: Native Intel MKL xerbla Create Microsoft manifest: yes List of functions: in the project's source file examples.def To build a custom DLL: 1. Open the libia32.sln or libintel64.sln solution depending on the architecture of your system. The solution includes the following projects: • i_malloc_dll • vml_dll_core • cdecl_parallel (in libia32.sln) or lp64_parallel (in libintel64.sln) • cdecl_sequential (in libia32.sln) or lp64_sequential (in libintel64.sln) 2. [Optional] To change any of the default settings, select the project depending on whether the DLL will use Intel MKL functions in the sequential or multi-threaded mode: • In the libia32 solution, select the cdecl_sequential or cdecl_parallel project. • In the libintel64 solution, select the lp64_sequential or lp64_parallel project. 3. [Optional] To build the DLL that uses the stdcall interface for the IA-32 architecture or the ILP64 interface for the Intel 64 architecture: a. Select Project>Properties>Configuration Properties>Linker>Input>Additional Dependencies. b. In the libia32 solution, change mkl_intel_c.lib to mkl_intel_s.lib. In the libintel64 solution, change mkl_intel_lp64.lib to mkl_intel_ilp64.lib. 4. [Optional] To include your own error handler in the DLL: a. Select Project>Properties>Configuration Properties>Linker>Input. b. Add .obj 5. [Optional] To turn off creation of the manifest: a. Select Project>Properties>Configuration Properties>Linker>Manifest File>Generate Manifest. b. Select: no. 6. [Optional] To change the list of functions to be included in the DLL: a. Select Source Files. b. Edit the examples.def file. Refer to Specifying Function Names for how to specify entry points. 7. To build the library: • In VS2005 - VS2008, select Build>Project Only>Link Only and link projects in this order: i_malloc_dll, vml_dll_core, cdecl_sequential/lp64_sequential or cdecl_ parallel/ lp64_parallel. • In VS2010, select Build>Build Solution. See Also Using the Custom Dynamic-link Library Builder in the Command-line Mode Distributing Your Custom Dynamic-link Library To enable use of your custom DLL in a threaded mode, distribute libiomp5md.dll along with the custom DLL. 4 Intel® Math Kernel Library for Windows* OS User's Guide 42Managing Performance and Memory 5 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Parallelism of the Intel® Math Kernel Library Intel MKL is extensively parallelized. See Threaded Functions and Problems for lists of threaded functions and problems that can be threaded. Intel MKL is thread-safe, which means that all Intel MKL functions (except the LAPACK deprecated routine ? lacon) work correctly during simultaneous execution by multiple threads. In particular, any chunk of threaded Intel MKL code provides access for multiple threads to the same shared data, while permitting only one thread at any given time to access a shared piece of data. Therefore, you can call Intel MKL from multiple threads and not worry about the function instances interfering with each other. The library uses OpenMP* threading software, so you can use the environment variable OMP_NUM_THREADS to specify the number of threads or the equivalent OpenMP run-time function calls. Intel MKL also offers variables that are independent of OpenMP, such as MKL_NUM_THREADS, and equivalent Intel MKL functions for thread management. The Intel MKL variables are always inspected first, then the OpenMP variables are examined, and if neither is used, the OpenMP software chooses the default number of threads. By default, Intel MKL uses the number of threads equal to the number of physical cores on the system. To achieve higher performance, set the number of threads to the number of real processors or physical cores, as summarized in Techniques to Set the Number of Threads. See Also Managing Multi-core Performance Threaded Functions and Problems The following Intel MKL function domains are threaded: • Direct sparse solver. • LAPACK. For the list of threaded routines, see Threaded LAPACK Routines. • Level1 and Level2 BLAS. For the list of threaded routines, see Threaded BLAS Level1 and Level2 Routines. • All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers. • All mathematical VML functions. • FFT. For the list of FFT transforms that can be threaded, see Threaded FFT Problems. 43Threaded LAPACK Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. The following LAPACK routines are threaded: • Linear equations, computational routines: • Factorization: ?getrf, ?gbtrf, ?potrf, ?pptrf, ?sytrf, ?hetrf, ?sptrf, ?hptrf • Solving: ?dttrsb, ?gbtrs, ?gttrs, ?pptrs, ?pbtrs, ?pttrs, ?sytrs, ?sptrs, ?hptrs, ? tptrs, ?tbtrs • Orthogonal factorization, computational routines: ?geqrf, ?ormqr, ?unmqr, ?ormlq, ?unmlq, ?ormql, ?unmql, ?ormrq, ?unmrq • Singular Value Decomposition, computational routines: ?gebrd, ?bdsqr • Symmetric Eigenvalue Problems, computational routines: ?sytrd, ?hetrd, ?sptrd, ?hptrd, ?steqr, ?stedc. • Generalized Nonsymmetric Eigenvalue Problems, computational routines: chgeqz/zhgeqz. A number of other LAPACK routines, which are based on threaded LAPACK or BLAS routines, make effective use of parallelism: ?gesv, ?posv, ?gels, ?gesvd, ?syev, ?heev, cgegs/zgegs, cgegv/zgegv, cgges/zgges, cggesx/zggesx, cggev/zggev, cggevx/zggevx, and so on. Threaded BLAS Level1 and Level2 Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. The following routines are threaded for Intel ® Core™2 Duo and Intel ® Core™ i7 processors: • Level1 BLAS: ?axpy, ?copy, ?swap, ddot/sdot, cdotc, drot/srot • Level2 BLAS: ?gemv, ?trmv, dsyr/ssyr, dsyr2/ssyr2, dsymv/ssymv Threaded FFT Problems The following characteristics of a specific problem determine whether your FFT computation may be threaded: • rank • domain • size/length • precision (single or double) • placement (in-place or out-of-place) • strides • number of transforms • layout (for example, interleaved or split layout of complex data) Most FFT problems are threaded. In particular, computation of multiple transforms in one call (number of transforms > 1) is threaded. Details of which transforms are threaded follow. One-dimensional (1D) transforms 1D transforms are threaded in many cases. 5 Intel® Math Kernel Library for Windows* OS User's Guide 441D complex-to-complex (c2c) transforms of size N using interleaved complex data layout are threaded under the following conditions depending on the architecture: Architecture Conditions Intel ® 64 N is a power of 2, log2(N) > 9, the transform is double-precision out-of-place, and input/output strides equal 1. IA-32 N is a power of 2, log2(N) > 13, and the transform is single-precision. N is a power of 2, log2(N) > 14, and the transform is double-precision. Any N is composite, log2(N) > 16, and input/output strides equal 1. 1D real-to-complex and complex-to-real transforms are not threaded. 1D complex-to-complex transforms using split-complex layout are not threaded. Prime-size complex-to-complex 1D transforms are not threaded. Multidimensional transforms All multidimensional transforms on large-volume data are threaded. Avoiding Conflicts in the Execution Environment Certain situations can cause conflicts in the execution environment that make the use of threads in Intel MKL problematic. This section briefly discusses why these problems exist and how to avoid them. If you thread the program using OpenMP directives and compile the program with Intel compilers, Intel MKL and the program will both use the same threading library. Intel MKL tries to determine if it is in a parallel region in the program, and if it is, it does not spread its operations over multiple threads unless you specifically request Intel MKL to do so via the MKL_DYNAMIC functionality. However, Intel MKL can be aware that it is in a parallel region only if the threaded program and Intel MKL are using the same threading library. If your program is threaded by some other means, Intel MKL may operate in multithreaded mode, and the performance may suffer due to overuse of the resources. The following table considers several cases where the conflicts may arise and provides recommendations depending on your threading model: Threading model Discussion You thread the program using OS threads (Win32* threads on Windows* OS). If more than one thread calls Intel MKL, and the function being called is threaded, it may be important that you turn off Intel MKL threading. Set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). You thread the program using OpenMP directives and/or pragmas and compile the program using a compiler other than a compiler from Intel. This is more problematic because setting of the OMP_NUM_THREADS environment variable affects both the compiler's threading library and libiomp. In this case, choose the threading library that matches the layered Intel MKL with the OpenMP compiler you employ (see Linking Examples on how to do this). If this is not possible, use Intel MKL in the sequential mode. To do this, you should link with the appropriate threading library: mkl_sequential.lib or mkl_sequential.dll (see High-level Directory Structure). There are multiple programs running on a multiple-cpu system, for example, a parallelized program that runs using MPI for communication in which each processor is treated as a node. The threading software will see multiple processors on the system even though each processor has a separate MPI process running on it. In this case, one of the solutions is to set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). Section Intel(R) Optimized MP LINPACK Benchmark for Clusters discusses another solution for a Hybrid (OpenMP* + MPI) mode. Managing Performance and Memory 5 45TIP To get best performance with threaded Intel MKL, compile your code with the /MT option. See Also Using Additional Threading Control Linking with Compiler Run-time Libraries Techniques to Set the Number of Threads Use one of the following techniques to change the number of threads to use in Intel MKL: • Set one of the OpenMP or Intel MKL environment variables: • OMP_NUM_THREADS • MKL_NUM_THREADS • MKL_DOMAIN_NUM_THREADS • Call one of the OpenMP or Intel MKL functions: • omp_set_num_threads() • mkl_set_num_threads() • mkl_domain_set_num_threads() When choosing the appropriate technique, take into account the following rules: • The Intel MKL threading controls take precedence over the OpenMP controls because they are inspected first. • A function call takes precedence over any environment variables. The exception, which is a consequence of the previous rule, is the OpenMP subroutine omp_set_num_threads(), which does not have precedence over Intel MKL environment variables, such as MKL_NUM_THREADS. See Using Additional Threading Control for more details. • You cannot change run-time behavior in the course of the run using the environment variables because they are read only once at the first call to Intel MKL. Setting the Number of Threads Using an OpenMP* Environment Variable You can set the number of threads using the environment variable OMP_NUM_THREADS. To change the number of threads, in the command shell in which the program is going to run, enter: set OMP_NUM_THREADS=. Some shells require the variable and its value to be exported: export OMP_NUM_THREADS=. You can alternatively assign value to the environment variable using Microsoft Windows* OS Control Panel. Note that you will not benefit from setting this variable on Microsoft Windows* 98 or Windows* ME because multiprocessing is not supported. See Also Using Additional Threading Control Changing the Number of Threads at Run Time You cannot change the number of threads during run time using environment variables. However, you can call OpenMP API functions from your program to change the number of threads during run time. The following sample code shows how to change the number of threads during run time using the omp_set_num_threads() routine. See also Techniques to Set the Number of Threads. 5 Intel® Math Kernel Library for Windows* OS User's Guide 46The following example shows both C and Fortran code examples. To run this example in the C language, use the omp.h header file from the Intel(R) compiler package. If you do not have the Intel compiler but wish to explore the functionality in the example, use Fortran API for omp_set_num_threads() rather than the C version. For example, omp_set_num_threads_( &i_one ); // ******* C language ******* #include "omp.h" #include "mkl.h" #include #define SIZE 1000 int main(int args, char *argv[]){ double *a, *b, *c; a = (double*)malloc(sizeof(double)*SIZE*SIZE); b = (double*)malloc(sizeof(double)*SIZE*SIZE); c = (double*)malloc(sizeof(double)*SIZE*SIZE); double alpha=1, beta=1; int m=SIZE, n=SIZE, k=SIZE, lda=SIZE, ldb=SIZE, ldc=SIZE, i=0, j=0; char transa='n', transb='n'; for( i=0; i #include ... mkl_set_num_threads ( 1 ); // ******* Fortran language ******* ... call mkl_set_num_threads( 1 ) See the Intel MKL Reference Manual for the detailed description of the threading control functions, their parameters, calling syntax, and more code examples. MKL_DYNAMIC The MKL_DYNAMIC environment variable enables Intel MKL to dynamically change the number of threads. The default value of MKL_DYNAMIC is TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE. When MKL_DYNAMIC is TRUE, Intel MKL tries to use what it considers the best number of threads, up to the maximum number you specify. Managing Performance and Memory 5 49For example, MKL_DYNAMIC set to TRUE enables optimal choice of the number of threads in the following cases: • If the requested number of threads exceeds the number of physical cores (perhaps because of using the Intel® Hyper-Threading Technology), and MKL_DYNAMIC is not changed from its default value of TRUE, Intel MKL will scale down the number of threads to the number of physical cores. • If you are able to detect the presence of MPI, but cannot determine if it has been called in a thread-safe mode (it is impossible to detect this with MPICH 1.2.x, for instance), and MKL_DYNAMIC has not been changed from its default value of TRUE, Intel MKL will run one thread. When MKL_DYNAMIC is FALSE, Intel MKL tries not to deviate from the number of threads the user requested. However, setting MKL_DYNAMIC=FALSE does not ensure that Intel MKL will use the number of threads that you request. The library may have no choice on this number for such reasons as system resources. Additionally, the library may examine the problem and use a different number of threads than the value suggested. For example, if you attempt to do a size one matrix-matrix multiply across eight threads, the library may instead choose to use only one thread because it is impractical to use eight threads in this event. Note also that if Intel MKL is called in a parallel region, it will use only one thread by default. If you want the library to use nested parallelism, and the thread within a parallel region is compiled with the same OpenMP compiler as Intel MKL is using, you may experiment with setting MKL_DYNAMIC to FALSE and manually increasing the number of threads. In general, set MKL_DYNAMIC to FALSE only under circumstances that Intel MKL is unable to detect, for example, to use nested parallelism where the library is already called from a parallel section. MKL_DOMAIN_NUM_THREADS The MKL_DOMAIN_NUM_THREADS environment variable suggests the number of threads for a particular function domain. MKL_DOMAIN_NUM_THREADS accepts a string value , which must have the following format: ::= { } ::= [ * ] ( | | | ) [ * ] ::= ::= MKL_DOMAIN_ALL | MKL_DOMAIN_BLAS | MKL_DOMAIN_FFT | MKL_DOMAIN_VML | MKL_DOMAIN_PARDISO ::= [ * ] ( | | ) [ * ] ::= ::= | | In the syntax above, values of indicate function domains as follows: MKL_DOMAIN_ALL All function domains MKL_DOMAIN_BLAS BLAS Routines MKL_DOMAIN_FFT non-cluster Fourier Transform Functions MKL_DOMAIN_VML Vector Mathematical Functions MKL_DOMAIN_PARDISO PARDISO For example, MKL_DOMAIN_ALL 2 : MKL_DOMAIN_BLAS 1 : MKL_DOMAIN_FFT 4 MKL_DOMAIN_ALL=2 : MKL_DOMAIN_BLAS=1 : MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2, MKL_DOMAIN_BLAS=1, MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2; MKL_DOMAIN_BLAS=1; MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL = 2 MKL_DOMAIN_BLAS 1 , MKL_DOMAIN_FFT 4 5 Intel® Math Kernel Library for Windows* OS User's Guide 50MKL_DOMAIN_ALL,2: MKL_DOMAIN_BLAS 1, MKL_DOMAIN_FFT,4 . The global variables MKL_DOMAIN_ALL, MKL_DOMAIN_BLAS, MKL_DOMAIN_FFT, MKL_DOMAIN_VML, and MKL_DOMAIN_PARDISO, as well as the interface for the Intel MKL threading control functions, can be found in the mkl.h header file. The table below illustrates how values of MKL_DOMAIN_NUM_THREADS are interpreted. Value of MKL_DOMAIN_NUM_ THREADS Interpretation MKL_DOMAIN_ALL= 4 All parts of Intel MKL should try four threads. The actual number of threads may be still different because of the MKL_DYNAMIC setting or system resource issues. The setting is equivalent to MKL_NUM_THREADS = 4. MKL_DOMAIN_ALL= 1, MKL_DOMAIN_BLAS =4 All parts of Intel MKL should try one thread, except for BLAS, which is suggested to try four threads. MKL_DOMAIN_VML= 2 VML should try two threads. The setting affects no other part of Intel MKL. Be aware that the domain-specific settings take precedence over the overall ones. For example, the "MKL_DOMAIN_BLAS=4" value of MKL_DOMAIN_NUM_THREADS suggests trying four threads for BLAS, regardless of later setting MKL_NUM_THREADS, and a function call "mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS );" suggests the same, regardless of later calls to mkl_set_num_threads(). However, a function call with input "MKL_DOMAIN_ALL", such as "mkl_domain_set_num_threads (4, MKL_DOMAIN_ALL);" is equivalent to "mkl_set_num_threads(4)", and thus it will be overwritten by later calls to mkl_set_num_threads. Similarly, the environment setting of MKL_DOMAIN_NUM_THREADS with "MKL_DOMAIN_ALL=4" will be overwritten with MKL_NUM_THREADS = 2. Whereas the MKL_DOMAIN_NUM_THREADS environment variable enables you set several variables at once, for example, "MKL_DOMAIN_BLAS=4,MKL_DOMAIN_FFT=2", the corresponding function does not take string syntax. So, to do the same with the function calls, you may need to make several calls, which in this example are as follows: mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS ); mkl_domain_set_num_threads ( 2, MKL_DOMAIN_FFT ); Setting the Environment Variables for Threading Control To set the environment variables used for threading control, in the command shell in which the program is going to run, enter : set = For example, set MKL_NUM_THREADS=4 set MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4" set MKL_DYNAMIC=FALSE Some shells require the variable and its value to be exported: export = For example: export MKL_NUM_THREADS=4 export MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4" export MKL_DYNAMIC=FALSE Managing Performance and Memory 5 51You can alternatively assign values to the environment variables using Microsoft Windows* OS Control Panel. Tips and Techniques to Improve Performance Coding Techniques To obtain the best performance with Intel MKL, ensure the following data alignment in your source code: • Align arrays on 16-byte boundaries. See Aligning Addresses on 16-byte Boundaries for how to do it. • Make sure leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16, where element_size is the size of an array element in bytes. • For two-dimensional arrays, avoid leading dimension values divisible by 2048 bytes. For example, for a double-precision array, with element_size = 8, avoid leading dimensions 256, 512, 768, 1024, … (elements). LAPACK Packed Routines The routines with the names that contain the letters HP, OP, PP, SP, TP, UP in the matrix type and storage position (the second and third letters respectively) operate on the matrices in the packed format (see LAPACK "Routine Naming Conventions" sections in the Intel MKL Reference Manual). Their functionality is strictly equivalent to the functionality of the unpacked routines with the names containing the letters HE, OR, PO, SY, TR, UN in the same positions, but the performance is significantly lower. If the memory restriction is not too tight, use an unpacked routine for better performance. In this case, you need to allocate N 2 /2 more memory than the memory required by a respective packed routine, where N is the problem size (the number of equations). For example, to speed up solving a symmetric eigenproblem with an expert driver, use the unpacked routine: call dsyevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info) where a is the dimension lda-by-n, which is at least N 2 elements, instead of the packed routine: call dspevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) where ap is the dimension N*(N+1)/2. FFT Functions Additional conditions can improve performance of the FFT functions. The addresses of the first elements of arrays and the leading dimension values, in bytes (n*element_size), of two-dimensional arrays should be divisible by cache line size, which equals: • 32 bytes for the Intel ® Pentium® III processors • 64 bytes for the Intel ® Pentium® 4 processors and processors using Intel ® 64 architecture 5 Intel® Math Kernel Library for Windows* OS User's Guide 52Hardware Configuration Tips Dual-Core Intel® Xeon® processor 5100 series systems To get the best performance with Intel MKL on Dual-Core Intel ® Xeon® processor 5100 series systems, enable the Hardware DPL (streaming data) Prefetcher functionality of this processor. To configure this functionality, use the appropriate BIOS settings, as described in your BIOS documentation. Intel® Hyper-Threading Technology Intel ® Hyper-Threading Technology (Intel ® HT Technology) is especially effective when each thread performs different types of operations and when there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance by disabling Intel HT Technology. If you run with Intel HT Technology enabled, performance may be especially impacted if you run on fewer threads than physical cores. Moreover, if, for example, there are two threads to every physical core, the thread scheduler may assign two threads to some cores and ignore the other cores altogether. If you are using the OpenMP* library of the Intel Compiler, read the respective User Guide on how to best set the thread affinity interface to avoid this situation. For Intel MKL, apply the following setting: set KMP_AFFINITY=granularity=fine,compact,1,0 See Also Using Parallelism of the Intel® Math Kernel Library Managing Multi-core Performance You can obtain best performance on systems with multi-core processors by requiring that threads do not migrate from core to core. To do this, bind threads to the CPU cores by setting an affinity mask to threads. Use one of the following options: • OpenMP facilities (recommended, if available), for example, the KMP_AFFINITY environment variable using the Intel OpenMP library • A system function, as explained below Consider the following performance issue: • The system has two sockets with two cores each, for a total of four cores (CPUs) • Performance of t he four -thread parallel application using the Intel MKL LAPACK is unstable The following code example shows how to resolve this issue by setting an affinity mask by operating system means using the Intel compiler. The code calls the system function SetThreadAffinityMask to bind the threads to appropriate cores , thus preventing migration of the threads. Then the Intel MKL LAPACK routine is called: // Set affinity mask #include #include int main(void) { #pragma omp parallel default(shared) { int tid = omp_get_thread_num(); // 2 packages x 2 cores/pkg x 1 threads/core (4 total cores) DWORD_PTR mask = (1 << (tid == 0 ? 0 : 2 )); SetThreadAffinityMask( GetCurrentThread(), mask ); } // Call Intel MKL LAPACK routine return 0; Managing Performance and Memory 5 53 } Compile the application with the Intel compiler using the following command: icl /Qopenmp test_application.c where test_application.c is the filename for the application. Build the application. Run it in four threads, for example, by using the environment variable to set the number of threads: set OMP_NUM_THREADS=4 test_application.exe See Windows API documentation at msdn.microsoft.com/ for the restrictions on the usage of Windows API routines and particulars of the SetThreadAffinityMask function used in the above example. See also a similar example at en.wikipedia.org/wiki/Affinity_mask . Operating on Denormals The IEEE 754-2008 standard, "An IEEE Standard for Binary Floating-Point Arithmetic", defines denormal (or subnormal) numbers as non-zero numbers smaller than the smallest possible normalized numbers for a specific floating-point format. Floating-point operations on denormals are slower than on normalized operands because denormal operands and results are usually handled through a software assist mechanism rather than directly in hardware. This software processing causes Intel MKL functions that consume denormals to run slower than with normalized floating-point numbers. You can mitigate this performance issue by setting the appropriate bit fields in the MXCSR floating-point control register to flush denormals to zero (FTZ) or to replace any denormals loaded from memory with zero (DAZ). Check your compiler documentation to determine whether it has options to control FTZ and DAZ. Note that these compiler options may slightly affect accuracy. FFT Optimized Radices You can improve the performance of Intel MKL FFT if the length of your data vector permits factorization into powers of optimized radices. In Intel MKL, the optimized radices are 2, 3, 5, 7, 11, and 13. Using Memory Management Intel MKL Memory Management Software Intel MKL has memory management software that controls memory buffers for the use by the library functions. New buffers that the library allocates when your application calls Intel MKL are not deallocated until the program ends. To get the amount of memory allocated by the memory management software, call the mkl_mem_stat() function. If your program needs to free memory, call mkl_free_buffers(). If another call is made to a library function that needs a memory buffer, the memory manager again allocates the buffers and they again remain allocated until either the program ends or the program deallocates the memory. This behavior facilitates better performance. However, some tools may report this behavior as a memory leak. The memory management software is turned on by default. To turn it off, set the MKL_DISABLE_FAST_MM environment variable to any value or call the mkl_disable_fast_mm() function. Be aware that this change may negatively impact performance of some Intel MKL routines, especially for small problem sizes. 5 Intel® Math Kernel Library for Windows* OS User's Guide 54Redefining Memory Functions In C/C++ programs, you can replace Intel MKL memory functions that the library uses by default with your own functions. To do this, use the memory renaming feature. Memory Renaming Intel MKL memory management by default uses standard C run-time memory functions to allocate or free memory. These functions can be replaced using memory renaming. Intel MKL accesses the memory functions by pointers i_malloc, i_free, i_calloc, and i_realloc, which are visible at the application level. These pointers initially hold addresses of the standard C run-time memory functions malloc, free, calloc, and realloc, respectively. You can programmatically redefine values of these pointers to the addresses of your application's memory management functions. Redirecting the pointers is the only correct way to use your own set of memory management functions. If you call your own memory functions without redirecting the pointers, the memory will get managed by two independent memory management packages, which may cause unexpected memory issues. How to Redefine Memory Functions To redefine memory functions, use the following procedure: If you are using the statically linked Intel MKL, 1. Include the i_malloc.h header file in your code. This header file contains all declarations required for replacing the memory allocation functions. The header file also describes how memory allocation can be replaced in those Intel libraries that support this feature. 2. Redefine values of pointers i_malloc, i_free, i_calloc, and i_realloc prior to the first call to MKL functions, as shown in the following example: #include "i_malloc.h" . . . i_malloc = my_malloc; i_calloc = my_calloc; i_realloc = my_realloc; i_free = my_free; . . . // Now you may call Intel MKL functions If you are using the dynamically linked Intel MKL, 1. Include the i_malloc.h header file in your code. 2. Redefine values of pointers i_malloc_dll, i_free_dll, i_calloc_dll, and i_realloc_dll prior to the first call to MKL functions, as shown in the following example: #include "i_malloc.h" . . . i_malloc_dll = my_malloc; i_calloc_dll = my_calloc; i_realloc_dll = my_realloc; i_free_dll = my_free; . . . // Now you may call Intel MKL functions Managing Performance and Memory 5 555 Intel® Math Kernel Library for Windows* OS User's Guide 56Language-specific Usage Options 6 The Intel® Math Kernel Library (Intel® MKL) provides broad support for Fortran and C/C++ programming. However, not all functions support both Fortran and C interfaces. For example, some LAPACK functions have no C interface. You can call such functions from C using mixed-language programming. If you want to use LAPACK or BLAS functions that support Fortran 77 in the Fortran 95 environment, additional effort may be initially required to build compiler-specific interface libraries and modules from the source code provided with Intel MKL. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Language-Specific Interfaces with Intel® Math Kernel Library This section discusses mixed-language programming and the use of language-specific interfaces with Intel MKL. See also Appendix G in the Intel MKL Reference Manual for details of the FFTW interfaces to Intel MKL. Interface Libraries and Modules You can create the following interface libraries and modules using the respective makefiles located in the interfaces directory. File name Contains Libraries, in Intel MKL architecture-specific directories mkl_blas95.lib 1 Fortran 95 wrappers for BLAS (BLAS95) for IA-32 architecture. mkl_blas95_ilp64.lib 1 Fortran 95 wrappers for BLAS (BLAS95) supporting LP64 interface. mkl_blas95_lp64.lib 1 Fortran 95 wrappers for BLAS (BLAS95) supporting ILP64 interface. mkl_lapack95.lib 1 Fortran 95 wrappers for LAPACK (LAPACK95) for IA-32 architecture. mkl_lapack95_lp64.lib 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting LP64 interface. mkl_lapack95_ilp64.lib 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting ILP64 interface. 57File name Contains fftw2xc_intel.lib 1 Interfaces for FFTW version 2.x (C interface for Intel compilers) to call Intel MKL FFTs. fftw2xc_ms.lib Contains interfaces for FFTW version 2.x (C interface for Microsoft compilers) to call Intel MKL FFTs. fftw2xf_intel.lib Interfaces for FFTW version 2.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. fftw3xc_intel.lib 2 Interfaces for FFTW version 3.x (C interface for Intel compiler) to call Intel MKL FFTs. fftw3xc_ms.lib Interfaces for FFTW version 3.x (C interface for Microsoft compilers) to call Intel MKL FFTs. fftw3xf_intel.lib 2 Interfaces for FFTW version 3.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. fftw2x_cdft_SINGLE.lib Single-precision interfaces for MPI FFTW version 2.x (C interface) to call Intel MKL cluster FFTs. fftw2x_cdft_DOUBLE.lib Double-precision interfaces for MPI FFTW version 2.x (C interface) to call Intel MKL cluster FFTs. fftw3x_cdft.lib Interfaces for MPI FFTW version 3.x (C interface) to call Intel MKL cluster FFTs. fftw3x_cdft_ilp64.lib Interfaces for MPI FFTW version 3.x (C interface) to call Intel MKL cluster FFTs supporting the ILP64 interface. Modules, in architecture- and interface-specific subdirectories of the Intel MKL include directory blas95.mod 1 Fortran 95 interface module for BLAS (BLAS95). lapack95.mod 1 Fortran 95 interface module for LAPACK (LAPACK95). f95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95. mkl95_blas.mod 1 Fortran 95 interface module for BLAS (BLAS95), identical to blas95.mod. To be removed in one of the future releases. mkl95_lapack.mod 1 Fortran 95 interface module for LAPACK (LAPACK95), identical to lapack95.mod. To be removed in one of the future releases. mkl95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95, identical to f95_precision.mod. To be removed in one of the future releases. mkl_service.mod 1 Fortran 95 interface module for Intel MKL support functions. 1 Prebuilt for the Intel® Fortran compiler 2 FFTW3 interfaces are integrated with Intel MKL. Look into \interfaces\fftw3x* \makefile for options defining how to build and where to place the standalone library with the wrappers. See Also Fortran 95 Interfaces to LAPACK and BLAS 6 Intel® Math Kernel Library for Windows* OS User's Guide 58Fortran 95 Interfaces to LAPACK and BLAS Fortran 95 interfaces are compiler-dependent. Intel MKL provides the interface libraries and modules precompiled with the Intel® Fortran compiler. Additionally, the Fortran 95 interfaces and wrappers are delivered as sources. (For more information, see Compiler-dependent Functions and Fortran 90 Modules). If you are using a different compiler, build the appropriate library and modules with your compiler and link the library as a user's library: 1. Go to the respective directory \interfaces\blas95 or \interfaces\lapack95 2. Type one of the following commands depending on your architecture: • For the IA-32 architecture, nmake libia32 install_dir= • For the Intel® 64 architecture, nmake libintel64 [interface=lp64|ilp64] install_dir= Important The parameter install_dir is required. As a result, the required library is built and installed in the \lib directory, and the .mod files are built and installed in the \include\[\{lp64|ilp64}] directory, where is one of {ia32, intel64}. By default, the ifort compiler is assumed. You may change the compiler with an additional parameter of nmake: FC=. For example, the command nmake libintel64 FC=f95 install_dir= interface=lp64 builds the required library and .mod files and installs them in subdirectories of . To delete the library from the building directory, use one of the following commands: • For the IA-32 architecture, nmake cleania32 install_dir= • For the Intel ® 64 architecture, nmake cleanintel64 [interface=lp64|ilp64] install_dir= • For all the architectures, nmake clean install_dir= CAUTION Even if you have administrative rights, avoid setting install_dir=..\.. or install_dir= in a build or clean command above because these settings replace or delete the Intel MKL prebuilt Fortran 95 library and modules. Compiler-dependent Functions and Fortran 90 Modules Compiler-dependent functions occur whenever the compiler inserts into the object code function calls that are resolved in its run-time library (RTL). Linking of such code without the appropriate RTL will result in undefined symbols. Intel MKL has been designed to minimize RTL dependencies. In cases where RTL dependencies might arise, the functions are delivered as source code and you need to compile the code with whatever compiler you are using for your application. Language-specific Usage Options 6 59In particular, Fortran 90 modules result in the compiler-specific code generation requiring RTL support. Therefore, Intel MKL delivers these modules compiled with the Intel compiler, along with source code, to be used with different compilers. Using the stdcall Calling Convention in C/C++ Intel MKL supports stdcall calling convention for the following function domains: • BLAS Routines • Sparse BLAS Routines • LAPACK Routines • Vector Mathematical Functions • Vector Statistical Functions • PARDISO • Direct Sparse Solvers • RCI Iterative Solvers • Support Functions To use the stdcall calling convention in C/C++, follow the guidelines below: • In your function calls, pass lengths of character strings to the functions. For example, compare the following calls to dgemm: cdecl: dgemm("N", "N", &n, &m, &k, &alpha, b, &ldb, a, &lda, &beta, c, &ldc); stdcall: dgemm("N", 1, "N", 1, &n, &m, &k, &alpha, b, &ldb, a, &lda, &beta, c, &ldc); • Define the MKL_STDCALL macro using either of the following techniques: – Define the macro in your source code before including Intel MKL header files: ... #define MKL_STDCALL #include "mkl.h" ... – Pass the macro to the compiler. For example: icl -DMKL_STDCALL foo.c • Link your application with the following library: – mkl_intel_s.lib for static linking – mkl_intel_s_dll.lib for dynamic linking See Also Using the cdecl and stdcall Interfaces Compiling an Application that Calls the Intel® Math Kernel Library and Uses the CVF Calling Conventions Include Files Compiling an Application that Calls the Intel® Math Kernel Library and Uses the CVF Calling Conventions The IA-32 architecture implementation of Intel MKL supports the Compaq Visual Fortran* (CVF) calling convention by providing the stdcall interface. 6 Intel® Math Kernel Library for Windows* OS User's Guide 60Although the Intel MKL does not provide the CVF interface in its Intel® 64 architecture implementation, you can use the Intel® Visual Fortran Compiler to compile your Intel® 64 architecture application that calls Intel MKL and uses the CVF calling convention. To do this: • Provide the following compiler options to enable compatibility with the CVF calling convention: /Gm or /iface:cvf • Additionally provide the following options to enable calling Intel MKL from your application: /iface:nomixed_str_len_arg See Also Using the cdecl and stdcall Interfaces Compiler Support Mixed-language Programming with the Intel Math Kernel Library Appendix A: Intel(R) Math Kernel Library Language Interfaces Support lists the programming languages supported for each Intel MKL function domain. However, you can call Intel MKL routines from different language environments. Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments Not all Intel MKL function domains support both C and Fortran environments. To use Intel MKL Fortran-style functions in C/C++ environments, you should observe certain conventions, which are discussed for LAPACK and BLAS in the subsections below. CAUTION Avoid calling BLAS 95/LAPACK 95 from C/C++. Such calls require skills in manipulating the descriptor of a deferred-shape array, which is the Fortran 90 type. Moreover, BLAS95/LAPACK95 routines contain links to a Fortran RTL. LAPACK and BLAS Because LAPACK and BLAS routines are Fortran-style, when calling them from C-language programs, follow the Fortran-style calling conventions: • Pass variables by address, not by value. Function calls in Example "Calling a Complex BLAS Level 1 Function from C++" and Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrate this. • Store your data in Fortran style, that is, column-major rather than row-major order. With row-major order, adopted in C, the last array index changes most quickly and the first one changes most slowly when traversing the memory segment where the array is stored. With Fortran-style columnmajor order, the last index changes most slowly whereas the first index changes most quickly (as illustrated by the figure below for a two-dimensional array). Language-specific Usage Options 6 61For example, if a two-dimensional matrix A of size mxn is stored densely in a one-dimensional array B, you can access a matrix element like this: A[i][j] = B[i*n+j] in C ( i=0, ... , m-1, j=0, ... , -1) A(i,j) = B(j*m+i) in Fortran ( i=1, ... , m, j=1, ... , n). When calling LAPACK or BLAS routines from C, be aware that because the Fortran language is caseinsensitive, the routine names can be both upper-case or lower-case, with or without the trailing underscore. For example, the following names are equivalent: • LAPACK: dgetrf, DGETRF, dgetrf_, and DGETRF_ • BLAS: dgemm, DGEMM, dgemm_, and DGEMM_ See Example "Calling a Complex BLAS Level 1 Function from C++" on how to call BLAS routines from C. See also the Intel(R) MKL Reference Manual for a description of the C interface to LAPACK functions. CBLAS Instead of calling BLAS routines from a C-language program, you can use the CBLAS interface. CBLAS is a C-style interface to the BLAS routines. You can call CBLAS routines using regular C-style calls. Use the mkl.h header file with the CBLAS interface. The header file specifies enumerated values and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrates the use of the CBLAS interface. C Interface to LAPACK Instead of calling LAPACK routines from a C-language program, you can use the C interface to LAPACK provided by Intel MKL. The C interface to LAPACK is a C-style interface to the LAPACK routines. This interface supports matrices in row-major and column-major order, which you can define in the first function argument matrix_order. Use the mkl_lapacke.h header file with the C interface to LAPACK. The header file specifies constants and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. You can find examples of the C interface to LAPACK in the examples\lapacke subdirectory in the Intel MKL installation directory. Using Complex Types in C/C++ As described in the documentation for the Intel® Visual Fortran Compiler XE, C/C++ does not directly implement the Fortran types COMPLEX(4) and COMPLEX(8). However, you can write equivalent structures. The type COMPLEX(4) consists of two 4-byte floating-point numbers. The first of them is the real-number component, and the second one is the imaginary-number component. The type COMPLEX(8) is similar to COMPLEX(4) except that it contains two 8-byte floating-point numbers. Intel MKL provides complex types MKL_Complex8 and MKL_Complex16, which are structures equivalent to the Fortran complex types COMPLEX(4) and COMPLEX(8), respectively. The MKL_Complex8 and MKL_Complex16 types are defined in the mkl_types.h header file. You can use these types to define complex data. You can also redefine the types with your own types before including the mkl_types.h header file. The only requirement is that the types must be compatible with the Fortran complex layout, that is, the complex type must be a pair of real numbers for the values of real and imaginary parts. For example, you can use the following definitions in your C++ code: #define MKL_Complex8 std::complex and #define MKL_Complex16 std::complex 6 Intel® Math Kernel Library for Windows* OS User's Guide 62See Example "Calling a Complex BLAS Level 1 Function from C++" for details. You can also define these types in the command line: -DMKL_Complex8="std::complex" -DMKL_Complex16="std::complex" See Also Intel® Software Documentation Library Calling BLAS Functions that Return the Complex Values in C/C++ Code Complex values that functions return are handled differently in C and Fortran. Because BLAS is Fortran-style, you need to be careful when handling a call from C to a BLAS function that returns complex values. However, in addition to normal function calls, Fortran enables calling functions as though they were subroutines, which provides a mechanism for returning the complex value correctly when the function is called from a C program. When a Fortran function is called as a subroutine, the return value is the first parameter in the calling sequence. You can use this feature to call a BLAS function from C. The following example shows how a call to a Fortran function as a subroutine converts to a call from C and the hidden parameter result gets exposed: Normal Fortran function call: result = cdotc( n, x, 1, y, 1 ) A call to the function as a subroutine: call cdotc( result, n, x, 1, y, 1) A call to the function from C: cdotc( &result, &n, x, &one, y, &one ) NOTE Intel MKL has both upper-case and lower-case entry points in the Fortran-style (caseinsensitive) BLAS, with or without the trailing underscore. So, all these names are equivalent and acceptable: cdotc, CDOTC, cdotc_, and CDOTC_. The above example shows one of the ways to call several level 1 BLAS functions that return complex values from your C and C++ applications. An easier way is to use the CBLAS interface. For instance, you can call the same function using the CBLAS interface as follows: cblas_cdotu( n, x, 1, y, 1, &result ) NOTE The complex value comes last on the argument list in this case. The following examples show use of the Fortran-style BLAS interface from C and C++, as well as the CBLAS (C language) interface: • Example "Calling a Complex BLAS Level 1 Function from C" • Example "Calling a Complex BLAS Level 1 Function from C++" • Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" Example "Calling a Complex BLAS Level 1 Function from C" The example below illustrates a call from a C program to the complex BLAS Level 1 function zdotc(). This function computes the dot product of two double-precision complex vectors. In this example, the complex dot product is returned in the structure c. #include "mkl.h" #define N 5 int main() { int n = N, inca = 1, incb = 1, i; MKL_Complex16 a[N], b[N], c; for( i = 0; i < n; i++ ){ a[i].real = (double)i; a[i].imag = (double)i * 2.0; b[i].real = (double)(n - i); b[i].imag = (double)i * 2.0; Language-specific Usage Options 6 63} zdotc( &c, &n, a, &inca, b, &incb ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.real, c.imag ); return 0; } Example "Calling a Complex BLAS Level 1 Function from C++" Below is the C++ implementation: #include #include #define MKL_Complex16 std::complex #include "mkl.h" #define N 5 int main() { int n, inca = 1, incb = 1, i; std::complex a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i] = std::complex(i,i*2.0); b[i] = std::complex(n-i,i*2.0); } zdotc(&c, &n, a, &inca, b, &incb ); std::cout << "The complex dot product is: " << c << std::endl; return 0; } Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" This example uses CBLAS: #include #include "mkl.h" typedef struct{ double re; double im; } complex16; #define N 5 int main() { int n, inca = 1, incb = 1, i; complex16 a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i].re = (double)i; a[i].im = (double)i * 2.0; b[i].re = (double)(n - i); b[i].im = (double)i * 2.0; } cblas_zdotc_sub(n, a, inca, b, incb, &c ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.re, c.im ); return 0; } Support for Boost uBLAS Matrix-matrix Multiplication If you are used to uBLAS, you can perform BLAS matrix-matrix multiplication in C++ using Intel MKL substitution of Boost uBLAS functions. uBLAS is the Boost C++ open-source library that provides BLAS functionality for dense, packed, and sparse matrices. The library uses an expression template technique for passing expressions as function arguments, which enables evaluating vector and matrix expressions in one pass without temporary matrices. uBLAS provides two modes: • Debug (safe) mode, default. 6 Intel® Math Kernel Library for Windows* OS User's Guide 64Checks types and conformance. • Release (fast) mode. Does not check types and conformance. To enable this mode, use the NDEBUG preprocessor symbol. The documentation for the Boost uBLAS is available at www.boost.org. Intel MKL provides overloaded prod() functions for substituting uBLAS dense matrix-matrix multiplication with the Intel MKL gemm calls. Though these functions break uBLAS expression templates and introduce temporary matrices, the performance advantage can be considerable for matrix sizes that are not too small (roughly, over 50). You do not need to change your source code to use the functions. To call them: • Include the header file mkl_boost_ublas_matrix_prod.hpp in your code (from the Intel MKL include directory) • Add appropriate Intel MKL libraries to the link line. The list of expressions that are substituted follows: prod( m1, m2 ) prod( trans(m1), m2 ) prod( trans(conj(m1)), m2 ) prod( conj(trans(m1)), m2 ) prod( m1, trans(m2) ) prod( trans(m1), trans(m2) ) prod( trans(conj(m1)), trans(m2) ) prod( conj(trans(m1)), trans(m2) ) prod( m1, trans(conj(m2)) ) prod( trans(m1), trans(conj(m2)) ) prod( trans(conj(m1)), trans(conj(m2)) ) prod( conj(trans(m1)), trans(conj(m2)) ) prod( m1, conj(trans(m2)) ) prod( trans(m1), conj(trans(m2)) ) prod( trans(conj(m1)), conj(trans(m2)) ) prod( conj(trans(m1)), conj(trans(m2)) ) These expressions are substituted in the release mode only (with NDEBUG preprocessor symbol defined). Supported uBLAS versions are Boost 1.34.1 and higher. To get them, visit www.boost.org. A code example provided in the \examples\ublas\source\sylvester.cpp file illustrates usage of the Intel MKL uBLAS header file for solving a special case of the Sylvester equation. To run the Intel MKL ublas examples, specify the boost_root parameter in the n make command, for instance, when using Boost version 1.37.0: nmake libia32 boost_root = \boost_1_37_0 Intel MKL ublas examples on default Boost uBLAS configuration support only: • Microsoft Visual C++* Compiler versions 2005 and higher • Intel C++ Compiler versions 11.1 and higher with Microsoft Visual Studio IDE versions 2005 and higher See Also Using Code Examples Invoking Intel MKL Functions from Java* Applications Language-specific Usage Options 6 65Intel MKL Java* Examples To demonstrate binding with Java, Intel MKL includes a set of Java examples in the following directory: \examples\java. The examples are provided for the following MKL functions: • ?gemm, ?gemv, and ?dot families from CBLAS • The complete set of non-cluster FFT functions • ESSL 1 -like functions for one-dimensional convolution and correlation • VSL Random Number Generators (RNG), except user-defined ones and file subroutines • VML functions, except GetErrorCallBack, SetErrorCallBack, and ClearErrorCallBack You can see the example sources in the following directory: \examples\java\examples. The examples are written in Java. They demonstrate usage of the MKL functions with the following variety of data: • 1- and 2-dimensional data sequences • Real and complex types of the data • Single and double precision However, the wrappers, used in the examples, do not: • Demonstrate the use of large arrays (>2 billion elements) • Demonstrate processing of arrays in native memory • Check correctness of function parameters • Demonstrate performance optimizations The examples use the Java Native Interface (JNI* developer framework) to bind with Intel MKL. The JNI documentation is available from http://java.sun.com/javase/6/docs/technotes/guides/jni/. The Java example set includes JNI wrappers that perform the binding. The wrappers do not depend on the examples and may be used in your Java applications. The wrappers for CBLAS, FFT, VML, VSL RNG, and ESSL-like convolution and correlation functions do not depend on each other. To build the wrappers, just run the examples. The makefile builds the wrapper binaries. After running the makefile, you can run the examples, which will determine whether the wrappers were built correctly. As a result of running the examples, the following directories will be created in \examples \java: • docs • include • classes • bin • _results The directories docs, include, classes, and bin will contain the wrapper binaries and documentation; the directory _results will contain the testing results. For a Java programmer, the wrappers are the following Java classes: • com.intel.mkl.CBLAS • com.intel.mkl.DFTI • com.intel.mkl.ESSL • com.intel.mkl.VML • com.intel.mkl.VSL 6 Intel® Math Kernel Library for Windows* OS User's Guide 66Documentation for the particular wrapper and example classes will be generated from the Java sources while building and running the examples. To browse the documentation, open the index file in the docs directory (created by the build script): \examples\java\docs\index.html. The Java wrappers for CBLAS, VML, VSL RNG, and FFT establish the interface that directly corresponds to the underlying native functions, so you can refer to the Intel MKL Reference Manual for their functionality and parameters. Interfaces for the ESSL-like functions are described in the generated documentation for the com.intel.mkl.ESSL class. Each wrapper consists of the interface part for Java and JNI stub written in C. You can find the sources in the following directory: \examples\java\wrappers. Both Java and C parts of the wrapper for CBLAS and VML demonstrate the straightforward approach, which you may use to cover additional CBLAS functions. The wrapper for FFT is more complicated because it needs to support the lifecycle for FFT descriptor objects. To compute a single Fourier transform, an application needs to call the FFT software several times with the same copy of the native FFT descriptor. The wrapper provides the handler class to hold the native descriptor, while the virtual machine runs Java bytecode. The wrapper for VSL RNG is similar to the one for FFT. The wrapper provides the handler class to hold the native descriptor of the stream state. The wrapper for the convolution and correlation functions mitigates the same difficulty of the VSL interface, which assumes a similar lifecycle for "task descriptors". The wrapper utilizes the ESSL-like interface for those functions, which is simpler for the case of 1-dimensional data. The JNI stub additionally encapsulates the MKL functions into the ESSL-like wrappers written in C and so "packs" the lifecycle of a task descriptor into a single call to the native method. The wrappers meet the JNI Specification versions 1.1 and 5.0 and should work with virtually every modern implementation of Java. The examples and the Java part of the wrappers are written for the Java language described in "The Java Language Specification (First Edition)" and extended with the feature of "inner classes" (this refers to late 1990s). This level of language version is supported by all versions of the Sun Java Development Kit* (JDK*) developer toolkit and compatible implementations starting from version 1.1.5, or by all modern versions of Java. The level of C language is "Standard C" (that is, C89) with additional assumptions about integer and floatingpoint data types required by the Intel MKL interfaces and the JNI header files. That is, the native float and double data types must be the same as JNI jfloat and jdouble data types, respectively, and the native int must be 4 bytes long. 1 IBM Engineering Scientific Subroutine Library (ESSL*). See Also Running the Java* Examples Running the Java* Examples The Java examples support all the C and C++ compilers that Intel MKL does. The makefile intended to run the examples also needs the n make utility, which is typically provided with the C/C++ compiler package. To run Java examples, the JDK* developer toolkit is required for compiling and running Java code. A Java implementation must be installed on the computer or available via the network. You may download the JDK from the vendor website. The examples should work for all versions of JDK. However, they were tested only with the following Java implementation s for all the supported architectures: • J2SE* SDK 1.4.2, JDK 5.0 and 6.0 from Sun Microsystems, Inc. (http://sun.com/). • JRockit* JDK 1.4.2 and 5.0 from Oracle Corporation (http://oracle.com/). Note that the Java run-time environment* (JRE*) system, which may be pre-installed on your computer, is not enough. You need the JDK* developer toolkit that supports the following set of tools: Language-specific Usage Options 6 67• java • javac • javah • javadoc To make these tools available for the examples makefile, set the JAVA_HOME environment variable and add the JDK binaries directory to the system PATH, for example : SET JAVA_HOME=C:\Program Files\Java\jdk1.5.0_09 SET PATH=%JAVA_HOME%\bin;%PATH% You may also need to clear the JDK_HOME environment variable, if it is assigned a value: SET JDK_HOME= To start the examples, use the makefile found in the Intel MKL Java examples directory: nmake {dllia32|dllintel64|libia32|libintel64} [function=...] [compiler=...] If you type the make command and omit the target (for example, dllia32), the makefile prints the help info, which explains the targets and parameters. For the examples list, see the examples.lst file in the Java examples directory. Known Limitations of the Java* Examples This section explains limitations of Java examples. Functionality Some Intel MKL functions may fail to work if called from the Java environment by using a wrapper, like those provided with the Intel MKL Java examples. Only those specific CBLAS, FFT, VML, VSL RNG, and the convolution/correlation functions listed in the Intel MKL Java Examples section were tested with the Java environment. So, you may use the Java wrappers for these CBLAS, FFT, VML, VSL RNG, and convolution/ correlation functions in your Java applications. Performance The Intel MKL functions must work faster than similar functions written in pure Java. However, the main goal of these wrappers is to provide code examples, not maximum performance. So, an Intel MKL function called from a Java application will probably work slower than the same function called from a program written in C/ C++ or Fortran. Known bugs There are a number of known bugs in Intel MKL (identified in the Release Notes), as well as incompatibilities between different versions of JDK. The examples and wrappers include workarounds for these problems. Look at the source code in the examples and wrappers for comments that describe the workarounds. 6 Intel® Math Kernel Library for Windows* OS User's Guide 68Coding Tips 7 This section discusses programming with the Intel® Math Kernel Library (Intel® MKL) to provide coding tips that meet certain, specific needs, such as consistent results of computations or conditional compilation. Aligning Data for Consistent Results Routines in Intel MKL may return different results from run-to-run on the same system. This is usually due to a change in the order in which floating-point operations are performed. The two most influential factors are array alignment and parallelism. Array alignment can determine how internal loops order floating-point operations. Non-deterministic parallelism may change the order in which computational tasks are executed. While these results may differ, they should still fall within acceptable computational error bounds. To better assure identical results from run-to-run, do the following: • Align input arrays on 16-byte boundaries • Run Intel MKL in the sequential mode To align input arrays on 16-byte boundaries, use mkl_malloc() in place of system provided memory allocators, as shown in the code example below. Sequential mode of Intel MKL removes the influence of nondeterministic parallelism. Aligning Addresses on 16-byte Boundaries // ******* C language ******* ... #include ... void *darray; int workspace; ... // Allocate workspace aligned on 16-byte boundary darray = mkl_malloc( sizeof(double)*workspace, 16 ); ... // call the program using MKL mkl_app( darray ); ... // Free workspace mkl_free( darray ); ! ******* Fortran language ******* ... double precision darray pointer (p_wrk,darray(1)) integer workspace ... ! Allocate workspace aligned on 16-byte boundary p_wrk = mkl_malloc( 8*workspace, 16 ) ... ! call the program using MKL call mkl_app( darray ) ... ! Free workspace call mkl_free(p_wrk) 69Using Predefined Preprocessor Symbols for Intel® MKL Version-Dependent Compilation Preprocessor symbols (macros) substitute values in a program before it is compiled. The substitution is performed in the preprocessing phase. The following preprocessor symbols are available: Predefined Preprocessor Symbol Description __INTEL_MKL__ Intel MKL major version __INTEL_MKL_MINOR__ Intel MKL minor version __INTEL_MKL_UPDATE__ Intel MKL update number INTEL_MKL_VERSION Intel MKL full version in the following format: INTEL_MKL_VERSION = (__INTEL_MKL__*100+__INTEL_MKL_MINOR__)*100+__I NTEL_MKL_UPDATE__ These symbols enable conditional compilation of code that uses new features introduced in a particular version of the library. To perform conditional compilation: 1. Include in your code the file where the macros are defined: • mkl.h for C/C++ • mkl.fi for Fortran 2. [Optionally] Use the following preprocessor directives to check whether the macro is defined: • #ifdef, #endif for C/C++ • !DEC$IF DEFINED, !DEC$ENDIF for Fortran 3. Use preprocessor directives for conditional inclusion of code: • #if, #endif for C/C++ • !DEC$IF, !DEC$ENDIF for Fortran Example Compile a part of the code if Intel MKL version is MKL 10.3 update 4: C/C++: #include "mkl.h" #ifdef INTEL_MKL_VERSION #if INTEL_MKL_VERSION == 100304 // Code to be conditionally compiled #endif #endif Fortran: include "mkl.fi" !DEC$IF DEFINED INTEL_MKL_VERSION !DEC$IF INTEL_MKL_VERSION .EQ. 100304 * Code to be conditionally compiled !DEC$ENDIF !DEC$ENDIF 7 Intel® Math Kernel Library for Windows* OS User's Guide 70Working with the Intel® Math Kernel Library Cluster Software 8 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 MPI Support Intel MKL ScaLAPACK and Cluster FFTs support MPI implementations identified in the Intel® Math Kernel Library (Intel® MKL) Release Notes. To link applications with ScaLAPACK or Cluster FFTs, you need to configure your system depending on your message-passing interface (MPI) implementation as explained below. If you are using MPICH2, do the following: 1. Add mpich2\include to the include path (assuming the default MPICH2 installation). 2. Add mpich2\lib to the library path. 3. Add mpi.lib to your link command. 4. Add fmpich2.lib to your Fortran link command. 5. Add cxx.lib to your Release target link command and cxxd.lib to your Debug target link command for C++ programs. If you are using the Microsoft MPI, do the following: 1. Add Microsoft Compute Cluster Pack\include to the include path (assuming the default installation of the Microsoft MPI). 2. Add Microsoft Compute Cluster Pack\Lib\AMD64 to the library path. 3. Add msmpi.lib to your link command. If you are using the Intel® MPI, do the following: 1. Add the following string to the include path: %ProgramFiles%\Intel\MPI\\\include, where is the directory for a particular MPI version and is ia32 or intel64, for example, %ProgramFiles%\Intel\MPI\3.1\intel64\include. 2. Add the following string to the library path: %ProgramFiles%\Intel\MPI\\\lib, for example, %ProgramFiles%\Intel\MPI\3.1\intel64\lib. 3. Add impi.lib and impicxx.lib to your link command. Check the documentation that comes with your MPI implementation for implementation-specific details of linking. Linking with ScaLAPACK and Cluster FFTs To link with Intel MKL ScaLAPACK and/or Cluster FFTs, use the following commands : 71set lib =;;%lib% where the placeholders stand for paths and libraries as explained in the following table: \lib\{ia32|intel64}, depending on your architecture. If you performed the Setting Environment Variables step of the Getting Started process, you do not need to add this directory to the lib environment variable. Typically the lib subdirectory in the MPI installation directory. For example, C:\Program Files (x86)\Intel\MPI\3.2.0.005\ia32\lib for a default installation of Intel MPI 3.2. One of icl, ifort, xilink. One of ScaLAPACK or Cluster FFT libraries for the appropriate architecture, which are listed in Directory Structure in Detail. For example, for the IA-32 architecture, it is one of mkl_scalapack_core.lib or mkl_cdft_core.lib. The BLACS library corresponding to your architecture, programming interface (LP64 or ILP64), and MPI version. These libraries are listed in Directory Structure in Detail. For example, for the IA-32 architecture, choose one of mkl_blacs_mpich2.lib or mkl_blacs_intelmpi.lib in case of static linking or mkl_blacs_dll.lib in case of dynamic linking; specifically, for MPICH2, choose mkl_blacs_mpich2.lib in case of static linking. Intel MKL libraries other than ScaLAPACK or Cluster FFTs libraries. TIP Use the Link-line Advisor to quickly choose the appropriate set of , , and . Intel MPI provides prepackaged scripts for its linkers to help you link using the respective linker. Therefore, if you are using Intel MPI, the best way to link is to use the following commands: \mpivars.bat set lib = ;%lib% where the placeholders that are not yet defined are explained in the following table: 8 Intel® Math Kernel Library for Windows* OS User's Guide 72 By default, the bin subdirectory in the MPI installation directory. For example, C: \Program Files (x86)\Intel\MPI\3.2.0.005\ia32\lib for a default installation of Intel MPI 3.2; mpicl or mpiifort See Also Linking Your Application with the Intel® Math Kernel Library Examples for Linking with ScaLAPACK and Cluster FFT Determining the Number of Threads The OpenMP* software responds to the environment variable OMP_NUM_THREADS. Intel MKL also has other mechanisms to set the number of threads, such as the MKL_NUM_THREADS or MKL_DOMAIN_NUM_THREADS environment variables (see Using Additional Threading Control). Make sure that the relevant environment variables have the same and correct values on all the nodes. Intel MKL versions 10.0 and higher no longer set the default number of threads to one, but depend on the OpenMP libraries used with the compiler to set the default number. For the threading layer based on the Intel compiler (mkl_intel_thread.lib), this value is the number of CPUs according to the OS. CAUTION Avoid over-prescribing the number of threads, which may occur, for instance, when the number of MPI ranks per node and the number of threads per node are both greater than one. The product of MPI ranks per node and the number of threads per node should not exceed the number of physical cores per node. The OMP_NUM_THREADS environment variable is assumed in the discussion below. Set OMP_NUM_THREADS so that the product of its value and the number of MPI ranks per node equals the number of real processors or cores of a node. If the Intel ® Hyper-Threading Technology is enabled on the node, use only half number of the processors that are visible on Windows OS. See Also Setting Environment Variables on a Cluster Using DLLs All the needed DLLs must be visible on all the nodes at run time, and you should install Intel® Math Kernel Library (Intel® MKL) on each node of the cluster. You can use Remote Installation Services (RIS) provided by Microsoft to remotely install the library on each of the nodes that are part of your cluster. The best way to make the DLLs visible is to point to these libraries in the PATH environment variable. See Setting Environment Variables on a Cluster on how to set the value of the PATH environment variable. The ScaLAPACK DLLs for the IA-32 and Intel® 64 architectures (in the \redist \ia32\mkl and \redist\intel64\mkl directories, respectively) use the MPI dispatching mechanism. MPI dispatching is based on the MKL_BLACS_MPI environment variable. The BLACS DLL uses MKL_BLACS_MPI for choosing the needed MPI libraries. The table below lists possible values of the variable. Value Comment MPICH2 Default value. MPICH2 1.0.x for Windows* OS is used for message passing INTELM PI Intel MPI is used for message passing Working with the Intel® Math Kernel Library Cluster Software 8 73Value Comment MSMPI Microsoft MPI is used for message passing If you are using a non-default MPI, assign the same appropriate value to MKL_BLACS_MPI on all nodes. See Also Setting Environment Variables on a Cluster Setting Environment Variables on a Cluster If you are using MPICH2 or Intel MPI, to set an environment variable on the cluster, use -env, -genv, - genvlist keys of mpiexec. See the following MPICH2 examples on how to set the value of OMP_NUM_THREADS: mpiexec -genv OMP_NUM_THREADS 2 .... mpiexec -genvlist OMP_NUM_THREADS .... mpiexec -n 1 -host first -env OMP_NUM_THREADS 2 test.exe : -n 1 -host second -env OMP_NUM_THREADS 3 test.exe .... See the following Intel MPI examples on how to set the value of MKL_BLACS_MPI: mpiexec -genv MKL_BLACS_MPI INTELMPI .... mpiexec -genvlist MKL_BLACS_MPI .... mpiexec -n 1 -host first -env MKL_BLACS_MPI INTELMPI test.exe : -n 1 -host second -env MKL_BLACS_MPI INTELMPI test.exe. When using MPICH2, you may have problems with getting the global environment, such as MKL_BLACS_MPI, by the -genvlist key. In this case, set up user or system environments on each node as follows: From the Start menu, select Settings > Control Panel > System > Advanced > Environment Variables. If you are using Microsoft MPI, the above ways of setting environment variables are also applicable if the Microsoft Single Program Multiple Data (SPMD) process managers are running in a debug mode on all nodes of the cluster. However, the best way to set environment variables is using the Job Scheduler with the Microsoft Management Console (MMC) and/or the Command Line Interface (CLI) to submit a job and pass environment variables. For more information about MMC and CLI, see the Microsoft Help and Support page at the Microsoft Web site (http://www.microsoft.com/). Building ScaLAPACK Tests To build ScaLAPACK tests, • For the IA-32 architecture, add mkl_scalapack_core.lib to your link command. • For the Intel® 64 architecture, add mkl_scalapack_lp64.lib or mkl_scalapack_ilp64.lib, depending on the desired interface. Examples for Linking with ScaLAPACK and Cluster FFT This section provides examples of linking with ScaLAPACK and Cluster FFT. Note that a binary linked with ScaLAPACK runs the same way as any other MPI application (refer to the documentation that comes with your MPI implementation). For further linking examples, see the support website for Intel products at http://www.intel.com/software/ products/support/. 8 Intel® Math Kernel Library for Windows* OS User's Guide 74See Also Directory Structure in Detail Examples for Linking a C Application These examples illustrate linking of an application whose main module is in C under the following conditions: • MPICH2 1.0.x is installed in c:\mpich2x64. • You use the Intel® C++ Compiler 10.0 or higher. To link with ScaLAPACK using LP64 interface for a cluster of Intel® 64 architecture based systems, set the environment variable and use the link line as follows: set lib=c:\mpich2x64\lib;\lib\intel64;%lib% icl mkl_scalapack_lp64.lib mkl_blacs_mpich2_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib mpi.lib cxx.lib bufferoverflowu.lib To link with Cluster FFT using LP64 interface for a cluster of Intel® 64 architecture based systems, set the environment variable and use the link line as follows: set lib=c:\mpich2x64\lib;\lib\intel64;%lib% icl mkl_cdft_core.lib mkl_blacs_mpich2_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib mpi.lib cxx.lib bufferoverflowu.lib See Also Linking with ScaLAPACK and Cluster FFTs Linking with System Libraries Examples for Linking a Fortran Application These examples illustrate linking of an application whose main module is in Fortran under the following conditions: • Microsoft Windows Compute Cluster Pack SDK is installed in c:\MS CCP SDK. • You use the Intel® Fortran Compiler 10.0 or higher. To link with ScaLAPACK using LP64 interface for a cluster of Intel® 64 architecture based systems, set the environment variable and use the link line as follows: set lib="c:\MS CCP SDK\Lib\AMD64";\lib\intel64;%lib% ifort mkl_scalapack_lp64.lib mkl_blacs_mpich2_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib msmpi.lib bufferoverflowu.lib To link with Cluster FFTs using LP64 interface for a cluster of Intel® 64 architecture based systems, set the environment variable and use the link line as follows: set lib="c:\MS CCP SDK\Lib\AMD64";\lib\intel64;%lib% ifort mkl_cdft_core.lib mkl_blacs_mpich2_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib msmpi.lib bufferoverflowu.lib See Also Linking with ScaLAPACK and Cluster FFTs Linking with System Libraries Working with the Intel® Math Kernel Library Cluster Software 8 758 Intel® Math Kernel Library for Windows* OS User's Guide 76Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) 9 Configuring Your Integrated Development Environment to Link with Intel Math Kernel Library Configuring the Microsoft Visual C/C++* Development System to Link with Intel® MKL Steps for configuring Microsoft Visual C/C++* Development System for linking with Intel® Math Kernel Library (Intel® MKL) depend on whether If you installed the C++ Integration(s) in Microsoft Visual Studio* component of the Intel® Composer XE: • If you installed the integration component, see Automatically Linking Your Microsoft Visual C/C++ Project with Intel MKL. • If you did not install the integration component or need more control over Intel MKL libraries to link, you can configure the Microsoft Visual C++* 2005, Visual C++* 2008, or Visual C++* 2010 development system by performing the following steps. Though some versions of the Visual C++* development system may vary slightly in the menu items mentioned below, the fundamental configuring steps are applicable to all these versions. 1. From the menu, select View > Solution Explorer (and make sure this window is active) 2. Select Tools > Options > Projects > VC++ Directories 3. From the Show directories for list, select Include Files. Add the directory for the Intel MKL include files, that is, \include 4. From the Show directories for list, select Library Files. Add architecture-specific directories for Intel MKL and OpenMP* libraries, for example: \lib\ia32 and \compiler\lib\ia32 5. From the Show directories for list, select Executable Files. Add architecture-specific directories with dynamic-link libraries: • For OpenMP* support, for example: \redist\ia32\compiler • For Intel MKL (only if you link dynamically), for example: \redist \ia32\mkl 6. Select Project>Properties>Configuration Properties>Linker>Input>Additional Dependencies. Add the libraries required, for example, mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib See Also Intel® Software Documentation Library Linking in Detail Configuring Intel® Visual Fortran to Link with Intel MKL Steps for configuring Intel® Visual Fortran for linking with Intel® Math Kernel Library (Intel® MKL) depend on whether you installed the Visual Fortran Integration(s) in Microsoft Visual Studio* component of the Intel® Composer XE: • If you installed the integration component, see Automatically Linking Your Intel® Visual Fortran Project with Intel® MKL. 77• If you did not install the integration component or need more control over Intel MKL libraries to link, you can configure your project as follows: 1. Select Project>Properties>Linker>General>Additional Library Directories. Add architecturespecific directories for Intel MKL and OpenMP* libraries, for example: \lib\ia32 and \compiler\lib\ia32 2. Select Project>Properties>Linker>Input>Additional Dependencies. Insert names of the required libraries, for example: mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib 3. Select Project>Properties>Debugging>Environment. Add architecture-specific paths to dynamiclink libraries: • For OpenMP* support; for example: enter PATH=%PATH%;\redist \ia32\compiler • For Intel MKL (only if you link dynamically); for example: enter PATH=%PATH%;\redist\ia32\mkl See Also Intel® Software Documentation Library Running an Intel MKL Example in the Visual Studio* 2008 IDE This section explains how to create and configure projects with the Intel® Math Kernel Library (Intel® MKL) examples in Microsoft Visual Studio* 2008. For Intel MKL examples where the instructions below do not work, see Known Limitations. To run the Intel MKL C examples in Microsoft Visual Studio 2008: 1. Do either of the following: • Install Intel® C/C++ Compiler and integrate it into Visual Studio (recommended). • Use the Microsoft Visual C++* 2008 Compiler integrated into Visual Studio*. 2. Create, configure, and run the Intel C/C++ and/or Microsoft Visual C++* 2008. To run the Intel MKL Fortran examples in Microsoft Visual Studio 2008: 1. Install Intel® Visual Fortran Compiler and integrate it into Visual Studio. The default installation of the Intel Visual Fortran Compiler performs this integration. For more information, see the Intel Visual Fortran Compiler documentation. 2. Create, configure, and run the Intel Visual Fortran project. Creating, Configuring, and Running the Intel® C/C++ and/or Visual C++* 2008 Project This section demonstrates how to create a Visual C/C++ project using an Intel® Math Kernel Library (Intel® MKL) example in Microsoft Visual Studio 2008. The instructions below create a Win32/Debug project running one Intel MKL example in a Console window. For details on creation of different kinds of Microsoft Visual Studio projects, refer to MSDN Visual Studio documentation at http://www.microsoft.com. To create and configure the Win32/Debug project running an Intel MKL C example with the Intel® C/C++ Compiler integrated into Visual Studio and/or Microsoft Visual C++* 2008, perform the following steps: 1. Create a C Project: a. Open Visual Studio 2008. b. On the main menu, select File > New > Project to open the New Project window. c. Select Project Types > Visual C++ > Win32, then select Templates > Win32 Console Application. In the Name field, type , for example, MKL_CBLAS_CAXPYIX, and click OK. The New Project window closes, and the Win32 Application Wizard - window opens. d. Select Next, then select Application Settings, check Additional options > Empty project, and click Finish. 9 Intel® Math Kernel Library for Windows* OS User's Guide 78The Win32 Application Wizard - window closes. The next steps are performed inside the Solution Explorer window. To open it, select View > Solution Explorer from the main menu. 2. (optional) To switch to the Intel C/C++ project, right-click and from the drop-down menu, select Convert to use Intel® C++ Project System. (The menu item is available if the Intel® C/C++ Compiler is integrated into Visual Studio.) 3. Add sources of the Intel MKL example to the project: a. Right-click the Source Files folder under and select Add > Existing Item... from the drop-down menu. The Add Existing Item - window opens. b. Browse to the Intel MKL example directory, for example, \examples\cblas \source. Select the example file and supporting files with extension ".c" (C sources), for example, select files cblas_caxpyix.c and common_func.c For the list of supporting files in each example directory, see Support Files for Intel MKL Examples. Click Add. The Add Existing Item - window closes, and selected files appear in the Source Files folder in Solution Explorer. The next steps adjust the properties of the project. 4. Select . 5. On the main menu, select Project > Properties to open the Property Pages window. 6. Set Intel MKL Include dependencies: a. Select Configuration Properties > C/C++ > General. In the right-hand part of the window, select Additional Include Directories > ... (the browse button). The Additional Include Directories window opens. b. Click the New Line button (the first button in the uppermost row). When the new line appears in the window, click the browse button. The Select Directory window opens. c. Browse to the \include directory and click OK. The Select Directory window closes, and full path to the Intel MKL include directory appears in the Additional Include Directories window. d. Click OK to close the window. 7. Set library dependencies: a. Select Configuration Properties > Linker > General. In the right-hand part of the window, select Additional Library Directories > ... (the browse button). The Additional Library Directories window opens. b. Click the New Line button (the first button in the uppermost row). When the new line appears in the window, click the browse button. The Select Directory window opens. c. Browse to the directory with the Intel MKL libraries \lib\, where is one of {ia32, intel64}, for example: \lib\ia32. (For most laptop and desktop computers, is ia32.). Click OK. The Select Directory window closes, and the full path to the Intel MKL libraries appears in the Additional Library Directories window. d. Click the New Line button again. When the new line appears in the window, click the browse button. The Select Directory window opens. e. Browse to the compiler\lib\, where is one of { ia32, intel64 }, for example: \compiler\lib\ia32. Click OK. The Select Directory window closes, and the specified full path appears in the Additional Library Directories window. f. Click OK to close the Additional Library Directories window. g. Select Configuration Properties > Linker > Input. In the right-hand part of the window, select Additional Dependencies > ... (the browse button). The Additional Dependencies window opens. h. Type the libraries required, for example, if =ia32, type mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib For more details, see Linking in Detail. i. Click OK to close the Additional Dependencies window. Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) 9 79j. If the Intel MKL example directory does not contain a data directory, skip the next step. 8. Set data dependencies for the Intel MKL example: a. Select Configuration Properties > Debugging. In the right-hand part of the window, select Command Arguments > > . The Command Arguments window opens. b. Type the path to the proper data file in quotes. The name of the data file is the same as the name of the example file, with a "d" extension, for example, "\examples\cblas\data \cblas_caxpyix.d". c. Click OK to close the Command Arguments window. 9. Click OK to close the Property Pages window. 10.Certain examples do not pause before the end of execution. To see the results printed in the Console window, set a breakpoint at the very last 'return 0;' statement or add a call to 'getchar();' before the last 'return 0' statement. 11.To build the solution, select Build > Build Solution . NOTE You may see warnings about unsafe functions and variables. To get rid of these warnings, go to Project > Properties, and when the Property Pages window opens, go to Configuration Properties > C/C++ > Preprocessor. In the right-hand part of the window, select Preprocessor Definitions, add _CRT_SECURE_NO_WARNINGS, and click OK. 12.To run the example, select Debug > Start Debugging. The Console window opens. 13.You can see the results of the example in the Console window. If you used the 'getchar();' statement to pause execution of the program, press Enter to complete the run. If you used a breakpoint to pause execution of the program, select Debug > Continue. The Console window closes. See Also Running an Intel MKL Example in the Visual Studio* 2008 IDE Creating, Configuring, and Running the Intel Visual Fortran Project This section demonstrates how to create an Intel Visual Fortran project running an Intel MKL example in Microsoft Visual Studio 2008. The instructions below create a Win32/Debug project running one Intel MKL example in a Console window. For details on creation of different kinds of Microsoft Visual Studio projects, refer to MSDN Visual Studio documentation at http://www.microsoft.com. To create and configure a Win32/Debug project running the Intel MKL Fortran example with the Intel Visual Fortran Compiler integrated into Visual Studio, perform the following steps: 1. Create a Visual Fortran Project: a. Open Visual Studio 2008. b. On the main menu, select File > New > Project to open the New Project window. c. Select Project Types > Intel® Fortran > Console Application, then select Templates > Empty Project. When done, in the Name field, type for example, MKL_PDETTF_D_TRIG_TRANSFORM_BVP, and click OK. The New Project window closes. The next steps are performed inside the Solution Explorer window. To open it, select View>Solution Explorer from the main menu. 2. Add sources of Intel MKL example to the project: a. Right-click the Source Files folder under and select Add > Existing Item... from the drop-down menu. The Add Existing Item - window opens. b. Browse to the Intel MKL example directory, for example, \examples\pdettf \source. Select the example file and supporting files with extension ".f" or ".f90" (Fortran sources). For example, select the d_trig_tforms_bvp.f90 file. For the list of supporting files in each example directory, see Support Files for Intel MKL Examples. Click Add. 9 Intel® Math Kernel Library for Windows* OS User's Guide 80The Add Existing Item - window closes, and the selected files appear in the Source Files folder in Solution Explorer. Some examples with the "use" statements require the next two steps. c. Right-click the Header Files folder under and select Add > Existing Item... from the drop-down menu. The Add Existing Item - window opens. d. Browse to the \include directory. Select the header files that appear in the "use" statements. For example, select the mkl_dfti.f90 and mkl_trig_transforms.f90 files. Click Add. The Add Existing Item - window closes, and the selected files to appear in theHeader Filesfolder in Solution Explorer. The next steps adjust the properties of the project: 3. Select the . 4. On the main menu, select Project > Properties to open the Property Pages window. 5. Set the Intel MKL include dependencies: a. Select Configuration Properties > Fortran > General. In the right-hand part of the window, select Additional Include Directories > > . The Additional Include Directories window opens. b. Type the Intel MKL include directory in quotes: "\include". Click OK to close the window. 6. Select Configuration Properties > Fortran > Preprocessor. In the right-hand part of the window, select Preprocess Source File > Yes (default is No). This step is recommended because some examples require preprocessing. 7. Set library dependencies: a. Select Configuration Properties > Linker > General. In the right-hand part of the window, select Additional Library Directories > > . The Additional Library Directories window opens. b. Type the directory with the Intel MKL libraries in quotes, that is, "\lib \", where is one of { ia32, intel64 }, for example: "\lib\ia32". (For most laptop and desktop computers is ia32.) Click OK to close the window. c. Select Configuration Properties > Linker > Input. In the right-hand part of the window, select Additional Dependencies and type the libraries required, for example, if =ia32, type mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib. 8. In the Property Pages window, click OK to close the window. 9. Some examples do not pause before the end of execution. To see the results printed in the Console window, set a breakpoint at the very end of the program or add the 'pause' statement before the last 'end' statement. 10.To build the solution, select Build > Build Solution. 11.To run the example, select Debug > Start Debugging. The Console window opens. 12.You can see the results of the example in the Console window. If you used 'pause' statement to pause execution of the program, press Enter to complete the run. If you used a breakpoint to pause execution of the program, select Debug > Continue. The Console window closes. Support Files for Intel® Math Kernel Library Examples Below is the list of support files that have to be added to the project for respective examples: examples\cblas\source: common_func.c examples\dftc\source: dfti_example_status_print.c dfti_example_support.c Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) 9 81Known Limitations of the Project Creation Procedure You cannot create a Visual Studio* project using the instructions from Creating, Configuring, and Running the Intel® C/C++ and/or Visual C++* 2008 Project or Creating, Configuring, and Running the Intel® Visual Fortran Project for examples from the following directories: examples\blas examples\blas95 examples\cdftc examples\cdftf examples\dftf examples\fftw2x_cdf examples\fftw2xc examples\fftw2xf examples\fftw3xc examples\fftw3xf examples\java examples\lapack examples\lapack95 Getting Assistance for Programming in the Microsoft Visual Studio* IDE Viewing Intel MKL Documentation in Visual Studio* IDE Viewing Intel MKL Documentation in Document Explorer (Visual Studio* 2005/2008 IDE) Intel MKL documentation is integrated in the Visual Studio IDE (VS) help collection. To open Intel MKL help, 1. Select Help > Contents from the menu. This displays the list of VS Help collections. 2. Click Intel Math Kernel Library Help. 3. In the help tree that expands, click Intel MKL Reference Manual. To open the help index, select Help > Inde x from the menu. To search in the help, select Help > Search from the menu and enter a search string. 9 Intel® Math Kernel Library for Windows* OS User's Guide 82You can filter Visual Studio Help collections to show only content related to installed Intel tools. To do this, select "Intel" from the Filtered by list. This hides the contents and index entries for all collections that do not refer to Intel. Accessing Intel MKL Documentation in Visual Studio* 2010 IDE To access the Intel MKL documentation in Visual Studio* 2010 IDE: • Configure the IDE to use local help (once). To do this, Go to Help > Manage Help Settings and check I want to use online help • Use the Help > View Help menu item to view a list of available help collections and open the Intel MKL documentation. Using Context-Sensitive Help When typing your code in the Visual Studio* (VS) IDE Code Editor, you can get context-sensitive help using the F1 Help and Dynamic Help features. F1 Help To open the help topic relevant to the current selection, press F1. In particular, to open the help topic describing an Intel MKL function called in your code, select the function name and press F1. The topic with the function description opens in the window that displays search results: Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) 9 83Dynamic Help Dynamic Help also provides access to topics relevant to the current selection or to the text being typed. Links to all relevant topics are displayed in the Dynamic Help window. To get the list of relevant topics each time you select the Intel MKL function name or as you type it in your code, open the Dynamic Help window by selecting Help > Dynamic Help from the menu. To open a topic from the list, click the appropriate link in the Dynamic Help window, shown in the above figure. Typically only one link corresponds to each Intel MKL function. Using the IntelliSense* Capability IntelliSense is a set of native Visual Studio*(VS) IDE features that make language references easily accessible. The user programming with Intel MKL in the VS Code Editor can employ two IntelliSense features: Parameter Info and Complete Word. Both features use header files. Therefore, to benefit from IntelliSense, make sure the path to the include files is specified in the VS or solution settings. For example, see Configuring the Microsoft Visual C/C++* Development System to Link with Intel® MKL on how to do this. Parameter Info The Parameter Info feature displays the parameter list for a function to give information on the number and types of parameters. This feature requires adding the include statement with the appropriate Intel MKL header file to your code. To get the list of parameters of a function specified in the header file, 1. Type the function name. 2. Type the opening parenthesis. This brings up the tooltip with the list of the function parameters: 9 Intel® Math Kernel Library for Windows* OS User's Guide 84Complete Word For a software library, the Complete Word feature types or prompts for the rest of the name defined in the header file once you type the first few characters of the name in your code. This feature requires adding the include statement with the appropriate Intel MKL header file to your code. To complete the name of the function or named constant specified in the header file, 1. Type the first few characters of the name. 2. Press Alt+RIGHT ARROW or Ctrl+SPACEBAR. If you have typed enough characters to disambiguate the name, the rest of the name is typed automatically. Otherwise, a pop-up list appears with the names specified in the header file 3. Select the name from the list, if needed. Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) 9 859 Intel® Math Kernel Library for Windows* OS User's Guide 86LINPACK and MP LINPACK Benchmarks 10 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Intel® Optimized LINPACK Benchmark for Windows* OS Intel® Optimized LINPACK Benchmark is a generalization of the LINPACK 1000 benchmark. It solves a dense (real*8) system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system, converts that time into a performance rate, and tests the results for accuracy. The generalization is in the number of equations (N) it can solve, which is not limited to 1000. It uses partial pivoting to assure the accuracy of the results. Do not use this benchmark to report LINPACK 100 performance because that is a compiled-code only benchmark. This is a shared-memory (SMP) implementation which runs on a single platform. Do not confuse this benchmark with: • MP LINPACK, which is a distributed memory version of the same benchmark. • LINPACK, the library, which has been expanded upon by the LAPACK library. Intel provides optimized versions of the LINPACK benchmarks to help you obtain high LINPACK benchmark results on your genuine Intel processor systems more easily than with the High Performance Linpack (HPL) benchmark. Use this package to benchmark your SMP machine. Additional information on this software as well as other Intel software performance products is available at http://www.intel.com/software/products/. Contents of the Intel® Optimized LINPACK Benchmark The Intel Optimized LINPACK Benchmark for Windows* OS contains the following files, located in the benchmarks\linpack\ subdirectory of the Intel® Math Kernel Library (Intel® MKL) directory: File in benchmarks \linpack\ Description linpack_xeon32.exe The 32-bit program executable for a system based on Intel® Xeon® processor or Intel® Xeon® processor MP with or without Streaming SIMD Extensions 3 (SSE3). linpack_xeon64.exe The 64-bit program executable for a system with Intel® Xeon® processor using Intel® 64 architecture. runme_xeon32.bat A sample shell script for executing a pre-determined problem set for linpack_xeon32.exe. OMP_NUM_THREADS set to 2 processors. runme_xeon64.bat A sample shell script for executing a pre-determined problem set for linpack_xeon64.exe. OMP_NUM_THREADS set to 4 processors. 87File in benchmarks \linpack\ Description lininput_xeon32 Input file for pre-determined problem for the runme_xeon32 script. lininput_xeon64 Input file for pre-determined problem for the runme_xeon64 script. win_xeon32.txt Result of the runme_xeon32 script execution. win_xeon64.txt Result of the runme_xeon64 script execution. help.lpk Simple help file. xhelp.lpk Extended help file. See Also High-level Directory Structure Running the Software To obtain results for the pre-determined sample problem sizes on a given system, type one of the following, as appropriate: runme_xeon32.bat runme_xeon64.bat To run the software for other problem sizes, see the extended help included with the program. Extended help can be viewed by running the program executable with the -e option: linpack_xeon32.exe -e linpack_xeon64.exe -e The pre-defined data input fileslininput_xeon32 and lininput_xeon64 are provided merely as examples. Different systems have different number of processors or amount of memory and thus require new input files. The extended help can be used for insight into proper ways to change the sample input files. Each input file requires at least the following amount of memory: lininput_xeon32 2 GB lininput_xeon64 16 GB If the system has less memory than the above sample data input requires, you may need to edit or create your own data input files, as explained in the extended help. Each sample script uses the OMP_NUM_THREADS environment variable to set the number of processors it is targeting. To optimize performance on a different number of physical processors, change that line appropriately. If you run the Intel Optimized LINPACK Benchmark without setting the number of threads, it will default to the number of cores according to the OS. You can find the settings for this environment variable in the runme_* sample scripts. If the settings do not yet match the situation for your machine, edit the script. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 10 Intel® Math Kernel Library for Windows* OS User's Guide 88Known Limitations of the Intel® Optimized LINPACK Benchmark The following limitations are known for the Intel Optimized LINPACK Benchmark for Windows* OS: • Intel Optimized LINPACK Benchmark is threaded to effectively use multiple processors. So, in multiprocessor systems, best performance will be obtained with the Intel® Hyper-Threading Technology turned off, which ensures that the operating system assigns threads to physical processors only. • If an incomplete data input file is given, the binaries may either hang or fault. See the sample data input files and/or the extended help for insight into creating a correct data input file. Intel® Optimized MP LINPACK Benchmark for Clusters Overview of the Intel® Optimized MP LINPACK Benchmark for Clusters The Intel® Optimized MP LINPACK Benchmark for Clusters is based on modifications and additions to HPL 2.0 from Innovative Computing Laboratories (ICL) at the University of Tennessee, Knoxville (UTK). The Intel Optimized MP LINPACK Benchmark for Clusters can be used for Top 500 runs (see http://www.top500.org). To use the benchmark you need be intimately familiar with the HPL distribution and usage. The Intel Optimized MP LINPACK Benchmark for Clusters provides some additional enhancements and bug fixes designed to make the HPL usage more convenient, as well as explain Intel® Message-Passing Interface (MPI) settings that may enhance performance. The .\benchmarks\mp_linpack directory adds techniques to minimize search times frequently associated with long runs. The Intel® Optimized MP LINPACK Benchmark for Clusters is an implementation of the Massively Parallel MP LINPACK benchmark by means of HPL code. It solves a random dense (real*8) system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system, converts that time into a performance rate, and tests the results for accuracy. You can solve any size (N) system of equations that fit into memory. The benchmark uses full row pivoting to ensure the accuracy of the results. Use the Intel Optimized MP LINPACK Benchmark for Clusters on a distributed memory machine. On a shared memory machine, use the Intel Optimized LINPACK Benchmark. Intel provides optimized versions of the LINPACK benchmarks to help you obtain high LINPACK benchmark results on your systems based on genuine Intel processors more easily than with the HPL benchmark. Use the Intel Optimized MP LINPACK Benchmark to benchmark your cluster. The prebuilt binaries require that you first install Intel® MPI 3.x be installed on the cluster. The run-time version of Intel MPI is free and can be downloaded from www.intel.com/software/products/ . The Intel package includes software developed at the University of Tennessee, Knoxville, Innovative Computing Laboratories and neither the University nor ICL endorse or promote this product. Although HPL 2.0 is redistributable under certain conditions, this particular package is subject to the Intel MKL license. Intel MKL has introduced a new functionality into MP LINPACK, which is called a hybrid build, while continuing to support the older version. The term hybrid refers to special optimizations added to take advantage of mixed OpenMP*/MPI parallelism. If you want to use one MPI process per node and to achieve further parallelism by means of OpenMP, use the hybrid build. In general, the hybrid build is useful when the number of MPI processes per core is less than one. If you want to rely exclusively on MPI for parallelism and use one MPI per core, use the non-hybrid build. In addition to supplying certain hybrid prebuilt binaries, Intel MKL supplies some hybrid prebuilt libraries for Intel® MPI to take advantage of the additional OpenMP* optimizations. If you wish to use an MPI version other than Intel MPI, you can do so by using the MP LINPACK source provided. You can use the source to build a non-hybrid version that may be used in a hybrid mode, but it would be missing some of the optimizations added to the hybrid version. Non-hybrid builds are the default of the source code makefiles provided. In some cases, the use of the hybrid mode is required for external reasons. If there is a choice, the non-hybrid code may be faster. To use the non-hybrid code in a hybrid mode, use the threaded version of Intel MKL BLAS, link with a thread-safe MPI, and call function MPI_init_thread() so as to indicate a need for MPI to be thread-safe. LINPACK and MP LINPACK Benchmarks 10 89Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Contents of the Intel® Optimized MP LINPACK Benchmark for Clusters The Intel Optimized MP LINPACK Benchmark for Clusters (MP LINPACK Benchmark) includes the HPL 2.0 distribution in its entirety, as well as the modifications delivered in the files listed in the table below and located in the benchmarks\mp_linpack\ subdirectory of the Intel MKL directory. NOTE Because MP LINPACK Benchmark includes the entire HPL 2.0 distribution, which provides a configuration for Linux* OS only, some Linux OS files remain in the directory. Directory/File in benchmarks \mp_linpack\ Contents testing\ptest\HPL_pdtest.c HPL 2.0 code modified to display captured DGEMM information in ASYOUGO2_DISPLAY if it was captured (for details, see New Features). src\blas\HPL_dgemm.c HPL 2.0 code modified to capture DGEMM information, if desired, from ASYOUGO2_DISPLAY. src\grid\HPL_grid_init.c HPL 2.0 code modified to do additional grid experiments originally not in HPL 2.0. src\pgesv\HPL_pdgesvK2.c HPL 2.0 code modified to do ASYOUGO and ENDEARLY modifications. src\pgesv\HPL_pdgesv0.c HPL 2.0 code modified to do ASYOUGO, ASYOUGO2, and ENDEARLY modifications. testing\ptest\HPL.dat HPL 2.0 sample HPL.dat modified. makes All the makefiles in this directory have been rebuilt in the Windows OS distribution. testing\ptimer\ Some files in here have been modified in the Windows OS distribution. testing\timer\ Some files in here have been modified in the Windows OS distribution. Make (New) Sample architecture makefile for nmake utility to be used on processors based on the IA-32 and Intel® 64 architectures and Windows OS. bin_intel\ia32\xhpl_ia32.exe (New) Prebuilt binary for the IA-32 architecture, Windows OS, and Intel® MPI. bin_intel \intel64\xhpl_intel64.exe (New) Prebuilt binary for the Intel® 64 architecture, Windows OS, and Intel MPI. 10 Intel® Math Kernel Library for Windows* OS User's Guide 90Directory/File in benchmarks \mp_linpack\ Contents lib_hybrid \ia32\libhpl_hybrid.lib (New) Prebuilt library with the hybrid version of MP LINPACK for the IA-32 architecture and Intel MPI. lib_hybrid \intel64\libhpl_hybrid.lib (New) Prebuilt library with the hybrid version of MP LINPACK for the Intel® 64 architecture and Intel MPI. bin_intel \ia32\xhpl_hybrid_ia32.exe (New) Prebuilt hybrid binary for the IA-32 architecture, Windows OS, and Intel MPI. bin_intel \intel64\xhpl_hybrid_intel64.exe (New) Prebuilt hybrid binary for the Intel® 64 architecture, Windows OS, and Intel MPI. nodeperf.c (New) Sample utility that tests the DGEMM speed across the cluster. See Also High-level Directory Structure Building the MP LINPACK The MP LINPACK Benchmark contains a few sample architecture makefiles. You can edit them to fit your specific configuration. Specifically: • Set TOPdir to the directory that MP LINPACK is being built in. • Set MPI variables, that is, MPdir, MPinc, and MPlib. • Specify the location Intel MKL and of files to be used (LAdir, LAinc, LAlib). • Adjust compiler and compiler/linker options. • Specify the version of MP LINPACK you are going to build (hybrid or non-hybrid) by setting the version parameter for the nmake command. For example: nmake arch=intel64 mpi=intelmpi version=hybrid install For some sample cases, the makefiles contain values that must be common. However, you need to be familiar with building an HPL and picking appropriate values for these variables. New Features of Intel® Optimized MP LINPACK Benchmark The toolset is basically identical with the HPL 2.0 distribution. There are a few changes that are optionally compiled in and disabled until you specifically request them. These new features are: ASYOUGO: Provides non-intrusive performance information while runs proceed. There are only a few outputs and this information does not impact performance. This is especially useful because many runs can go for hours without any information. ASYOUGO2: Provides slightly intrusive additional performance information by intercepting every DGEMM call. ASYOUGO2_DISPLAY: Displays the performance of all the significant DGEMMs inside the run. ENDEARLY: Displays a few performance hints and then terminates the run early. FASTSWAP: Inserts the LAPACK-optimized DLASWP into HPL's code. You can experiment with this to determine best results. HYBRID: Establishes the Hybrid OpenMP/MPI mode of MP LINPACK, providing the possibility to use threaded Intel MKL and prebuilt MP LINPACK hybrid libraries. CAUTION Use this option only with an Intel compiler and the Intel® MPI library version 3.1 or higher. You are also recommended to use the compiler version 10.0 or higher. LINPACK and MP LINPACK Benchmarks 10 91Benchmarking a Cluster To benchmark a cluster, follow the sequence of steps below (some of them are optional). Pay special attention to the iterative steps 3 and 4. They make a loop that searches for HPL parameters (specified in HPL.dat) that enable you to reach the top performance of your cluster. 1. Install HPL and make sure HPL is functional on all the nodes. 2. You may run nodeperf.c (included in the distribution) to see the performance of DGEMM on all the nodes. Compile nodeperf.c with your MPI and Intel MKL. For example: icl /Za /O3 /w /D_WIN_ /I"\include" "\" "\lib\intel64\mkl_core.lib" "\lib\intel64\libiomp5md.lib" nodeperf.c where is msmpi.lib in the case of Microsoft* MPI and mpi.lib in the case of MPICH. Launching nodeperf.c on all the nodes is especially helpful in a very large cluster. nodeperf enables quick identification of the potential problem spot without numerous small MP LINPACK runs around the cluster in search of the bad node. It goes through all the nodes, one at a time, and reports the performance of DGEMM followed by some host identifier. Therefore, the higher the DGEMM performance, the faster that node was performing. 3. Edit HPL.dat to fit your cluster needs. Read through the HPL documentation for ideas on this. Note, however, that you should use at least 4 nodes. 4. Make an HPL run, using compile options such as ASYOUGO, ASYOUGO2, or ENDEARLY to aid in your search. These options enable you to gain insight into the performance sooner than HPL would normally give this insight. When doing so, follow these recommendations: • Use MP LINPACK, which is a patched version of HPL, to save time in the search. All performance intrusive features are compile-optional in MP LINPACK. That is, if you do not use the new options to reduce search time, these features are disabled. The primary purpose of the additions is to assist you in finding solutions. HPL requires a long time to search for many different parameters. In MP LINPACK, the goal is to get the best possible number. Given that the input is not fixed, there is a large parameter space you must search over. An exhaustive search of all possible inputs is improbably large even for a powerful cluster. MP LINPACK optionally prints information on performance as it proceeds. You can also terminate early. • Save time by compiling with -DENDEARLY -DASYOUGO2 and using a negative threshold (do not use a negative threshold on the final run that you intend to submit as a Top500 entry). Set the threshold in line 13 of the HPL 2.0 input file HPL.dat • If you are going to run a problem to completion, do it with -DASYOUGO. 5. Using the quick performance feedback, return to step 3 and iterate until you are sure that the performance is as good as possible. See Also Options to Reduce Search Time Options to Reduce Search Time Running large problems to completion on large numbers of nodes can take many hours. The search space for MP LINPACK is also large: not only can you run any size problem, but over a number of block sizes, grid layouts, lookahead steps, using different factorization methods, and so on. It can be a large waste of time to run a large problem to completion only to discover it ran 0.01% slower than your previous best problem. Use the following options to reduce the search time: 10 Intel® Math Kernel Library for Windows* OS User's Guide 92• -DASYOUGO • -DENDEARLY • -DASYOUGO2 Use -DASYOUGO2 cautiously because it does have a marginal performance impact. To see DGEMM internal performance, compile with -DASYOUGO2 and -DASYOUGO2_DISPLAY. These options provide a lot of useful DGEMM performance information at the cost of around 0.2% performance loss. If you want to use the old HPL, simply omit these options and recompile from scratch. To do this, try "nmake arch= clean_arch_all". -DASYOUGO -DASYOUGO gives performance data as the run proceeds. The performance always starts off higher and then drops because this actually happens in LU decomposition (a decomposition of a matrix into a product of a lower (L) and upper (U) triangular matrices). The ASYOUGO performance estimate is usually an overestimate (because the LU decomposition slows down as it goes), but it gets more accurate as the problem proceeds. The greater the lookahead step, the less accurate the first number may be. ASYOUGO tries to estimate where one is in the LU decomposition that MP LINPACK performs and this is always an overestimate as compared to ASYOUGO2, which measures actually achieved DGEMM performance. Note that the ASYOUGO output is a subset of the information that ASYOUGO2 provides. So, refer to the description of the -DASYOUGO2 option below for the details of the output. -DENDEARLY -DENDEARLY t erminates the problem after a few steps, so that you can set up 10 or 20 HPL runs without monitoring them, see how they all do, and then only run the fastest ones to completion. -DENDEARLY assumes -DASYOUGO. You do not need to define both, although it doesn't hurt. To avoid the residual check for a problem that terminates early, set the "threshold" parameter in HPL.dat to a negative number when testing ENDEARLY. It also sometimes gives a better picture to compile with -DASYOUGO2 when using - DENDEARLY. Usage notes on -DENDEARLY follow: • -DENDEARLY stops the problem after a few iterations of DGEMM on the block size (the bigger the blocksize, the further it gets). It prints only 5 or 6 "updates", whereas -DASYOUGO prints about 46 or so output elements before the problem completes. • Performance for -DASYOUGO and -DENDEARLY always starts off at one speed, slowly increases, and then slows down toward the end (because that is what LU does). -DENDEARLY is likely to terminate before it starts to slow down. • -DENDEARLY terminates the problem early with an HPL Error exit. It means that you need to ignore the missing residual results, which are wrong because the problem never completed. However, you can get an idea what the initial performance was, and if it looks good, then run the problem to completion without - DENDEARLY. To avoid the error check, you can set HPL's threshold parameter in HPL.dat to a negative number. • Though -DENDEARLY terminates early, HPL treats the problem as completed and computes Gflop rating as though the problem ran to completion. Ignore this erroneously high rating. • The bigger the problem, the more accurately the last update that -DENDEARLY returns is close to what happens when the problem runs to completion. -DENDEARLY is a poor approximation for small problems. It is for this reason that you are suggested to use ENDEARLY in conjunction with ASYOUGO2, because ASYOUGO2 reports actual DGEMM performance, which can be a closer approximation to problems just starting. LINPACK and MP LINPACK Benchmarks 10 93-DASYOUGO2 -DASYOUGO2 gives detailed single-node DGEMM performance information. It captures all DGEMM calls (if you use Fortran BLAS) and records their data. Because of this, the routine has a marginal intrusive overhead. Unlike -DASYOUGO, which is quite non-intrusive, -DASYOUGO2 interrupts every DGEMM call to monitor its performance. You should beware of this overhead, although for big problems, it is, less than 0.1%. Here is a sample ASYOUGO2 output (the first 3 non-intrusive numbers can be found in ASYOUGO and ENDEARLY), so it suffices to describe these numbers here: Col=001280 Fract=0.050 Mflops=42454.99 (DT=9.5 DF=34.1 DMF=38322.78). The problem size was N=16000 with a block size of 128. After 10 blocks, that is, 1280 columns, an output was sent to the screen. Here, the fraction of columns completed is 1280/16000=0.08. Only up to 40 outputs are printed, at various places through the matrix decomposition: fractions 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 0.250 0.255 0.260 0.265 0.270 0.275 0.280 0.285 0.290 0.295 0.300 0.305 0.310 0.315 0.320 0.325 0.330 0.335 0.340 0.345 0.350 0.355 0.360 0.365 0.370 0.375 0.380 0.385 0.390 0.395 0.400 0.405 0.410 0.415 0.420 0.425 0.430 0.435 0.440 0.445 0.450 0.455 0.460 0.465 0.470 0.475 0.480 0.485 0.490 0.495 0.515 0.535 0.555 0.575 0.595 0.615 0.635 0.655 0.675 0.695 0.795 0.895. However, this problem size is so small and the block size so big by comparison that as soon as it prints the value for 0.045, it was already through 0.08 fraction of the columns. On a really big problem, the fractional number will be more accurate. It never prints more than the 112 numbers above. So, smaller problems will have fewer than 112 updates, and the biggest problems will have precisely 112 updates. Mflops is an estimate based on 1280 columns of LU being completed. However, with lookahead steps, sometimes that work is not actually completed when the output is made. Nevertheless, this is a good estimate for comparing identical runs. The 3 numbers in parenthesis are intrusive ASYOUGO2 addins. DT is the total time processor 0 has spent in DGEMM. DF is the number of billion operations that have been performed in DGEMM by one processor. Hence, the performance of processor 0 (in Gflops) in DGEMM is always DF/DT. Using the number of DGEMM flops as a basis instead of the number of LU flops, you get a lower bound on performance of the run by looking at DMF, which can be compared to Mflops above (It uses the global LU time, but the DGEMM flops are computed under the assumption that the problem is evenly distributed amongst the nodes, as only HPL's node (0,0) returns any output.) Note that when using the above performance monitoring tools to compare different HPL.dat input data sets, you should be aware that the pattern of performance drop-off that LU experiences is sensitive to some input data. For instance, when you try very small problems, the performance drop-off from the initial values to end values is very rapid. The larger the problem, the less the drop-off, and it is probably safe to use the first few performance values to estimate the difference between a problem size 700000 and 701000, for instance. Another factor that influences the performance drop-off is the grid dimensions (P and Q). For big problems, the performance tends to fall off less from the first few steps when P and Q are roughly equal in value. You can make use of a large number of parameters, such as broadcast types, and change them so that the final performance is determined very closely by the first few steps. Using these tools will greatly assist the amount of data you can test. See Also Benchmarking a Cluster 10 Intel® Math Kernel Library for Windows* OS User's Guide 94Intel® Math Kernel Library Language Interfaces Support A Language Interfaces Support, by Function Domain The following table shows language interfaces that Intel® Math Kernel Library (Intel® MKL) provides for each function domain. However, Intel MKL routines can be called from other languages using mixed-language programming. See Mixed-language Programming with Intel® MKL for an example of how to call Fortran routines from C/C++. Function Domain FORTRAN 77 interface Fortran 9 0/95 interface C/C++ interface Basic Linear Algebra Subprograms (BLAS) Yes Yes via CBLAS BLAS-like extension transposition routines Yes Yes Sparse BLAS Level 1 Yes Yes via CBLAS Sparse BLAS Level 2 and 3 Yes Yes Yes LAPACK routines for solving systems of linear equations Yes Yes Yes LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations Yes Yes Yes Auxiliary and utility LAPACK routines Yes Yes Parallel Basic Linear Algebra Subprograms (PBLAS) Yes ScaLAPACK routines Yes † DSS/PARDISO* solvers Yes Yes Yes Other Direct and Iterative Sparse Solver routines Yes Yes Yes Vector Mathematical Library (VML) functions Yes Yes Yes Vector Statistical Library (VSL) functions Yes Yes Yes Fourier Transform functions (FFT) Yes Yes Cluster FFT functions Yes Yes Trigonometric Transform routines Yes Yes Fast Poisson, Laplace, and Helmholtz Solver (Poisson Library) routines Yes Yes Optimization (Trust-Region) Solver routines Yes Yes Yes Data Fitting functions Yes Yes Yes GMP* arithmetic functions †† Yes Support functions (including memory allocation) Yes Yes Yes † Supported using a mixed language programming call. See Intel ® MKL Include Files for the respective header file. 95†† GMP Arithmetic Functions are deprecated and will be removed in a future release. Include Files Function domain Fortran Include Files C/C++ Include Files All function domains mkl.fi mkl.h BLAS Routines blas.f90 mkl_blas.fi mkl_blas.h BLAS-like Extension Transposition Routines mkl_trans.fi mkl_trans.h CBLAS Interface to BLAS mkl_cblas.h Sparse BLAS Routines mkl_spblas.fi mkl_spblas.h LAPACK Routines lapack.f90 mkl_lapack.fi mkl_lapack.h C Interface to LAPACK mkl_lapacke.h ScaLAPACK Routines mkl_scalapack.h All Sparse Solver Routines mkl_solver.f90 mkl_solver.h PARDISO mkl_pardiso.f77 mkl_pardiso.f90 mkl_pardiso.h DSS Interface mkl_dss.f77 mkl_dss.f90 mkl_dss.h RCI Iterative Solvers ILU Factorization mkl_rci.fi mkl_rci.h Optimization Solver Routines mkl_rci.fi mkl_rci.h Vector Mathematical Functions mkl_vml.f77 mkl_vml.90 mkl_vml.h Vector Statistical Functions mkl_vsl.f77 mkl_vsl.f90 mkl_vsl_functions.h Fourier Transform Functions mkl_dfti.f90 mkl_dfti.h Cluster Fourier Transform Functions mkl_cdft.f90 mkl_cdft.h Partial Differential Equations Support Routines Trigonometric Transforms mkl_trig_transforms.f90 mkl_trig_transforms.h Poisson Solvers mkl_poisson.f90 mkl_poisson.h Data Fitting functions mkl_df.f77 mkl_df.f90 mkl_df.h GMP interface † mkl_gmp.h Support functions mkl_service.f90 mkl_service.h A Intel® Math Kernel Library for Windows* OS User's Guide 96Function domain Fortran Include Files C/C++ Include Files mkl_service.fi Memory allocation routines i_malloc.h Intel MKL examples interface mkl_example.h † GMP Arithmetic Functions are deprecated and will be removed in a future release. See Also Language Interfaces Support, by Function Domain Intel® Math Kernel Library Language Interfaces Support A 97A Intel® Math Kernel Library for Windows* OS User's Guide 98Support for Third-Party Interfaces B GMP* Functions Intel® Math Kernel Library (Intel® MKL) implementation of GMP* arithmetic functions includes arbitrary precision arithmetic operations on integer numbers. The interfaces of such functions fully match the GNU Multiple Precision* (GMP) Arithmetic Library. For specifications of these functions, please see http:// software.intel.com/sites/products/documentation/hpc/mkl/gnump/index.htm. NOTE Intel MKL GMP Arithmetic Functions are deprecated and will be removed in a future release. If you currently use the GMP* library, you need to modify INCLUDE statements in your programs to mkl_gmp.h. FFTW Interface Support Intel® Math Kernel Library (Intel® MKL) offers two collections of wrappers for the FFTW interface (www.fftw.org). The wrappers are the superstructure of FFTW to be used for calling the Intel MKL Fourier transform functions. These collections correspond to the FFTW versions 2.x and 3.x and the Intel MKL versions 7.0 and later. These wrappers enable using Intel MKL Fourier transforms to improve the performance of programs that use FFTW without changing the program source code. See the "FFTW Interface to Intel® Math Kernel Library" appendix in the Intel MKL Reference Manual for details on the use of the wrappers. Important For ease of use, FFTW3 interface is also integrated in Intel MKL. 99B Intel® Math Kernel Library for Windows* OS User's Guide 100Directory Structure in Detail C Tables in this section show contents of the Intel(R) Math Kernel Library (Intel(R) MKL) architecture-specific directories. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Detailed Structure of the IA-32 Architecture Directories Static Libraries in the lib\ia32 Directory File Contents Interface layer mkl_intel_c.lib cdecl interface library mkl_intel_s.lib CVF default interface library mkl_blas95.lib Fortran 95 interface library for BLAS. Supports the Intel® Fortran compiler mkl_lapack95.lib Fortran 95 interface library for LAPACK. Supports the Intel® Fortran compiler Threading layer mkl_intel_thread.lib Threading library for the Intel compilers mkl_pgi_thread.lib Threading library for the PGI* compiler mkl_sequential.lib Sequential library Computational layer mkl_core.lib Kernel library for IA-32 architecture mkl_solver.lib Deprecated. Empty library for backward compatibility mkl_solver_sequential.lib Deprecated. Empty library for backward compatibility mkl_scalapack_core.lib ScaLAPACK routines mkl_cdft_core.lib Cluster version of FFTs Run-time Libraries (RTL) 101File Contents mkl_blacs_intelmpi.lib BLACS routines supporting Intel MPI mkl_blacs_mpich2.lib BLACS routines supporting MPICH2 Dynamic Libraries in the lib\ia32 Directory File Contents mkl_rt.lib Single Dynamic Library to be used for linking Interface layer mkl_intel_c_dll.lib cdecl interface library for dynamic linking mkl_intel_s_dll.lib CVF default interface library for dynamic linking Threading layer mkl_intel_thread_dll.lib Threading library for dynamic linking with the Intel compilers mkl_pgi_thread_dll.lib Threading library for dynamic linking with the PGI* compiler mkl_sequential_dll.lib Sequential library for dynamic linking Computational layer mkl_core_dll.lib Core library for dynamic linking mkl_scalapack_core_dll.lib ScaLAPACK routine library for dynamic linking mkl_cdft_core_dll.lib Cluster FFT library for dynamic linking Run-time Libraries (RTL) mkl_blacs_dll.lib BLACS interface library for dynamic linking Contents of the redist\ia32\mkl Directory File Contents mkl_rt.dll Single Dynamic Library Threading layer mkl_intel_thread.dll Dynamic threading library for the Intel compilers mkl_pgi_thread.dll Dynamic threading library for the PGI* compiler mkl_sequential.dll Dynamic sequential library Computational layer mkl_core.dll Core library containing processor-independent code and a dispatcher for dynamic loading of processor-specific code mkl_def.dll Default kernel (Intel® Pentium®, Pentium® Pro, Pentium® II, and Pentium® III processors) C Intel® Math Kernel Library for Windows* OS User's Guide 102File Contents mkl_p4.dll Pentium® 4 processor kernel mkl_p4p.dll Kernel for the Intel® Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3), including Intel® Core™ Duo and Intel® Core™ Solo processors. mkl_p4m.dll Kernel for processors based on the Intel® Core™ microarchitecture (except Intel® Core™ Duo and Intel® Core™ Solo processors, for which mkl_p4p.dll is intended) mkl_p4m3.dll Kernel for the Intel® Core™ i7 processors mkl_vml_def.dll VML/VSL part of default kernel for old Intel® Pentium® processors mkl_vml_ia.dll VML/VSL default kernel for newer Intel® architecture processors mkl_vml_p4.dll VML/VSL part of Pentium® 4 processor kernel mkl_vml_p4p.dll VML/VSL for Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3) mkl_vml_p4m.dll VML/VSL for processors based on the Intel® Core™ microarchitecture (except Intel® Core™ Duo and Intel® Core™ Solo processors, for which mkl_vml_p4p.dll is intended). mkl_vml_p4m2.dll VML/VSL for 45nm Hi-k Intel® Core™2 and Intel Xeon® processor families mkl_vml_p4m3.dll VML/VSL for the Intel® Core™ i7 processors mkl_vml_avx.dll VML/VSL optimized for the Intel® Advanced Vector Extensions (Intel® AVX) mkl_scalapack_core.dll ScaLAPACK routines mkl_cdft_core.dll Cluster FFT dynamic library libimalloc.dll Dynamic library to support renaming of memory functions Run-time Libraries (RTL) mkl_blacs.dll BLACS routines mkl_blacs_intelmpi.dll BLACS routines supporting Intel MPI mkl_blacs_mpich2.dll BLACS routines supporting MPICH2 1033\mkl_msg.dll Catalog of Intel® Math Kernel Library (Intel® MKL) messages in English 1041\mkl_msg.dll Catalog of Intel MKL messages in Japanese. Available only if the Intel® MKL package provides Japanese localization. Please see the Release Notes for this information Detailed Structure of the Intel® 64 Architecture Directories Directory Structure in Detail C 103Static Libraries in the lib\intel64 Directory File Contents Interface layer mkl_intel_lp64.lib LP64 interface library for the Intel compilers mkl_intel_ilp64.lib ILP64 interface library for the Intel compilers mkl_intel_sp2dp.a SP2DP interface library for the Intel compilers mkl_blas95_lp64.lib Fortran 95 interface library for BLAS. Supports the Intel® Fortran compiler and LP64 interface mkl_blas95_ilp64.lib Fortran 95 interface library for BLAS. Supports the Intel® Fortran compiler and ILP64 interface mkl_lapack95_lp64.lib Fortran 95 interface library for LAPACK. Supports the Intel® Fortran compiler and LP64 interface mkl_lapack95_ilp64.lib Fortran 95 interface library for LAPACK. Supports the Intel® Fortran compiler and ILP64 interface Threading layer mkl_intel_thread.lib Threading library for the Intel compilers mkl_pgi_thread.lib Threading library for the PGI* compiler mkl_sequential.lib Sequential library Computational layer mkl_core.lib Kernel library for the Intel® 64 architecture mkl_solver_lp64.lib Deprecated. Empty library for backward compatibility mkl_solver_lp64_sequential.lib Deprecated. Empty library for backward compatibility mkl_solver_ilp64.lib Deprecated. Empty library for backward compatibility mkl_solver_ilp64_sequential.lib Deprecated. Empty library for backward compatibility mkl_scalapack_lp64.lib ScaLAPACK routine library supporting the LP64 interface mkl_scalapack_ilp64.lib ScaLAPACK routine library supporting the ILP64 interface mkl_cdft_core.lib Cluster version of FFTs Run-time Libraries (RTL) mkl_blacs_intelmpi_lp64.lib LP64 version of BLACS routines supporting Intel MPI mkl_blacs_intelmpi_ilp64.lib ILP64 version of BLACS routines supporting Intel MPI mkl_blacs_mpich2_lp64.lib LP64 version of BLACS routines supporting MPICH2 mkl_blacs_mpich2_ilp64.lib ILP64 version of BLACS routines supporting MPICH2 mkl_blacs_msmpi_lp64.lib LP64 version of BLACS routines supporting Microsoft* MPI mkl_blacs_msmpi_ilp64.lib ILP64 version of BLACS routines supporting Microsoft* MPI C Intel® Math Kernel Library for Windows* OS User's Guide 104Dynamic Libraries in the lib\intel64 Directory File Contents mkl_rt.lib Single Dynamic Library to be used for linking Interface layer mkl_intel_lp64_dll.lib LP64 interface library for dynamic linking with the Intel compilers mkl_intel_ilp64_dll.lib ILP64 interface library for dynamic linking with the Intel compilers Threading layer mkl_intel_thread_dll.lib Threading library for dynamic linking with the Intel compilers mkl_pgi_thread_dll.lib Threading library for dynamic linking with the PGI* compiler mkl_sequential_dll.lib Sequential library for dynamic linking Computational layer mkl_core_dll.lib Core library for dynamic linking mkl_scalapack_lp64_dll.lib ScaLAPACK routine library for dynamic linking supporting the LP64 interface mkl_scalapack_ilp64_dll.lib ScaLAPACK routine library for dynamic linking supporting the ILP64 interface mkl_cdft_core_dll.lib Cluster FFT library for dynamic linking Run-time Libraries (RTL) mkl_blacs_lp64_dll.lib LP64 version of BLACS interface library for dynamic linking mkl_blacs_ilp64_dll.lib ILP64 version of BLACS interface library for dynamic linking Contents of the redist\intel64\mkl Directory File Contents mkl_rt.dll Single Dynamic Library Threading layer mkl_intel_thread.dll Dynamic threading library for the Intel compilers mkl_pgi_thread.dll Dynamic threading library for the PGI* compiler mkl_sequential.dll Dynamic sequential library Computational layer mkl_core.dll Core library containing processor-independent code and a Directory Structure in Detail C 105File Contents dispatcher for dynamic loading of processor-specific code mkl_def.dll Default kernel for the Intel® 64 architecture mkl_p4n.dll Kernel for the Intel® Xeon® processor using the Intel® 64 architecture mkl_mc.dll Kernel for processors based on the Intel® Core™ microarchitecture mkl_mc3.dll Kernel for the Intel® Core™ i7 processors mkl_avx.dll Kernel optimized for the Intel® Advanced Vector Extensions (Intel® AVX). mkl_vml_def.dll VML/VSL part of default kernel mkl_vml_p4n.dll VML/VSL for the Intel® Xeon® processor using the Intel® 64 architecture mkl_vml_mc.dll VML/VSL for processors based on the Intel® Core™ microarchitecture mkl_vml_mc2.dll VML/VSL for 45nm Hi-k Intel® Core™2 and Intel Xeon® processor families mkl_vml_mc3.dll VML/VSL for the Intel® Core® i7 processors mkl_vml_avx.dll VML/VSL optimized for the Intel® Advanced Vector Extensions (Intel® AVX) mkl_scalapack_lp64.dll ScaLAPACK routine library supporting the LP64 interface mkl_scalapack_ilp64.dll ScaLAPACK routine library supporting the ILP64 interface mkl_cdft_core.dll Cluster FFT dynamic library libimalloc.dll Dynamic library to support renaming of memory functions Run-time Libraries (RTL) mkl_blacs_lp64.dll LP64 version of BLACS routines mkl_blacs_ilp64.dll ILP64 version of BLACS routines mkl_blacs_intelmpi_lp64.dll LP64 version of BLACS routines supporting Intel MPI mkl_blacs_intelmpi_ilp64.dll ILP64 version of BLACS routines supporting Intel MPI mkl_blacs_mpich2_lp64.dll LP64 version of BLACS routines supporting MPICH2 mkl_blacs_mpich2_ilp64.dll ILP64 version of BLACS routines supporting MPICH2 mkl_blacs_msmpi_lp64.dll LP64 version of BLACS routines supporting Microsoft* MPI mkl_blacs_msmpi_ilp64.dll ILP64 version of BLACS routines supporting Microsoft* MPI 1033\mkl_msg.dll Catalog of Intel® Math Kernel Library (Intel® MKL) messages in English 1041\mkl_msg.dll Catalog of Intel MKL messages in Japanese. Available only if the Intel® MKL package provides Japanese localization. Please see the Release Notes for this information C Intel® Math Kernel Library for Windows* OS User's Guide 106Index A affinity mask 53 aligning data 69 architecture support 23 B BLAS calling routines from C 61 Fortran 95 interface to 59 threaded routines 43 building a custom DLL in Visual Studio* IDE 41 C C interface to LAPACK, use of 61 C, calling LAPACK, BLAS, CBLAS from 61 C/C++, Intel(R) MKL complex types 62 calling BLAS functions from C 63 CBLAS interface from C 63 complex BLAS Level 1 function from C 63 complex BLAS Level 1 function from C++ 63 Fortran-style routines from C 61 calling convention, cdecl and stdcall 19 CBLAS interface, use of 61 cdecl interface, use of 33 Cluster FFT, linking with 71 cluster software, Intel(R) MKL cluster software, linking with commands 71 linking examples 74 code examples, use of 19 coding data alignment techniques to improve performance 52 compilation, Intel(R) MKL version-dependent 70 compiler run-time libraries, linking with 38 compiler support 19 compiler-dependent function 59 complex types in C and C++, Intel(R) MKL 62 computation results, consistency 69 computational libraries, linking with 37 conditional compilation 70 configuring Intel(R) Visual Fortran 77 Microsoft Visual* C/C++ 77 project that runs Intel(R) MKL code example in Visual Studio* 2008 IDE 78 consistent results 69 context-sensitive Help, for Intel(R) MKL, in Visual Studio* IDE 83 conventions, notational 13 ctdcall interface, use of 33 custom DLL building 39 composing list of functions 40 specifying function names 41 CVF calling convention, use with Intel(R) MKL 60 D denormal number, performance 54 directory structure documentation 26 high-level 23 in-detail documentation directories, contents 26 E Enter index keyword 27 environment variables, setting 17 examples, linking for cluster software 74 general 30 F FFT interface data alignment 52 optimised radices 54 threaded problems 43 FFTW interface support 99 Fortran 95 interface libraries 36 G GNU* Multiple Precision Arithmetic Library 99 H header files, Intel(R) MKL 96 Help, for Intel(R) MKL in Visual Studio* IDE 82 HT technology, configuration tip 53 hybrid, version, of MP LINPACK 89 I ILP64 programming, support for 34 include files, Intel(R) MKL 96 installation, checking 17 Intel(R) Hyper-Threading Technology, configuration tip 53 Intel(R) Visual* Fortran project, linking with Intel(R) MKL 28 IntelliSense*, with Intel(R) MKL, in Visual Studio* IDE 84 interface cdecl and stdcall, use of 33 Fortran 95, libraries 36 LP64 and ILP64, use of 34 interface libraries and modules, Intel(R) MKL 57 interface libraries, linking with 33 J Java* examples 66 L language interfaces support 95 language-specific interfaces interface libraries and modules 57 LAPACK Index 107C interface to, use of 61 calling routines from C 61 Fortran 95 interface to 59 performance of packed routines 52 threaded routines 43 layers, Intel(R) MKL structure 25 libraries to link with computational 37 interface 33 run-time 38 system libraries 38 threading 36 link tool, command line 30 linking Intel(R) Visual* Fortran project with Intel(R) MKL 28 Microsoft Visual* C/C++ project with Intel(R) MKL 28 linking examples cluster software 74 general 30 linking with compiler run-time libraries 38 computational libraries 37 interface libraries 33 system libraries 38 threading libraries 36 linking, quick start 27 linking, Web-based advisor 29 LINPACK benchmark 87 M memory functions, redefining 55 memory management 54 memory renaming 55 Microsoft Visual* C/C++ project, linking with Intel(R) MKL 28 mixed-language programming 61 module, Fortran 95 59 MP LINPACK benchmark 89 multi-core performance 53 N notational conventions 13 number of threads changing at run time 46 changing with OpenMP* environment variable 46 Intel(R) MKL choice, particular cases 49 setting for cluster 73 techniques to set 46 P parallel performance 45 parallelism, of Intel(R) MKL 43 performance multi-core 53 with denormals 54 with subnormals 54 S ScaLAPACK, linking with 71 SDL 28, 32 sequential mode of Intel(R) MKL 36 Single Dynamic Library 28, 32 stdcall calling convention, use in C/C++ 60 structure high-level 23 in-detail model 25 support, technical 11 supported architectures 23 system libraries, linking with 38 T technical support 11 thread safety, of Intel(R) MKL 43 threaded functions 43 threaded problems 43 threading control, Intel(R) MKL-specific 48 threading libraries, linking with 36 U uBLAS, matrix-matrix multiplication, substitution with Intel MKL functions 64 unstable output, getting rid of 69 usage information 15 V Visual Studio* 2008 IDE, configuring a project that runs Intel(R) MKL code example 78 Visual Studio* IDE IntelliSense*, with Intel(R) MKL 84 using Intel(R) MKL context-sensitive Help in 83 Veiwing Intel(R) MKL documentation in 82 Intel® Math Kernel Library for Windows* OS User's Guide 108 Intel® Math Kernel Library Reference Manual Document Number: 630813-045US MKL 10.3 Update 8 Legal Information Contents Legal Information..............................................................................33 Introducing the Intel® Math Kernel Library.........................................35 Getting Help and Support...................................................................37 What's New........................................................................................39 Notational Conventions......................................................................41 Chapter 1: Function Domains BLAS Routines.........................................................................................44 Sparse BLAS Routines..............................................................................44 LAPACK Routines.....................................................................................44 ScaLAPACK Routines................................................................................44 PBLAS Routines.......................................................................................45 Sparse Solver Routines.............................................................................45 VML Functions.........................................................................................46 Statistical Functions.................................................................................46 Fourier Transform Functions......................................................................46 Partial Differential Equations Support..........................................................46 Nonlinear Optimization Problem Solvers......................................................47 Support Functions....................................................................................47 BLACS Routines.......................................................................................47 Data Fitting Functions...............................................................................48 GMP Arithmetic Functions..........................................................................48 Performance Enhancements......................................................................48 Parallelism..............................................................................................49 C Datatypes Specific to Intel MKL...............................................................49 Chapter 2: BLAS and Sparse BLAS Routines BLAS Routines.........................................................................................51 Routine Naming Conventions.............................................................51 Fortran 95 Interface Conventions.......................................................52 Matrix Storage Schemes...................................................................53 BLAS Level 1 Routines and Functions..................................................53 ?asum....................................................................................54 ?axpy....................................................................................55 ?copy.....................................................................................56 ?dot.......................................................................................58 ?sdot.....................................................................................59 ?dotc.....................................................................................60 ?dotu.....................................................................................61 ?nrm2....................................................................................62 ?rot.......................................................................................63 ?rotg.....................................................................................64 ?rotm....................................................................................65 ?rotmg...................................................................................67 ?scal......................................................................................69 Contents 3 ?swap....................................................................................70 i?amax...................................................................................71 i?amin...................................................................................72 ?cabs1...................................................................................73 BLAS Level 2 Routines......................................................................74 ?gbmv...................................................................................75 ?gemv...................................................................................77 ?ger......................................................................................79 ?gerc.....................................................................................81 ?geru.....................................................................................82 ?hbmv...................................................................................84 ?hemv...................................................................................86 ?her......................................................................................87 ?her2.....................................................................................89 ?hpmv...................................................................................91 ?hpr......................................................................................92 ?hpr2.....................................................................................94 ?sbmv....................................................................................95 ?spmv....................................................................................98 ?spr.......................................................................................99 ?spr2...................................................................................101 ?symv..................................................................................102 ?syr.....................................................................................104 ?syr2...................................................................................106 ?tbmv..................................................................................107 ?tbsv...................................................................................109 ?tpmv..................................................................................112 ?tpsv...................................................................................113 ?trmv...................................................................................115 ?trsv....................................................................................117 BLAS Level 3 Routines....................................................................118 ?gemm.................................................................................119 ?hemm.................................................................................122 ?herk...................................................................................124 ?her2k.................................................................................126 ?symm.................................................................................128 ?syrk...................................................................................131 ?syr2k..................................................................................133 ?trmm..................................................................................135 ?trsm...................................................................................138 Sparse BLAS Level 1 Routines..................................................................140 Vector Arguments..........................................................................140 Naming Conventions......................................................................140 Routines and Data Types................................................................141 BLAS Level 1 Routines That Can Work With Sparse Vectors.................141 ?axpyi..........................................................................................141 ?doti............................................................................................143 ?dotci...........................................................................................144 ?dotui...........................................................................................145 ?gthr............................................................................................146 Intel® Math Kernel Library Reference Manual 4 ?gthrz..........................................................................................147 ?roti.............................................................................................148 ?sctr............................................................................................149 Sparse BLAS Level 2 and Level 3 Routines.................................................151 Naming Conventions in Sparse BLAS Level 2 and Level 3.....................151 Sparse Matrix Storage Formats........................................................152 Routines and Supported Operations..................................................152 Interface Consideration...................................................................153 Sparse BLAS Level 2 and Level 3 Routines.........................................158 mkl_?csrgemv.......................................................................161 mkl_?bsrgemv......................................................................164 mkl_?coogemv......................................................................166 mkl_?diagemv.......................................................................169 mkl_?csrsymv.......................................................................171 mkl_?bsrsymv.......................................................................173 mkl_?coosymv......................................................................176 mkl_?diasymv.......................................................................178 mkl_?csrtrsv.........................................................................181 mkl_?bsrtrsv.........................................................................184 mkl_?cootrsv........................................................................186 mkl_?diatrsv.........................................................................189 mkl_cspblas_?csrgemv...........................................................192 mkl_cspblas_?bsrgemv...........................................................194 mkl_cspblas_?coogemv..........................................................197 mkl_cspblas_?csrsymv...........................................................199 mkl_cspblas_?bsrsymv...........................................................202 mkl_cspblas_?coosymv..........................................................204 mkl_cspblas_?csrtrsv.............................................................207 mkl_cspblas_?bsrtrsv.............................................................209 mkl_cspblas_?cootrsv............................................................212 mkl_?csrmv..........................................................................215 mkl_?bsrmv..........................................................................218 mkl_?cscmv..........................................................................222 mkl_?coomv.........................................................................225 mkl_?csrsv...........................................................................228 mkl_?bsrsv...........................................................................232 mkl_?cscsv...........................................................................235 mkl_?coosv...........................................................................239 mkl_?csrmm.........................................................................242 mkl_?bsrmm.........................................................................246 mkl_?cscmm.........................................................................250 mkl_?coomm........................................................................254 mkl_?csrsm..........................................................................257 mkl_?cscsm..........................................................................261 mkl_?coosm..........................................................................265 mkl_?bsrsm..........................................................................268 mkl_?diamv..........................................................................272 mkl_?skymv.........................................................................275 mkl_?diasv...........................................................................278 mkl_?skysv...........................................................................281 Contents 5 mkl_?diamm.........................................................................284 mkl_?skymm........................................................................288 mkl_?diasm..........................................................................291 mkl_?skysm..........................................................................295 mkl_?dnscsr..........................................................................298 mkl_?csrcoo..........................................................................301 mkl_?csrbsr..........................................................................304 mkl_?csrcsc..........................................................................307 mkl_?csrdia..........................................................................309 mkl_?csrsky..........................................................................313 mkl_?csradd.........................................................................316 mkl_?csrmultcsr....................................................................320 mkl_?csrmultd......................................................................324 BLAS-like Extensions..............................................................................327 ?axpby.........................................................................................327 ?gem2vu......................................................................................329 ?gem2vc.......................................................................................331 ?gemm3m....................................................................................333 mkl_?imatcopy..............................................................................335 mkl_?omatcopy.............................................................................338 mkl_?omatcopy2...........................................................................341 mkl_?omatadd...............................................................................344 Chapter 3: LAPACK Routines: Linear Equations Routine Naming Conventions...................................................................347 C Interface Conventions..........................................................................348 Fortran 95 Interface Conventions.............................................................351 Intel® MKL Fortran 95 Interfaces for LAPACK Routines vs. Netlib Implementation.........................................................................352 Matrix Storage Schemes.........................................................................353 Mathematical Notation............................................................................354 Error Analysis........................................................................................354 Computational Routines..........................................................................355 Routines for Matrix Factorization......................................................357 ?getrf...................................................................................357 ?gbtrf...................................................................................359 ?gttrf...................................................................................361 ?dttrfb..................................................................................363 ?potrf...................................................................................364 ?pstrf...................................................................................366 ?pftrf...................................................................................368 ?pptrf...................................................................................369 ?pbtrf...................................................................................371 ?pttrf...................................................................................373 ?sytrf...................................................................................374 ?hetrf...................................................................................378 ?sptrf...................................................................................381 ?hptrf...................................................................................383 Routines for Solving Systems of Linear Equations...............................385 ?getrs..................................................................................385 Intel® Math Kernel Library Reference Manual 6 ?gbtrs..................................................................................387 ?gttrs...................................................................................389 ?dttrsb.................................................................................392 ?potrs..................................................................................393 ?pftrs...................................................................................395 ?pptrs..................................................................................396 ?pbtrs..................................................................................398 ?pttrs...................................................................................400 ?sytrs...................................................................................402 ?hetrs..................................................................................404 ?sytrs2.................................................................................406 ?hetrs2................................................................................408 ?sptrs..................................................................................409 ?hptrs..................................................................................411 ?trtrs...................................................................................413 ?tptrs...................................................................................416 ?tbtrs...................................................................................418 Routines for Estimating the Condition Number...................................420 ?gecon.................................................................................420 ?gbcon.................................................................................422 ?gtcon..................................................................................424 ?pocon.................................................................................426 ?ppcon.................................................................................428 ?pbcon.................................................................................430 ?ptcon..................................................................................432 ?sycon.................................................................................434 ?syconv................................................................................436 ?hecon.................................................................................438 ?spcon.................................................................................439 ?hpcon.................................................................................441 ?trcon..................................................................................443 ?tpcon..................................................................................445 ?tbcon..................................................................................447 Refining the Solution and Estimating Its Error....................................449 ?gerfs..................................................................................449 ?gerfsx.................................................................................452 ?gbrfs..................................................................................458 ?gbrfsx.................................................................................461 ?gtrfs...................................................................................467 ?porfs..................................................................................469 ?porfsx.................................................................................472 ?pprfs..................................................................................478 ?pbrfs..................................................................................480 ?ptrfs...................................................................................483 ?syrfs...................................................................................485 ?syrfsx.................................................................................488 ?herfs..................................................................................494 ?herfsx.................................................................................496 ?sprfs...................................................................................501 ?hprfs..................................................................................504 Contents 7 ?trrfs...................................................................................506 ?tprfs...................................................................................508 ?tbrfs...................................................................................511 Routines for Matrix Inversion...........................................................514 ?getri...................................................................................514 ?potri...................................................................................516 ?pftri....................................................................................517 ?pptri...................................................................................519 ?sytri...................................................................................520 ?hetri...................................................................................522 ?sytri2.................................................................................523 ?hetri2.................................................................................525 ?sytri2x................................................................................527 ?hetri2x...............................................................................529 ?sptri...................................................................................530 ?hptri...................................................................................532 ?trtri....................................................................................534 ?tftri....................................................................................535 ?tptri...................................................................................536 Routines for Matrix Equilibration......................................................538 ?geequ.................................................................................538 ?geequb...............................................................................540 ?gbequ.................................................................................542 ?gbequb...............................................................................545 ?poequ.................................................................................547 ?poequb...............................................................................549 ?ppequ.................................................................................550 ?pbequ.................................................................................552 ?syequb...............................................................................554 ?heequb...............................................................................556 Driver Routines......................................................................................557 ?gesv...........................................................................................558 ?gesvx.........................................................................................561 ?gesvxx........................................................................................567 ?gbsv...........................................................................................574 ?gbsvx.........................................................................................576 ?gbsvxx........................................................................................582 ?gtsv............................................................................................589 ?gtsvx..........................................................................................591 ?dtsvb..........................................................................................595 ?posv...........................................................................................596 ?posvx.........................................................................................599 ?posvxx........................................................................................604 ?ppsv...........................................................................................611 ?ppsvx.........................................................................................612 ?pbsv...........................................................................................617 ?pbsvx.........................................................................................619 ?ptsv............................................................................................623 ?ptsvx..........................................................................................625 ?sysv...........................................................................................629 Intel® Math Kernel Library Reference Manual 8 ?sysvx..........................................................................................631 ?sysvxx........................................................................................635 ?hesv...........................................................................................642 ?hesvx.........................................................................................645 ?hesvxx........................................................................................649 ?spsv...........................................................................................655 ?spsvx..........................................................................................657 ?hpsv...........................................................................................661 ?hpsvx.........................................................................................663 Chapter 4: LAPACK Routines: Least Squares and Eigenvalue Problems Routine Naming Conventions...................................................................668 Matrix Storage Schemes.........................................................................669 Mathematical Notation............................................................................669 Computational Routines..........................................................................669 Orthogonal Factorizations................................................................670 ?geqrf..................................................................................671 ?geqrfp................................................................................674 ?geqpf..................................................................................676 ?geqp3.................................................................................678 ?orgqr..................................................................................681 ?ormqr.................................................................................683 ?ungqr.................................................................................685 ?unmqr................................................................................687 ?gelqf..................................................................................689 ?orglq..................................................................................692 ?ormlq.................................................................................694 ?unglq..................................................................................696 ?unmlq.................................................................................698 ?geqlf..................................................................................700 ?orgql..................................................................................702 ?ungql..................................................................................704 ?ormql.................................................................................706 ?unmql.................................................................................708 ?gerqf..................................................................................710 ?orgrq..................................................................................712 ?ungrq.................................................................................714 ?ormrq.................................................................................716 ?unmrq................................................................................718 ?tzrzf...................................................................................720 ?ormrz.................................................................................723 ?unmrz................................................................................725 ?ggqrf..................................................................................728 ?ggrqf..................................................................................731 Singular Value Decomposition..........................................................734 ?gebrd.................................................................................736 ?gbbrd.................................................................................739 ?orgbr..................................................................................742 ?ormbr.................................................................................744 ?ungbr.................................................................................747 Contents 9 ?unmbr................................................................................749 ?bdsqr..................................................................................752 ?bdsdc.................................................................................756 Symmetric Eigenvalue Problems......................................................758 ?sytrd..................................................................................762 ?syrdb..................................................................................764 ?herdb.................................................................................766 ?orgtr..................................................................................768 ?ormtr.................................................................................770 ?hetrd..................................................................................772 ?ungtr..................................................................................775 ?unmtr.................................................................................776 ?sptrd..................................................................................779 ?opgtr..................................................................................781 ?opmtr.................................................................................782 ?hptrd..................................................................................784 ?upgtr..................................................................................786 ?upmtr.................................................................................787 ?sbtrd..................................................................................789 ?hbtrd..................................................................................791 ?sterf...................................................................................793 ?steqr..................................................................................795 ?stemr.................................................................................798 ?stedc..................................................................................801 ?stegr..................................................................................805 ?pteqr..................................................................................810 ?stebz..................................................................................813 ?stein...................................................................................815 ?disna..................................................................................818 Generalized Symmetric-Definite Eigenvalue Problems.........................819 ?sygst..................................................................................820 ?hegst..................................................................................822 ?spgst..................................................................................823 ?hpgst..................................................................................825 ?sbgst..................................................................................827 ?hbgst..................................................................................829 ?pbstf..................................................................................831 Nonsymmetric Eigenvalue Problems.................................................833 ?gehrd.................................................................................835 ?orghr..................................................................................837 ?ormhr.................................................................................839 ?unghr.................................................................................842 ?unmhr................................................................................844 ?gebal..................................................................................847 ?gebak.................................................................................849 ?hseqr..................................................................................851 ?hsein..................................................................................855 ?trevc..................................................................................860 ?trsna..................................................................................864 ?trexc..................................................................................868 Intel® Math Kernel Library Reference Manual 10 ?trsen..................................................................................870 ?trsyl...................................................................................874 Generalized Nonsymmetric Eigenvalue Problems................................877 ?gghrd.................................................................................878 ?ggbal..................................................................................880 ?ggbak.................................................................................883 ?hgeqz.................................................................................885 ?tgevc..................................................................................890 ?tgexc..................................................................................894 ?tgsen..................................................................................896 ?tgsyl...................................................................................902 ?tgsna..................................................................................906 Generalized Singular Value Decomposition........................................910 ?ggsvp.................................................................................910 ?tgsja..................................................................................914 Cosine-Sine Decomposition.............................................................919 ?bbcsd.................................................................................920 ?orbdb/?unbdb......................................................................925 Driver Routines......................................................................................930 Linear Least Squares (LLS) Problems................................................930 ?gels....................................................................................930 ?gelsy..................................................................................933 ?gelss..................................................................................937 ?gelsd..................................................................................939 Generalized LLS Problems...............................................................943 ?gglse..................................................................................943 ?ggglm.................................................................................946 Symmetric Eigenproblems...............................................................948 ?syev...................................................................................949 ?heev...................................................................................951 ?syevd.................................................................................954 ?heevd.................................................................................956 ?syevx.................................................................................959 ?heevx.................................................................................963 ?syevr..................................................................................966 ?heevr.................................................................................970 ?spev...................................................................................975 ?hpev...................................................................................977 ?spevd.................................................................................979 ?hpevd.................................................................................981 ?spevx.................................................................................985 ?hpevx.................................................................................988 ?sbev...................................................................................991 ?hbev...................................................................................993 ?sbevd.................................................................................995 ?hbevd.................................................................................998 ?sbevx...............................................................................1001 ?hbevx...............................................................................1004 ?stev..................................................................................1008 ?stevd................................................................................1009 Contents 11 ?stevx................................................................................1012 ?stevr.................................................................................1015 Nonsymmetric Eigenproblems........................................................1019 ?gees.................................................................................1020 ?geesx...............................................................................1024 ?geev.................................................................................1028 ?geevx...............................................................................1032 Singular Value Decomposition........................................................1037 ?gesvd...............................................................................1037 ?gesdd...............................................................................1041 ?gejsv................................................................................1045 ?gesvj................................................................................1051 ?ggsvd...............................................................................1055 Cosine-Sine Decomposition............................................................1060 ?orcsd/?uncsd.....................................................................1060 Generalized Symmetric Definite Eigenproblems................................1065 ?sygv.................................................................................1066 ?hegv.................................................................................1068 ?sygvd...............................................................................1071 ?hegvd...............................................................................1074 ?sygvx...............................................................................1077 ?hegvx...............................................................................1081 ?spgv.................................................................................1085 ?hpgv.................................................................................1087 ?spgvd...............................................................................1089 ?hpgvd...............................................................................1092 ?spgvx...............................................................................1096 ?hpgvx...............................................................................1099 ?sbgv.................................................................................1103 ?hbgv.................................................................................1105 ?sbgvd...............................................................................1107 ?hbgvd...............................................................................1110 ?sbgvx...............................................................................1113 ?hbgvx...............................................................................1117 Generalized Nonsymmetric Eigenproblems.......................................1120 ?gges.................................................................................1121 ?ggesx...............................................................................1126 ?ggev.................................................................................1132 ?ggevx...............................................................................1136 Chapter 5: LAPACK Auxiliary and Utility Routines Auxiliary Routines.................................................................................1143 ?lacgv.........................................................................................1155 ?lacrm........................................................................................1156 ?lacrt..........................................................................................1156 ?laesy.........................................................................................1157 ?rot............................................................................................1158 ?spmv........................................................................................1159 ?spr...........................................................................................1161 ?symv........................................................................................1162 Intel® Math Kernel Library Reference Manual 12 ?syr............................................................................................1163 i?max1.......................................................................................1164 ?sum1........................................................................................1165 ?gbtf2.........................................................................................1166 ?gebd2.......................................................................................1167 ?gehd2.......................................................................................1168 ?gelq2........................................................................................1170 ?geql2........................................................................................1171 ?geqr2........................................................................................1172 ?geqr2p......................................................................................1174 ?gerq2........................................................................................1175 ?gesc2........................................................................................1176 ?getc2........................................................................................1177 ?getf2.........................................................................................1178 ?gtts2.........................................................................................1179 ?isnan........................................................................................1180 ?laisnan......................................................................................1181 ?labrd.........................................................................................1181 ?lacn2........................................................................................1184 ?lacon.........................................................................................1185 ?lacpy.........................................................................................1186 ?ladiv.........................................................................................1187 ?lae2..........................................................................................1188 ?laebz.........................................................................................1189 ?laed0........................................................................................1192 ?laed1........................................................................................1194 ?laed2........................................................................................1195 ?laed3........................................................................................1197 ?laed4........................................................................................1199 ?laed5........................................................................................1200 ?laed6........................................................................................1200 ?laed7........................................................................................1202 ?laed8........................................................................................1204 ?laed9........................................................................................1207 ?laeda........................................................................................1208 ?laein.........................................................................................1209 ?laev2........................................................................................1212 ?laexc.........................................................................................1213 ?lag2..........................................................................................1214 ?lags2........................................................................................1216 ?lagtf..........................................................................................1218 ?lagtm........................................................................................1220 ?lagts.........................................................................................1221 ?lagv2........................................................................................1223 ?lahqr.........................................................................................1224 ?lahrd.........................................................................................1226 ?lahr2.........................................................................................1228 ?laic1.........................................................................................1230 ?laln2.........................................................................................1232 ?lals0.........................................................................................1234 Contents 13 ?lalsa..........................................................................................1236 ?lalsd.........................................................................................1239 ?lamrg........................................................................................1241 ?laneg........................................................................................1242 ?langb........................................................................................1243 ?lange........................................................................................1244 ?langt.........................................................................................1245 ?lanhs........................................................................................1246 ?lansb........................................................................................1247 ?lanhb........................................................................................1248 ?lansp........................................................................................1249 ?lanhp........................................................................................1250 ?lanst/?lanht...............................................................................1251 ?lansy.........................................................................................1252 ?lanhe........................................................................................1253 ?lantb.........................................................................................1255 ?lantp.........................................................................................1256 ?lantr.........................................................................................1257 ?lanv2........................................................................................1259 ?lapll..........................................................................................1259 ?lapmr........................................................................................1260 ?lapmt........................................................................................1262 ?lapy2........................................................................................1262 ?lapy3........................................................................................1263 ?laqgb........................................................................................1264 ?laqge........................................................................................1265 ?laqhb........................................................................................1266 ?laqp2........................................................................................1268 ?laqps........................................................................................1269 ?laqr0.........................................................................................1270 ?laqr1.........................................................................................1273 ?laqr2.........................................................................................1274 ?laqr3.........................................................................................1277 ?laqr4.........................................................................................1280 ?laqr5.........................................................................................1282 ?laqsb........................................................................................1285 ?laqsp........................................................................................1286 ?laqsy.........................................................................................1287 ?laqtr.........................................................................................1289 ?lar1v.........................................................................................1290 ?lar2v.........................................................................................1293 ?larf...........................................................................................1294 ?larfb.........................................................................................1295 ?larfg.........................................................................................1298 ?larfgp........................................................................................1299 ?larft..........................................................................................1300 ?larfx..........................................................................................1302 ?largv.........................................................................................1304 ?larnv.........................................................................................1305 ?larra.........................................................................................1306 Intel® Math Kernel Library Reference Manual 14 ?larrb.........................................................................................1307 ?larrc..........................................................................................1309 ?larrd.........................................................................................1310 ?larre.........................................................................................1312 ?larrf..........................................................................................1315 ?larrj..........................................................................................1317 ?larrk.........................................................................................1318 ?larrr..........................................................................................1319 ?larrv.........................................................................................1320 ?lartg.........................................................................................1323 ?lartgp........................................................................................1324 ?lartgs........................................................................................1326 ?lartv.........................................................................................1327 ?laruv.........................................................................................1328 ?larz...........................................................................................1329 ?larzb.........................................................................................1330 ?larzt..........................................................................................1332 ?las2..........................................................................................1334 ?lascl..........................................................................................1335 ?lasd0........................................................................................1336 ?lasd1........................................................................................1338 ?lasd2........................................................................................1340 ?lasd3........................................................................................1342 ?lasd4........................................................................................1344 ?lasd5........................................................................................1346 ?lasd6........................................................................................1347 ?lasd7........................................................................................1350 ?lasd8........................................................................................1353 ?lasd9........................................................................................1354 ?lasda.........................................................................................1356 ?lasdq........................................................................................1358 ?lasdt.........................................................................................1360 ?laset.........................................................................................1361 ?lasq1........................................................................................1362 ?lasq2........................................................................................1363 ?lasq3........................................................................................1364 ?lasq4........................................................................................1365 ?lasq5........................................................................................1366 ?lasq6........................................................................................1367 ?lasr...........................................................................................1368 ?lasrt..........................................................................................1371 ?lassq.........................................................................................1372 ?lasv2.........................................................................................1373 ?laswp........................................................................................1374 ?lasy2.........................................................................................1375 ?lasyf.........................................................................................1377 ?lahef.........................................................................................1378 ?latbs.........................................................................................1380 ?latdf..........................................................................................1382 ?latps.........................................................................................1383 Contents 15 ?latrd.........................................................................................1385 ?latrs..........................................................................................1387 ?latrz..........................................................................................1390 ?lauu2........................................................................................1392 ?lauum.......................................................................................1393 ?org2l/?ung2l..............................................................................1394 ?org2r/?ung2r.............................................................................1395 ?orgl2/?ungl2..............................................................................1396 ?orgr2/?ungr2.............................................................................1397 ?orm2l/?unm2l............................................................................1399 ?orm2r/?unm2r...........................................................................1400 ?orml2/?unml2............................................................................1402 ?ormr2/?unmr2...........................................................................1404 ?ormr3/?unmr3...........................................................................1405 ?pbtf2.........................................................................................1407 ?potf2.........................................................................................1408 ?ptts2.........................................................................................1409 ?rscl...........................................................................................1411 ?syswapr....................................................................................1411 ?heswapr....................................................................................1413 ?sygs2/?hegs2.............................................................................1415 ?sytd2/?hetd2.............................................................................1417 ?sytf2.........................................................................................1418 ?hetf2.........................................................................................1419 ?tgex2........................................................................................1421 ?tgsy2........................................................................................1423 ?trti2..........................................................................................1426 clag2z.........................................................................................1427 dlag2s........................................................................................1427 slag2d........................................................................................1428 zlag2c.........................................................................................1429 ?larfp.........................................................................................1429 ila?lc..........................................................................................1431 ila?lr...........................................................................................1432 ?gsvj0........................................................................................1432 ?gsvj1........................................................................................1434 ?sfrk...........................................................................................1437 ?hfrk..........................................................................................1438 ?tfsm..........................................................................................1440 ?lansf.........................................................................................1442 ?lanhf.........................................................................................1443 ?tfttp..........................................................................................1444 ?tfttr..........................................................................................1445 ?tpttf..........................................................................................1446 ?tpttr..........................................................................................1448 ?trttf..........................................................................................1449 ?trttp..........................................................................................1450 ?pstf2.........................................................................................1451 dlat2s ........................................................................................1453 zlat2c ........................................................................................1454 Intel® Math Kernel Library Reference Manual 16 ?lacp2........................................................................................1455 ?la_gbamv..................................................................................1455 ?la_gbrcond................................................................................1457 ?la_gbrcond_c.............................................................................1459 ?la_gbrcond_x.............................................................................1460 ?la_gbrfsx_extended....................................................................1462 ?la_gbrpvgrw...............................................................................1467 ?la_geamv..................................................................................1468 ?la_gercond.................................................................................1470 ?la_gercond_c.............................................................................1471 ?la_gercond_x.............................................................................1472 ?la_gerfsx_extended.....................................................................1473 ?la_heamv..................................................................................1478 ?la_hercond_c.............................................................................1480 ?la_hercond_x.............................................................................1481 ?la_herfsx_extended....................................................................1482 ?la_herpvgrw...............................................................................1487 ?la_lin_berr.................................................................................1488 ?la_porcond................................................................................1489 ?la_porcond_c.............................................................................1490 ?la_porcond_x.............................................................................1492 ?la_porfsx_extended....................................................................1493 ?la_porpvgrw...............................................................................1498 ?laqhe........................................................................................1499 ?laqhp........................................................................................1501 ?larcm........................................................................................1502 ?la_rpvgrw..................................................................................1503 ?larscl2.......................................................................................1504 ?lascl2........................................................................................1504 ?la_syamv...................................................................................1505 ?la_syrcond.................................................................................1507 ?la_syrcond_c..............................................................................1508 ?la_syrcond_x.............................................................................1509 ?la_syrfsx_extended.....................................................................1511 ?la_syrpvgrw...............................................................................1516 ?la_wwaddw................................................................................1517 Utility Functions and Routines................................................................1518 ilaver..........................................................................................1519 ilaenv.........................................................................................1520 iparmq........................................................................................1522 ieeeck.........................................................................................1523 lsamen.......................................................................................1524 ?labad........................................................................................1524 ?lamch.......................................................................................1525 ?lamc1.......................................................................................1526 ?lamc2.......................................................................................1526 ?lamc3.......................................................................................1527 ?lamc4.......................................................................................1528 ?lamc5.......................................................................................1528 second/dsecnd.............................................................................1529 Contents 17 chla_transtype.............................................................................1529 iladiag........................................................................................1530 ilaprec........................................................................................1531 ilatrans.......................................................................................1531 ilauplo........................................................................................1532 xerbla_array................................................................................1532 Chapter 6: ScaLAPACK Routines Overview.............................................................................................1535 Routine Naming Conventions.................................................................1536 Computational Routines........................................................................1537 Linear Equations..........................................................................1537 Routines for Matrix Factorization....................................................1538 p?getrf...............................................................................1538 p?gbtrf...............................................................................1540 p?dbtrf...............................................................................1542 p?dttrf................................................................................1543 p?potrf...............................................................................1545 p?pbtrf...............................................................................1546 p?pttrf................................................................................1548 Routines for Solving Systems of Linear Equations.............................1550 p?getrs...............................................................................1550 p?gbtrs...............................................................................1551 p?dbtrs...............................................................................1553 p?dttrs...............................................................................1555 p?potrs...............................................................................1557 p?pbtrs...............................................................................1558 p?pttrs...............................................................................1560 p?trtrs................................................................................1562 Routines for Estimating the Condition Number..................................1563 p?gecon..............................................................................1564 p?pocon..............................................................................1566 p?trcon...............................................................................1568 Refining the Solution and Estimating Its Error..................................1570 p?gerfs...............................................................................1570 p?porfs...............................................................................1573 p?trrfs................................................................................1576 Routines for Matrix Inversion.........................................................1578 p?getri...............................................................................1578 p?potri...............................................................................1580 p?trtri.................................................................................1581 Routines for Matrix Equilibration.....................................................1583 p?geequ.............................................................................1583 p?poequ.............................................................................1584 Orthogonal Factorizations..............................................................1586 p?geqrf...............................................................................1587 p?geqpf..............................................................................1589 p?orgqr..............................................................................1591 p?ungqr..............................................................................1592 p?ormqr.............................................................................1594 Intel® Math Kernel Library Reference Manual 18 p?unmqr.............................................................................1596 p?gelqf...............................................................................1598 p?orglq...............................................................................1600 p?unglq..............................................................................1602 p?ormlq..............................................................................1603 p?unmlq.............................................................................1605 p?geqlf...............................................................................1608 p?orgql...............................................................................1609 p?ungql..............................................................................1611 p?ormql..............................................................................1612 p?unmql.............................................................................1615 p?gerqf...............................................................................1617 p?orgrq..............................................................................1619 p?ungrq..............................................................................1620 p?ormrq.............................................................................1622 p?unmrq.............................................................................1624 p?tzrzf................................................................................1626 p?ormrz..............................................................................1628 p?unmrz.............................................................................1631 p?ggqrf...............................................................................1633 p?ggrqf...............................................................................1636 Symmetric Eigenproblems.............................................................1640 p?sytrd...............................................................................1640 p?ormtr..............................................................................1643 p?hetrd..............................................................................1646 p?unmtr.............................................................................1648 p?stebz..............................................................................1651 p?stein...............................................................................1653 Nonsymmetric Eigenvalue Problems................................................1656 p?gehrd..............................................................................1657 p?ormhr.............................................................................1659 p?unmhr.............................................................................1662 p?lahqr...............................................................................1664 Singular Value Decomposition........................................................1666 p?gebrd..............................................................................1666 p?ormbr.............................................................................1669 p?unmbr.............................................................................1672 Generalized Symmetric-Definite Eigen Problems...............................1676 p?sygst...............................................................................1676 p?hegst..............................................................................1677 Driver Routines....................................................................................1679 p?gesv........................................................................................1679 p?gesvx......................................................................................1681 p?gbsv........................................................................................1685 p?dbsv........................................................................................1687 p?dtsv........................................................................................1689 p?posv........................................................................................1691 p?posvx......................................................................................1693 p?pbsv........................................................................................1697 p?ptsv........................................................................................1699 Contents 19 p?gels.........................................................................................1701 p?syev........................................................................................1704 p?syevd......................................................................................1706 p?syevx......................................................................................1708 p?heev.......................................................................................1713 p?heevd......................................................................................1715 p?heevx......................................................................................1717 p?gesvd......................................................................................1723 p?sygvx......................................................................................1726 p?hegvx......................................................................................1732 Chapter 7: ScaLAPACK Auxiliary and Utility Routines Auxiliary Routines.................................................................................1739 p?lacgv.......................................................................................1743 p?max1......................................................................................1744 ?combamax1...............................................................................1745 p?sum1......................................................................................1745 p?dbtrsv.....................................................................................1746 p?dttrsv......................................................................................1748 p?gebd2......................................................................................1751 p?gehd2.....................................................................................1754 p?gelq2......................................................................................1756 p?geql2......................................................................................1758 p?geqr2......................................................................................1760 p?gerq2......................................................................................1762 p?getf2.......................................................................................1763 p?labrd.......................................................................................1765 p?lacon.......................................................................................1768 p?laconsb....................................................................................1769 p?lacp2.......................................................................................1770 p?lacp3.......................................................................................1772 p?lacpy.......................................................................................1773 p?laevswp...................................................................................1774 p?lahrd.......................................................................................1775 p?laiect.......................................................................................1778 p?lange.......................................................................................1779 p?lanhs.......................................................................................1780 p?lansy, p?lanhe..........................................................................1782 p?lantr........................................................................................1783 p?lapiv........................................................................................1785 p?laqge.......................................................................................1787 p?laqsy.......................................................................................1789 p?lared1d....................................................................................1791 p?lared2d....................................................................................1792 p?larf.........................................................................................1793 p?larfb........................................................................................1795 p?larfc........................................................................................1798 p?larfg........................................................................................1800 p?larft........................................................................................1802 p?larz.........................................................................................1804 Intel® Math Kernel Library Reference Manual 20 p?larzb.......................................................................................1807 p?larzc........................................................................................1809 p?larzt........................................................................................1813 p?lascl........................................................................................1815 p?laset.......................................................................................1817 p?lasmsub...................................................................................1818 p?lassq.......................................................................................1819 p?laswp......................................................................................1821 p?latra........................................................................................1822 p?latrd........................................................................................1823 p?latrs........................................................................................1826 p?latrz........................................................................................1828 p?lauu2......................................................................................1830 p?lauum.....................................................................................1831 p?lawil........................................................................................1832 p?org2l/p?ung2l...........................................................................1833 p?org2r/p?ung2r..........................................................................1835 p?orgl2/p?ungl2...........................................................................1836 p?orgr2/p?ungr2..........................................................................1838 p?orm2l/p?unm2l.........................................................................1840 p?orm2r/p?unm2r........................................................................1843 p?orml2/p?unml2.........................................................................1846 p?ormr2/p?unmr2........................................................................1849 p?pbtrsv.....................................................................................1851 p?pttrsv......................................................................................1854 p?potf2.......................................................................................1857 p?rscl.........................................................................................1858 p?sygs2/p?hegs2.........................................................................1859 p?sytd2/p?hetd2..........................................................................1861 p?trti2........................................................................................1864 ?lamsh.......................................................................................1866 ?laref..........................................................................................1867 ?lasorte......................................................................................1868 ?lasrt2........................................................................................1869 ?stein2.......................................................................................1870 ?dbtf2.........................................................................................1872 ?dbtrf.........................................................................................1873 ?dttrf..........................................................................................1874 ?dttrsv........................................................................................1875 ?pttrsv........................................................................................1876 ?steqr2.......................................................................................1878 Utility Functions and Routines................................................................1879 p?labad.......................................................................................1879 p?lachkieee.................................................................................1880 p?lamch......................................................................................1881 p?lasnbt......................................................................................1882 pxerbla.......................................................................................1882 Chapter 8: Sparse Solver Routines PARDISO* - Parallel Direct Sparse Solver Interface...................................1885 Contents 21 pardiso.......................................................................................1886 pardisoinit...................................................................................1902 pardiso_64..................................................................................1903 pardiso_getenv, pardiso_setenv.....................................................1904 PARDISO Parameters in Tabular Form.............................................1905 Direct Sparse Solver (DSS) Interface Routines.........................................1914 DSS Interface Description.............................................................1916 DSS Routines..............................................................................1916 dss_create..........................................................................1916 dss_define_structure............................................................1918 dss_reorder.........................................................................1920 dss_factor_real, dss_factor_complex......................................1921 dss_solve_real, dss_solve_complex........................................1923 dss_delete..........................................................................1926 dss_statistics.......................................................................1927 mkl_cvt_to_null_terminated_str............................................1930 Implementation Details.................................................................1931 Iterative Sparse Solvers based on Reverse Communication Interface (RCI ISS)...............................................................................................1932 CG Interface Description...............................................................1933 FGMRES Interface Description........................................................1938 RCI ISS Routines.........................................................................1945 dcg_init..............................................................................1945 dcg_check...........................................................................1946 dcg....................................................................................1946 dcg_get..............................................................................1948 dcgmrhs_init.......................................................................1948 dcgmrhs_check....................................................................1949 dcgmrhs.............................................................................1950 dcgmrhs_get.......................................................................1952 dfgmres_init........................................................................1952 dfgmres_check....................................................................1953 dfgmres..............................................................................1954 dfgmres_get........................................................................1956 Implementation Details.................................................................1957 Preconditioners based on Incomplete LU Factorization Technique................1958 ILU0 and ILUT Preconditioners Interface Description.........................1960 dcsrilu0.......................................................................................1961 dcsrilut.......................................................................................1963 Calling Sparse Solver and Preconditioner Routines from C/C++..................1967 Chapter 9: Vector Mathematical Functions Data Types, Accuracy Modes, and Performance Tips..................................1969 Function Naming Conventions................................................................1970 Function Interfaces.......................................................................1971 VML Mathematical Functions..................................................1971 Pack Functions....................................................................1971 Unpack Functions.................................................................1972 Service Functions.................................................................1972 Input Parameters.................................................................1972 Intel® Math Kernel Library Reference Manual 22 Output Parameters...............................................................1973 Vector Indexing Methods.......................................................................1973 Error Diagnostics..................................................................................1973 VML Mathematical Functions..................................................................1974 Special Value Notations.................................................................1976 Arithmetic Functions.....................................................................1976 v?Add.................................................................................1976 v?Sub.................................................................................1979 v?Sqr.................................................................................1981 v?Mul.................................................................................1983 v?MulByConj.......................................................................1986 v?Conj................................................................................1987 v?Abs.................................................................................1989 v?Arg.................................................................................1991 v?LinearFrac........................................................................1993 Power and Root Functions.............................................................1995 v?Inv.................................................................................1995 v?Div.................................................................................1997 v?Sqrt................................................................................2000 v?InvSqrt............................................................................2002 v?Cbrt................................................................................2004 v?InvCbrt...........................................................................2006 v?Pow2o3...........................................................................2007 v?Pow3o2...........................................................................2009 v?Pow................................................................................2011 v?Powx...............................................................................2014 v?Hypot..............................................................................2017 Exponential and Logarithmic Functions............................................2019 v?Exp.................................................................................2019 v?Expm1............................................................................2022 v?Ln...................................................................................2024 v?Log10.............................................................................2027 v?Log1p..............................................................................2030 Trigonometric Functions................................................................2031 v?Cos.................................................................................2031 v?Sin..................................................................................2034 v?SinCos............................................................................2036 v?CIS.................................................................................2038 v?Tan.................................................................................2040 v?Acos...............................................................................2042 v?Asin................................................................................2045 v?Atan................................................................................2047 v?Atan2..............................................................................2050 Hyperbolic Functions.....................................................................2052 v?Cosh...............................................................................2052 v?Sinh................................................................................2055 v?Tanh...............................................................................2058 v?Acosh..............................................................................2061 v?Asinh..............................................................................2064 v?Atanh..............................................................................2067 Contents 23 Special Functions.........................................................................2070 v?Erf..................................................................................2070 v?Erfc.................................................................................2073 v?CdfNorm..........................................................................2075 v?ErfInv.............................................................................2077 v?ErfcInv............................................................................2080 v?CdfNormInv.....................................................................2082 v?LGamma..........................................................................2084 v?TGamma.........................................................................2086 Rounding Functions......................................................................2088 v?Floor...............................................................................2088 v?Ceil.................................................................................2089 v?Trunc..............................................................................2091 v?Round.............................................................................2093 v?NearbyInt........................................................................2094 v?Rint................................................................................2096 v?Modf...............................................................................2098 VML Pack/Unpack Functions...................................................................2100 v?Pack........................................................................................2100 v?Unpack....................................................................................2103 VML Service Functions...........................................................................2106 vmlSetMode................................................................................2106 vmlGetMode................................................................................2108 vmlSetErrStatus...........................................................................2109 vmlGetErrStatus..........................................................................2110 vmlClearErrStatus........................................................................2111 vmlSetErrorCallBack.....................................................................2111 vmlGetErrorCallBack.....................................................................2114 vmlClearErrorCallBack..................................................................2114 Chapter 10: Statistical Functions Random Number Generators..................................................................2115 Conventions................................................................................2116 Mathematical Notation..........................................................2117 Naming Conventions............................................................2118 Basic Generators..........................................................................2121 BRNG Parameter Definition....................................................2122 Random Streams.................................................................2123 Data Types.........................................................................2124 Error Reporting............................................................................2124 VSL RNG Usage Model..................................................................2125 Service Routines..........................................................................2127 vslNewStream.....................................................................2128 vslNewStreamEx..................................................................2129 vsliNewAbstractStream.........................................................2131 vsldNewAbstractStream........................................................2133 vslsNewAbstractStream........................................................2135 vslDeleteStream..................................................................2137 vslCopyStream....................................................................2138 vslCopyStreamState.............................................................2139 Intel® Math Kernel Library Reference Manual 24 vslSaveStreamF...................................................................2140 vslLoadStreamF...................................................................2141 vslSaveStreamM..................................................................2142 vslLoadStreamM..................................................................2144 vslGetStreamSize.................................................................2145 vslLeapfrogStream...............................................................2146 vslSkipAheadStream............................................................2148 vslGetStreamStateBrng........................................................2151 vslGetNumRegBrngs.............................................................2152 Distribution Generators.................................................................2153 Continuous Distributions.......................................................2156 Discrete Distributions...........................................................2189 Advanced Service Routines............................................................2208 Data types..........................................................................2208 vslRegisterBrng...................................................................2209 vslGetBrngProperties............................................................2210 Formats for User-Designed Generators...................................2211 Convolution and Correlation...................................................................2214 Naming Conventions.....................................................................2215 Data Types..................................................................................2215 Parameters.................................................................................2216 Task Status and Error Reporting.....................................................2218 Task Constructors........................................................................2220 vslConvNewTask/vslCorrNewTask...........................................2220 vslConvNewTask1D/vslCorrNewTask1D...................................2223 vslConvNewTaskX/vslCorrNewTaskX.......................................2225 vslConvNewTaskX1D/vslCorrNewTaskX1D...............................2228 Task Editors................................................................................2232 vslConvSetMode/vslCorrSetMode...........................................2232 vslConvSetInternalPrecision/vslCorrSetInternalPrecision............2234 vslConvSetStart/vslCorrSetStart............................................2235 vslConvSetDecimation/vslCorrSetDecimation...........................2237 Task Execution Routines................................................................2238 vslConvExec/vslCorrExec......................................................2239 vslConvExec1D/vslCorrExec1D...............................................2242 vslConvExecX/vslCorrExecX...................................................2246 vslConvExecX1D/vslCorrExecX1D...........................................2249 Task Destructors..........................................................................2253 vslConvDeleteTask/vslCorrDeleteTask.....................................2253 Task Copy...................................................................................2254 vslConvCopyTask/vslCorrCopyTask.........................................2254 Usage Examples...........................................................................2256 Mathematical Notation and Definitions............................................2258 Data Allocation............................................................................2259 VSL Summary Statistics........................................................................2261 Naming Conventions.....................................................................2262 Data Types..................................................................................2263 Parameters.................................................................................2263 Task Status and Error Reporting.....................................................2263 Task Constructors........................................................................2267 Contents 25 vslSSNewTask.....................................................................2267 Task Editors................................................................................2269 vslSSEditTask......................................................................2270 vslSSEditMoments................................................................2278 vslSSEditCovCor..................................................................2280 vslSSEditPartialCovCor.........................................................2282 vslSSEditQuantiles...............................................................2284 vslSSEditStreamQuantiles.....................................................2286 vslSSEditPooledCovariance....................................................2287 vslSSEditRobustCovariance...................................................2289 vslSSEditOutliersDetection....................................................2292 vslSSEditMissingValues.........................................................2294 vslSSEditCorParameterization................................................2298 Task Computation Routines...........................................................2300 vslSSCompute.....................................................................2302 Task Destructor...........................................................................2303 vslSSDeleteTask..................................................................2303 Usage Examples...........................................................................2304 Mathematical Notation and Definitions............................................2305 Chapter 11: Fourier Transform Functions FFT Functions.......................................................................................2312 Computing an FFT........................................................................2313 FFT Interface...............................................................................2313 Descriptor Manipulation Functions..................................................2313 DftiCreateDescriptor.............................................................2314 DftiCommitDescriptor...........................................................2316 DftiFreeDescriptor................................................................2317 DftiCopyDescriptor...............................................................2318 FFT Computation Functions............................................................2319 DftiComputeForward............................................................2320 DftiComputeBackward..........................................................2322 Descriptor Configuration Functions.................................................2325 DftiSetValue........................................................................2325 DftiGetValue........................................................................2327 Status Checking Functions.............................................................2329 DftiErrorClass......................................................................2329 DftiErrorMessage.................................................................2331 Configuration Settings..................................................................2332 DFTI_PRECISION.................................................................2334 DFTI_FORWARD_DOMAIN.....................................................2335 DFTI_DIMENSION, DFTI_LENGTHS.........................................2336 DFTI_PLACEMENT................................................................2336 DFTI_FORWARD_SCALE, DFTI_BACKWARD_SCALE...................2336 DFTI_NUMBER_OF_USER_THREADS.......................................2336 DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES......................2337 DFTI_NUMBER_OF_TRANSFORMS..........................................2339 DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE..................2339 DFTI_COMPLEX_STORAGE, DFTI_REAL_STORAGE, DFTI_CONJUGATE_EVEN_STORAGE....................................2340 Intel® Math Kernel Library Reference Manual 26 DFTI_PACKED_FORMAT........................................................2347 DFTI_WORKSPACE...............................................................2351 DFTI_COMMIT_STATUS........................................................2352 DFTI_ORDERING..................................................................2352 Cluster FFT Functions............................................................................2352 Computing Cluster FFT..................................................................2353 Distributing Data among Processes.................................................2354 Cluster FFT Interface....................................................................2356 Descriptor Manipulation Functions..................................................2356 DftiCreateDescriptorDM........................................................2357 DftiCommitDescriptorDM.......................................................2358 DftiFreeDescriptorDM...........................................................2359 FFT Computation Functions............................................................2360 DftiComputeForwardDM........................................................2360 DftiComputeBackwardDM......................................................2362 Descriptor Configuration Functions.................................................2364 DftiSetValueDM...................................................................2365 DftiGetValueDM...................................................................2367 Error Codes.................................................................................2370 Chapter 12: PBLAS Routines Overview.............................................................................................2373 Routine Naming Conventions.................................................................2374 PBLAS Level 1 Routines.........................................................................2375 p?amax......................................................................................2376 p?asum.......................................................................................2377 p?axpy.......................................................................................2378 p?copy........................................................................................2379 p?dot..........................................................................................2380 p?dotc........................................................................................2381 p?dotu........................................................................................2382 p?nrm2.......................................................................................2383 p?scal.........................................................................................2384 p?swap.......................................................................................2385 PBLAS Level 2 Routines.........................................................................2386 p?gemv......................................................................................2387 p?agemv.....................................................................................2389 p?ger..........................................................................................2391 p?gerc........................................................................................2393 p?geru........................................................................................2394 p?hemv......................................................................................2396 p?ahemv.....................................................................................2397 p?her.........................................................................................2399 p?her2........................................................................................2400 p?symv.......................................................................................2402 p?asymv.....................................................................................2404 p?syr..........................................................................................2406 p?syr2........................................................................................2407 p?trmv.......................................................................................2409 p?atrmv......................................................................................2410 Contents 27 p?trsv.........................................................................................2413 PBLAS Level 3 Routines.........................................................................2414 p?geadd......................................................................................2415 p?tradd.......................................................................................2416 p?gemm.....................................................................................2418 p?hemm.....................................................................................2420 p?herk........................................................................................2422 p?her2k......................................................................................2424 p?symm......................................................................................2426 p?syrk........................................................................................2428 p?syr2k......................................................................................2430 p?tran........................................................................................2432 p?tranu.......................................................................................2433 p?tranc.......................................................................................2434 p?trmm......................................................................................2435 p?trsm........................................................................................2437 Chapter 13: Partial Differential Equations Support Trigonometric Transform Routines..........................................................2441 Transforms Implemented..............................................................2442 Sequence of Invoking TT Routines..................................................2443 Interface Description....................................................................2445 TT Routines.................................................................................2445 ?_init_trig_transform............................................................2445 ?_commit_trig_transform......................................................2446 ?_forward_trig_transform.....................................................2448 ?_backward_trig_transform...................................................2450 free_trig_transform..............................................................2451 Common Parameters....................................................................2452 Implementation Details.................................................................2455 Poisson Library Routines .......................................................................2457 Poisson Library Implemented.........................................................2457 Sequence of Invoking PL Routines..................................................2462 Interface Description....................................................................2464 PL Routines for the Cartesian Solver...............................................2465 ?_init_Helmholtz_2D/?_init_Helmholtz_3D..............................2465 ?_commit_Helmholtz_2D/?_commit_Helmholtz_3D..................2467 ?_Helmholtz_2D/?_Helmholtz_3D..........................................2470 free_Helmholtz_2D/free_Helmholtz_3D...................................2474 PL Routines for the Spherical Solver................................................2475 ?_init_sph_p/?_init_sph_np...................................................2475 ?_commit_sph_p/?_commit_sph_np.......................................2476 ?_sph_p/?_sph_np...............................................................2478 free_sph_p/free_sph_np.......................................................2480 Common Parameters....................................................................2481 Implementation Details.................................................................2486 Calling PDE Support Routines from Fortran 90..........................................2492 Chapter 14: Nonlinear Optimization Problem Solvers Organization and Implementation...........................................................2495 Intel® Math Kernel Library Reference Manual 28 Routine Naming Conventions.................................................................2496 Nonlinear Least Squares Problem without Constraints................................2496 ?trnlsp_init..................................................................................2497 ?trnlsp_check..............................................................................2499 ?trnlsp_solve...............................................................................2500 ?trnlsp_get..................................................................................2502 ?trnlsp_delete..............................................................................2503 Nonlinear Least Squares Problem with Linear (Bound) Constraints..............2504 ?trnlspbc_init...............................................................................2505 ?trnlspbc_check...........................................................................2506 ?trnlspbc_solve............................................................................2508 ?trnlspbc_get...............................................................................2510 ?trnlspbc_delete..........................................................................2511 Jacobian Matrix Calculation Routines.......................................................2512 ?jacobi_init..................................................................................2512 ?jacobi_solve...............................................................................2513 ?jacobi_delete.............................................................................2514 ?jacobi........................................................................................2515 ?jacobix......................................................................................2516 Chapter 15: Support Functions Version Information Functions................................................................2521 mkl_get_version..........................................................................2521 mkl_get_version_string.................................................................2523 Threading Control Functions...................................................................2524 mkl_set_num_threads..................................................................2524 mkl_domain_set_num_threads......................................................2525 mkl_set_dynamic.........................................................................2526 mkl_get_max_threads..................................................................2526 mkl_domain_get_max_threads......................................................2527 mkl_get_dynamic.........................................................................2528 Error Handling Functions.......................................................................2528 xerbla.........................................................................................2529 pxerbla.......................................................................................2530 Equality Test Functions.........................................................................2530 lsame.........................................................................................2530 lsamen.......................................................................................2531 Timing Functions..................................................................................2532 second/dsecnd.............................................................................2532 mkl_get_cpu_clocks.....................................................................2533 mkl_get_cpu_frequency................................................................2534 mkl_get_max_cpu_frequency........................................................2534 mkl_get_clocks_frequency.............................................................2535 Memory Functions................................................................................2536 mkl_free_buffers..........................................................................2536 mkl_thread_free_buffers...............................................................2537 mkl_disable_fast_mm...................................................................2538 mkl_mem_stat............................................................................2538 mkl_malloc..................................................................................2539 mkl_free.....................................................................................2540 Contents 29 Examples of mkl_malloc(), mkl_free(), mkl_mem_stat() Usage..........2540 Miscellaneous Utility Functions...............................................................2542 mkl_progress...............................................................................2542 mkl_enable_instructions................................................................2544 Functions Supporting the Single Dynamic Library......................................2545 mkl_set_interface_layer................................................................2545 mkl_set_threading_layer...............................................................2546 mkl_set_xerbla............................................................................2546 mkl_set_progress.........................................................................2547 Chapter 16: BLACS Routines Matrix Shapes......................................................................................2549 BLACS Combine Operations...................................................................2550 ?gamx2d.....................................................................................2551 ?gamn2d.....................................................................................2552 ?gsum2d.....................................................................................2553 BLACS Point To Point Communication......................................................2554 ?gesd2d......................................................................................2556 ?trsd2d.......................................................................................2557 ?gerv2d......................................................................................2557 ?trrv2d.......................................................................................2558 BLACS Broadcast Routines.....................................................................2559 ?gebs2d......................................................................................2560 ?trbs2d.......................................................................................2560 ?gebr2d......................................................................................2561 ?trbr2d.......................................................................................2562 BLACS Support Routines........................................................................2562 Initialization Routines...................................................................2562 blacs_pinfo.........................................................................2563 blacs_setup.........................................................................2563 blacs_get............................................................................2564 blacs_set............................................................................2565 blacs_gridinit.......................................................................2566 blacs_gridmap.....................................................................2567 Destruction Routines....................................................................2568 blacs_freebuff.....................................................................2568 blacs_gridexit......................................................................2569 blacs_abort.........................................................................2569 blacs_exit...........................................................................2569 Informational Routines..................................................................2570 blacs_gridinfo......................................................................2570 blacs_pnum........................................................................2570 blacs_pcoord.......................................................................2571 Miscellaneous Routines.................................................................2571 blacs_barrier.......................................................................2571 Examples of BLACS Routines Usage........................................................2572 Chapter 17: Data Fitting Functions Naming Conventions.............................................................................2581 Data Types..........................................................................................2582 Intel® Math Kernel Library Reference Manual 30 Mathematical Conventions.....................................................................2582 Data Fitting Usage Model.......................................................................2585 Data Fitting Usage Examples..................................................................2585 Task Status and Error Reporting.............................................................2590 Task Creation and Initialization Routines..................................................2592 df?newtask1d..............................................................................2592 Task Editors.........................................................................................2594 df?editppspline1d.........................................................................2595 df?editptr....................................................................................2601 dfieditval.....................................................................................2602 df?editidxptr................................................................................2604 Computational Routines........................................................................2606 df?construct1d.............................................................................2606 df?interpolate1d/df?interpolateex1d................................................2607 df?integrate1d/df?integrateex1d.....................................................2613 df?searchcells1d/df?searchcellsex1d...............................................2619 df?interpcallback..........................................................................2621 df?integrcallback..........................................................................2623 df?searchcellscallback...................................................................2625 Task Destructors..................................................................................2627 dfdeletetask................................................................................2627 Appendix A: Linear Solvers Basics Sparse Linear Systems..........................................................................2629 Matrix Fundamentals....................................................................2629 Direct Method..............................................................................2630 Sparse Matrix Storage Formats......................................................2634 Appendix B: Routine and Function Arguments Vector Arguments in BLAS.....................................................................2645 Vector Arguments in VML......................................................................2646 Matrix Arguments.................................................................................2646 Appendix C: Code Examples BLAS Code Examples............................................................................2653 Fourier Transform Functions Code Examples............................................2656 FFT Code Examples......................................................................2656 Examples of Using Multi-Threading for FFT Computation............2662 Examples for Cluster FFT Functions.................................................2666 Auxiliary Data Transformations......................................................2667 Appendix D: CBLAS Interface to the BLAS CBLAS Arguments................................................................................2669 Level 1 CBLAS......................................................................................2670 Level 2 CBLAS......................................................................................2672 Level 3 CBLAS......................................................................................2676 Sparse CBLAS......................................................................................2678 Appendix E: Specific Features of Fortran 95 Interfaces for LAPACK Routines Interfaces Identical to Netlib..................................................................2681 Contents 31 Interfaces with Replaced Argument Names..............................................2682 Modified Netlib Interfaces......................................................................2684 Interfaces Absent From Netlib................................................................2684 Interfaces of New Functionality...............................................................2687 Appendix F: FFTW Interface to Intel® Math Kernel Library Notational Conventions ........................................................................2689 FFTW2 Interface to Intel® Math Kernel Library .........................................2689 Wrappers Reference.....................................................................2689 One-dimensional Complex-to-complex FFTs ............................2689 Multi-dimensional Complex-to-complex FFTs............................2690 One-dimensional Real-to-half-complex/Half-complex-to-real FFTs...............................................................................2690 Multi-dimensional Real-to-complex/Complex-to-real FFTs..........2690 Multi-threaded FFTW............................................................2691 FFTW Support Functions.......................................................2691 Limitations of the FFTW2 Interface to Intel MKL.......................2691 Calling Wrappers from Fortran.......................................................2692 Installation..................................................................................2693 Creating the Wrapper Library.................................................2693 Application Assembling ........................................................2694 Running Examples ...............................................................2694 MPI FFTW Wrappers.....................................................................2694 MPI FFTW Wrappers Reference..............................................2694 Creating MPI FFTW Wrapper Library.......................................2696 Application Assembling with MPI FFTW Wrapper Library............2696 Running Examples ...............................................................2696 FFTW3 Interface to Intel® Math Kernel Library..........................................2697 Using FFTW3 Wrappers.................................................................2697 Calling Wrappers from Fortran.......................................................2699 Building Your Own Wrapper Library.................................................2699 Building an Application..................................................................2700 Running Examples .......................................................................2700 MPI FFTW Wrappers.....................................................................2701 Building Your Own Wrapper Library........................................2701 Building an Application.........................................................2701 Running Examples...............................................................2702 Appendix G: Bibliography Appendix H: Glossary Intel® Math Kernel Library Reference Manual 32 Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Java is a registered trademark of Oracle and/or its affiliates. Third Party Content Intel® Math Kernel Library (Intel® MKL) includes content from several 3rd party sources that was originally governed by the licenses referenced below: • Portions© Copyright 2001 Hewlett-Packard Development Company, L.P. 33 • Sections on the Linear Algebra PACKage (LAPACK) routines include derivative work portions that have been copyrighted: © 1991, 1992, and 1998 by The Numerical Algorithms Group, Ltd. • Intel MKL fully supports LAPACK 3.3 set of computational, driver, auxiliary and utility routines under the following license: Copyright © 1992-2010 The University of Tennessee. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer listed in this license in the documentation and/or other materials provided with the distribution. • Neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. The original versions of LAPACK from which that part of Intel MKL was derived can be obtained from http://www.netlib.org/lapack/index.html. The authors of LAPACK are E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. • The original versions of the Basic Linear Algebra Subprograms (BLAS) from which the respective part of Intel® MKL was derived can be obtained from http://www.netlib.org/blas/index.html. • The original versions of the Basic Linear Algebra Communication Subprograms (BLACS) from which the respective part of Intel MKL was derived can be obtained from http://www.netlib.org/blacs/index.html. The authors of BLACS are Jack Dongarra and R. Clint Whaley. • The original versions of Scalable LAPACK (ScaLAPACK) from which the respective part of Intel® MKL was derived can be obtained from http://www.netlib.org/scalapack/index.html. The authors of ScaLAPACK are L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. • The original versions of the Parallel Basic Linear Algebra Subprograms (PBLAS) routines from which the respective part of Intel® MKL was derived can be obtained from http://www.netlib.org/scalapack/html/ pblas_qref.html. • PARDISO (PARallel DIrect SOlver)* in Intel® MKL is compliant with the 3.2 release of PARDISO that is freely distributed by the University of Basel. It can be obtained at http://www.pardiso-project.org. • Some Fast Fourier Transform (FFT) functions in this release of Intel® MKL have been generated by the SPIRAL software generation system (http://www.spiral.net/) under license from Carnegie Mellon University. The authors of SPIRAL are Markus Puschel, Jose Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo. Copyright© 1994-2011, Intel Corporation. All rights reserved. Intel® Math Kernel Library Reference Manual 34 Introducing the Intel® Math Kernel Library The Intel® Math Kernel Library (Intel® MKL) improves performance of scientific, engineering, and financial software that solves large computational problems. Among other functionality, Intel MKL provides linear algebra routines, fast Fourier transforms, as well as vectorized math and random number generation functions, all optimized for the latest Intel processors, including processors with multiple cores (see the Intel® MKL Release Notes for the full list of supported processors). Intel MKL also performs well on non-Intel processors. Intel MKL is thread-safe and extensively threaded using the OpenMP* technology. For more details about functionality provided by Intel MKL, see the Function Domains section. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 35 Intel® Math Kernel Library Reference Manual 36 Getting Help and Support Getting Help The online version of the Intel® Math Kernel Library (Intel® MKL) Reference Manual integrates into the Microsoft Visual Studio* development system help on Windows* OS or into the Eclipse* development system help on Linux* OS. For information on how to use the online help, see the Intel MKL User's Guide. Getting Technical Support Intel MKL provides a product web site that offers timely and comprehensive product information, including product features, white papers, and technical articles. For the latest information, check: http:// www.intel.com/software/products/support. Intel also provides a support web site that contains a rich repository of self help information, including getting started tips, known product issues, product errata, license information, user forums, and more (visit http://www.intel.com/software/products/). Registering your product entitles you to one year of technical support and product updates through Intel® Premier Support. Intel Premier Support is an interactive issue management and communication web site providing these services: • Submit issues and review their status. • Download product updates anytime of the day. To register your product, contact Intel, or seek product support, please visit http://www.intel.com/software/ products/support. 37 Intel® Math Kernel Library Reference Manual 38 What's New This Reference Manual documents Intel® Math Kernel Library (Intel® MKL) 10.3 Update 8 release. The following function domains were updated in Intel MKL 10.3 Update 8 with new functions, enhancements to the existing functionality, or improvements to the existing documentation: • New data fitting functions provide spline-based interpolation capabilities that you can use to approximate functions, function derivatives or function integrals, and perform cell search operations. See Data Fitting Functions. • The Fourier transform documentation has been updated and improved, especially in the descriptions of configuration settings that define the forward domain of the transform (see DFTI_FORWARD_DOMAIN), memory layout of the input/output data (see DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES), distances between consecutive data sets for computing multiple transforms (see DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE), and storage schemes (see DFTI_COMPLEX_STORAGE, DFTI_REAL_STORAGE). Additionally, several minor updates have been made to correct errors in the manual. 39 Intel® Math Kernel Library Reference Manual 40 Notational Conventions This manual uses the following terms to refer to operating systems: Windows* OS This term refers to information that is valid on all supported Windows* operating systems. Linux* OS This term refers to information that is valid on all supported Linux* operating systems. Mac OS* X This term refers to information that is valid on Intel®-based systems running the Mac OS* X operating system. This manual uses the following notational conventions: • Routine name shorthand (for example, ?ungqr instead of cungqr/zungqr). • Font conventions used for distinction between the text and the code. Routine Name Shorthand For shorthand, names that contain a question mark "?" represent groups of routines with similar functionality. Each group typically consists of routines used with four basic data types: single-precision real, double-precision real, single-precision complex, and double-precision complex. The question mark is used to indicate any or all possible varieties of a function; for example: ?swap Refers to all four data types of the vector-vector ?swap routine: sswap, dswap, cswap, and zswap. Font Conventions The following font conventions are used: UPPERCASE COURIER Data type used in the description of input and output parameters for Fortran interface. For example, CHARACTER*1. lowercase courier Code examples: a(k+i,j) = matrix(i,j) and data types for C interface, for example, const float* lowercase courier mixed with UpperCase courier Function names for C interface, for example, vmlSetMode lowercase courier italic Variables in arguments and parameters description. For example, incx. * Used as a multiplication symbol in code examples and equations and where required by the Fortran syntax. 41 Intel® Math Kernel Library Reference Manual 42 Function Domains 1 The Intel® Math Kernel Library includes Fortran routines and functions optimized for Intel® processor-based computers running operating systems that support multiprocessing. In addition to the Fortran interface, Intel MKL includes a C-language interface for the Discrete Fourier transform functions, as well as for the Vector Mathematical Library and Vector Statistical Library functions. For hardware and software requirements to use Intel MKL, see Intel® MKL Release Notes. The Intel® Math Kernel Library includes the following groups of routines: • Basic Linear Algebra Subprograms (BLAS): – vector operations – matrix-vector operations – matrix-matrix operations • Sparse BLAS Level 1, 2, and 3 (basic operations on sparse vectors and matrices) • LAPACK routines for solving systems of linear equations • LAPACK routines for solving least squares problems, eigenvalue and singular value problems, and Sylvester's equations • Auxiliary and utility LAPACK routines • ScaLAPACK computational, driver and auxiliary routines (only in Intel MKL for Linux* and Windows* operating systems) • PBLAS routines for distributed vector, matrix-vector, and matrix-matrix operation • Direct and Iterative Sparse Solver routines • Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments (with Fortran and C interfaces) • Vector Statistical Library (VSL) functions for generating vectors of pseudorandom numbers with different types of statistical distributions and for performing convolution and correlation computations • General Fast Fourier Transform (FFT) Functions, providing fast computation of Discrete Fourier Transform via the FFT algorithms and having Fortran and C interfaces • Cluster FFT functions (only in Intel MKL for Linux* and Windows* operating systems) • Tools for solving partial differential equations - trigonometric transform routines and Poisson solver • Optimization Solver routines for solving nonlinear least squares problems through the Trust-Region (TR) algorithms and computing Jacobi matrix by central differences • Basic Linear Algebra Communication Subprograms (BLACS) that are used to support a linear algebra oriented message passing interface • Data Fitting functions for spline-based approximation of functions, derivatives and integrals of functions, and search • GMP arithmetic functions For specific issues on using the library, also see the Intel® MKL Release Notes. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 43 BLAS Routines The BLAS routines and functions are divided into the following groups according to the operations they perform: • BLAS Level 1 Routines perform operations of both addition and reduction on vectors of data. Typical operations include scaling and dot products. • BLAS Level 2 Routines perform matrix-vector operations, such as matrix-vector multiplication, rank-1 and rank-2 matrix updates, and solution of triangular systems. • BLAS Level 3 Routines perform matrix-matrix operations, such as matrix-matrix multiplication, rank-k update, and solution of triangular systems. Starting from release 8.0, Intel® MKL also supports the Fortran 95 interface to the BLAS routines. Starting from release 10.1, a number of BLAS-like Extensions are added to enable the user to perform certain data manipulation, including matrix in-place and out-of-place transposition operations combined with simple matrix arithmetic operations. Sparse BLAS Routines The Sparse BLAS Level 1 Routines and Functions and Sparse BLAS Level 2 and Level 3 Routines routines and functions operate on sparse vectors and matrices. These routines perform vector operations similar to the BLAS Level 1, 2, and 3 routines. The Sparse BLAS routines take advantage of vector and matrix sparsity: they allow you to store only non-zero elements of vectors and matrices. Intel MKL also supports Fortran 95 interface to Sparse BLAS routines. LAPACK Routines The Intel® Math Kernel Library fully supports LAPACK 3.1 set of computational, driver, auxiliary and utility routines. The original versions of LAPACK from which that part of Intel MKL was derived can be obtained from http:// www.netlib.org/lapack/index.html. The authors of LAPACK are E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. The LAPACK routines can be divided into the following groups according to the operations they perform: • Routines for solving systems of linear equations, factoring and inverting matrices, and estimating condition numbers (see Chapter 3). • Routines for solving least squares problems, eigenvalue and singular value problems, and Sylvester's equations (see Chapter 4). • Auxiliary and utility routines used to perform certain subtasks, common low-level computation or related tasks (see Chapter 5). Starting from release 8.0, Intel MKL also supports the Fortran 95 interface to LAPACK computational and driver routines. This interface provides an opportunity for simplified calls of LAPACK routines with fewer required arguments. ScaLAPACK Routines The ScaLAPACK package (included only with the Intel® MKL versions for Linux* and Windows* operating systems, see Chapter 6 and Chapter 7) runs on distributed-memory architectures and includes routines for solving systems of linear equations, solving linear least squares problems, eigenvalue and singular value problems, as well as performing a number of related computational tasks. The original versions of ScaLAPACK from which that part of Intel MKL was derived can be obtained from http://www.netlib.org/scalapack/index.html. The authors of ScaLAPACK are L. Blackford, J. Choi, A.Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K.Stanley, D. Walker, and R. Whaley. 1 Intel® Math Kernel Library Reference Manual 44 The Intel MKL version of ScaLAPACK is optimized for Intel® processors and uses MPICH version of MPI as well as Intel MPI. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 PBLAS Routines The PBLAS routines perform operations with distributed vectors and matrices. • PBLAS Level 1 Routines perform operations of both addition and reduction on vectors of data. Typical operations include scaling and dot products. • PBLAS Level 2 Routines perform distributed matrix-vector operations, such as matrix-vector multiplication, rank-1 and rank-2 matrix updates, and solution of triangular systems. • PBLAS Level 3 Routines perform distributed matrix-matrix operations, such as matrix-matrix multiplication, rank-k update, and solution of triangular systems. Intel MKL provides the PBLAS routines with interface similar to the interface used in the Netlib PBLAS (part of the ScaLAPACK package, see http://www.netlib.org/scalapack/html/pblas_qref.html). Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Sparse Solver Routines Direct sparse solver routines in Intel MKL (see Chapter 8) solve symmetric and symmetrically-structured sparse matrices with real or complex coefficients. For symmetric matrices, these Intel MKL subroutines can solve both positive-definite and indefinite systems. Intel MKL includes the PARDISO* sparse solver interface as well as an alternative set of user callable direct sparse solver routines. If you use the sparse solver PARDISO* from Intel MKL, please cite: O.Schenk and K.Gartner. Solving unsymmetric sparse systems of linear equations with PARDISO. J. of Future Generation Computer Systems, 20(3):475-487, 2004. Intel MKL provides also an iterative sparse solver (see Chapter 8) that uses Sparse BLAS level 2 and 3 routines and works with different sparse data formats. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for Function Domains 1 45 Optimization Notice use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 VML Functions The Vector Mathematical Library (VML) functions (see Chapter 9) include a set of highly optimized implementations of certain computationally expensive core mathematical functions (power, trigonometric, exponential, hyperbolic, etc.) that operate on vectors of real and complex numbers. Application programs that might significantly improve performance with VML include nonlinear programming software, integrals computation, and many others. VML provides interfaces both for Fortran and C languages. Statistical Functions The Vector Statistical Library (VSL) contains three sets of functions (see Chapter 10): • The first set includes a collection of pseudo- and quasi-random number generator subroutines implementing basic continuous and discrete distributions. To provide best performance, the VSL subroutines use calls to highly optimized Basic Random Number Generators (BRNGs) and a library of vector mathematical functions. • The second set includes a collection of routines that implement a wide variety of convolution and correlation operations. • The third set includes a collection of routines for initial statistical analysis of raw single and double precision multi-dimensional datasets. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Fourier Transform Functions The Intel® MKL multidimensional Fast Fourier Transform (FFT) functions with mixed radix support (see Chapter 11) provide uniformity of discrete Fourier transform computation and combine functionality with ease of use. Both Fortran and C interface specification are given. There is also a cluster version of FFT functions, which runs on distributed-memory architectures and is provided only in Intel MKL versions for the Linux* and Windows* operating systems. The FFT functions provide fast computation via the FFT algorithms for arbitrary lengths. See the Intel® MKL User's Guide for the specific radices supported. Partial Differential Equations Support Intel® MKL provides tools for solving Partial Differential Equations (PDE) (see Chapter 13). These tools are Trigonometric Transform interface routines and Poisson Library. 1 Intel® Math Kernel Library Reference Manual 46 The Trigonometric Transform routines may be helpful to users who implement their own solvers similar to the solver that the Poisson Library provides. The users can improve performance of their solvers by using fast sine, cosine, and staggered cosine transforms implemented in the Trigonometric Transform interface. The Poisson Library is designed for fast solving of simple Helmholtz, Poisson, and Laplace problems. The Trigonometric Transform interface, which underlies the solver, is based on the Intel MKL FFT interface (refer to Chapter 11), optimized for Intel® processors. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Nonlinear Optimization Problem Solvers Intel® MKL provides Nonlinear Optimization Problem Solver routines (see Chapter 14) that can be used to solve nonlinear least squares problems with or without linear (bound) constraints through the Trust-Region (TR) algorithms and compute Jacobi matrix by central differences. Support Functions The Intel® MKL support functions (see Chapter 15) are used to support the operation of the Intel MKL software and provide basic information on the library and library operation, such as the current library version, timing, setting and measuring of CPU frequency, error handling, and memory allocation. Starting from release 10.0, the Intel MKL support functions provide additional threading control. Starting from release 10.1, Intel MKL selectively supports a Progress Routine feature to track progress of a lengthy computation and/or interrupt the computation using a callback function mechanism. The user application can define a function called mkl_progress that is regularly called from the Intel MKL routine supporting the progress routine feature. See the Progress Routines section in Chapter 15 for reference. Refer to a specific LAPACK or DSS/PARDISO function description to see whether the function supports this feature or not. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 BLACS Routines The Intel® Math Kernel Library implements routines from the BLACS (Basic Linear Algebra Communication Subprograms) package (see Chapter 16) that are used to support a linear algebra oriented message passing interface that may be implemented efficiently and uniformly across a large range of distributed memory platforms. The original versions of BLACS from which that part of Intel MKL was derived can be obtained from http:// www.netlib.org/blacs/index.html. The authors of BLACS are Jack Dongarra and R. Clint Whaley. Function Domains 1 47 Data Fitting Functions The Data Fitting component includes a set of highly-optimized implementations of algorithms for the following spline-based computations: • spline construction • interpolation including computation of derivatives and integration • search The algorithms operate on single and double vector-valued functions set in the points of the given partition. You can use Data Fitting algorithms in applications that are based on data approximation. GMP Arithmetic Functions Intel® MKL implementation of GMP* arithmetic functions includes arbitrary precision arithmetic operations on integer numbers. The interfaces of such functions fully match the GNU Multiple Precision (GMP*) Arithmetic Library. NOTE GMP Arithmetic Functions are deprecated and will be removed in a future Intel MKL release. Performance Enhancements The Intel® Math Kernel Library has been optimized by exploiting both processor and system features and capabilities. Special care has been given to those routines that most profit from cache-management techniques. These especially include matrix-matrix operation routines such as dgemm(). In addition, code optimization techniques have been applied to minimize dependencies of scheduling integer and floating-point units on the results within the processor. The major optimization techniques used throughout the library include: • Loop unrolling to minimize loop management costs • Blocking of data to improve data reuse opportunities • Copying to reduce chances of data eviction from cache • Data prefetching to help hide memory latency • Multiple simultaneous operations (for example, dot products in dgemm) to eliminate stalls due to arithmetic unit pipelines • Use of hardware features such as the SIMD arithmetic units, where appropriate These are techniques from which the arithmetic code benefits the most. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 1 Intel® Math Kernel Library Reference Manual 48 Parallelism In addition to the performance enhancements discussed above, Intel® MKL offers performance gains through parallelism provided by the symmetric multiprocessing performance (SMP) feature. You can obtain improvements from SMP in the following ways: • One way is based on user-managed threads in the program and further distribution of the operations over the threads based on data decomposition, domain decomposition, control decomposition, or some other parallelizing technique. Each thread can use any of the Intel MKL functions (except for the deprecated ? lacon LAPACK routine) because the library has been designed to be thread-safe. • Another method is to use the FFT and BLAS level 3 routines. They have been parallelized and require no alterations of your application to gain the performance enhancements of multiprocessing. Performance using multiple processors on the level 3 BLAS shows excellent scaling. Since the threads are called and managed within the library, the application does not need to be recompiled thread-safe (see also Fortran 95 Interface Conventions in Chapter 2 ). • Yet another method is to use tuned LAPACK routines. Currently these include the single- and double precision flavors of routines for QR factorization of general matrices, triangular factorization of general and symmetric positive-definite matrices, solving systems of equations with such matrices, as well as solving symmetric eigenvalue problems. For instructions on setting the number of available processors for the BLAS level 3 and LAPACK routines, see Intel® MKL User's Guide. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 C Datatypes Specific to Intel MKL The mkl_types.h file defines datatypes specific to Intel MKL. C/C++ Type Fortran Type LP32 Equivalent (Size in Bytes) LP64 Equivalent (Size in Bytes) ILP64 Equivalent (Size in Bytes) MKL_INT (MKL integer) INTEGER (default INTEGER) C/C++: int Fortran: INTEGER*4 (4 bytes) C/C++: int Fortran: INTEGER*4 (4 bytes) C/C++: long long (or define MKL_ILP64 macros Fortran: INTEGER*8 (8 bytes) MKL_UINT (MKL unsigned integer) N/A C/C++: unsigned int (4 bytes) C/C++: unsigned int (4 bytes) C/C++: unsigned long long (8 bytes) MKL_LONG (MKL long integer) N/A C/C++: long (4 bytes) C/C++: long (Windows: 4 bytes) (Linux, Mac: 8 bytes) C/C++: long (8 bytes) Function Domains 1 49 C/C++ Type Fortran Type LP32 Equivalent (Size in Bytes) LP64 Equivalent (Size in Bytes) ILP64 Equivalent (Size in Bytes) MKL_Complex8 (Like C99 complex float) COMPLEX*8 (8 bytes) (8 bytes) (8 bytes) MKL_Complex16 (Like C99 complex double) COMPLEX*16 (16 bytes) (16 bytes) (16 bytes) You can redefine datatypes specific to Intel MKL. One reason to do this is if you have your own types which are binary-compatible with Intel MKL datatypes, with the same representation or memory layout. To redefine a datatype, use one of these methods: • Insert the #define statement redefining the datatype before the mkl.h header file #include statement. For example, #define MKL_INT size_t #include "mkl.h" • Use the compiler -D option to redefine the datatype. For example, ...-DMKL_INT=size_t... NOTE As the user, if you redefine Intel MKL datatypes you are responsible for making sure that your definition is compatible with that of Intel MKL. If not, it might cause unpredictable results or crash the application. 1 Intel® Math Kernel Library Reference Manual 50 BLAS and Sparse BLAS Routines 2 This chapter describes the Intel® Math Kernel Library implementation of the BLAS and Sparse BLAS routines, and BLAS-like extensions. The routine descriptions are arranged in several sections: • BLAS Level 1 Routines (vector-vector operations) • BLAS Level 2 Routines (matrix-vector operations) • BLAS Level 3 Routines (matrix-matrix operations) • Sparse BLAS Level 1 Routines (vector-vector operations). • Sparse BLAS Level 2 and Level 3 Routines (matrix-vector and matrix-matrix operations) • BLAS-like Extensions Each section presents the routine and function group descriptions in alphabetical order by routine or function group name; for example, the ?asum group, the ?axpy group. The question mark in the group name corresponds to different character codes indicating the data type (s, d, c, and z or their combination); see Routine Naming Conventions. When BLAS or Sparse BLAS routines encounter an error, they call the error reporting routine xerbla. In BLAS Level 1 groups i?amax and i?amin, an "i" is placed before the data-type indicator and corresponds to the index of an element in the vector. These groups are placed in the end of the BLAS Level 1 section. BLAS Routines Routine Naming Conventions BLAS routine names have the following structure: ( ) The field indicates the data type: s real, single precision c complex, single precision d real, double precision z complex, double precision Some routines and functions can have combined character codes, such as sc or dz. For example, the function scasum uses a complex input array and returns a real value. The field, in BLAS level 1, indicates the operation type. For example, the BLAS level 1 routines ? dot, ?rot, ?swap compute a vector dot product, vector rotation, and vector swap, respectively. In BLAS level 2 and 3, reflects the matrix argument type: ge general matrix gb general band matrix sy symmetric matrix sp symmetric matrix (packed storage) sb symmetric band matrix he Hermitian matrix hp Hermitian matrix (packed storage) 51 hb Hermitian band matrix tr triangular matrix tp triangular matrix (packed storage) tb triangular band matrix. The field, if present, provides additional details of the operation. BLAS level 1 names can have the following characters in the field: c conjugated vector u unconjugated vector g Givens rotation construction m modified Givens rotation mg modified Givens rotation construction BLAS level 2 names can have the following characters in the field: mv matrix-vector product sv solving a system of linear equations with a single unknown vector r rank-1 update of a matrix r2 rank-2 update of a matrix. BLAS level 3 names can have the following characters in the field: mm matrix-matrix product sm solving a system of linear equations with multiple unknown vectors rk rank-k update of a matrix r2k rank-2k update of a matrix. The examples below illustrate how to interpret BLAS routine names: ddot : double-precision real vector-vector dot product cdotc : complex vector-vector dot product, conjugated scasum : sum of magnitudes of vector elements, single precision real output and single precision complex input cdotu : vector-vector dot product, unconjugated, complex sgemv : matrix-vector product, general matrix, single precision ztrmm