MADpre - a MADCAP code for map-making preconditioner pre-computation

Purpose

MADpre computes a preconditioner matrix useful in iterative map-making solvers based on the Preconditioned Conjugate Gradient (PCG) algorithm, as for example the one implemented in the MADmap code. For polarized data sets it also permits a determination of a subset of all observed pixels for which the separation of all three Stokes parameters (I,Q,U) is possible.
MADpre can be applied to any data produced by a single dish experiment and for which the linear data model of the time ordered data, i.e.,

d = A s + n,

is applicable. Here, d is a measurement, s - a sky signal, n - instrumental noise, and A - a "pointing" matrix.

The preconditioner matrix, P, is then defined as,

P = (A^t diag N^{-1} A)^{-1},

where N characterizes the time domain noise.

As a by-product the code computes "naive" (simple binned) maps, i.e., estimates of the sky signal, s. Those are given by,

m = P A^t diag N^{-1} d.

The following observations can be made:

if time domain noise is white, then P is the pixel-pixel noise correlation matrix of the simple binned maps mentioned above. Those are also in this case the maximum likelihood and minimum variance estimates of the sky signals.

P will be in general block-diagonal with a size of the block ranging from 1 to some number (p+1)i+2-p for the multi-resoultion cases in which i defines how many high resolution pixels are confined in a low resolution pixel and p=0,1 depending whether it is temperature or polarization maps which are of higher resolution. The assumption of the block-diagonality is to ensure an efficient computation and storing of the preconditioner and is always the case whenever the pixelization scheme used is nested, i.e., the high resolution pixels are obtained via subdivision of the lower resolution ones.

if non-sky-like degrees of freedom are included in the pointing matrix, i.e., some synchronous instrumental effect, then the preconditioner as computed by MADpre will include only diagonal elements for those extra parameters.

> Top of the page

Running MADpre

Compilation

Calling syntax

Run modes

Input/Output

Output files

Temporary files

Performance

Memory considerations

Compilation

> back

Calling syntax

[MADpre executable] [m3 config file (string of characters)] [# of samples per proc (int)] [# of pixels per proc (int)] [processing gang size (int)] [data gang size (int)] [in/out of core (1/0)]

[m3 config file] - name of the M3 XML config file desribing the input data structure;

[# of samples per proc] - # of time domain samples processed by each proc at any given time. This parameter impacts the memory consumption of the code. Though it does not affect a number of flops performed by the code, it may affect the run-time as it determines a number of disk accesses in the out-of-core run mode. As a rule ridiculously small values should be avoided and values of few millions seem a good starting point.

[# of pixels per proc] - # of pixels collected by in the memory of each processor. This parameter affects the memory required by the code, a number of disk accesses in the out-of-core mode, as well as a number of flops. The efficient run would call for it to be not smaller than the maximal number of pixels for any of the considered pixel classes. More generally higher its value better. Some reasonable fraction of the maximal number of pixels could be a good starting point if memory is of concern. See Performance considerations below for more about the trade-offs.

[in/out of core] - a flag determining how many disk access instances are required by the code. If it is set to 1 only one intermadiate read/write will be done. This however requires that a sufficient memory per proc is available. If [in/out of core] flag is set to 1, [# of pixels per proc] must be at least equal to a number of all pixels for each pixel class of the considered data set, otherwise error will occur. If [in/out of core] == 0, the code will access the disk as many times as needed and as determined by the data size and run parameters.

[# processing gang size] and [# data gang size]:

MADpre divides conceptually all the allocated processors in two different ways for the purpose of distributing the data and organizing the calculations. Both divisions corresponds into grouping the processors in equal subsets, containing [# data gang size] and [# processing gang size] members, respectively. Therefore the permitted values of these two input parameters must be such that they divide evenly the total requested number of processors. Moreover one of the two divisions must be a subdivision of the other. Consequently one of the two numbers must divide evenly by the other. MADpre does the relevant checks before proceeding to perform calculations.

[# data gang size] is useful if the processed data set is made of many single detector data, which are to provide a single estimate of the sky signals. In such a case the parameter defines how many processors will be processing each single detector data set, implying that the data of [# of all procs]/[# data gang size] detectors are then processed simultaneously. If the ful data set contains only one detector data then [# data gang size] has to be set equal to [# of all procs]. If the detectors number is larger than [# of all procs]/[# data gang size], each data gang will process multiple single-detector-data sets one after another.

[# processing gang size] defines the size of the processing gangs. MPI communication in the code is performed only between processors of the same processing gang. Effectively, the time domain data are reduced by the code in two steps. First, each processing gang reduces the data from each of the member procs producing single output objects (be it submap or part of the preconditioner). On the second step the objects produced by different processing gangs are combined together. This second reduction step is done via the disk. At any given time there is at most one I/O process per processing gang. The processing gang size should not be therefore too small to avoid I/O contention, however if it is too large the communication overhead may however become important.
> back

Run modes

MADpre uses the disk to store some temporary, intermediate products of the processing in order to save the memory. The usage of the disk can be minimized by requesting in-the-core mode. This can be done by setting the last command line parameter to 1. The in-the-core mode is possible only if the pixel range defined on the command line exceeds a total number of pixels allowed in the pixelization scheme as defined in the input XML config file. If that is not the case MADpre will continue in the "out-of-the-core" mode. The latter can be forced by setting the last command line parameter to 0.

> back

Input/Output

The input files are all described in the M3 XML config file the name of which is provided to MADpre on the command line. Check there for more details.

On the output MADpre produces two kinds of 'legitimate' output files, but also leaves some temporary outputs which may (or may not) be useful for a user. All the files are binary and therefore non-portable. (This is because of the endian conventions on one hand but also different padding schemes adopted on different platforms.) To alleviate this problem to some extent we provide a simple and easy to use IDL routines to read the output files.

The official MADpre outputs are:

preconditioner file - it is a binary file in the M3 sparse matrix format and called prec.m3. It contains all the pixels (or all the low resolution pixels in the multi-resolution case) for which the inversion of the respective diagonal block matrix could be (numerically) performed. It can be read into IDL using the readM3prec.pro routine.

map files - those are binary files in the M3 map format, one for each Stokes parameter. They are called binnedMap?.m3 with ? = 1,2,3 (corresponding to the I, Q, & U Stokes parameters). They contain all the pixels (or low resolution pixels in the multi-resolution run) for which the inversion of the respective diagonal block matrix could be numerically done. They can be read into IDL using the readM3map.pro routine.

> back

The temporary (not erased) outputs:

The temporary files contain the complete information about the weights (i.e., a preconditioner matrix prior to the inversion) however it may be potentially distributed over mutliple files, each corresponding to a different pixel range as defined on the command line. Note that the weight matrix includes all pixels ever observed, i.e., also those for which the inversion may fail and therefore will not be included in the preconditioner matrix stored in the prec.m3 file. The temporary files are called : tmpStokesFile_p0_p1.wght, where p0 = i [specified pixel range]; p1 = p0 + [specified pixel range] and thus p0 and p1 define the pixel range, p0 <= p < p1, of pixels weights of which are stored in a given file. Only files corresponding to the observed pixel ranges are stored.
In addition the files also contained the "noise weighted, simple binned map" a.k.a. the rhs of the map-making equation, e.g., A^t diag N^{-1} d.

These files are saved as simple binary dumps with a content as follows:

# of low resolution pixels (4 byte integer), denoted below np;
# of all pixels (low and high resolution) per low resolution pixels (int4), denoted nb;
# of elements in the upper triangle of each of the diagonal blocks assigedn to each low resolution pixel (int4), denoted nm;
# of high resolution pixels per each low resolution pixel (int4), denoted nh.

if the last number, nh equals to 1, i.e., we consider a single resolution case then:

nb is a number of observed Stokes params,
nm = (nb+1)nb/2,

and the remaining file content is as follows:

a vector of a length np of 4 byte ints, storing the observed pixel numbers;
a vector of a length nb np of 8 byte reals, storing the submaps for the observed pixels in a given pixel range;
a vector of a length nm np of 8 bytre reals, storing all the upper triangles of the weight matrix blocks for each observed pixel from a given range.

if nh is larger than 1, i.e., we consider a multiresolution case then:

nb is either nh+2 if a Stokes I map is of a higher resolution; or 2 nh + 1 if Stokes Q and U are high resolution,
nm = (nb+1)nb/2,

and the following objects are stored subsequently:

a vector of a length np of 4 byte ints, storing the observed low resolution pixel numbers;
a vector of a length nh np of 4 byte ints, storing the observed high resolution pixel numbers for each observed low resolution pixel. Unobserved high resolution pixels are marked by -1;
a vector of a length nb np of 8 byte reals, storing the submaps for the observed pixels in a given pixel range;
a vector of a length nm np of 8 bytre reals, storing all the upper triangles of the weight matrix blocks for each observed pixel from a given range.

A simple reading IDL routine can be found here.

Note that these outputs will be erased by successive runs of MADpre initiated from the same directory.

> back

Performance

MADpre performs calculations in two steps.
(i) on the first step it reads consecutively segments of the input time domain (measurements and pointing) data and "projects" them to a pixel domain storing only information about those pixels, which belong to a given interval of pixels. Then it loops over the pixel intervals. The collected information for each pixel interval is written on the disc as a temporary file (in the out-of-core mode).
(ii) on the second stage, MADpre rereads the temporary files, distributes pixel domain object over all processors, and performs all the pixel domain operations (matrix block inversion, map estimation etc). The final output is saved at the end of this step. This is the first step which is the most time and memory consuming.

The total memory used by MADpre is defined by two input, command line parameters (but see Memory considerations below): (i) a number of pixels per processor (parameter #3) (= npixPerProc);
(ii) a TOD segment length to be read and stored in a memory (parameter #4) (= ntod). If npixPerProc is smaller than a total maximum number of pixels allowed in a given pixelization class (= npixTot, as defined in the XML file) then MADpre will reread multiple times the entire time domain data set parsing it for pixels contained in consecutive pixel intervals each of the length npixPerProc. A total number of readings is then increased by:

Ceil[ npixTot/npixPerProc],

while the number of extra flops added as the results of this is roughly (ntotTod is a total length of the time ordered data processed),

Ceil[ npixTot/npixPerProc] ntotTod.

For this reason it is advantagous to keep this ratio as small as possible.
On the other hand the performance is roughly independent on ntod, at least as long as a disc access latenancy is shorter then the time required for a single data read of ntod data.

Consequently, it is often advantagous to limit the memory usage by decreasing ntot as much as it is reasonable, and set npixPerProc to be as large as allowed by the remaining memory.
However, if the disc access is very slow it may be better to set ntod as large as allowed for by the available memory and keep npixPerProc rather small, even if that leads to extra, spurious flops.

MADpre estimates the required memory for each run given input paramaters and outputs in stdout.

> back

Memory consideration

The memory allocated by MADpre depends on the two input parameters [# of samples per proc] and [# of pixels per proc]. They define the sizes of time and pixel domain objects stored in the memory by the code. MADpre estimates the memory required for these objects.

However, MADpre uses the M3 library to perform input. M3 routines allocate internally memory as needed and some of the allocated M3 objects are preserved throughout the entire run-time of the code. They therefore may contribute significantly to the memory budget of MADpre.

The specific case relevant for MADpre is that of computation pointing information. Often such information is read from the disk only in a compressed form and then unfolded on-the-fly by M3 and GCP libraries routines whenever needed. The unfolding requires some basic information about the pointing to be stored at any time of the code performance leading to often substantial overhead in the memory requirements of the MADpre code. An important point is that size of such resident information does not depend on either [# of samples per proc] or [# of pixels per proc]. And the memory requirement of MADpre can not be always tuned by decreasing any of these two parameters. However the extra memory will in general depend on how many processors is used per single detector data set and therefore can be decreased by increasing the value of [# data gang size]. Note that the data layout of MADpre is such that attempts to minimize the extra memory burden due to the M3/GCP libraries.

If no 'on-the-fly' capability is invoked (this is defined in the xml config file), then the MADpre-provided estimate of the memory required to perform the run is generally adequate.

> back

> Top of the page

Dos and don'ts

Though MADpre can (and will) formally run for any combination of the single dish detectors, i.e., total power and polarization-sensitive, in multi-resolution cases some of the actually observed pixels may be lost. It is therefore advised to separate both kinds of detectors and run them one at the time, simply combining the precomputed preconditioners afterwards.

> Top of the page

Availability and distribution

The MADpre software is part of the MADCAP suite of codes and requires M3 and GCP libraries. The software is made available on the NERSC supercomputers as part of the CMB module, which is there mantained by the members of the Computational Cosmology Center of Berkeley Lab.

> Top of the page

Contact

radek(at)apc(dot)univ(dash)paris7(dot)fr

> Top of the page > Home

Centre National de la Recherche Scientifique
Laboratoire Astroparticule et Cosmologie
Equipe de Traitement des Données et Simulations (ADAMIS)