Randomized algorithms for data analysis of the next generation cosmic microwave polarisation observations.


The present-day cosmology is a high-precision science area, which over the last few decades has made strides in improving our understanding of the nature and physics of the Universe. It has also reached the point where it has started serving as a nature’s largest laboratory, where some of the most fundamental problems are discovered and the fundamental laws of physics can be put to tests in unique conditions, which can not be reproduced in any of the man-made facilities. Indeed, this is cosmology which can bring some tantalizing clues to some of the most exciting and mind-boggling questions of modern physics. How the Universe has come to being? What is the nature of dark matter? What is the nature of dark energy? Was Einstein right about gravity? What is the absolute mass scale of neutrinos?

Cosmological experiments involve observations and cosmological insights are derived from analysis of collected observational data sets. Observations of the Cosmic Microwave Background (CMB) anisotropies have been, and continue to be, one of the primary observational probes of cosmology. CMB is the light left over after the very early, dense and hot period of the Universe’s evolution. Emitted very far away and long time ago, CMB photons have traveled over vast distances to reach our observatories today bringing with them a snapshot of how the Universe looked alike at time when they were born. This snapshot can be, and has been, used to obtain invaluable information about the Universe itself: its composition, geometry and even topology, setting statistically robust constraints on its numerous parameters. Thanks to the CMB observations of the past decade, combined with other complementary probes, the stage has been set for us to start addressing some of the questions listed above in a meaningful and fruitful way. Out of these, the question about the origin of the Universe is probably the most exciting one. And this is the question, which, at least at the present, only measurements of a characteristic, divergence-free mode of the CMB polarization, called B-mode, can shed some light on. Consequently, there is an entire slew of current, forthcoming and planned experimental efforts aiming at its detection and characterization. If successful, their impact on modern physics is going to be truly ground-breaking.

The CMB B-mode polarization signal is very weak in front of other astrophysical, environmental signals. The signal-to-noise drives the size of the data sets required for the detection of B-modes beyond anything we have got accustomed to. Indeed, the volume of CMB data sets is projected to grow at Moore’s rate for at least the next decade, maintaining a twenty-years long trend. Moreover, the CMB B-mode signal is also easy to be confounded with signals from other, non-cosmological sources — instrumental and astrophysical in origin, requiring sophisticated observatories, generating complex data sets, with sufficient redundancy to permit disentangling and separating these different contributions. The new forthcoming and planned CMB observatories will observe the sky with as many as O(10^5) detectors over the period of many years, collecting as many as O(10^15) samples and producing the data sets with volumes in excess of many PetaBytes.

Any information relevant to the Universe’s origins will have to be derived from these data sets. Though over the past two decades CMB data analysis has established itself as a sophisticated, autonomous area of cosmology and was validated by numerous successful examples and important results, the new data sets will pose unprecedented challenge for the current data analysis pipelines: the challenge is multi-faceted and inherently multidisciplinary as it involves at the same time the need for computationally-efficient numerical algorithms capable of exploiting power of the largest current and forthcoming supercomputers, their advanced implementations and sophisticated statistical methods.

The project.
The focus of this project is on solvers for very large linear systems. There are two specific systems, which are ubiquitous in the CMB data analysis and are key for any efficient analysis. These are referred to as (generalized) map-making problem and a Wiener Filter problem and we will target both as part of the proposed work.

The proposed project involves three steps:

(1) development of new techniques for solving these very large systems;

(2) demonstrating them and validating on cutting-edge, realistic simulations;

(3) applying them to the forthcoming data of one of the major experimental efforts in the field, called the Simons Observatory. The experiment will become operational later this year.

In the case of both these operations, the goal is to compress with minimal loss the information contained in huge data sets as collected directly by the CMB experiments into smaller, more manageable objects. In the simplest case these are maps of the observed sky signals, more generally these may be maps of signals of different physical origin, or some linear combinations thereof. These are generally inverse problems and solve general least square equations using some iterative algorithms. The most successful and common one is preconditioned conjugate gradient (PCG) techniques with advanced preconditioners. These techniques have been shown to be very successful to date but fall short in reaching the performance necessary to ensure efficient analysis of the forthcoming data.

In this project we propose to investigate novel approaches to these problems based on the so-called randomized techniques. Randomization is a technique that lies at the core of many machine learning applications and has been revolutionizing the resolution of high-dimensional problems through random projections. Such projections are used to embed high dimensional vectors into a low dimensional space while preserving some geometry, e.g. inner products. This allows to obtain a so-called sketch of the problem, e.g., a compressed version of the original problem, that can be solved with a significantly lower computational and communication cost.

We will consider three different applications of the randomization concept in our work:

(1) directly producing a sketch of the original problem and subsequently devising techniques to solving it. This will involve proposing new approaches to constructing the random projections adapted to the specificity of our problems. The standard approach uses dense, random, Gaussian matrices which will be too computationally involved given the data volumes expected in our case;

(2) implementing PCG techniques which capitalise on the randomisation while performing the Gramm-Schmidt orthogonalization of the search directions. This has been shown to bring a theoretical speed-up of a factor of a few by allowing higher concurrencies and cutting on the number of the floating point operations, as well as better numerical stability in the case of systems as large as considered in our applications;

(3) employing the randomisation in constructing the efficient preconditioners for the PCG solvers. We will consider specifically the so-called two level preconditioners based on the so-called deflation principle and use the randomized algorithms to construct the respective deflation space.

We expect that taken together these developments will allow us to accelerate the data analysis process by nearly an order of magnitude, enabling the scientific exploitation of the forthcoming data sets on the level not possible hereto. This will have huge impact on the field, giving the student visibility in the community, and putting the student in a very good position with regard to his/her future career.

Our group at APC has been at the forefront of this research for nearly two decades and has developed a massively parallel framework which permits efficient implementation and validation of the novel approaches. The proposed project will benefit from the infrastructure and the know-how of the group.

The project will be carried out in the context of an established, long term, multidisciplinary collaboration, B3DCMB, (Big Bang from Big Data), which involves applied mathematicians, high performance scientific computing experts from INRIA Paris (led by dr. L. Grigori) and statisticians from ENSAE (led by prof. N. Chopin). The student will have opportunity to interact directly with all of them benefiting from their expertise and knowledge.

The student will become a full member of the Simons Observatory collaboration with full access to data owing to the full institutional membership of the APC team in this collaboration.

The implementation plan is as follows:

1st year:
    •    implementation of  items (1) and (2) in the context of the map-making and Wiener filter problems;
2nd year:
    •    implementation of item (3) and validation and demonstration of the techniques on simulated data;
    •    implementation of the developed software within the data analysis pipeline of the Simons Observatory;
    •    tests and validation on the actual data.
3 year:
    •    application to the analysis of the Simons Observatory data and its scientific exploitation.


Josquin Errard and Radek Stompor






Niveau demandé: 


Email du responsable: