Pourvu:
Summary
The proposed thesis aims to develop and apply a probabilistic programming language written in JAX and using SPMD parallelism on GPUs to scale Bayesian inference tasks on large cosmological datasets. This language has no equivalent on the market and will be a major asset for the scientific community. The thesis will require the development of low-level code and will be led collaboratively following the open-source philosophy. The language will be applied to simulated and real cosmological datasets from both CMB (Simons Observatory) and galaxy surveys (Rubin/LSST, Euclid). This ambitious project requires strong computer development skills and an understanding of machine learning, offering valuable resources to the research community and industry.
Context
In a field such as cosmology where the data comes from an experiment that cannot be redone, the best way to use the data to constrain physical models is to perform the most realistic numerical simulations. These universe simulations are then compared to the observation data using a Bayesian inference approach, in order to determine the initial parameters of the universe, with an associated uncertainty.
This is the case, for example, for the analysis of Cosmic Microwave Background signal and polarisation, in the light of the upcoming ground based Simons Observatory, or the LightBIRD satellite, which aim at constraining the early Universe. Another avenue, that can be correlated with the CMB data is weak gravitational lensing effect, a phenomenon where light is deflected on its path by the presence of baryonic matter and especially dark matter. This effect, which we measure statistically in the images of the various cosmological surveys, allows us to constrain the properties of dark matter and the evolution of dark energy in the Universe. Simulating weak lensing maps to analyse images from the upcoming LSST and Euclid surveys and infer dark matter and dark energy parameters with the precision required by the experiments requires the use of very high dimensional O(10^10) models.
Since the advent of deep learning and associated software libraries allowing automatic differentiation (AD) of functions, the kind of analysis required by simulation-based Bayesian inference (SBI) has been greatly simplified and made portable to now available hardware such as GPUs. However, the high dimensionality of the simulations that was intended in a WL analysis on future data is still difficult to achieve because it required, until recently, to be able to load the models into the memory of the GPUs (fixed and relatively small memory).
To fill this gap, Google has developed an AD framework that can parallelise models, called Mesh-Tensorflow [1], which was used by [2] to create the first fully-differentiable cosmological N-body simulator. Such distribution of models, which is now available for the simulations, is however still to be developed for the probabilistic programming languages used to perform large scale Bayesian inference.
Fortunately, such goal is getting closer and a collaboration has recently emerged with the user-support of IDRIS (French AI supercomputer facility for public research) on the development of an MPI-like framework for GPUs to communicate (leveraging Mesh-Tensorflow) in order to distribute SPMD (Single Process Multiple Data) like instructions on GPU clusters.
[1] Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., ... & Hechtman, B. (2018). Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems, 31. https://arxiv.org/abs/1811.02084
[2] Modi, C., Lanusse, F., & Seljak, U. (2021). FlowPM: Distributed TensorFlow implementation of the FastPM cosmological N-body solver. Astronomy and Computing, 37, 100505. https://arxiv.org/abs/2010.11847Subject
The proposed topic will build on these recent developments and will focus on the development and application in cosmology of a probabilistic programming language written in JAX [3] and using the SPMD parallelism described above in order to obtain a scaling of the models according to the dimensionality of the simulations. This type of language has no equivalent on the market at the moment and will be a major asset for the scientific community. The thesis will require the development of code in CUDA (C++) for very low level instructions. The whole development process will be collaborative, following the open-source philosophy. It will be led on a daily basis by Alexandre Boucaud and François Lanusse, the lead developer.
During the first year and a half of the thesis, this language will be used to perform Bayesian inference (WL) from the CMB polarisation data as well as simulated galaxy images in order to test its operation and to get familiar with the treatment of the different cosmological systematics. The language developed in the second part of the thesis will be applied to the first year of Simons Observatory and Rubin/LSST data (first quarter of 2026) in order to propose an inference method that will optimally exploit these long-awaited datasets. This is a rare event in a thesis.
It is an ambitious subject, with both a scientific aspect of great interest and a very technical aspect. It is aimed at someone with strong computer development skills (in particular a good knowledge of low-level languages such as C++) and an understanding of machine learning, and who is keen to be able to apply tools to fundamental physics topics. It should be noted that all of the techniques and knowledge developed during this thesis will be highly valued resources in the research community as well as in large companies. At the end of the thesis, this will allow this person to choose to continue in academic research or to reorient himself, without any penalty.
[3] https://github.com/google/jax