Astronomers Release Massive Dataset to Accelerate AI Research in Space Science

December 2, 2024

Webb telescope and stars seen as earth rotates

The James Webb Space Telescope; tracks of stars photographed as earth rotates. JWST credit: NASA/Chris Gunn; star rotation image credit: KPNO/NOIRLab/AURA


A global team of astronomers and machine learning researchers today announced the release of the "Multimodal Universe" - a groundbreaking 100 terabyte dataset that brings together hundreds of millions of astronomical observations in unprecedented detail and scale. This massive collection of space data aims to revolutionize how artificial intelligence can be applied to unlock the mysteries of the cosmos.

"The Multimodal Universe makes accessing machine learning-ready astronomical datasets as easy as writing a single line of code," says Helen Qu, a postdoctoral researcher at the Flatiron Institute. "I'm excited to see how this can accelerate new developments in both astronomy and machine learning."

Astronomers have over decades pushed cutting edge technologies to their limits across many fields to be able to observe the universe and its constituent parts in multiple modes, such as near infrared light images with JWST or measurements of exoplanets with TESS! The Multimodal Universe combines observations from many of astronomy's most important surveys and telescopes, including the Dark Energy Spectroscopic Instrument (DESI), the Sloan Digital Sky Survey (SDSS), and and other major space- and ground-based observatories to enable new science. In total, it contains:

  • Over 120 million galaxy images
  • More than 5 million stellar and galactic spectra
  • Light curves for over 3.5 million astronomical objects
  • Detailed measurements of nearly 220 million stars from ESA's Gaia satellite
  • And compendia of downstream labels such as supernova and galaxy classifications.

"One of Multimodal Universe’s key features is its ability to combine data from multiple astronomical surveys" says Liam Parker, a Ph.D. student at Berkeley and a group member at Polymathic AI, "This will be critical as multimodal machine learning grows in popularity across the physical sciences." 

Importantly, the data is being released in formats optimized for machine learning research. This is an important step to enabling broad applications of machine learning in astronomy, as until now each researcher would often (re)create their own datasets, which is a huge cost for small and large projects alike! Along with the data, the team is publishing some benchmarking results that demonstrate its potential applications, which range from classifying galaxies to better understand the evolution of galaxies, to improving early warning systems for supernova explosions so astronomers don’t miss unique events.

Our work, from around a dozen institutes and two dozen researchers, paves a path for machine learning to become a core component of modern astronomy,” says Polymathic AI member Micah Bowles, a Schmidt AI in Science Fellow at the University of Oxford. “Assembling this dataset was only possible through a broad collaboration of not only the Polymathic AI team but many expert astronomers from around the world.”

The Multimodal Universe is freely available to researchers worldwide through multiple access points, including Hugging Face. The team has also released extensive documentation and tools to help scientists work with the data effectively.

"We are witnessing a change of paradigm in the way AI is applied to astronomy and science in general," says Marc Huertas-Company, research scientist at the Instituto de Astrofísica de Canarias. "Supervised models trained for a specific task are being replaced by large multi-purpose models trained with large quantities of unlabelled and heterogenous data. The MMU dataset will play a key role in this transition."

"By easing access to astronomical data, we hope to create new opportunities for cross pollination between astronomy and machine learning," says Michael J. Smith, UniverseTBD member and Director of AI at Aspia Space. "Open datasets like the Multimodal Universe will help the community build better, more transparent foundation models. This is essential as we move toward more sophisticated AI applications in astronomy."

This project represents a significant step toward enabling more AI applications in astronomy. The Multimodal Universe aims to accelerate discoveries in galaxy evolution, stellar physics, and on the nature of the Universe itself by reducing the cost of building and maintaining AI tools and analysis frameworks in astronomy, astrophysics, and cosmology.


For more information, visit https://github.com/MultimodalUniverse and find the paper, poster and video recording at https://neurips.cc/virtual/2024/poster/97791

Read New Datasets Will Train AI Models To Think Like Scientist, published today by the Simons Foundation