Problem Overview
Satellite imagery is critical for a wide variety of applications from disaster management and recovery to
agriculture, to military intelligence. A major obstacle for all of these use cases is the presence of clouds, which
cover over 66% of the Earth’s surface (Xie et al, 2020). Clouds introduce noise and inaccuracy in image-based
models and usually have to be identified and removed. Improving methods of identifying clouds can unlock the
potential of an unlimited range of satellite imagery use cases, enabling faster, more efficient, and more accurate
image-based research.
The labeling project used data from the Sentinal-2 mission, which captures wide-swath, high-resolution,
multi-spectral imaging used to monitor land surface conditions and the way they change. For each tile, data is
separated into different bands of light across the full visible spectrum, near-infrared, and infrared light.
Sentinel-2 imagery has recently been used for critical applications like:
• Tracking an erupting volcano on the Spanish island of La Palma. Satellite images showed the path of lava flowing
across the land and helped evacuate towns in danger
• Mapping deforestation in the Amazon rainforest and identifying effective interventions
• Monitoring wildfires in California to identify their sources and track air pollutants
The biggest challenges in cloud detection are identifying thin clouds and distinguishing between bright clouds and
other bright objects (Kristollari & Karathanassi, 2020). The three most common approaches used are Threshold
methods, Handcrafted models, and Deep learning.
The Project
The availability of labeled data has been a major obstacle to cloud detection efforts. Existing models have often been used as a proxy for ground truth, significantly limiting performance (Zupanc, 2017).
The labels for this dataset were generated using human annotation of the optical bands of Sentinel-2 imagery. As a first step, in 2021, Radiant Earth Foundation ran a contest to crowdsource data labels identifying clouds in satellite imagery, sponsored by Planet, Microsoft AI for Earth, and Azavea. The result is a diverse set of Sentinel-2 scenes labeled for cloudy pixels. To simplify the crowdsourcing task, a generic “cloud” / “no cloud” classification was implemented rather than categorizing clouds by type.
The resulting crowdsourced dataset, while extensive, had varying degrees of label quality. As a second step, With support from Microsoft AI for Earth, Radiant Earth worked with expert annotators at HAIVO by B.O.T to validate and, as needed, revise these labels on Taqadam mobile app, designed for geospatial annotation use cases.
Outcome
The final dataset is a high-quality human-verified set of cloud labels that spans imagery and cloud conditions across three continents (Africa, South America, and Australia). The dataset has an open license (CC BY 4.0) and will be made publicly available after the competition ends. The labeled dataset was used in a competition by Microsoft AI for Earth and Radiant Earth Foundation to be awarded the best use case.
https://www.drivendata.org/competitions/83/cloud-cover/page/398/