Multi-Temporal Cloud Gap Imputation With HLS Imagery Across CONUS

This dataset contains temporal Harmonized Landsat-Sentinel imagery of diverse land covers across the Contiguous United States for the year 2022 along with binary cloud masks for the same area and year. This dataset's primary purpose is to train machine learning models for cloud gap imputation. The dataset contains 7,852 224x224x18 HLS scenes and 21,642 binary cloud masks of size 224x224.

Product Details

Visibility: Public
Owner: Clark Center for Geospatial Analytics
Created: 24 May 2024
Last Updated: 3 Apr 2025

Product Contents

root

README

Multi-Temporal Cloud Gap Imputation With HLS Data Across CONUS

Dataset Description

Dataset Summary

This dataset contains temporal Harmonized Landsat and Sentinel-2 (HLS) imagery of diverse land covers across the Contiguous United States (CONUS) for the year 2022 along with binary cloud masks for the same area and year.

Dataset Structure

hls-multi-temporal-cloud-gap-imputation/
├── chip_catalog.csv
├── cloud_catalog.csv
├── train/
│ ├── hls_chips/
│ ├── cloud_masks/
├── test/
│ ├── hls_chips/
│ ├── hls_chips_masked/

Data Catalogs

Data catalogs containing metadata for both HLS scenes and cloud masks are provided. The catalog for HLS scenes is provided in chip_catalog.csv and the catalog for cloud masks is provided in cloud_catalog.csv.

Cloud Catalog

cloud_catalog.csv contains the following columns:

Column	Contains
cloud_mask_id	'chip_XXX_YYY' corresponding to the file name
cloud_pct	ratio of cloudy pixels within the chip, ranges from 0.0 to 1.0
usage	'train' or 'validate'
bin	which of 10 bins the cloud mask fall within, e.g. 0.1-0.2

HLS Chip Catalog

chip_catalog.csv contains the following columns:

Column	Contains
chip_id	'chip_XXX_YYY' corresponding to the file name
chip_x	x coordinate of bounding box centroid
chip_y	y coordinate of bounding box centroid
tile	HLS tile ID
valid_first	count of valid pixels in the first time step
valid_second	count of valid pixels in the second time step
valid_third	count of valid pixels in the thrid time step
bad_pct_first	percent of invalid pixels in the first time step
bad_pct_second	percent of invalid pixels in the second time step
bad_pct_third	percent of invalid pixels in the third time step
first_image_date	date of first time step in `YYYY/mm/dd`
second_image_date	date of second time step in `YYYY/mm/dd`
third_image_date	date of third time step in `YYYY/mm/dd`
bad_pct_max	maximum of invalid pixels in all time steps
na_count	count of pixels in all time steps with no data
usage	'train' or 'validate'

Invalid pixels are those which intersect with any QA mask.

Ground Truth HLS Scenes

The ground truth HLS scenes are stored in GeoTIFF format under train/hls_chips/ and test/hls_chips/. Each GeoTIFF file covers a 224 x 224 pixel area at 30m spatial resolution. Each file contains 18 bands consisting of 6 spectral bands in 3 steps stacked together. The file name structure is chip_XXX_YYY.tif where XXX and YYY refer to row and column of a tile grid imposed on the Continental US. Since the dataset is sampled from this country-wide grid not all XXXs and YYYs are present in the dataset.

Masked HLS Scenes for Testing

Testing scenes are pre-masked to ensure that all models are evaluated using the same test set. These scenes are stored in GeoTIFF format under test/hls_chips. The file name structure for masked scenes is chip_XXX_YYY_masked.tif. Each masked scene corresponds to the ground truth scene with the same value of XXX_YYY in the file name. For example, the file test/hls_chips/chip_373_294.tif is the ground truth for test/hls_masked/chip_373_294_masked.tif, with the latter having values of 0 at cloud-masked locations. Cloud masks are present in all possible combinations of time steps in equal proportion. Possible combinations given time steps t1, t2, and t3 are:

t1
t2
t3
t1, t2
t1, t3,
t2, t3,
t1, t2, t3

So, for example, 1/7th of test scenes are masked at ONLY t2, and 1/7th at t2 AND t3, etc. for each of the possible combinations.

Cloud masks for the test scenes range from 0.01% coverage to 100% coverage, and are equally sampled from 10 equally sized bins between 0-100%.

Cloud Masks

The training cloud masks are stored in GeoTIFF format under train/cloud_masks/. The file name structure for cloud masks is chip_XXX_YYY_T_cmask.tif where XXX and YYY refer to row and column of a tile grid imposed on the Continental US. T refers to the time step of each cloud mask and is meant only to distinguish cloud masks derived from the same location from each other. The intent for training is that these cloud masks are randomly paired with training HLS scenes in all time steps. The distribution of cloud mask coverage for the training set does not correspond to the distribution of cloud mask coverage for the validation set, as the distribution of the latter has been equalized. This may lead to higher validation accuracy if the user chooses not to equalize the training dataset - it is left to the user's discretion.

Band Order

In each HLS GeoTIFF the following bands are repeated for each of three observations throughout the year:

Channel	Name	HLS S30 Band number
1	Blue	B02
2	Green	B03
3	Red	B04
4	NIR	B8A
5	SWIR 1	B11
6	SWIR 2	B12

Masks are a stored as a single-band binary image where 1 denotes the presence of the cloud mask and 0 denotes the absence of the cloud mask.

Dataset Creation

Code used to generate HLS scenes and cloud masks is available here. Code used to generate masked test scenes is available here. usage='validate' was used along with default parameters when initializing the dataset using the gapfill.py code. Refer to Seeing Through the Clouds: Cloud Gap Imputation with Prithvi Foundation Model for further information about the creation and initial use of this dataset.

Chip Generation and Partitioning

Three HLS scenes were selected between Mar and Sep 2022 with time difference between scenes varying between 1 and 200 days. After filtering for missing values and cloudy pixels, a total of 7,852 cloud-free chips evenly distributed across the CONUS were generated. This set was randomly partitioned into training (80%) and validation (20%) sets, resulting in 6,231 training chips and 1,621 validation chips.

Cloud Generation and Partitioning

Cloud masks were generated from the same region of CONUS using HLS cloud mask quality flag and exported as a binary layer of cloudy and non-cloudy pixels. This yielded 21,642 cloud masks, of which 1,600 were randomly selected and reserved for validation, resulting in 20,042 training cloud masks

License and Citation

This dataset is published under a CC-BY-4.0 license. If you find this dataset useful for your application, you can cite it as following:

@misc{hls-multi-temporal-cloud-gap-imputation,
    author = {Godwin, Denys and Li, Hanxi (Steve) and Alemohammad, Hamed},
    doi    = {https://doi.org/10.5281/zenodo.11281740},
    title  = {{Multi-Temporal Cloud Gap Imputation With HLS Data Across CONUS}},
    version = {1.0},
    year   = {2024}
}

Contact

For any questions about the dataset, you can contact Dr. Hamed Alemohammad.

Funding

This dataset is generated with funding from a grant awarded to Clark University Center for Geospatial Analytics (CGA) by NASA.